DescriptionPosition Overview
As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization, you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.
You will partner with Enterprise Sales teams to lead the technical discovery, architectural design, and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs), generative AI applications, and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks, RDMA networking, and the software orchestration layers that make massive-scale GPU clusters hum.
Core Responsibilities
1. Strategic Technical Advisory
- Architectural Design: Design end-to-end AI infrastructure solutions on OCI, focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct™ accelerators.
- Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g., training vs. inference, FP8 vs. FP16 precision).
- Networking Excellence: Design high-throughput, low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCI’s non-blocking leaf-spine architecture.
2. Hands-on Execution & Validation
- Proof of Concept (PoC): Lead deep-dive technical evaluations, demonstrating OCI’s superior price-performance ratios for model training and fine-tuning.
- Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack, Triton Inference Server, and NeMo Framework on OCI.
- Performance Tuning: Work directly with engineering teams to troubleshoot "bottlenecks"—whether they reside in the kernel, the NCCL (NVIDIA Collective Communications Library) configuration, or the storage IOPS.
3. Thought Leadership & Enablement
- Content Creation: Develop whitepapers, reference architectures, and blog posts detailing OCI’s competitive advantages in the AI sovereign cloud and private AI spaces.
- Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators, interconnects (InfiniBand vs. Ethernet), and distributed training frameworks (PyTorch, JAX, DeepSpeed).
ResponsibilitiesRequired Technical Competencies
|
Domain |
Expertise Required |
|
GPU Architecture |
Deep knowledge of CUDA cores, Tensor Cores, HBM3 memory, and NVLink/NVSwitch topologies. |
|
Networking |
Mastery of RDMA, RoCE, and high-speed fabric management for multi-node distributed training. |
|
Storage |
Experience with high-performance parallel file systems like Lustre, Weka, or OCI’s High-Performance Storage for feeding data to GPUs at scale. |
|
Orchestration |
Proficiency in Kubernetes (OKE) for AI, Slurm for batch job scheduling, and NVIDIA GPU Operator. |
|
AI Frameworks |
Hands-on experience with PyTorch, TensorFlow, and libraries for distributed computing like Megatron-LM. |
Candidate Qualifications
- Education: Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related quantitative field.
- Experience: 10+ years in Pre-Sales Engineering, Systems Architecture, or HPC. At least 3 years specifically focused on GPU-accelerated computing.
- The "OCI Edge": Familiarity with OCI’s "Off-Box" virtualization and how it enables "Bare Metal" performance in a cloud environment.
- Communication: The ability to explain the difference between latency and throughput to a CTO, while being able to debug a Python script with a Data Scientist.
QualificationsCareer Level - IC5