Oracle logo

Master Principal Cloud Engineer – GPU & AI Infrastructure

Oracle
11 days ago
Full-time
On-site
Beijing, China
Description

Position Overview

As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization, you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.

You will partner with Enterprise Sales teams to lead the technical discovery, architectural design, and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs), generative AI applications, and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks, RDMA networking, and the software orchestration layers that make massive-scale GPU clusters hum.

 

Core Responsibilities

1. Strategic Technical Advisory

  • Architectural Design: Design end-to-end AI infrastructure solutions on OCI, focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct™ accelerators.
  • Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g., training vs. inference, FP8 vs. FP16 precision).
  • Networking Excellence: Design high-throughput, low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCI’s non-blocking leaf-spine architecture.

2. Hands-on Execution & Validation

  • Proof of Concept (PoC): Lead deep-dive technical evaluations, demonstrating OCI’s superior price-performance ratios for model training and fine-tuning.
  • Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack, Triton Inference Server, and NeMo Framework on OCI.
  • Performance Tuning: Work directly with engineering teams to troubleshoot "bottlenecks"—whether they reside in the kernel, the NCCL (NVIDIA Collective Communications Library) configuration, or the storage IOPS.

3. Thought Leadership & Enablement

  • Content Creation: Develop whitepapers, reference architectures, and blog posts detailing OCI’s competitive advantages in the AI sovereign cloud and private AI spaces.
  • Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators, interconnects (InfiniBand vs. Ethernet), and distributed training frameworks (PyTorch, JAX, DeepSpeed).


Responsibilities

Required Technical Competencies

Domain

Expertise Required

GPU Architecture

Deep knowledge of CUDA cores, Tensor Cores, HBM3 memory, and NVLink/NVSwitch topologies.

Networking

Mastery of RDMA, RoCE, and high-speed fabric management for multi-node distributed training.

Storage

Experience with high-performance parallel file systems like Lustre, Weka, or OCI’s High-Performance Storage for feeding data to GPUs at scale.

Orchestration

Proficiency in Kubernetes (OKE) for AI, Slurm for batch job scheduling, and NVIDIA GPU Operator.

AI Frameworks

Hands-on experience with PyTorch, TensorFlow, and libraries for distributed computing like Megatron-LM.

 

Candidate Qualifications

  • Education: Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related quantitative field.
  • Experience: 10+ years in Pre-Sales Engineering, Systems Architecture, or HPC. At least 3 years specifically focused on GPU-accelerated computing.
  • The "OCI Edge": Familiarity with OCI’s "Off-Box" virtualization and how it enables "Bare Metal" performance in a cloud environment.
  • Communication: The ability to explain the difference between latency and throughput to a CTO, while being able to debug a Python script with a Data Scientist.

 



Qualifications

Career Level - IC5