We are seeking an experienced C++ AI Inference Engineer to design, optimize, and deploy high-performance AI inference engines using modern C++ and processor-specific optimizations. You will collaborate with research teams to productionize cutting-edge AI model architectures for CPU-based inference.
Collaborate with research teams to understand AI model architectures and requirements
Design and implement AI model inference pipelines using C++17/20 and SIMD intrinsics (AVX2/AVX-512)
Optimize cache hierarchy, NUMA-aware memory allocation, and matrix multiplication (GEMM) kernels
Develop operator fusion techniques and CPU inference engines for production workloads
Write production-grade, thread-safe C++ code with comprehensive unit testing
Profile and debug performance using Linux tools (perf, VTune, flamegraphs)
Conduct code reviews and ensure compliance with coding standards
Stay current with HPC, OpenMP, and modern C++ best practices
Core Requirements:
Modern C++ (C++17/20) with smart pointers, coroutines, and concepts
SIMD intrinsics - AVX2 required, AVX-512 strongly preferred
Cache optimization - L1/L2/L3 prefetching and locality awareness
NUMA-aware programming for multi-socket systems
GEMM/blocked matrix multiplication kernel implementation
OpenMP 5.0+ for parallel computing
Linux performance profiling (perf, valgrind, sanitizers)
Strongly Desired:
High-performance AI inference engine development
Operator fusion and kernel fusion techniques
HPC (High-Performance Computing) experience
Memory management and allocation optimization
Bachelor's/Master's in Computer Science, Electrical Engineering, or related field
3-7+ years proven C++ development experience
Linux/Unix expertise with strong debugging skills
Familiarity with Linear Algebra, numerical methods, and performance analysis
Experience with multi-threading, concurrency, and memory management
Strong problem-solving and analytical abilities
Knowledge of PyTorch/TensorFlow C++ backends
Real-time systems or embedded systems background
ARM SVE, RISC-V vector extensions, or Intel ISPC experience
Production-grade AI inference libraries powering LLMs and vision models
CPU-optimized inference pipelines for sub-millisecond latency
Cross-platform deployment across Intel Xeon, AMD EPYC, and ARM architectures
Performance optimizations reducing inference costs by 3-5x