DescriptionWe're hiring a Software Engineer to build the infrastructure that powers our AI agents and ML systems end-to-end — from fine-tuning foundation models to shipping production-grade agent harnesses. You'll work across the stack: building MLOps pipelines, customizing LLMs, and deploying scalable agent systems on Kubernetes. This role sits at the intersection of ML engineering, platform engineering, and applied AI.
Responsibilities
- Design and build agent harnesses in Python — the runtime scaffolding that enables AI agents to perceive, reason, plan, and act reliably
- Develop and maintain a robust MLOps framework using Kubeflow and complementary tooling (MLflow, Argo, Airflow, or similar) to orchestrate training, evaluation, and deployment workflows
- Fine-tune foundation LLMs using techniques such as LoRA/QLoRA, SFT, and RLHF; manage datasets, training runs, and evaluation pipelines
- Deploy and operate services on Kubernetes, including model serving, autoscaling, and observability
- Build and integrate AI agents using modern agent frameworks (LangGraph, CrewAI, AutoGen, LlamaIndex, or similar)
- Apply software engineering rigor — SOLID principles, secure coding, static analysis, code reviews, and CI/CD — across all deliverables
- Collaborate with researchers, ML engineers, and product teams to take prototypes from notebook to production
Qualifications
Nice to Have
- Experience with distributed training (DeepSpeed, FSDP, Accelerate)
- Familiarity with vector databases, RAG architectures, and evaluation frameworks for LLMs
- Experience with model serving frameworks (vLLM, TGI, KServe, Triton)