Raphael San Andres

ML & AI Systems Engineer

ML and AI Systems Engineer with 3+ years building and deploying ML systems at scale. First hire on a customer-facing ML support team managing multi-node NVIDIA H100 GPU clusters at 99.9% uptime.

About

Hello! I'm Raphael. I'm a person who gets excited about linguistics, loves diving into video game statistics, and has a bunch of other random interests floating around. Professionally, I'm working in the cool world of ML/AI.

Experience

Stealth Startup

GPU Cloud Infrastructure Provider

ML Solutions Engineer (First Hire) | Software Engineer

Feb 2024 - Present

First engineer on a 5-person customer-facing ML support team supporting several early-stage AI startups on GPU infrastructure
Managed multi-node NVIDIA H100 GPU Kubernetes clusters, maintaining 99.9% uptime across enterprise AI infrastructure
Helped design and implement automated node repair processes, reducing mean time to repair from 36-72 hours to under 1 hour
Built internal knowledge base and ticketing codebase RAG system to accelerate issue resolution across the support team
Leading infrastructure readiness for next-generation NVIDIA GB100 (Blackwell) GPU deployments
Developed bidirectional Jira-Kubernetes operators in Go, automating incident response for ~20 weekly node failures
Optimized SLURM and Kubernetes ML workflows, improving training job throughput for customers running distributed workloads

Weights & Biases

Machine Learning Support Engineer

Jan 2023 - Jan 2024

Debugged and resolved 600+ technical issues for ML practitioners at OpenAI, NVIDIA, and Microsoft, covering model integrations, LLM deployments, and on-premise instances
Triaged and traced 50+ bugs across the Weights & Biases SDK, web application, backend services, and managed instances, contributing bugfix PRs and new integrations
Managed ~20 customer requests daily while running debugging sessions, cross-team syncs, and building internal tooling (W&B integrations, frontend features)

Projects

Atlas AI

Active

Clinical intelligence platform — query patient records in natural language via a 3-node multi-agent RAG pipeline with 20+ medical tools.

LangGraph, FastAPI, Next.js, pgvector, AWS Bedrock, ECS, RDS, Ollama

Atlas AI MCP

Active

MCP server exposing healthcare AI tools for RAG-powered clinical queries, document reranking, and FHIR data ingestion. Published on PyPI and Smithery.

Python, MCP

Corium

Recent

Kubernetes operator with 3 custom CRDs for automated metrics collection, threshold alerting, and a monitoring dashboard.

Go, Kubebuilder, Prometheus, Grafana, Next.js

JaxStats

Active

Game performance analyzer with XGBoost ML scoring, 8-metric GPI breakdown, AI coaching, and live game overlays.

FastAPI, XGBoost, React, Chart.js, Ollama

Aphae

Active

AI agent office simulation — procedurally generated personalities (Big Five), emergent relationships, LLM-driven conversations, and a drama director.

Godot 4, GDScript, Ollama

Models from Scratch

Recent

MLP, CNN, and Transformer self-attention implemented from scratch with hand-derived forward and backward passes — no frameworks, no autograd.

Python, NumPy

Capstone (PSU AI 894)

Spring 2024

Predicts NFL formation based on player positions and coordinates.

Skills

LanguagesPythonGoTypeScriptSQLRC++

ML & AIPyTorchTensorFlowKerasXGBoostHuggingFaceLangChainLangGraphRayOpenAI APIOllama

InfrastructureKubernetesDockerSLURMAWS SageMakerGCP VertexAzureHelmKustomizeGitHub Actions

Data & ObservabilityPostgreSQLpgvectorDataDogPrometheusGrafanaJupyterWeights & Biases

CertificationsGoogle Data Analytics Professional Certificate

Education

Masters in Artificial Intelligence

Penn State · May 2025

Bachelor of Science in Statistics

UCLA · June 2022

Publications

RL with Mario (Article) →