maxawad.com
Meta·Reality Labs
Case Study — Job Report

Debugging Digital Humans at Scale

Building the internal tooling and ML infrastructure that kept Meta’s Avatar pipeline healthy — from search and triage to automated testing, observability, and CI/CD gatekeeping.

ML Infrastructure Engineer Meta · Reality Labs Avatars Team
01Context

The Organization

Meta’s Reality Labs division — formerly Oculus — is the arm responsible for building the hardware and software that power Meta’s mixed-reality ecosystem: Quest headsets, Ray-Ban Meta smart glasses, Horizon Worlds, and the underlying Avatars platform that gives users a persistent digital identity across every surface.

Reality Labs

Meta’s dedicated division for VR, AR, and mixed-reality products. Encompasses hardware (Quest, Ray-Ban), software (Horizon), and core platform services like Avatars.

Avatars Team

Responsible for creating realistic, expressive digital humans. Avatars have natural gestures, idle animations, and playful micro-expressions — representing users across Quest, Horizon, Messenger, and Instagram.

My Role

ML Infrastructure Engineer architecting experimentation infrastructure supporting Meta Avatars and Reality Labs research platforms. Focused on debugging tooling, pipeline testing, observability, and compute orchestration used by ~300 engineers.

02The Problem

Why This Work Mattered

Meta Avatars are generated from a deep pipeline: face scans, body estimation, clothing selection, expression rigs, and rendering. With billions of potential configurations, edge cases surface constantly — avatars rendered too skinny, too tall, with broken textures, or mismatched proportions. Researchers and engineers needed fast, reliable tooling to find, diagnose, and fix these issues before they shipped to users.

~300
Engineers Supported
150+
GPU Cores Orchestrated
~2TB
RAM per Compute Env
4
Services Built
03Deliverables

Services & Applications Built

Four distinct services, all engineered around one mission: give the Avatars team total visibility into their pipeline so no rendering defect reaches production undetected.

Avatar Search & Chat Interface

A conversational search tool enabling researchers to query specific avatars across their account. Natural-language input to locate, inspect, and triage avatar issues by ID, configuration, or visual anomaly type.

SearchChat UXQuery EngineResearcher-Facing

Debugging Dashboard

A standalone product platform for retrieving and displaying avatar debugging data. Visualized rendering parameters, mesh metrics, body proportions, and expression rig states so engineers could pinpoint exactly where the pipeline produced a defect.

DashboardData VizInternal ToolReal-time

Email Automation Service

Automated notification pipelines that alerted stakeholders when avatar quality regressions were detected. Digest reports, threshold-based alerts, and escalation routing so the right people knew about issues before users did.

AutomationAlertsEmailMonitoring

CI/CD Test Infrastructure

Built the merge-gate testing layer that enforced cross-team test suites. Code could not merge unless it passed validation from all dependent teams — preventing one team's change from breaking another team's avatar surface.

CI/CDTestingMerge GatesCross-Team

Centralized Observability Platform

Developed a unified platform aggregating logs, debugging signals, and model outputs using Plog and internal infrastructure. Enabled real-time monitoring by research teams and Reality Labs leadership — giving end-to-end visibility into the experiment-to-render pipeline across GPU-intensive compute environments.

PlogObservabilityReal-timeLeadership Visibility
04Architecture

System Design

The four services formed an integrated debugging ecosystem. The search interface and dashboard consumed data from the avatar pipeline, while the email service monitored quality signals and the CI/CD layer enforced standards at merge time.

┌──────────────────────────────────────────────────────────────────────────┐
│               META AVATARS — ML INFRASTRUCTURE ECOSYSTEM                │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐   │
│  │  Chat / Search   │  │ Debug Dashboard  │  │  Observability   │   │
│  │   Interface      │  │  (Standalone)    │  │  Platform (Plog) │   │
│  └───────┬──────────┘  └────────┬─────────┘  └────────┬─────────┘   │
│          │                      │                      │              │
│          ▼                      ▼                      ▼              │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │           C++ Experimentation Infrastructure                  │  │
│  │  Experiment Tracking · Debugging · Results Aggregation        │  │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────┐ ┌───────────────┐   │  │
│  │  │  Meshes  │ │   Rigs   │ │  Textures  │ │   Rendering   │   │  │
│  │  └──────────┘ └──────────┘ └────────────┘ └───────────────┘   │  │
│  └──────────────────────────┬─────────────────────────────────────┘  │
│                             │                                        │
│             ┌───────────────┼───────────────┐                        │
│             ▼               ▼               ▼                        │
│  ┌──────────────────┐  ┌────────────────┐  ┌──────────────────┐   │
│  │ Email Automation │  │ CI/CD Test Gate│  │  GPU Compute     │   │
│  │  (Monitoring)    │  │ (Merge Block)  │  │  Orchestration   │   │
│  └──────────────────┘  └────────────────┘  │  150+ cores      │   │
│          │                    │            │  ~2TB RAM         │   │
│          ▼                    ▼            │  ~55 min/render   │   │
│  Stakeholder Alerts    Cross-Team Tests   └──────────────────┘   │
│  & Digest Reports      Must pass ALL                                │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘
05Impact

Results & Outcomes

These tools became core infrastructure for the Avatars team’s daily workflow, reducing time-to-diagnosis and preventing pipeline breakages across teams.

~300 engineers supportedThe C++ experimentation infrastructure enabled experiment tracking, debugging, and results aggregation across ML workflows used daily by hundreds of researchers and engineers.

Faster issue triageResearchers could search and locate defective avatars through the chat interface instead of manually querying databases, reducing diagnosis time from hours to minutes.

Visual debugging at a glanceThe debugging dashboard surfaced mesh geometry, body proportions, expression states, and rendering parameters in a single view, eliminating the need to inspect raw pipeline data.

Real-time observability for leadershipCentralized platform aggregating logs, debugging signals, and model outputs via Plog, enabling monitoring by both research teams and Reality Labs leadership.

Proactive quality monitoringThe email automation service caught regressions early by alerting teams when quality metrics drifted outside acceptable thresholds, before users encountered the issue.

Zero pipeline regressions from unvetted mergesDistributed validation systems including end-to-end, regression, and integration testing frameworks prevented regressions across complex ML research pipelines.

GPU-scale compute orchestrationEngineered high-performance compute orchestration for systems utilizing 150+ GPU cores and ~2TB RAM, enabling avatar generation pipelines requiring ~55 minutes per render.

06Technology

Tech Stack & Environment

Working within Meta’s internal infrastructure, leveraging their proprietary tooling alongside industry-standard technologies.

Languages & Tooling

  • C++ (experimentation infra)
  • Python (pipeline automation)
  • Bash / Shell scripting
  • Jupyter Notebooks

Infrastructure & Compute

  • ALA servers (GPU clusters)
  • MTP developer platforms
  • Internal cloud environments
  • 150+ GPU core orchestration

Observability & Testing

  • Plog (logging infrastructure)
  • End-to-end test frameworks
  • Regression & integration suites
  • Cross-team merge gates

Domain

  • ML experimentation pipelines
  • Avatar generation (~55 min/render)
  • 3D mesh & body estimation
  • Expression rigging systems
07Workflow

How a Typical Bug Flowed Through the System

From detection to resolution, the four services formed a continuous loop that kept avatar quality high and turnaround fast.

1. Detection

Email Automation Service

Quality metrics drift outside threshold — an avatar body type renders 15% narrower than expected. The monitoring service fires an alert email to the owning team with the affected avatar IDs, configuration snapshot, and severity level.

2. Search & Triage

Chat / Search Interface

A researcher opens the chat tool, queries for the flagged avatar IDs, and filters by body configuration. The interface returns matching avatars, their generation timestamps, and pipeline stage where the anomaly was introduced.

3. Diagnosis

Debugging Dashboard

The engineer opens the standalone dashboard, loads the affected avatar, and inspects the mesh geometry, body proportion parameters, and expression rig state side-by-side. Pinpoints the issue to a body estimation weight that was incorrectly applied.

4. Fix & Validation

CI/CD Test Infrastructure

The engineer submits a fix. The merge gate runs the full cross-team test suite: avatar rendering tests, expression tests, body proportion tests, and integration tests from every dependent surface. All pass. Code merges. Pipeline stays healthy.

Back to maxawad.com

Case Study — Meta Reality Labs, Avatars Team · Prepared 2026