← Back to all roles
ACTIVE — FOUNDING ROLE

Founding AI
Engineer

NEW YORK CITY · IN-PERSON

We're building the intelligence layer for all home-based care. Medicaid, Medicare, the VA, and private insurers authorize $300B annually for 14M Americans to receive care at home. Only 8M get it. $84B is spent on admin workers coordinating the gap. We're closing it by replacing coordination with autonomous agents.

We've built a sophisticated multi-agent harness in production: Casey handles scheduling and call-outs autonomously, Steffi plans schedules forward, Riley answers inbound calls, and Binky lets admins tune their own agents. At one customer, our agents handle 2,600+ calls a week at 97% containment. The four largest home care EMRs chose us as their system of action.

You're the person who turns this into a system that learns. You own the evaluation, simulation, retrieval, and alignment infrastructure that lets us ship agent improvements with confidence, measure what's actually working, and compound every production decision into a better model. This is the most senior AI hire at the company.

YOU OWN

  • Evaluation and simulation infrastructure — CI/CD for agent behavior. Run Casey against historical scenarios, measure fill-rate outcomes, catch regressions, A/B test prompt and orchestration changes before they hit production. This is the piece that lets us ship changes to a sophisticated multi-agent system with confidence instead of vibes.
  • Retrieval and grounding architecture — our agents operate over dense, agency-specific knowledge (SOPs, escalation policies, scheduling rules, compliance constraints). Design the retrieval layer that gets the right context to the right agent at the right turn, so the agent harness keeps scaling as the knowledge surface grows.
  • Alignment and outcome measurement — connect every ranking decision, outreach attempt, and agent choice to downstream business outcomes (fills, retention, utilization). The event data exists across multiple systems. Nobody has unified it. You will, and everything else compounds from it.
  • The agent intelligence roadmap — we have 100K+ scheduling decisions and growing with paired outcome data. You'll decide when fine-tuning, reward modeling, or new model architectures are the right next investment for the harness, and ship them.

YOU ARE

  • An applied scientist who ships production code. Not a notebook jockey — your mental model of system behavior is probabilistic, but your output is working systems.
  • Fluent in the full stack of interventions — prompting, retrieval, evals, fine-tuning, RL — and have the judgment to know which to reach for.
  • Someone whose first instinct when something breaks is to design an experiment, not just read a stack trace.
  • Comfortable with the reactive trap dying. You stop treating symptoms and start building the systems that prevent whole classes of problem.
  • Indifferent to healthcare domain experience. The domain is learnable in weeks.

HOW WE BUILD

Tony Xu built DoorDash by running tens of thousands of experiments a year. 95% fail. The 5% that work compound into permanent improvement for every customer. That's our model. The quality of our product is proportional to the number of experiments we can run.

Concrete example: this week we found that a significant slice of one customer's caregiver roster was blocked by stale availability data. The response wasn't a ticket — it was an experiment. Clean the data, measure whether fills-per-contact improves, build proactive monitoring that catches it across every customer. That's the loop.

Your job is to build the infrastructure that makes every engineer — yourself included — faster and more confident at running these experiments on a production agent system.

THE PROCESS

01

Intro Call

45 min, remote

Daniel walks you through the role and the reactive-trap diagnosis. You walk us through a reliability or evaluation system you've built for a production ML/LLM product.

02

Technical Exercise

Async, own time

We hand you a real slice of our data (ranking snapshots, outreach events, outcomes). You design and prototype an offline evaluation framework. Record a Loom walking through your thinking.

03

On-site Day

Half day, in-office

Review + live extension of your exercise with Greg and Ethan, architecture conversation with Daniel, close with Victor.

04

Offer

Target: offer signed by May 10.

APPLY

Send a short note to daniel+ai-eng@zingage.com — walk us through a reliability or evaluation system you've built for a production ML/LLM product, and the experiment that most changed your view of how to measure agent quality.