Hiring Playbook

How to hire an ML engineer who will ship, not stall

Most ML engineer hires go sideways because the screen tests what the candidate knows, not what they have done. Here are the questions that actually predict production performance.

A machine learning engineer hire goes one of two ways. Either the candidate ramps up, ships an improvement to an existing ML system in the first quarter, and starts owning a slice of the stack by month six. Or the candidate spends four months in notebooks, never ships, and quietly transitions into a data analyst role while the team quietly hopes the next hire is better.

The difference between those outcomes is almost always set at the screening stage. Most interview loops test ML theory: tree depth, regularization, gradient descent. Those questions do not predict who will ship in production. This page lays out the questions that do.

What production ML actually requires

Shipping ML is a distributed systems job with a model attached. The model is 20% of the work. The other 80% is data pipelines, feature stores, serving infrastructure, monitoring, versioning, rollback, and on-call. An engineer who has only trained models in a notebook is missing most of the actual job.

Before you design the interview, be honest with yourself about which 80% the role owns. Is this a modeling-heavy seat on an existing platform, or does the candidate need to own the whole stack? If it is the latter, the candidate pool is five to ten times smaller, and the screen needs to reflect that.

The production ML screen

The following questions are the core of the screen we run on every senior ML search at Engineers in AI. Tony Kochhar, a 20-year engineering veteran, runs these conversations personally on senior roles. Each question is designed to be hard to fake.

  • Describe the last ML system you shipped, end to end. Training pipeline, feature store, serving path, monitoring. If the candidate cannot describe all four, they have owned none of them.
  • How did you know the model was working in production? Strong answers cite specific metrics, drift thresholds, and alerting. Weak answers cite A/B lift without confidence intervals.
  • What did you retrain on, and how often? What triggered it? Filters for engineers who have operated a model vs those who have only trained one.
  • Walk through a time your model was wrong in production. How did you notice? Real ML engineers have stories here. Candidates without production experience deflect.
  • What was your feature engineering pipeline, and where did it break? This is where most production ML time is actually spent.
  • Who was on-call for the ML system? If it was you, describe a real page. The single highest-signal question in the screen.

Signals that matter more than credentials

A Kaggle grandmaster title does not predict production ML performance. Neither does a PhD. Neither does a FAANG resume, on its own. The signals that actually matter are operational.

  • The candidate has been paged for an ML system they owned.
  • The candidate can describe the data distribution they trained on in plain language.
  • The candidate has killed or deprecated a model they shipped, because it was not working.
  • The candidate talks about data quality before they talk about model architecture.
  • The candidate has a cost-per-prediction number in their head.

Red flags worth catching early

  • "I shipped a model" with no discussion of how it was served, monitored, or rolled back. The model was handed off, not owned.
  • Exclusive focus on novel architectures with no discussion of baselines. Strong ML engineers beat the baseline before they try anything fancy.
  • No mention of data issues in the entire interview. Data is where production ML time goes. A candidate who skips it has not lived it.
  • Resumes full of modeling frameworks with zero mention of pipelines, orchestration, or infrastructure.

Common mistakes hiring managers make

The most common mistake is using the same interview loop for every ML hire, regardless of whether the role is platform, applied, or research. A second common mistake is letting one strong paper carry a candidate through the loop when the rest of the signal is weak. A third is chasing a name-brand resume at the expense of someone quieter who has shipped more real systems.

The way to avoid all three is to write down what the hire will ship in their first 90 days, and interview backwards from that list. If the hire will own a recommendation service, the loop should include a realistic recommendation system design question, not a generic ML theory quiz.

How the loop should be structured

An interview loop that actually predicts ML engineer performance has three stages. A 45-to-60-minute conversation about a system the candidate shipped, with specific follow-ups on monitoring, retraining, and failure modes. A scoped ML system design, not a leetcode puzzle. And a practical coding exercise that focuses on data handling and pipeline logic, not on exotic algorithms the candidate will never reach for in production.

The loop you want to avoid is the one that tests theory in three separate stages. If your interview is mostly a machine learning final exam, you will end up with candidates who can explain algorithms and stall the moment they have to debug a feature pipeline that silently dropped 3% of rows last Tuesday.

When to bring in a specialist recruiter

Use an ML specialist recruiter when your role is senior, the screening bar is high, and your internal team cannot filter on production signal. At a fully loaded ML engineer compensation of $350K to $500K, a bad hire is an expensive mistake. A flat 20% placement fee pays back the first time a bad candidate gets filtered out before your team spends a week on them.

Engineers in AI has closed over 1,000 technical placements in 20 years, including ML hires at Agoda, Hearst, Con Edison, and Trilogy. No retainer, no exclusivity, 90-day replacement guarantee. If you are hiring an ML engineer and want an engineering-led read on your role, book a hiring call.

Hire an ML engineer who has owned production

Screening built around shipped systems, not theory. Flat 20% fee. 90-day replacement.