Why Most AI Recruiting Fails at the Screening Stage

A Series B company we worked with last year spent four months hiring an ML engineer. They ran the candidate through six rounds, including a take-home notebook, a system design interview, and a panel with two founders. The candidate had a PhD from a name-brand program, three NeurIPS papers, and a clean GitHub. They were hired in Q2. By Q4, they had shipped zero models into production. Not one.

When the VP of Engineering called us to run the replacement search, we spent thirty minutes on the post-mortem before we spent a minute on sourcing. The story was not complicated. The candidate was a genuinely strong researcher. They had never owned a training-to-serving pipeline end to end, never debugged a p99 latency regression in a retrieval system, never had to explain to finance why GPU spend doubled in a month. None of the six interview rounds had asked them to.

This is the pattern we see more than any other. The failure is almost never at the sourcing stage. It is at screening. And in AI specifically, screening is harder than most hiring teams realize, because the job titles have drifted so far from the underlying work that a resume is no longer a reliable signal of what someone can actually do.

The resume vs. reality gap in AI

Open any ten "Senior ML Engineer" resumes on LinkedIn and you will find ten different jobs. One person trained transformer models from scratch on a custom distributed cluster. The next wrote SQL to pull features for a scikit-learn logistic regression. A third fine-tuned an open-source LLM on internal docs and called it RAG. A fourth ran notebooks for a data science team and never shipped anything that served live traffic. All four have the same title. All four claim the same skills.

The gap has widened in the last three years. When foundation models became table stakes, the industry absorbed a wave of engineers who had real ML intuition from the classical era, plus a much larger wave of engineers who learned on the job by wiring API calls to OpenAI and Anthropic. Both populations are useful. They are also radically different hires, and they fail in different ways when placed in the wrong seat.

What a typical AI resume tells you:

Which frameworks someone has touched (PyTorch, JAX, TensorFlow, LangChain).
Which companies they worked at and for how long.
Which models or papers they were adjacent to.
Whether they have a graduate degree in a quantitative field.

What a typical AI resume does not tell you:

Whether they have ever owned a model in production past the first week.
How they handle the moment a model regresses silently on real user traffic.
Whether their intuition for cost-per-query is grounded in actual invoices or vibes.
How they decide when not to use an LLM.

The second list is where the job lives. The first list is where most interview loops stay.

A resume tells you which tools someone has been near. It does not tell you which ones have blown up in their hands.

What bad screening looks like

There are a few failure modes we see on repeat. They are worth naming because they persist at companies with otherwise excellent engineering cultures. Smart teams miss on AI hires because they are screening the way they screen for backend engineers, and the translation does not hold.

Asking about papers

Publication record is a signal, but it is a signal about a different job. A researcher who has authored a strong paper has demonstrated they can frame a novel problem, run experiments cleanly, and write up results. They have not necessarily demonstrated anything about training stability at scale, serving latency budgets, or the grubby work of keeping a model honest in production. When a founder leads an interview with "walk me through your favorite paper," they are optimizing for candidates who enjoy that conversation. Great for a research scientist role. Not the signal you want for someone who needs to keep a recommender from quietly rotting over six months.

LeetCode-style ML questions

"Implement k-means from scratch on a whiteboard." "Derive backpropagation for a two-layer network." These questions sort for a skill that almost never binds in real work. Production ML engineers do not implement k-means. They choose between a managed service and an open-source library, tune the inputs, and spend most of their time on data quality, feature drift, and the evaluation harness. Whiteboard derivations test whether a candidate studied for the interview. They do not test whether the candidate will notice that your training set has a 3% label noise problem that is capping model accuracy at 88%.

Generic system design

"Design Twitter" with an ML twist tacked on the end ("now add a recommendation feed") tends to produce a conversation about Kafka and Redis with a hand-wave at a model. This is fine for a staff backend engineer. It is not diagnostic for an MLE. The specific hard parts of ML systems (feature stores, online-offline parity, shadow deployments, evaluation infrastructure, the retraining loop) rarely come up unless the interviewer specifically steers toward them. And if the interviewer is a backend generalist, they often cannot steer there, because those are not the systems they build.

Take-homes that test the wrong thing

A clean notebook that trains a model on a toy dataset and reports an F1 score is a fine signal that someone is literate with scikit-learn or PyTorch. It tells you almost nothing about the part of the job that actually fails. The interesting question is never "can you train a model." It is "what did you do when the first version was worse than the baseline, and how did you know it was worse."

What good screening looks like

The questions that actually predict performance are boring-sounding on the surface and deep underneath. They all probe for the same thing: has this person felt the specific pain of owning an AI system past the demo stage? The pain has a shape. You can hear it in an answer within about ninety seconds.

We cluster the questions we ask into five categories. You do not need to cover all five in a single screen, but a complete interview loop should hit most of them.

1. Production deployment experience

Has the candidate owned a model from training through serving, or have they handed it off at the notebook stage? The difference is everything. A candidate who has deployed will talk unprompted about versioning, rollback, canaries, and the day the model started returning NaN in prod. A candidate who has not will describe the model architecture in loving detail and get quiet when you ask how they monitored it.

2. RAG and retrieval pipeline design

Almost every applied AI team in 2026 has a retrieval system somewhere. The people who have actually built one have strong opinions about chunking strategy, embedding model choice, hybrid search, reranking, and the failure modes of naive cosine similarity on heterogeneous corpora. The people who have read about it have a diagram. Ask specifically about what broke and what they changed in response.

3. Model evaluation and failure analysis

This is the most underrated category. Anyone can run an eval. Fewer people can design one that actually tracks what the product cares about, and fewer still can reason about the cases where their eval disagrees with user behavior. Ask a candidate to describe a time they discovered their evaluation metric was lying to them. If they have a real story, they are a senior MLE. If they look puzzled, they have not been senior long.

4. Inference infra and cost tradeoffs

Cost per query is an engineering constraint now, not a finance afterthought. A strong candidate can speak fluently about quantization, batching, KV-cache reuse, speculative decoding, and when to serve a smaller model with prompt engineering instead of a larger model without it. They will also have an opinion on the correct ratio of capability to cost for a specific use case, grounded in real numbers from real workloads.

5. Debugging non-deterministic systems

Traditional software either works or throws. ML systems are probabilistic. A weak candidate will say "I'd add more logging" and stop. A strong candidate will describe the difference between a data bug, a distribution shift, a labeling artifact, and a genuine model regression, and will walk through how they tell them apart. This is the single hardest skill to fake and the single most valuable one to hire for.

You cannot screen for production AI skill with questions that have clean, textbook answers. The answers live in war stories.

A concrete example: three questions we ask every ML candidate

To make this less abstract, here are three of the questions we use in our technical screens at Engineers in AI. We run these before the candidate ever gets to the client. They are not trick questions. They are designed to give strong candidates room to show depth, and to surface quickly when that depth is not there.

Question 1. Tell me about a model you shipped that regressed in production. How did you find out, and what did you do?

Strong answer: The candidate walks through a specific incident with timestamps and numbers. They distinguish between what the monitoring caught, what the monitoring missed, and what a human finally noticed. They can describe the root cause (input distribution shift, upstream schema change, a silent label-pipeline bug) and the fix. They mention what they changed in the monitoring or evaluation setup afterward so the same class of bug would be caught earlier next time.

Weak answer: "We had monitoring dashboards and they showed the regression." Or they pivot to a training-time issue instead of a production one. If they have never been on-call for a model, this question flushes that out in under two minutes.

Question 2. You have a RAG system. Retrieval recall looks fine on your benchmark, but users say the answers feel wrong. Where do you look?

Strong answer: The candidate immediately questions the benchmark. They talk about whether the eval set reflects real user queries, whether "recall" is measuring what the product needs, whether the retriever is returning technically-relevant-but-practically-useless chunks, whether the reranker is collapsing diversity, whether the generator is ignoring retrieved context, and whether the chunks themselves are sized wrong for the query shape. They have a mental decision tree and they walk it quickly.

Weak answer: "I would increase top-k." Or they go straight to the embedding model without interrogating the data or the eval. The weak answer treats the system as a black box with knobs. The strong answer treats it as a pipeline with specific joints that can fail.

Question 3. Your model costs $0.08 per query. The product team needs it under $0.02. Walk me through your options.

Strong answer: The candidate opens with questions about the workload (throughput, latency budget, tolerance for quality regression) before they propose solutions. They discuss caching (at the prompt, embedding, and response level), batching, distillation to a smaller model, quantization, switching providers, rewriting the prompt to reduce tokens, and hybrid approaches where an expensive model is used only for hard cases. They have an intuition for which levers are worth pulling first based on the workload shape.

Weak answer: "Use a cheaper model." Without further interrogation of the tradeoff. Or a laundry list of optimizations without any ordering or judgment about what to try first.

Notice what these questions have in common. None of them have a single correct answer. All of them reward specificity. All of them are easy to answer well if you have done the work, and impossible to fake convincingly if you have not. That is the bar a screen needs to clear in AI hiring right now.

How Engineers in AI approaches AI screening

Our working assumption is that the resume is a starting point for a conversation, not a verdict. Every candidate we put in front of a client has been through a technical screen run by someone who has shipped the kind of work they would be shipping. That sounds obvious. It is rarer than it should be, because most recruiters cannot run that screen themselves. They rely on the client's loop to do the filtering, which means the client's loop is doing two jobs: finding the real candidates, and interviewing them. That is the stage where good pipelines go wrong.

We do this because of how the business is structured. Tony's 20 years in engineering before the recruiting work means the screens are a conversation between operators, not a checklist read off a form. The 20% flat fee keeps the incentive clean: we do not earn more by placing a weaker candidate faster, so we would rather lose a placement than push someone whose production chops we cannot vouch for. Across 1,000-plus placements at Agoda, Hearst, Con Edison, Trilogy, and others, the pattern holds. The hires that last are the ones where the screen was honest.

If you are hiring AI or ML engineers right now, the single most useful thing you can do this quarter is audit your own screening loop against the five categories above. Who on your team asks about production regressions? Who pushes on evaluation design? Who can tell whether a RAG answer is a sourcing problem, a retrieval problem, or a generation problem? If the answer is "nobody, or one person who is always busy," your loop is leaky. Hires will slip through. Some will work out. Some will cost you four months and a rehire.

The field is moving too fast to hire on proxies. The people who are good at this work leave fingerprints: specific stories, specific numbers, specific regrets. A screen that asks for those fingerprints, and knows what they look like when it gets them, will outperform a screen that asks for credentials. Every time.