Role Guide16 min read

Data Scientist Interview Prep: The Complete 2026 Guide

A practical guide to 2026 data scientist interviews — SQL screens, statistics, ML case studies, A/B testing, and the product-data hybrid rounds at FAANG — with how a real-time AI copilot actually fits each round.

Devon Park

Head of Research, Acedly

Why the DS interview looks the way it does in 2026

The data scientist title in 2026 is not the same job it was five years ago. Two structural shifts compressed the role from both sides. From the modelling side, ML engineers absorbed the productionisation work — anything that ships to a serving stack is now an MLE job, not a DS one. From the analytics side, "analytics engineers" with strong dbt and warehouse skills absorbed the dashboarding and metric-definition work. What's left in the middle, where most "DS" job postings now live, is product data science — the role that owns experimentation, metric design, and the statistical rigour behind product decisions.

This matters for interview preparation because the loop has tracked the role. Five years ago a DS loop was 60% modelling and 40% SQL. In 2026 it's closer to 60% SQL and experimentation, 25% product and metric sense, and 15% modelling — and the modelling round, when it appears, is increasingly a "case study" rather than a coding problem. Get the weighting wrong in your prep and you'll over-rotate on Kaggle-style competitions while the actual interview is asking you to diagnose why a metric dropped.

There is a second track — the ML-applied DS roles at companies that still distinguish between research scientists, ML engineers, and applied data scientists (Netflix, Stripe, Anthropic, DeepMind in the limited cases where they hire under the DS title). These loops invert the weighting back to ~50% modelling depth, with the case study round expected to go three layers deep — feature design, eval methodology, online evaluation — rather than skimming the surface. If you're targeting these roles, prep accordingly; if you're targeting Meta or Airbnb, do not.

The 2026 DS interview loop, stage by stage

A typical 2026 data scientist loop runs four to six stages over three to five weeks. The exact composition varies by company and track, but the shape is consistent.

Recruiter screen (30 minutes). Mostly logistics and compensation expectations, plus a few "tell me about yourself" probes. The signal here is whether you can articulate the kind of problems you've worked on in plain English — not jargon-heavy, not overly modest. Two to three crisp project descriptions, each tied to a measurable business outcome, is what gets you to the next round.

SQL / coding screen (45–60 minutes). This is the technical filter. Live coding in StrataScratch, DataLemur, Coderpad, or HackerRank — depending on the company. Two to three medium SQL problems plus, occasionally, a Python data-manipulation problem. The bar is correctness on the first run, decent variable naming, and an explanation of edge cases the interviewer didn't ask about. Time pressure is real; most candidates fail by overcomplicating the join.

On-site or virtual on-site (4–5 rounds, typically 45 minutes each). This is where the loops diverge by company:

  • Meta runs a SQL deep dive, a product/metric sense round, an A/B testing round, and a behavioural ("what's a project you're proud of"). The product round is the most heavily weighted.
  • Google is broader: SQL, statistics, an ML case study, a product round, and a "Googleyness" behavioural. The ML case is more present than at Meta.
  • Amazon is leadership-principle-driven; expect every round to be threaded with LP probing, plus the technical content. SQL is short; statistics is short; the DS-specific round is usually a metric-design problem framed in LP language.
  • Netflix is the strategic-thinking outlier — fewer rounds, higher signal expected per round, and a strong emphasis on writing. You may be asked to write a one-page memo explaining your analysis.
  • Airbnb weighs host-side metrics heavily ("how would you measure host churn?") and runs a long product round.
  • ML-research-leaning shops (DeepMind, Anthropic, OpenAI in the cases where they hire DS at all) run a paper-discussion round and a deep modelling round in addition to the standard rota.

Hiring manager round (45 minutes). Typically scheduled last, sometimes between technical rounds. Less technical, more about fit and how you'd structure your first 90 days. Senior candidates should expect strategic questions — "what's the most important metric this team should track that they currently don't?"

The total time investment from first recruiter call to offer is rarely under three weeks, and frequently five to six. Build your prep schedule around that, not around a single high-stakes day.

SQL rounds: what's actually screened in 2026

The SQL screen is where most candidates lose the offer, and the gap between "I know SQL" and "I can pass a 60-minute SQL screen" is wider than candidates realise. Three categories of problem dominate.

Window functions — almost universally tested. The specific functions that come up over and over: ROW_NUMBER(), RANK(), DENSE_RANK() for top-N-per-group; LAG() and LEAD() for period-over-period change; SUM() OVER (PARTITION BY ... ORDER BY ...) for running totals. If you cannot write a top-N-per-group query without thinking, you will fail. The classic trap is reaching for self-joins or correlated subqueries when a window function would do it in five lines.

CTEs and multi-step transformations — the modern style. Anything more complex than a single join is expected to be expressed as a chain of CTEs, named clearly, with each step doing one thing. The interviewer is looking at readability as well as correctness; a 40-line CTE chain with descriptive names beats a 12-line nested-subquery solution every time.

The four canonical patterns:

  1. Top-N per group (find the top 3 customers by spend in each region) — window function with RANK() or ROW_NUMBER(), filter on rank.
  2. Retention and cohort analysis (what fraction of January signups returned in February) — self-join on user_id with date arithmetic, or window functions for active-day flags.
  3. Funnel conversion (signup → activation → first purchase) — staged CTEs with LEFT JOIN or EXISTS checks, computing conversion rates between stages.
  4. Sessionisation (group rows into sessions where consecutive events are within 30 minutes) — LAG() to compute time deltas, then a running sum of "new session" flags.

The most common mistake by far is wrong granularity. If your output should have one row per user-day and you accidentally produce one row per event, every downstream count is wrong by orders of magnitude. Always state out loud what the granularity of the output should be before writing the query, and verify with a small SELECT COUNT(*) after.

Statistics and probability rounds

The stats round is the most variable across companies. Some are theoretical (Bayes derivations, distribution properties); some are applied ("what test would you run for this scenario?"). Three sub-categories cover most of what's asked.

Bayesian / conditional probability. The Monty Hall problem still gets asked, surprisingly often, and its variants — "two coins, one fair one biased, you flip and see heads, what's P(biased)?" — appear at most companies. The mechanical procedure is to write Bayes' theorem out, identify the prior, the likelihood, and the evidence, and compute. Doing this on a whiteboard while talking is the actual skill being tested; getting the answer is necessary but not sufficient.

Distributions and when to assume them. Normal approximations are useful but candidates over-apply them. The interviewer wants to hear: "I'd assume normal here because n is large enough that CLT applies, but I'd verify by checking the residuals; if the underlying data is heavy-tailed I'd reach for a t-distribution or a non-parametric alternative." Naming the assumption and the verification is the senior signal.

Hypothesis testing. The standard "what test would you run?" framework: identify the metric type (proportion, mean, count, ratio), check the assumptions (independence, normality, sample size), pick the test (z-test for proportions, t-test for means, chi-square for categorical, Mann-Whitney for non-normal), state the null and alternative, define the significance level, and discuss multiple-comparisons correction if applicable. You should be able to walk through this in 90 seconds for any given scenario.

Confidence intervals. The trap is the interpretation. "There's a 95% chance the true mean is in this interval" is wrong — frequentist CIs don't make probability statements about parameters. The correct statement is: "if I repeated this experiment many times, 95% of the constructed intervals would contain the true mean." Get this wrong and stats-fluent interviewers will note it.

A/B testing and experimentation: the bread and butter

This is the round most weighted at product-DS companies (Meta, Airbnb, Uber). The expected rubric, at minimum:

  1. Hypothesis — what do you think will happen and why? Tied to a behavioural mechanism, not just "we think this will be good."
  2. Metric — primary success metric, secondary metrics, guardrail metrics. Senior candidates always specify guardrails before being asked.
  3. Power calculation — how many samples to detect an X% lift at 80% power, 5% significance? You should be able to estimate this without a calculator using the rule-of-thumb n ≈ 16 × σ² / δ² per arm.
  4. Randomisation unit — user, session, device? The interviewer will probe this; pick deliberately.
  5. Guardrails and SRM check — sample ratio mismatch (the actual split deviates from the intended 50/50) is the most common signal of a broken experiment, and senior candidates check for it before reporting results.
  6. Analysis — point estimate, confidence interval, p-value (with multiple-comparisons correction if applicable), and the practical-versus-statistical-significance distinction.
  7. Decision — ship, don't ship, or iterate, with the trade-off explicit.

The gotchas the interviewer will probe:

  • Novelty effects. Treatment looks great in week one and regresses by week three because users were just exploring the new feature.
  • Network effects. The classic Facebook News Feed gotcha — if the treatment changes who sees what, you can't randomise by user because the control group is contaminated by treated users' behaviour. The interviewer will sometimes phrase this as "what if we were testing a marketplace ranking change?" and they want to hear the network-interference framing.
  • Dilution. If only 10% of users see the feature, the lift on the full population is the lift on the 10% times the 10%. Forgetting this turns a "5% lift" into a marketing claim that doesn't survive a second look.
  • Primary vs guardrail trade-offs. "What if revenue goes up but DAU goes down?" The senior answer involves the elasticity of the guardrail and a horizon question — short-term revenue lift that costs long-term engagement is rarely worth it.

A worked-example walkthrough — designing the experiment for a new homepage feed — should take about 8–10 minutes. Practice this until it's automatic.

ML case studies (the product DS angle)

When ML appears in a product DS loop, it appears as a case study, not a coding round. The framing is always some variant of "design a ranker for X" — feed ranking, search results, recommendations, ad selection. The expected structure:

  1. Business goal — what are we actually optimising for? Engagement, revenue, long-term retention? The senior signal is naming the long-term objective even when the proxy metric is short-term.
  2. Labels — what are the positive and negative classes? How are labels generated, and what biases does that introduce? (Position bias, selection bias, the cold-start problem.)
  3. Features — three to five categories: user features, item features, context features, interaction features, and (for sequence-aware models) recent-history features.
  4. Model class — gradient-boosted trees as the workhorse default, deep learning where the data and signal justify it. The senior candidate names the trade-off — interpretability, training cost, online-serving latency — rather than reaching for whatever's fashionable.
  5. Offline evaluation — AUC-ROC for classification, NDCG for ranking, RMSE for regression. The trap is stopping here; offline metrics correlate weakly with online business metrics, and the senior candidate names this.
  6. Online evaluation — A/B test design, primary and guardrail metrics, the loop back to the experimentation round.

The honest depth expected by level: at L4 (junior) you can describe each step. At L5 (mid) you can argue trade-offs at each step. At L6 (senior) you can identify the two or three steps where this case study is unusual — what makes it harder than the textbook framing — and propose how you'd handle them.

Product / metric rounds: the DAU drop

The signature product round, asked at almost every company that does product DS interviews, is some variant of: "DAU dropped 5% week-over-week. Walk me through how you'd diagnose."

The expected framework, executed live:

  1. Validate the data first. Is the metric actually down, or is this an instrumentation issue? Check the logging pipeline, check for upstream changes, look for partial-day data.
  2. Segment. By geography, platform, OS, country, user cohort, acquisition channel. The drop is rarely uniform; finding the segment localises the cause.
  3. Decompose by behaviour. DAU = new users + returning users. Did new-user signup drop? Did returning-user retention drop? These have completely different causes.
  4. Decompose by funnel. Within each behaviour group: did app opens drop? Did the open-to-engagement rate drop? Each step has different upstream causes.
  5. Cross-reference with external events. Product launches (yours and competitors'), news cycles, holidays, paid-marketing changes, infrastructure incidents.
  6. Form a hypothesis, design a verification. Once you have a candidate explanation, what data would falsify it?

At Meta and Google, the expected output is a "metric tree" — a visual decomposition of the metric showing every input and the relative magnitude of its change. The senior candidate draws the tree before talking, then walks through it; the junior candidate talks first and never gets to the tree.

Where a real-time AI assistant helps DS rounds — and where it doesn't

Be honest about this. The DS loop has rounds where AI help can take you most of the way there, and rounds where it makes you sound like a fraud. The grid below is what we tell our own users.

AI assistance fit by DS interview round
FeatureSQLStatsA/B testingML caseProduct / metricBehavioural
AI help qualityExcellentGoodStrongStrongModerateStrong
Latency requirementSub-200 ms (live coding)ConversationalConversationalConversationalConversationalConversational
Stealth requirementHigh (screen share)MediumMediumMediumMediumMedium
Ethical comfortContestedComfortableComfortableComfortableComfortablePersonal call
Recommended use modeScript close to verbatimThinking aidFramework promptOutline + you fill inBrainstorm; defend yourselfOutline only — say it in your voice

The honest summary: SQL screens are where AI assistance is closest to taking the work over — the syntax is rigid, the edit-distance from prompt to working query is small, and good assistants like Acedly read the editor directly. Statistics and A/B testing rounds the AI is extremely useful for surfacing the right framework and the right test, but you still have to defend the answer when the interviewer probes. Product/metric rounds the AI is a brainstorming partner but cannot defend the answer for you — the interviewer will ask "why that segment, not this one?" and you need to have an opinion. Behavioural rounds the AI can produce a structure (situation-task-action-result) but the content has to be yours, in your voice, or it will sound rehearsed.

Acedly during a live DS round

Acedly is built for live human interviews where you control the disclosure. Specifically, for data-scientist rounds, three things matter:

Latency. Median end-to-end latency is approximately 98 ms — measured from end-of-utterance to first rendered token. That budget matters most in the SQL coding screen, where the gap between "the AI helps" and "the AI is too slow to help" is the difference between writing your own answer and copying its.

Coding-platform editor reading. Most DS SQL screens run on Coderpad or HackerRank — both verified surfaces where Acedly reads the problem statement, the schema, and the partial query the candidate has written, and uses all three as grounding context. A copilot that only listens to audio leaves the schema on the table, which means worse SQL suggestions.

Multi-model routing. SQL questions route to DeepSeek for code generation; stats and probability route to Claude for reasoning quality; product/metric rounds route to GPT for the broader business context. The router selects per-question, not per-session.

Eight verified platforms. Zoom, Microsoft Teams, Google Meet, Webex, Lark/Feishu, Amazon Chime, Coderpad, HackerRank. Together these cover roughly 95% of professional DS interview surfaces in 2026.

12+ programming languages, including SQL. Python, R, SQL (PostgreSQL, MySQL, BigQuery, Snowflake dialects), Scala, Java, JavaScript/TypeScript, Go, Rust, C++, Julia, MATLAB, Bash. The SQL dialect detection matters because the window-function syntax differs subtly between Postgres and MySQL.

A four-week DS interview prep plan

If you have four weeks, the schedule below is a workable allocation. Adjust the weighting if you're targeting a non-standard track.

Week 1 — SQL drilling. Two hours a day. Fifty problems on StrataScratch or DataLemur, biased toward window functions, retention/cohort patterns, and funnel queries. Every problem gets a one-sentence post-mortem: which pattern, which window function, what was the granularity. By the end of the week you should be able to recognise top-N-per-group from the first sentence.

Week 2 — statistics, probability, and A/B testing. One hour of theory (review of distributions, hypothesis testing, Bayes), one hour of applied — work through ten classic A/B testing prompts (sample size, novelty effects, network effects, dilution). Practice the seven-step rubric out loud until it's automatic. Reading: Kohavi, Tang, Xu's Trustworthy Online Controlled Experiments covers nearly everything that gets asked.

Week 3 — product and metric mock cases. Three mock cases per day, 30 minutes each, on a variety of metrics. The DAU drop, the engagement decline, the conversion-rate drop, the churn spike. Use the metric-tree framework every time. Record yourself; play it back; the first three are bad, the tenth is automatic.

Week 4 — company-specific. If you're targeting Meta, drill product cases from the Decode and Conquer track. If you're targeting Google, broaden across SQL, stats, and ML cases. If you're targeting Amazon, prepare an LP-mapped portfolio of two stories per principle. The last 48 hours: rest. Lighter problems, sleep, the mental refresh of recognising you've done the work.

The single highest-leverage habit, across every week: write a one-sentence post-mortem for every problem you solve. After 50 problems, you'll have a rolling table of your own pattern-recognition history. After 100, you'll be diagnosing DAU drops in your sleep.

Frequently asked questions