Data Analyst · Interview Prep 2026

Top Data Analyst Interview Questions & Answers

Q: What is the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN?

INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table and matching rows from the right (NULLs where no match). FULL OUTER JOIN returns all rows from both tables, with NULLs where there's no match on either side.

Q: What is a window function and when would you use one?

Window functions perform calculations across a set of rows related to the current row — without collapsing them like GROUP BY. Examples: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER(). Use them for running totals, ranking within groups, calculating period-over-period changes, and comparing a row to its group average.

Q: How would you find duplicate records in a SQL table?

SELECT column1, column2, COUNT(*) as count FROM table GROUP BY column1, column2 HAVING COUNT(*) > 1; This groups by the columns that define uniqueness and returns groups with more than one occurrence.

Q: What is the difference between mean, median, and mode?

Mean is the average (sum / count). Median is the middle value when sorted — resistant to outliers. Mode is the most frequent value. Use median for skewed distributions (income, home prices) where outliers distort the mean.

Q: What is selection bias and how does it affect analysis?

Selection bias occurs when the sample analyzed isn't representative of the population you're trying to understand. Example: analyzing only churned users to understand retention issues — you're missing the users who stayed. It leads to incorrect conclusions because the findings apply to the sample, not the broader population.

Q: A/B test ran for 2 weeks. Treatment group has 8% conversion vs. 6% for control. What questions would you ask?

Before declaring a winner: (1) Is it statistically significant? (2) What's the sample size — is the test powered? (3) Did the test run for at least one business cycle? (4) Are there segment differences that might confound the result? (5) Was the randomization valid? (6) Is the effect consistent over time (novelty effect check)?

Q: How would you handle missing data in a dataset?

First, understand why data is missing — MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random — the hardest case). Options: drop rows (only if MCAR and small percentage), impute with mean/median/mode (simple), use model-based imputation (more sophisticated), or create a binary 'is_missing' indicator and keep it. The right approach depends on how much data is missing and why.

Data analyst interviews test SQL proficiency, statistical reasoning, business acumen, and the ability to communicate insights clearly. Here's what to expect across technical and behavioral rounds.

SQL & Technical (5 questions)Statistics & Analysis (5 questions)Business Case (5 questions)

1SQL & Technical

Write a SQL query to find the second-highest salary in a table.

SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees); Or using DENSE_RANK(): SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk FROM employees) ranked WHERE rnk = 2 LIMIT 1;

What is the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN?

INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table and matching rows from the right (NULLs where no match). FULL OUTER JOIN returns all rows from both tables, with NULLs where there's no match on either side.

What is a window function and when would you use one?

Window functions perform calculations across a set of rows related to the current row — without collapsing them like GROUP BY. Examples: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER(). Use them for running totals, ranking within groups, calculating period-over-period changes, and comparing a row to its group average.

How would you find duplicate records in a SQL table?

SELECT column1, column2, COUNT(*) as count FROM table GROUP BY column1, column2 HAVING COUNT(*) > 1; This groups by the columns that define uniqueness and returns groups with more than one occurrence.

What is indexing in a database and why does it matter?

An index is a data structure that improves query performance on columns used in WHERE, JOIN, and ORDER BY clauses. Without an index, the database performs a full table scan (O(n)). With an index, lookups are O(log n) or O(1). Trade-off: indexes speed up reads but slow down writes and consume storage.

2Statistics & Analysis

Explain p-value in plain language.

A p-value is the probability of observing your result (or a more extreme one) if the null hypothesis were true. A p-value of 0.03 means there's a 3% chance you'd see this result by random chance if nothing was actually happening. A common threshold is p < 0.05, but this doesn't mean the effect is large or practically important — statistical significance ≠ practical significance.

What is the difference between mean, median, and mode?

Mean is the average (sum / count). Median is the middle value when sorted — resistant to outliers. Mode is the most frequent value. Use median for skewed distributions (income, home prices) where outliers distort the mean.

What is selection bias and how does it affect analysis?

Selection bias occurs when the sample analyzed isn't representative of the population you're trying to understand. Example: analyzing only churned users to understand retention issues — you're missing the users who stayed. It leads to incorrect conclusions because the findings apply to the sample, not the broader population.

A/B test ran for 2 weeks. Treatment group has 8% conversion vs. 6% for control. What questions would you ask?

Before declaring a winner: (1) Is it statistically significant? (2) What's the sample size — is the test powered? (3) Did the test run for at least one business cycle? (4) Are there segment differences that might confound the result? (5) Was the randomization valid? (6) Is the effect consistent over time (novelty effect check)?

How would you handle missing data in a dataset?

First, understand why data is missing — MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random — the hardest case). Options: drop rows (only if MCAR and small percentage), impute with mean/median/mode (simple), use model-based imputation (more sophisticated), or create a binary 'is_missing' indicator and keep it. The right approach depends on how much data is missing and why.

Get the Interview First

IntelligentCV builds an ATS-optimized resume that gets you to the interview stage.

🎯 Free Quiz

3Business Case

How would you measure the success of a marketing campaign?

Define success metrics before the campaign runs: conversion rate, cost per acquisition (CAC), return on ad spend (ROAS), incremental lift (measured via holdout group), and customer lifetime value of acquired users. Post-campaign: compare to benchmarks, segment by channel/creative/audience, and attribute multi-touch journeys.

Our app's DAU has been declining for 3 months. What would you investigate?

Systematic approach: (1) Segment the decline by platform, user cohort, geography, and acquisition source; (2) Check the activation funnel for new users; (3) Analyze retention curves by cohort to see if older or newer users are driving the decline; (4) Look at competitive landscape and market changes; (5) Survey churned users; (6) Check for product changes or bugs coinciding with the start of the decline.

How would you prioritize which metrics to track for a new product feature?

Work backwards from the feature's goal. Define: primary metric (what the feature is directly trying to move), guardrail metrics (what it shouldn't harm), secondary metrics (adjacent KPIs), and leading indicators (early signals before primary metric changes). Too many metrics = no focus; too few = blind spots.

How would you estimate the number of coffee shops in New York City?

Fermi estimation: NYC population ~8.3M. Average neighborhood has ~1 coffee shop per 2,000-3,000 people. 8,300,000 / 2,500 ≈ 3,300 coffee shops. Alternatively: estimate per capita coffee consumption, cups per day per shop, and work from supply side. Show your assumptions clearly — the reasoning matters more than the answer.

How do you communicate a complex analysis to a non-technical audience?

Lead with the insight and recommendation — not the methodology. Use plain language, avoid jargon, focus on what the data means for their decision, and use one clear visualization per point. Anticipate the 'so what?' question and answer it before they ask. Save methodology details for the appendix or Q&A.

How to Prepare for Data Analyst Interviews

⚡Practice SQL daily — especially window functions, CTEs, and query optimization
⚡Review statistics fundamentals: hypothesis testing, confidence intervals, distributions
⚡Be ready to walk through a full analysis case study from a previous job or personal project
⚡Practice Fermi estimation problems (market sizing, traffic estimation)
⚡Know the company's core business metrics and be ready to discuss how you'd measure them