to navigate

to select

to close

On this page

Interview: Data Science & ML

ML interview questions — bias-variance, evaluation metrics, feature engineering, model selection, and production ML.

Data science and ML interviews test statistical intuition, practical pipeline knowledge, and ability to explain model decisions.

Core Concepts

Q: Bias vs variance?

High bias (underfitting): Model too simple, misses patterns. Fix: more features, complex model.
High variance (overfitting): Model memorizes training data. Fix: regularization, more data, simpler model.

The goal is finding the bias-variance sweet spot.

Q: What is cross-validation?

Split data into K folds. Train on K-1 folds, validate on the remaining fold. Repeat K times. More reliable than a single train/test split.

Q: Explain precision, recall, and F1.

Metric	Formula	When it matters
Precision	TP / (TP + FP)	Minimize false positives (spam filter)
Recall	TP / (TP + FN)	Minimize false negatives (cancer detection)
F1	2 × P × R / (P + R)	Balance precision and recall

Q: ROC curve and AUC?

ROC plots True Positive Rate vs False Positive Rate at various thresholds. AUC (Area Under Curve) summarizes performance — 1.0 is perfect, 0.5 is random.

Feature Engineering

Q: How do you handle missing data?

Remove rows/columns (if few missing)
Impute with mean/median/mode
Model-based imputation
Create “missing” indicator feature

Q: How do you encode categorical variables?

One-hot encoding — for nominal categories (color: red, blue, green)
Label encoding — for ordinal categories (size: S, M, L)
Target encoding — replace category with mean target value (careful with leakage)

Q: Feature scaling — when and why?

Required for distance-based algorithms (KNN, SVM) and gradient descent. Not needed for tree-based models.

StandardScaler — zero mean, unit variance
MinMaxScaler — scale to [0, 1] range

Model Selection

Q: When to use which algorithm?

Algorithm	Best For
Linear/Logistic Regression	Baseline, interpretable
Random Forest	General purpose, feature importance
Gradient Boosting (XGBoost)	Tabular data competitions
SVM	High-dimensional, clear margin
K-Means	Clustering, customer segmentation
Neural Networks	Images, text, large datasets

Q: What is regularization?

Penalties added to loss function to prevent overfitting:

L1 (Lasso) — drives some coefficients to zero (feature selection)
L2 (Ridge) — shrinks coefficients uniformly

Q: Random Forest vs Gradient Boosting?

Random Forest: Parallel trees, bagging, less overfitting, faster training
Gradient Boosting: Sequential trees, each corrects previous errors, often higher accuracy

Deep Learning

Q: What is backpropagation?

Algorithm to compute gradients of the loss with respect to each weight by applying the chain rule backward through the network. Weights are updated via gradient descent.

Q: What is transfer learning?

Use a model pre-trained on a large dataset (e.g., ImageNet) as a starting point. Fine-tune the last layers on your smaller dataset. Dramatically reduces training time and data requirements.

Q: Overfitting in neural networks?

Dropout — randomly disable neurons during training
Early stopping — stop when validation loss increases
Data augmentation — artificially expand training set
L2 regularization — weight decay

Production ML

Q: How do you deploy an ML model?

Train and evaluate offline
Save model (joblib, torch.save, .keras)
Wrap in API (FastAPI/Flask)
Containerize (Docker)
Deploy with CI/CD
Monitor predictions and data drift

See Full-Stack ML Capstone.

Q: What is data drift?

When production data distribution differs from training data. Model accuracy degrades over time. Monitor input feature distributions and retrain periodically.

Q: Train/serve skew?

Training pipeline differs from serving pipeline (different preprocessing). Fix with unified pipelines (Scikit-learn Pipeline, TF Serving).

Python-Specific ML

Q: Why NumPy over Python lists?

Vectorized operations in C — 10-100x faster. Memory efficient contiguous arrays.

Q: What is a Scikit-learn Pipeline?

Chains preprocessing and modeling steps. Ensures same transformations applied during training and prediction. Prevents data leakage.

  Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier()),
])

Q: pandas .groupby() use cases?

Aggregating data by category — sales by region, average score by department, count by date.

Interview: Web & Backend

Backend interview questions — REST API …

Interview Q&A #1

Python Core interview question #1.

Interview: Data Science & ML

Core Concepts link

Feature Engineering link

Model Selection link

Deep Learning link

Production ML link

Python-Specific ML link

Related link