Data science and ML interviews test statistical intuition, practical pipeline knowledge, and ability to explain model decisions.

Core Concepts

Q: Bias vs variance?

  • High bias (underfitting): Model too simple, misses patterns. Fix: more features, complex model.
  • High variance (overfitting): Model memorizes training data. Fix: regularization, more data, simpler model.

The goal is finding the bias-variance sweet spot.

Q: What is cross-validation?

Split data into K folds. Train on K-1 folds, validate on the remaining fold. Repeat K times. More reliable than a single train/test split.

Q: Explain precision, recall, and F1.

Metric Formula When it matters
Precision TP / (TP + FP) Minimize false positives (spam filter)
Recall TP / (TP + FN) Minimize false negatives (cancer detection)
F1 2 × P × R / (P + R) Balance precision and recall

Q: ROC curve and AUC?

ROC plots True Positive Rate vs False Positive Rate at various thresholds. AUC (Area Under Curve) summarizes performance — 1.0 is perfect, 0.5 is random.


Feature Engineering

Q: How do you handle missing data?

  1. Remove rows/columns (if few missing)
  2. Impute with mean/median/mode
  3. Model-based imputation
  4. Create “missing” indicator feature

Q: How do you encode categorical variables?

  • One-hot encoding — for nominal categories (color: red, blue, green)
  • Label encoding — for ordinal categories (size: S, M, L)
  • Target encoding — replace category with mean target value (careful with leakage)

Q: Feature scaling — when and why?

Required for distance-based algorithms (KNN, SVM) and gradient descent. Not needed for tree-based models.

  • StandardScaler — zero mean, unit variance
  • MinMaxScaler — scale to [0, 1] range

Model Selection

Q: When to use which algorithm?

Algorithm Best For
Linear/Logistic Regression Baseline, interpretable
Random Forest General purpose, feature importance
Gradient Boosting (XGBoost) Tabular data competitions
SVM High-dimensional, clear margin
K-Means Clustering, customer segmentation
Neural Networks Images, text, large datasets

Q: What is regularization?

Penalties added to loss function to prevent overfitting:

  • L1 (Lasso) — drives some coefficients to zero (feature selection)
  • L2 (Ridge) — shrinks coefficients uniformly

Q: Random Forest vs Gradient Boosting?

  • Random Forest: Parallel trees, bagging, less overfitting, faster training
  • Gradient Boosting: Sequential trees, each corrects previous errors, often higher accuracy

Deep Learning

Q: What is backpropagation?

Algorithm to compute gradients of the loss with respect to each weight by applying the chain rule backward through the network. Weights are updated via gradient descent.

Q: What is transfer learning?

Use a model pre-trained on a large dataset (e.g., ImageNet) as a starting point. Fine-tune the last layers on your smaller dataset. Dramatically reduces training time and data requirements.

Q: Overfitting in neural networks?

  • Dropout — randomly disable neurons during training
  • Early stopping — stop when validation loss increases
  • Data augmentation — artificially expand training set
  • L2 regularization — weight decay

Production ML

Q: How do you deploy an ML model?

  1. Train and evaluate offline
  2. Save model (joblib, torch.save, .keras)
  3. Wrap in API (FastAPI/Flask)
  4. Containerize (Docker)
  5. Deploy with CI/CD
  6. Monitor predictions and data drift

See Full-Stack ML Capstone.

Q: What is data drift?

When production data distribution differs from training data. Model accuracy degrades over time. Monitor input feature distributions and retrain periodically.

Q: Train/serve skew?

Training pipeline differs from serving pipeline (different preprocessing). Fix with unified pipelines (Scikit-learn Pipeline, TF Serving).


Python-Specific ML

Q: Why NumPy over Python lists?

Vectorized operations in C — 10-100x faster. Memory efficient contiguous arrays.

Q: What is a Scikit-learn Pipeline?

Chains preprocessing and modeling steps. Ensures same transformations applied during training and prediction. Prevents data leakage.

  Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier()),
])
  

Q: pandas .groupby() use cases?

Aggregating data by category — sales by region, average score by department, count by date.