Interview: Data Science & ML
ML interview questions — bias-variance, evaluation metrics, feature engineering, model selection, and production ML.
Data science and ML interviews test statistical intuition, practical pipeline knowledge, and ability to explain model decisions.
Core Concepts
Q: Bias vs variance?
- High bias (underfitting): Model too simple, misses patterns. Fix: more features, complex model.
- High variance (overfitting): Model memorizes training data. Fix: regularization, more data, simpler model.
The goal is finding the bias-variance sweet spot.
Q: What is cross-validation?
Split data into K folds. Train on K-1 folds, validate on the remaining fold. Repeat K times. More reliable than a single train/test split.
Q: Explain precision, recall, and F1.
| Metric | Formula | When it matters |
|---|---|---|
| Precision | TP / (TP + FP) | Minimize false positives (spam filter) |
| Recall | TP / (TP + FN) | Minimize false negatives (cancer detection) |
| F1 | 2 × P × R / (P + R) | Balance precision and recall |
Q: ROC curve and AUC?
ROC plots True Positive Rate vs False Positive Rate at various thresholds. AUC (Area Under Curve) summarizes performance — 1.0 is perfect, 0.5 is random.
Feature Engineering
Q: How do you handle missing data?
- Remove rows/columns (if few missing)
- Impute with mean/median/mode
- Model-based imputation
- Create “missing” indicator feature
Q: How do you encode categorical variables?
- One-hot encoding — for nominal categories (color: red, blue, green)
- Label encoding — for ordinal categories (size: S, M, L)
- Target encoding — replace category with mean target value (careful with leakage)
Q: Feature scaling — when and why?
Required for distance-based algorithms (KNN, SVM) and gradient descent. Not needed for tree-based models.
- StandardScaler — zero mean, unit variance
- MinMaxScaler — scale to [0, 1] range
Model Selection
Q: When to use which algorithm?
| Algorithm | Best For |
|---|---|
| Linear/Logistic Regression | Baseline, interpretable |
| Random Forest | General purpose, feature importance |
| Gradient Boosting (XGBoost) | Tabular data competitions |
| SVM | High-dimensional, clear margin |
| K-Means | Clustering, customer segmentation |
| Neural Networks | Images, text, large datasets |
Q: What is regularization?
Penalties added to loss function to prevent overfitting:
- L1 (Lasso) — drives some coefficients to zero (feature selection)
- L2 (Ridge) — shrinks coefficients uniformly
Q: Random Forest vs Gradient Boosting?
- Random Forest: Parallel trees, bagging, less overfitting, faster training
- Gradient Boosting: Sequential trees, each corrects previous errors, often higher accuracy
Deep Learning
Q: What is backpropagation?
Algorithm to compute gradients of the loss with respect to each weight by applying the chain rule backward through the network. Weights are updated via gradient descent.
Q: What is transfer learning?
Use a model pre-trained on a large dataset (e.g., ImageNet) as a starting point. Fine-tune the last layers on your smaller dataset. Dramatically reduces training time and data requirements.
Q: Overfitting in neural networks?
- Dropout — randomly disable neurons during training
- Early stopping — stop when validation loss increases
- Data augmentation — artificially expand training set
- L2 regularization — weight decay
Production ML
Q: How do you deploy an ML model?
- Train and evaluate offline
- Save model (
joblib,torch.save,.keras) - Wrap in API (FastAPI/Flask)
- Containerize (Docker)
- Deploy with CI/CD
- Monitor predictions and data drift
Q: What is data drift?
When production data distribution differs from training data. Model accuracy degrades over time. Monitor input feature distributions and retrain periodically.
Q: Train/serve skew?
Training pipeline differs from serving pipeline (different preprocessing). Fix with unified pipelines (Scikit-learn Pipeline, TF Serving).
Python-Specific ML
Q: Why NumPy over Python lists?
Vectorized operations in C — 10-100x faster. Memory efficient contiguous arrays.
Q: What is a Scikit-learn Pipeline?
Chains preprocessing and modeling steps. Ensures same transformations applied during training and prediction. Prevents data leakage.
Pipeline([
("scaler", StandardScaler()),
("model", RandomForestClassifier()),
])
Q: pandas .groupby() use cases?
Aggregating data by category — sales by region, average score by department, count by date.