10 Essential ML Interview Questions

by Accenture

1.   What is the difference between probability and likelihood?

During the training phase, given some data, we determine the parameters of our model such that our model performance is maximized.


  • Likelihood: During the training phase, given some outcome, we determine theta that maximizes the probability that such an outcome occurs.

  • Probability: During the testing phase given a theta, we determine the probability of observing the outcome.


2.   What is the Bayes theorem? How it is useful in a machine learning context?

Bayes theorem allows us to determine posterior probabilities from our priors when presented with evidence. More simply, it is a method of revising existing predictions given new evidence.


How much more likely A is than B now, equals to how much more likely A was than B before we saw new evidence, times how much more likely this evidence would be to occur if A were true than if B were true.

3.   What is the difference between discriminative and generative models?

Discriminative models learn decision boundaries between classes. A Generative Model ‌explicitly models the actual distribution of each class. In final both of them is predicting the conditional probability P(Animal | Features). But both models learn different probabilities.


  • A Generative Model ‌learns the joint probability distribution p(x,y). It predicts the conditional probability with the help of Bayes Theorem.

  • A Discriminative model ‌learns the conditional probability distribution p(y|x).

  • Discriminative not so good with outliers. Generative: outliers are better handled.


During the training phase, given some data, we determine the parameters of our model such that our model performance is maximized. In discriminative models, we maximize the conditional likelihood (we maximize conditional probability given the model parameters). In generative models, we maximize the joint likelihood, that is the joint probability given the model parameters.


4.   Cross-validation

In normal cross-validation, say k-fold, we split the data into k equal sized chunks. Then, we randomly select k-1 chunks for training and the remaining chunk for testing the model, and we repeat this random selection k times, summing up all the outcomes.


5.   How is decision tree pruned?

Pruning involves removal of nodes and branches in a decision tree to make it simpler so as to mitigate overfitting and improve performance. We constructed a decision tree, and we have a validation set for each leaf node we can determine the node purity. Ideally, we want the nodes to be as pure as possible for high accuracy, but it is very easy to overfit. So much so that the leaf nodes may only have a single data point. We can mitigate this by pruning the decision tree by a method called cost-effective pruning.


The following algorithm takes place while applying cost-effective pruning: 

  1. Determine the performance of the original tree, T, with the validation data

  2. Consider a sub-tree, t(1), and remove it from the original tree, replacing a sub-tree with a leaf.

  3. Determine the performance of a new tree, T(new).

  4. If the delta in performance is insignificant (that is, if validation set does not have the significant difference in delta performance), consider simpler (pruned) tree (Occam’s razor) as an original, and continue to the next sub-tree.

number of leaves

Original tree T

Validation Set









5 years

3 years










3 years

5 years





Sub-tree t(1)

Note that this method goes from the bottom of the tree. When you consider sub-trees to be replaced by a leaf node, this sub-tree should be the last one to a leaf node, as shown in the example.

6.   How do you handle an imbalanced data set?

Imbalanced data occurs when we have a huge amount of observations in one class (called major class), and just a few observations in another class (called minor class). To balance two classes, we can either undersample major class - that is, decrease the number of observations by random selection, or oversample minor class - that is, artificially increase the number of observations.


7.   How do you handle missing/corrupt data?

To deal with missing values, we could perform data imputation. The big idea is that if there is data that is missing, you add value, but that data can be of different types.

For categorical values, you can add a new category e.g. “unknown” or “other”. For numeric types you can impute with zero and add an indicator variable showing that the value is missing:  the model will compute it for you.


8.   How would you deal with outliers?

Analyze the data with and without outliers. We don't know if removing them is going to have some adverse effect after all. That said, there are two methods of dealing with outliers:  

  1. Trimming, where we delete the outlier altogether

  2. Winsorizing, where we seal or floor the value to the closest that is either the maximum or minimum acceptable non-outlier value.

The latter is usually the preferred technique.


9.   How to avoid overfitting?

This question is tricky as you have to know what model you are dealing with to understand

what techniques need to be applied to handle overfitting.



Tree-Based Models

Neural Networks


10.   What is the difference between L1 and L2 penalty terms?

L1 method puts a constraint on the sum of the absolute values of the model parameters, which has to be less than a fixed value. To do so, it applies a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero.


L2 method includes the sum of the squared values of the model parameters. It is trying to keep the parameters small and acts as a penalty on models with many large feature weight values, but not shrinking them to zero.


You can also mention that L1 is used in Lasso regression, and L2 is used in Ridge regression, and describe each of them in brief.