Statistical Methods in Machine Learning
Factor Analysis
1. Gentle Introduction
Factor analysis is used to estimate a model which explains variance/covariance between a set of observed variables (in a population) by a set of (usually) fewer unobserved factors and their weights.
The key concept of factor analysis is that multiple observed variables have similar patterns of responses because they are all associated with a latent (i.e. not directly measured) variable.
For example, people may respond similarly to questions about Income, Education, and Occupation, which are all associated with the latent variable Socioeconomic Status.
Income
Education
Occupation
Socioeconomic Status
Another example. We have data on a particular set of observed characteristics for people: let's say we have a sample of individuals and we want to know why they experience insomnia, have suicidal thoughts, are hyperventilate, and typically feel nauseous most of the time.
Within that sample, there is a degree of variance and covariance between these set of variables. For example, there might be some sort of covariance between insomnia and suicidal thoughts. When we use Factor analysis, we suppose that the variance and covariance structure in our observed characteristics is impart at least due to some unobserved factors. Thus, the aim of factor analysis is to come up with a model which explains that covariance between a set of observed independent variables (in a population) by a set of (usually) fewer unobserved factors and their weights.
Assume, unobserved factors in our example are depression and extreme anxiety. Thus, these two underlying factors are responsible for and causing the variance and covariance between all of these variables that we observed.
Insomnia
Suicidal Thoughts
Hyperventilate
Feel Nauseous
Depression
Extreme Anxiety
Both factors, Depression and Anxiety, have their weightings  or causal effects on each of the observed characteristics  Insomnia, Suicidal Thoughts, Hyperventilate, and Feel Nauseous. Typically, the weightings which these unobserved characteristics (factors) have on these observed characteristics differ from each other. These weightings are called loadings. You can see these loadings outlined below:
Insomnia
Suicidal Thoughts
Hyperventilate
Feel Nauseous
Depression
Extreme Anxiety
variant commonality
(shared factors)
unique
variance
When doing Factor Analysis, we suppose that there is a certain proportion of e.g. insomnia which is due to these shared unobserved factors  Depression and Anxiety. We call this proportion a variant commonality (darkgrey color)  a proportion of variance explained by a set of factors (in our example, Depression and Anxiety), which are also common (related) to the other observed variables. We also suppose that there is a proportion of insomnia which isn't explained by these unobserved factors. This is something which we call the unique variance of that particular observed variable  Insomnia (green color). It is unique as it is not caused by the common set of factors. Therefore, in factor analysis we suppose that there are a set of unobserved variables (where i is a particular observed variable), which themselves explain this unique variance of that particular factor.
In our example, a set of unobserved variables e_i that themselves explain this unique variance of that particular factor are , , , , outlined below.
Insomnia
Suicidal Thoughts
Hyperventilate
Feel Nauseous
Depression
Extreme Anxiety
If these unobserved factors  , , , ,  are themselves correlated, then we can say there is a proportion of covariance which is due to the shared factors and is also a proportion of covariance which is due to these unique factors. However, keep in mind that when doing factor analysis, we assume that , , , are independent from each other. We will get back to all the assumptions of Factor Analysis later in this article.
2. Introduction to Factor Analysis
More formally, factor analysis is a method for investigating whether a number of variables of interest , are linearly related to a smaller number of unobservable factors .
The fact that the factors are not observable disqualifies regression and many other methods. We shall see, however, that under certain conditions the hypothesized factor model has certain implications, and these implications in turn can be tested against the observations. Exactly what these conditions and implications are, and how the model can be tested, must be explained with some care.
Factor analysis is best explained in the context of a simple example. Students entering a certain MBA program must take three required courses in finance, marketing and business policy. Let , , , respectively, represent a student's grades in these courses. The available data consist of the grades of five students (in a 10point numerical scale), as shown in Figure 1.
Figure 1: Grades
It has been suggested that these grades are functions of two underlying factors, , and , tentatively and rather loosely described as quantitative ability and verbal ability, respectively. It is assumed that each Y variable is linearly related to the two factors, as follows:
The error terms , , and , serve to indicate that the hypothesized relationships are not exact. In the special vocabulary of factor analysis, the parameters are referred to as loadings, where i belongs to an observed variable , and j belongs to unobserved factor . For example, is called the loading of variable on factor . In this MBA program, Finance is highly quantitative, while marketing and policy have a strong qualitative orientation. Quantitative skills should help a student in Finance, but not in marketing or policy. Verbal skills should be helpful in marketing or policy but not in Finance. In other words, it is expected that the loadings have roughly the following structure:
The grade in the Finance course is expected to be positively related to quantitative ability but unrelated to verbal ability; the grades in marketing and policy, on the other hand, are expected to be positively related to verbal ability but unrelated to quantitative ability. Of course, the zeros in the preceding table are not expected to be exactly equal to zero. By `0' we mean approximately equal to zero and by `+' a positive number substantially different from zero.
It may appear that the loadings can be estimated and the expectations tested by regressing each Y against the two factors. Such an approach, however, is not feasible because the factors cannot be observed. An entirely new strategy is required.
3. Assumptions & Implications
Let us turn to the process that generates the observations on , and according to the Figure 1. The simplest model of factor analysis is based on two assumptions concerning the relationships in Figure 1. We shall first describe these assumptions and then examine their implications.
3.1. Assumptions
Assumption I: The error terms are independent of one another, and such that E(e_i) = 0 and Var(e_i) = sigma^2_i.
, and
, and
Assumption II: The unobservable factors are independent of one another and of the error terms, and are such that:
In the context of the present example, this means in part that there is no relationship between quantitative and verbal ability. In more advanced models of factor analysis, the condition that the factors are independent of one another can be relaxed. As for the factor means and variances, the assumption is that the factors are standardized. It is an assumption made for mathematical convenience; since the factors are not observable, we might as well think of them as measured in standardized form.
3.2. Implications
Let us now examine some implications of these assumptions. Each observable variable is a linear function of independent factors and error terms, and can be written as:
The variance of Yi can be calculated by applying the result:
We see that the variance of Yi consists of two parts:
In Progress
X. Summary
Here is a list of takeaways:

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. This method investigates whether a number of variables of interest are linearly related to a smaller number of unobservable factors.

In the special vocabulary of factor analysis, the parameters of these linear functions are referred to as loadings.

Under certain conditions, the variance of each variable and the covariance of each pair of variables can be expressed in terms of the variant commonality (the loadings) and the unique variance (variance of the error terms).

The commonality of a variable is the proportion of its variance explained by a set of unobserved common factors. The specific variance of a variable is the proportion of its variance that is not caused by the common factors.

There exist an infinite number of sets of loadings yielding the same theoretical variances and covariances.

Factor analysis usually proceeds in two stages. In the first, one set of loadings is calculated which yields theoretical variances and covariances that fit the observed ones as closely as possible according to a certain criterion. These loadings, however, may not agree with the prior expectations, or may not lend themselves to a reasonable interpretation. Thus, in the second stage, the first loadings are "rotated" in an effort to arrive at another set of loadings that fit equally well the observed variances and covariances but are more consistent with prior expectations or more easily interpreted.

A method widely used for determining the first set of loadings is the principal component method. This method seeks values of the loadings that bring the estimate of the total communality as close as possible to the total of the observed variances.

When the variables are not measured in the same units, it is customary to standardize them prior to subjecting them to the principal component method so that all have mean equal to zero and variance equal to one.

The varimax rotation method encourages the detection of factors each of which is related to a few variables. It discourages the detection of factors influencing all variables.