# 1.   Gentle Introduction

Factor analysis is used to estimate a model which explains variance/covariance between a set of observed variables (in a population) by a set of (usually) fewer unobserved factors and their weights.

The key concept of factor analysis is that multiple observed variables have similar patterns of responses because they are all associated with a latent (i.e. not directly measured) variable.

# For example, people may respond similarly to questions about Income, Education, and Occupation, which are all associated with the latent variable Socioeconomic Status.

Income

Education

Occupation

Socioeconomic Status

Another example. We have data on a particular set of observed characteristics for people: let's say we have a sample of individuals and we want to know why they experience insomnia,  have suicidal thoughts, are hyperventilate, and typically feel nauseous most of the time.

Within that sample, there is a degree of variance and covariance between these set of variables. For example, there might be some sort of covariance between insomnia and suicidal thoughts. When we use Factor analysis, we suppose that the variance and covariance structure in our observed characteristics is impart at least due to some unobserved factors. Thus, the aim of factor analysis is to come up with a model which explains that covariance between a set of observed independent variables (in a population) by a set of (usually) fewer unobserved factors and their weights.

Assume, unobserved factors in our example are depression and extreme anxiety. Thus, these two underlying factors are responsible for and causing the variance and covariance between all of these variables that we observed.

Insomnia

Suicidal Thoughts

Hyperventilate

Feel Nauseous

Depression

Extreme Anxiety

Both factors, Depression and Anxiety, have their weightings - or causal effects on each of the observed characteristics - Insomnia, Suicidal Thoughts, Hyperventilate, and Feel Nauseous. Typically, the weightings which these unobserved characteristics (factors) have on these observed characteristics differ from each other. These weightings are called loadings. You can see these loadings        outlined below:

Insomnia

Suicidal Thoughts

Hyperventilate

Feel Nauseous

Depression

Extreme Anxiety

variant commonality

(shared factors)

unique

variance

When doing Factor Analysis, we suppose that there is a certain proportion of e.g. insomnia which is due to these shared unobserved factors - Depression and Anxiety. We call this proportion a variant commonality (dark-grey color) -  a proportion of variance explained by a set of factors (in our example, Depression and Anxiety), which are also common (related) to the other observed variables. We also suppose that there is a proportion of insomnia which isn't explained by these unobserved factors. This is something which we call the unique variance of that particular observed variable - Insomnia (green color). It is unique as it is not caused by the common set of factors. Therefore, in factor analysis we suppose that there are a set of unobserved variables    (where i  is a particular observed variable), which themselves explain this unique variance of that particular factor.

In our example, a set of unobserved variables e_i that themselves explain this unique variance of that particular factor are      ,     ,       ,      , outlined below.

Insomnia

Suicidal Thoughts

Hyperventilate

Feel Nauseous

Depression

Extreme Anxiety

If these unobserved factors -     ,      ,      ,      , - are themselves correlated, then we can say there is a proportion of covariance which is due to the shared factors and is also a proportion of covariance which is due to these unique factors. However, keep in mind that when doing factor analysis, we assume that     ,      ,      ,      are independent from each other. We will get back to all the assumptions of Factor Analysis later in this article.

# 2.   Introduction to Factor Analysis

More formally, factor analysis is a method for investigating whether a number of variables of interest                , are linearly related to a smaller number of unobservable factors                        .

# The error terms     ,      , and     , serve to indicate that the hypothesized relationships are not exact. In the special vocabulary of factor analysis, the parameters     are referred to as loadings, where i  belongs to an observed variable     , and j belongs to unobserved factor     . For example,        is called the loading of variable       on factor     . In this MBA program, Finance is highly quantitative, while marketing and policy have a strong qualitative orientation. Quantitative skills should help a student in Finance, but not in marketing or policy. Verbal skills should be helpful in marketing or policy but not in Finance. In other words, it is expected that the loadings have roughly the following structure:

The grade in the Finance course is expected to be positively related to quantitative ability but unrelated to verbal ability; the grades in marketing and policy, on the other hand, are expected to be positively related to verbal ability but unrelated to quantitative ability. Of course, the zeros in the preceding table are not expected to be exactly equal to zero. By `0' we mean approximately equal to zero and by `+' a positive number substantially different from zero.

It may appear that the loadings can be estimated and the expectations tested by regressing each Y against the two factors. Such an approach, however, is not feasible because the factors cannot be observed. An entirely new strategy is required.

# 3.   Assumptions & Implications

Let us turn to the process that generates the observations on     ,      and       according to the Figure 1. The simplest model of factor analysis is based on two assumptions concerning the relationships in Figure 1. We shall first describe these assumptions and then examine their implications.

# X.   Summary

Here is a list of take-aways: