top of page
Writer's pictureSrivaths Gondi

Factor analysis (Principle component analysis)


Dimensionality reduction


In a phrase, PCA(principle component analysis) is majorly a dimensionality reduction method, but if you understand the perspective behind the working of this statistical method it is much more interesting than the functionality itself.

PCA is a specific topic of factor analysis, that helps us understand the correlation and the importance of each dimension or attribute in data. However, let's walk through this more intuitively.

To understand the method, we must understand the problem we are trying to solve through factor analysis.


What is factor analysis?


Imagine you have many friends (it's sad if you are actually having to imagine), and you observe how they behave in different situations. You might notice that some of your friends tend to be outgoing, while others are more reserved. Some may be really good at sports, while others excel in academic pursuits. These are all observable behaviors or characteristics that you can see on the surface.

Now, what if I told you that there are some underlying factors or hidden traits that contribute to these behaviors? Factor Analysis is like a magical tool that helps us uncover these hidden factors!


Factor Analysis is a statistical technique that allows us to dig beneath the surface and understand the hidden patterns and relationships among different variables. These variables can be anything from test scores and survey responses to the features of complex datasets.

Although factor analysis helps us understand quite a lot about the data, this blog is mainly focused on PCA (Principle component analysis).


Back to PCA


PCA is a type of unsupervised learning algorithm, which means it doesn’t need a target variable to perform this algorithm. It works just like how u convert a real-life 3D object into instances of 2D images with the help of a camera. Just like how few images contain more information about the object similarly, this algorithm device a method to determine the most useful features for dimensionality reduction


Given the dataset:

Mouse_gene_dataset

Given this dataset, you are asked to reduce the dimensionality from 3D to 2D.


To help you understand intuitively, let's consider a hypothetical situation:

You suspect a terrorist is located in a certain building, but you have only one surveillance camera to install in the area. The best course of action would be to install the camera in a location where it can capture most of the people and activity in the area for study and inspection. If you were given another camera, it would be installed in a spot where the first camera cannot capture, but the first camera would still gather much more information than the second camera.

Let's keep this scenario in mind and take a look at the plot for the dataset above.


plot

In PCA analysis, we aim to represent all the features or attributes using only one dimension, which is referred to as 'PC1'. Later on, it can be expanded into PC2, PC3, and so on.


Note: No.of Principle components(PCn) <= Dimensions.

The more the number of principle components, the more accurately it can represent the variations in the data of different attributes and the lesser the chances of underfitting.



Finally, PC1 is the line that captures the maximum number of projections of the plot points and variance.


Note: PC1 is a whole new axis on its own, This is plotted with the help of eigenvectors.

PCA_plot

As you can see we have PC1 is plotted in a manner it captures the most plot projections and variance.


Mathematically, PC1 is the unit eigenvector plotted with the highest eigenvalue among the three, where the new origin for these principle components is the mean of the plot points.


To better view this graph copy, paste the following code in your editor to view:



Benefits of plotting more PCs


However, if we try to plot more Principle components, we will better be able to retain the variance and behavior across different dimensions.

As mentioned before, it is always better to have more surveillance cameras to inspect the place although the first camera is doing its best to capture all the different views and corners of the building.


Since we are free to represent our original data in 2 dimensions out of 3, we can conveniently use PC2 to additionally interpret our data.


Pca_plot

Upon close observation, it becomes clear that the axis PC2 contains a comparatively lesser amount of distributed data. However, it is able to capture certain elements and variations in the data that PC1 is unable to capture.


An apt way to imagine this would be to think of PC2 as capturing a video of a dance performance from the side of the stage, while PC1 is capturing the video diagonally from the stage, partially attempting to capture from both perspectives.


But to be more mathematically accurate, all the principle components are always orthogonal to each other.

Now that we have plotted both, what are the values of attributes in the new axis of PC1 & PC2?

They are nothing but coordinates of projections of plot points of each axis. To imagine it may look something like this.


PC1:



Point projections

PC2:



Point projections

(The above figures have nothing to do with the dataset, they are just for visualization and understanding purposes)




Coordinates of point projections on PC1:

[6.503507, 3.322388, 1.518042, -4.529308, -5.498534, -6.316095]


PC2:

[1.112709, 3.874071, 1.715182, -0.511816, -0.749608, -0.799539]


Finally, we can replace all the 3 genes with the 2 new dimensions or axis we have created (PC1 & PC2).


To summarize, we have ignored the original axes(x,y,z) and created our own axes(PC1, PC2) to capture all the behavior in just two axes.

38 views

Comments


bottom of page