Correlation Analysis and Dimensionality Reduction

Correlation Analysis is meant for analyzing how two ordinal variables co-variate (increase and decrease together). It helps in at least two fields:
1. Regression analysis: By choosing which variables should be included.
2. Dimensionality reduction: If 2+ independent variables are correlated (this is called: multi-collinearity), we can keep one of them for the regression or any other analysis.

A famous measure of correlation is the Pearson-Linear correlation Coefficient:

Say if you have 2 variables: If they co-variate, they may get 0.5*0.5=0.25. But if they don’t, then their

First to determine how you may need to apply some pre-processing:

  • You may need to filter out the small samples. To do so you need to calculate the minimum acceptable sample size. Here are the formulas on how to do so.
  • If the variable is a rate, you need to calculate this rate and its composants (nominator and denominator).

After that, you calculate the correlation between each two variables. At least, check the correlation between your dependent variable and other “independent” variables.

Correlation Matrix:
I recommend building the whole correlation matrix between each two variables. This helps advancing the dimensionality reduction: If two or more “independent variables” are found to be correlated and all of them are correlated with the dependent variable, one of them can be kept to include in analysis and the others can be omitted.

In excel, go to Data tab > Data Analysis > Correlation. Then choose the input data as your matrix:

(Note the correlation matrix behind when zooming in: For the N variables, the matrix is NxN. The rows and columns have the names of all variables ).
And in the next box you choose the input range. Here you choose all your variables in one range and check the box “Labels in First Row”:

Then run ok > Essentially this is how the correlation matrix look like from distance:


Then check for the significance value for the corresponding degree of freedom and significance level, from the table here.

Last: Use the option: Conditional Formatting to highlight the correlation values that exceed the tabulated significance value. Like the above matrix. This can be achieved for both directions: positively and negatively correlated variables.

Leave a Reply

Your email address will not be published. Required fields are marked *