Correlation and Covariance

June 2, 2018
R Regression

Correlation



What?


A correlation, r is a statistical way that describes the degree of relationship between two quantitative variables.

A positive correlation means that when one variable increase, the other variable also increase and when one variable decrease, the other variable decrease. While negative correlation exists when one variable increase, the other variable decrease.



Why?


Correlation is important as it describes the direction (positive or negative) and also strength (weak or strong) of a relationship between two variables. For an example, a correlation, r can be used to explain the degree of association between GPA and study time of student in hours. However, bear in mind that the relationship between these two variables isn’t necessarily a causal one. It means that correlation, in this case, does not imply that the change in GPA is affected by the study time in hours and vice versa. So, no matter what variable is being used; in this case, whether as a response variable or a predictor variable, the correlation coefficient will remain the same.



How?


Correlation, r has the value that lies between -1 to 1 (from negative correlation to positive correlation).

  • when r= 0, there is no linear relationship between two variables.
  • when r= -1, there is a strong negative correlation between two variables.
  • when r= +1, there is a strong positive correlation between two variables.


The formula for calculating correlation, r between two quantitative variables is:


\[r = \frac {{S}{xy}}{\sqrt {{S}{xx}{S}_{yy}}}\]

  • Sxy is the sum of product of difference between each x value and its mean, \(\bar{X}\) and difference between each y value and its mean, \(\bar{Y}\).
  • Sxx is the sum of square of difference between each x value and its mean, \(\bar{X}\).
  • Syy is the sum of square of difference between each y value and its mean, \(\bar{Y}\).



To compute correlation, r by using statistical language, R use cor( ) function.


cor(x,y)
# correlation between 2 variables.



Example


“women” is a dataset from base library of R which consists of height and weight of 15 womens.



women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

cor(women$height, women$weight)
[1] 0.9954948


Since r= 0.995, so there is a strong positive linear relationship between weight and height of women.









Covariance



What?


Covariance is a measure of how a change in one variable is associated with changes in the second variable. For a variable, variance measures the deviation of data from its mean. When coming to two variables, covariance is used to indicate the direction of a linear relationship between the two variables. If both variables tend to increase or decrease together, the covariance coefficient is positive. If one variable tends to increase while the other variable decrease, the covariance coefficient is negative.



Why?


Covariance can be used to describe the direction (positive, negative) of a linear relationship between two variables. In addition, covariance is a measure of correlation and it can be used to determine the correlation coefficient, r between two variables. The formula for calculating correlation, r between two variables using covariance is:


\[r = \frac {cov(x,y)}{\sqrt {{S}{x}{S}_{y}}}\]

  • where cov(x,y) is the covariance of variable x and y.
  • Sx is the standard deviation of x.
  • Sy is the standard deviation of y.



How?


Unlike correlation, r, covariance is not limited to the range of -1 to +1. If both variables tend to increase or decrease together, the covariance coefficient is positive. A positive covariance coefficient indicates that there is a positive linear relationship between two variables while negative covariance coefficient explains that there is a negative linear relationship between two variables. The formula for calculating covariance coefficient is:


\[cov{(x,y)} = \frac{\sum\limits_{i=1}^{n}{(x_i-\overline{x}) \cdot (y_i-\overline{y})} }{n-1}\]

To perform computation of covariance between two variables in R, use cov( ) function.


cov(x,y)
# covariance of x and y.



Example



cov(women$height,women$weight)
[1] 69


Positive covariance coefficient, 69 indicates that there is positive linear relationship between weight and height of women.









Correlation vs Covariance



Correlation Covariance
No unit Expressed in units
Value lies from -1 to +1 Value lies from negative infinity to positive infinity
Direction and magnitude of linear relationship is known Magnitude of linear relationship is not known.
cor(x,y) cov(x,y)









Testing Hypothesis About a Correlation Coefficient



What?


Hypothesis testing is the idea to make use of sample statistics to make inference about population parameters. The sample statistics in this case refers to correlation coefficient, r. By convention, the population parameter that we use here is \(\rho\).



Why?


Normally, we know which variable should be regarded as a response variable, Y. Consider evaluating whether or not a linear relationship exists between human fertility rate and radiotherapy treatments. For this example, it is obvious that human fertility rate should be used as a response variable to be predicted by radiotherapy treatments.

What if we have the case like it is not obvious to treat which variable as a response variable? For an instance, we wish to evaluate whether or not a linear relationship exists between height and weight of women. In this situation, t-test for correlation coefficient is used instead of Z-test for testing:


  • \(H_0\): \(\rho\) = 0.
  • \(H_0\): \(\rho\) \(\ge\) 0.
  • \(H_0\): \(\rho\) \(\le\) 0.



How?


Step1:


Define null hypothesis, \(H_0\) and alternative hypothesis, \(H_1\).


Step2:


Set confidence level.


Step3:


Compute test statistic, t with formula:


\[t = \frac{r{\sqrt{N-2}}}{\sqrt {1-{r}^{2}}}\]


Step4:


Reject \(H_0\) if p-value is less than significance level, \(\alpha\).


Step5:


Conclusion.



To have a hypothesis testing for correlation coefficient in r, use cor.test(). There are two ways of syntax which provide the same output.

The first one is making use of with( ) function. With( ) function allow us to include the name of column inside the cor.test( ) function without the use of dollar sign, ‘$’.


with(dataname, cor.test(column_name1, column_name2, alternative= "greater" / "less" / "two.sided"))


or


The second one is making use of dollar sign, ‘$’ which has much shorter code compare to with( ) function.


cor.test(dataname$column_name1, data$column_name2, alternative= "greater" / "less" / "two.sided")



Example


I’m using back the ‘women’ dataset from the previous example as illustration. Assuming that we know in advance that correlation of height and weight should be positive (the increase of height will increase the weight also). So, our null hypothesis, \(H_0\) and alternative hypothesis, \(H_1\) is set as:


  • \(H_0\): \(\rho\) \(\le\) 0
  • \(H_1\): \(\rho\) \(>\) 0


We set our significance level, \(\alpha\) equal to 5%.



cor.test(women$height, women$weight, alternative= "greater")

    Pearson's product-moment correlation

data:  women$height and women$weight
t = 37.855, df = 13, p-value = 5.455e-15
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.9883962 1.0000000
sample estimates:
      cor 
0.9954948 


Since the p-value is less than 0.05 (5%), there is sufficient evidence to conclude that at \(\alpha\)=0.05, there is a significant linear relationship between height and weight of women.

Note that correlation coefficient, r is also known as Pearson’s product-moment correlation coefficient.









It’s 12 am.




Testing Difference Between Two Corrrelation Coefficients



What?


To compare the correlation coefficient, r between two variables from two different groups, we can also do a hyopothesis testing on it.


Why?


To determine the correlation coefficient, r for two different groups whether it’s different from each other for the same variables.


How?


Step1:


Z-test is used instead of t-test. And this require us to do a transformation which is called Fisher’s r to z transformation. The formula of the transformation is:


\[Z_r= \frac{[ln(1+r)-ln(1-r)]}{2}\]


Step2:


Set the confidence level.


Step3:


The test statistic, Z is:


\[Z= \frac{Z_1-Z_2}{{\sigma}_{z_1-z_2}}\]


where \[{\sigma}_{z_1-z_2}= \sqrt{\frac{1}{{N}_{1}-{3}}+\frac{1}{{N}_{2}-{3}}}\]


Step4:


Reject \(H_0\) if p-value is less than significance level, \(\alpha\).


Step5:


Conclusion.



To testing the difference between two correlation coefficients, use r.test( ) function in psych package.


install.packages("psych")
# install package

library("psych")
# load library

r.test(r12=r_1, n=n_1, r34=y, n2=n_2, twotailed = T/F)
# hypothesis testing


Example


Suppose we have two groups, Chemistry students and students from Engineering programme. For the sample of 100 students from Chemistry programme, the correlation between GPA and study time in hours is 0.789 while the correlation between these two variables for the 120 students from Engineering programme is 0.673. To test the difference between two correlation coefficients whether it’s the same at \(\alpha\) = 0.05:


  • \(H_0\): \(r_1\) = \(r_2\)
  • \(H_1\): \(r_1\) \(\ne\) \(r_2\)


library("psych")

r.test(r12=0.789, n=100, r34=0.673, n2=120, twotailed = T)
Correlation tests 
Call:r.test(n = 100, r12 = 0.789, r34 = 0.673, n2 = 120, twotailed = T)
Test of difference between two independent correlations 
 z value 1.84    with probability  0.07


Since the p-value is not less than 0.05, thus there is insufficient evidence to conclude that there is a difference between two correlation coefficients.









Correlation Matrix


In addition, to find a single correlation coefficient, r for two quantitative variables, we can also make use of cor( ) function to find a correlation matrix for multiple quantitative variables. This enables us to have a look at the relationship between multiple variables during the stage of exploratory analysis.


cor(x)
# correlation for multiple variables in table
# x is a dataset which consists of numerical variables.   


Example 1


Let’s use one of the popular sample datasets from base library named ‘mtcars’ for the next illustration. By typing in ‘mtcars’ in r console, you will found that the ‘mtcars’ dataset consists of 11 numerical variables. To find all of the correlation coefficients, r at once in a table:



cor(mtcars)
            mpg        cyl       disp         hp        drat         wt
mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
            qsec         vs          am       gear        carb
mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000


You might found that the values of correlation are showed in 7 decimal places and it is quite messy. To change the decimal places of the values (round off), apply round ( ) function.


round(x, digit place)
# replace x with name of dataset


Example 2


To display the output of ‘mtcars’ dataset from the previous example in 2 decimal places, you have two options here.

  • Store its output into a named variable first then only apply round( ) function.
  • One shot to display the round off output in R console.



data= cor(mtcars)
# store output into 'data'

round(data, 2)
       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
# display output in 2 decimal places(round off)


or



round(cor(mtcars), 2)
       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
# one shot display the new output in 2 decimal places









Missing value


Sometime we might get ‘NA’ as a result of cor( ) function. This is because of the dataset we put into the argument of cor( ) function consists of missing values. To encounter this kind of problem, apply use=“complete.obs” inside the function of cor( ) function. For more details about the usage of use( ) function, we can always refer the official documentation of correlation, covariance function by typing in ?cor or ?cov in R console.









Visualizing correlation



Scatter Plot













Scatter plot is a graphical way to show the relationship between two quantitative variables. To create a simple scatter plot in R, we can make use of R base graphics or download and install additional data visualization R packages like ‘ggplot2’, ‘ggvis’, ‘plotly’ and so on.









Visualizing Correlation Matrix




Symbolic Number Coding



From previous example, we determine correlation coefficients for multiple variables in a table form like:

       mpg   cyl  disp    hp
mpg   1.00 -0.85 -0.85 -0.78
cyl  -0.85  1.00  0.90  0.83
disp -0.85  0.90  1.00  0.79
hp   -0.78  0.83  0.79  1.00


synmnum( ) function is a simple function to visualize the structured matrix. In short, this function encodes range of values inside a matrix table as symbols we set for ease of interpretation.


     mpg cyl disp hp
mpg  &   +   +    + 
cyl  +   &   *    * 
disp +   *   &    * 
hp   +   *   *    & 
attr(,"legend")
[1] -0.9 '+' -0.6 '-' -0.3 '@' 0 '$' 0.3 '^' 0.6 '*' 0.9 '#' 0.95 '&' 1



If you notice, you will found that there’s a legend below the table, each value within an interval is represented by a symbol we set. For an example, the correlation between ‘mpg’ and ‘cyl’ is represented by a symbol, ‘+’, that means the correlation value for both of this variables is within the range of -0.90 and -0.60 (strong negative correlation).






Scatter Plot Matrix





A scatter plot matrix is a table of scatter plots in visualizing correlation matrix. The diagonal of the table is the list of variables inside of dataset, every variable on the rows of a table is our y-axis while every variable on the columns of a table will be our x-axis. For an example, let’s talk about the scatter plot above from the second row and first column, the ‘Solar.R’ is our y-axis and ‘Ozone’ is our x-axis.

Then what about the scatter plot from the third row and second column?

.
.
.
.


The answer is ‘Wind’ will be our y-axis while ‘Solar.R’ is our x-axis.






Network Plot



Network plot is one of the way to visualize correlation matrices. By using ‘corrr’ package, you can produce a network plot like:




For more on how to create your own network plot, click here.






Correlogram



A correlogram is another graphical display of a correlation matrix. To avoid ourselves from making hundred of plots during the stage of exploratory analysis, this technique is used. A customized correlogram can be simply plotted by using the available functions from ‘corrplot’ package.





A full correlogram which all the correlation coefficients for every pair of numerical variables are represented by symbols of ‘ellipse’.





You can also plot a upper or lower correlogram like:









This is a mixed correlogram which consists of correlation coefficient and pie chart for each pair of numerical variables. For the upper part of correlogram, every lower part of the correlation value is represented by a pie chart. A whole pie chart is equal to +ve 1 or -ve 1. By default, a red colour pie chart is representing for a negative correlation value, a blue colour pie chart is representing for a positive correlation value and white colour is used for a pie chart when two numerical variables have no correlation (r=0).








Additional Information









The End


Visualizing correlation matrices is not only limited to few of these, but I have to stop here, hope to hear from you about the other ways which I’m not able to cover here.









R vs Excel: lookup and text functions

June 29, 2018
R Excel

Create correlation network plot in a quick way

June 8, 2018
R Visualization

A hint on how to read World Bank data into R

May 19, 2018
R Import
comments powered by Disqus