Pearson Correlation Coefficient

Pearson Correlation Coefficient: Complete Guide with Examples

Pearson Correlation Coefficient: Complete Guide with Examples

What is Pearson Correlation Coefficient?

The Pearson Correlation Coefficient (denoted as r) is a statistical measure that quantifies the strength and direction of the linear relationship between two variables (X and Y). Its value ranges between -1 and +1:

-1

Perfect Negative

-0.5

Weak Negative

0

No Correlation

+0.5

Weak Positive

+1

Perfect Positive

  • +1: Perfect positive linear relationship (as X increases, Y increases)
  • -1: Perfect negative linear relationship (as X increases, Y decreases)
  • 0: No linear relationship (no consistent pattern between X and Y)

Pearson Correlation Formula

\[ r = \frac{\sum(x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum(x_i – \bar{x})^2 \sum(y_i – \bar{y})^2}} \]

Where:

  • xᵢ and yᵢ: Individual data points for X and Y
  • x̄ and ȳ: Means (averages) of X and Y
  • Numerator: Covariance between X and Y
  • Denominator: Product of standard deviations of X and Y

When to Use Pearson Correlation?

Situation Description Example
Linear Relationship Assessment When checking for linear relationship between two variables Height vs. Weight, Study hours vs. Exam scores
Continuous Variables When both variables are continuous Temperature, Time, Age, Income
Normally Distributed Data When data follows approximately normal distribution Standardized test scores, Biological measurements
Research and Analysis In scientific studies to understand relationships Economic indicators, Psychological studies

Examples of Appropriate Use:

  • Relationship between study hours and exam scores
  • Correlation between temperature and ice cream sales
  • Connection between advertising expenditure and sales revenue

When Not to Use Pearson Correlation?

Pearson Correlation should not be used when:

Situation Problem Alternative Approach
Non-Linear Relationships Only measures linear relationships Spearman’s rank or non-linear correlation
Categorical Data Requires continuous variables Chi-square test or other categorical methods
Presence of Outliers Outliers can distort results Robust correlation measures
Non-Normal Data Assumes normal distribution Non-parametric methods
Assuming Causation Correlation ≠ Causation Experimental design for causation

Example of Inappropriate Use:

Checking correlation between “ice cream sales” and “shark attacks” can be misleading because both increase in summer (seasonal effect), but there’s no direct causation.

Step-by-Step Calculation Example

Problem Statement:

Calculate the Pearson Correlation Coefficient for the following dataset of 5 students’ study hours (X) and exam scores (Y):

Student Study Hours (X) Exam Scores (Y)
1 2 50
2 4 60
3 6 75
4 8 85
5 10 90

Step 1: Calculate Means

\[ \bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6 \] \[ \bar{y} = \frac{50 + 60 + 75 + 85 + 90}{5} = \frac{360}{5} = 72 \]

Step 2: Calculate Deviations and Products

Create a table for (xᵢ – x̄), (yᵢ – ȳ), and their products:

xᵢ yᵢ xᵢ – x̄ yᵢ – ȳ (xᵢ – x̄)(yᵢ – ȳ) (xᵢ – x̄)² (yᵢ – ȳ)²
2 50 -4 -22 88 16 484
4 60 -2 -12 24 4 144
6 75 0 3 0 0 9
8 85 2 13 26 4 169
10 90 4 18 72 16 324

Step 3: Sum the Columns

\[ \sum(x_i – \bar{x})(y_i – \bar{y}) = 88 + 24 + 0 + 26 + 72 = 210 \] \[ \sum(x_i – \bar{x})^2 = 16 + 4 + 0 + 4 + 16 = 40 \] \[ \sum(y_i – \bar{y})^2 = 484 + 144 + 9 + 169 + 324 = 1130 \]

Step 4: Apply the Formula

\[ r = \frac{210}{\sqrt{40 \times 1130}} \] \[ r = \frac{210}{\sqrt{45200}} \] \[ r = \frac{210}{212.62} \approx 0.987 \]

Step 5: Interpretation

r = 0.987 indicates a very strong positive linear relationship between study hours and exam scores.

This means that as study hours increase, exam scores also consistently increase.

Key Takeaways

  • Pearson Correlation (r) measures linear relationship between two continuous variables (-1 to +1)
  • Use for normally distributed continuous data with linear relationships
  • Avoid for non-linear data, categorical variables, or when outliers are present
  • Correlation does not imply causation
  • In our example, r = 0.987 showed very strong positive correlation between study hours and exam scores

Leave a Reply

Your email address will not be published. Required fields are marked *