Pearson Correlation Coefficient: Complete Guide with Examples
Table of Contents
What is Pearson Correlation Coefficient?
The Pearson Correlation Coefficient (denoted as r) is a statistical measure that quantifies the strength and direction of the linear relationship between two variables (X and Y). Its value ranges between -1 and +1:
Perfect Negative
Weak Negative
No Correlation
Weak Positive
Perfect Positive
- +1: Perfect positive linear relationship (as X increases, Y increases)
- -1: Perfect negative linear relationship (as X increases, Y decreases)
- 0: No linear relationship (no consistent pattern between X and Y)
Pearson Correlation Formula
Where:
- xᵢ and yᵢ: Individual data points for X and Y
- x̄ and ȳ: Means (averages) of X and Y
- Numerator: Covariance between X and Y
- Denominator: Product of standard deviations of X and Y
When to Use Pearson Correlation?
Situation | Description | Example |
---|---|---|
Linear Relationship Assessment | When checking for linear relationship between two variables | Height vs. Weight, Study hours vs. Exam scores |
Continuous Variables | When both variables are continuous | Temperature, Time, Age, Income |
Normally Distributed Data | When data follows approximately normal distribution | Standardized test scores, Biological measurements |
Research and Analysis | In scientific studies to understand relationships | Economic indicators, Psychological studies |
Examples of Appropriate Use:
- Relationship between study hours and exam scores
- Correlation between temperature and ice cream sales
- Connection between advertising expenditure and sales revenue
When Not to Use Pearson Correlation?
Pearson Correlation should not be used when:
Situation | Problem | Alternative Approach |
---|---|---|
Non-Linear Relationships | Only measures linear relationships | Spearman’s rank or non-linear correlation |
Categorical Data | Requires continuous variables | Chi-square test or other categorical methods |
Presence of Outliers | Outliers can distort results | Robust correlation measures |
Non-Normal Data | Assumes normal distribution | Non-parametric methods |
Assuming Causation | Correlation ≠ Causation | Experimental design for causation |
Example of Inappropriate Use:
Checking correlation between “ice cream sales” and “shark attacks” can be misleading because both increase in summer (seasonal effect), but there’s no direct causation.
Step-by-Step Calculation Example
Problem Statement:
Calculate the Pearson Correlation Coefficient for the following dataset of 5 students’ study hours (X) and exam scores (Y):
Student | Study Hours (X) | Exam Scores (Y) |
---|---|---|
1 | 2 | 50 |
2 | 4 | 60 |
3 | 6 | 75 |
4 | 8 | 85 |
5 | 10 | 90 |
Step 1: Calculate Means
Step 2: Calculate Deviations and Products
Create a table for (xᵢ – x̄), (yᵢ – ȳ), and their products:
xᵢ | yᵢ | xᵢ – x̄ | yᵢ – ȳ | (xᵢ – x̄)(yᵢ – ȳ) | (xᵢ – x̄)² | (yᵢ – ȳ)² |
---|---|---|---|---|---|---|
2 | 50 | -4 | -22 | 88 | 16 | 484 |
4 | 60 | -2 | -12 | 24 | 4 | 144 |
6 | 75 | 0 | 3 | 0 | 0 | 9 |
8 | 85 | 2 | 13 | 26 | 4 | 169 |
10 | 90 | 4 | 18 | 72 | 16 | 324 |
Step 3: Sum the Columns
Step 4: Apply the Formula
Step 5: Interpretation
r = 0.987 indicates a very strong positive linear relationship between study hours and exam scores.
This means that as study hours increase, exam scores also consistently increase.
Key Takeaways
- Pearson Correlation (r) measures linear relationship between two continuous variables (-1 to +1)
- Use for normally distributed continuous data with linear relationships
- Avoid for non-linear data, categorical variables, or when outliers are present
- Correlation does not imply causation
- In our example, r = 0.987 showed very strong positive correlation between study hours and exam scores