What is Multiple Regression? Formula, Example, and Use in 2025
Multiple Regression is an advanced statistical technique used to model the relationship between one dependent variable (the variable to be predicted) and two or more independent variables (factors that help in prediction). It describes a linear relationship and is widely used in 2025 for predictive analytics in fields like data science, machine learning, and business intelligence.
The result of Multiple Regression is an equation that predicts the dependent variable based on a combination of independent variables. Tools like Python, R, and SPSS are commonly used to perform these calculations.
Table of Contents
When to Use Multiple Regression?
Multiple Regression is used when:
- Predicting a Dependent Variable with Multiple Factors: For example, predicting a student’s exam scores (dependent) based on study hours, attendance, coaching, and online learning platform usage (independent) in 2025’s education landscape.
- Linear Relationship Exists: There is a linear relationship between independent and dependent variables.
- Causal Analysis: To understand the effect of independent variables on the dependent variable.
- Examples (2025 Context):
- Predicting house prices (dependent) based on size, location, age, and smart home features (independent).
- Evaluating employee performance (dependent) based on experience, training, education, and AI tool proficiency (independent).
When Not to Use Multiple Regression?
Multiple Regression should not be used in the following cases:
- Non-Linear Relationship: If the relationship between variables is non-linear, use non-linear regression or advanced machine learning models like neural networks (popular in 2025).
- High Multicollinearity: If independent variables are highly correlated, results may be unreliable. Check using Variance Inflation Factor (VIF).
- Non-Normal Residuals: If regression residuals do not follow a normal distribution, the model’s results may not be trustworthy.
- Categorical Dependent Variable: If the dependent variable is categorical (e.g., Yes/No), use logistic regression or classification algorithms.
Multiple Regression Formula
Where:
- Y: Dependent variable (to be predicted).
- \(\beta_0\): Intercept (value of Y when all independent variables are zero).
- \(\beta_1, \beta_2, \dots, \beta_k\): Coefficients representing the effect of each independent variable.
- \(X_1, X_2, \dots, X_k\): Independent variables.
- \(\epsilon\): Error term (model inaccuracy).
Step-by-Step Example
Problem: A school wants to determine how exam marks (Y) depend on study hours (X1) and attendance percentage (X2) for 5 students. The data is as follows:
Student | Study Hours (X1) | Attendance % (X2) | Exam Marks (Y) |
---|---|---|---|
1 | 5 | 90 | 85 |
2 | 3 | 80 | 70 |
3 | 6 | 95 | 90 |
4 | 4 | 85 | 78 |
5 | 2 | 75 | 65 |
Step 1: Define the Model
The model will be:
Here, Y = Exam Marks, X1 = Study Hours, X2 = Attendance %.
Step 2: Calculate Coefficients
To calculate the coefficients (\(\beta_0, \beta_1, \beta_2\)) for Multiple Regression, we use the Ordinary Least Squares (OLS) method. This involves complex calculations typically performed using software (e.g., Python, R, or SPSS). For this example, assume the software provided these results:
\[ \beta_0 = 10, \quad \beta_1 = 5, \quad \beta_2 = 0.6 \]
Thus, the model is:
Step 3: Use the Model (Prediction)
For Student 1 (X1 = 5, X2 = 90):
\[ Y = 10 + 5(5) + 0.6(90) \]
\[ = 10 + 25 + 54 = 89 \]
Actual marks = 85, predicted = 89 (slight error).
For Student 5 (X1 = 2, X2 = 75):
\[ Y = 10 + 5(2) + 0.6(75) \]
\[ = 10 + 10 + 45 = 65 \]
Actual marks = 65, predicted = 65 (accurate).
Step 4: Interpret the Model
The equation \( Y = 10 + 5X_1 + 0.6X_2 \) means:
- \(\beta_0 = 10\): If study hours and attendance are zero, the baseline marks are 10.
- \(\beta_1 = 5\): Each additional study hour increases marks by 5 units (holding attendance constant).
- \(\beta_2 = 0.6\): A 1% increase in attendance percentage increases marks by 0.6 units (holding study hours constant).
Step 5: Check Model Accuracy
The accuracy of Multiple Regression is measured using R² (coefficient of determination). Assume the software reported R² = 0.92, meaning 92% of the variance in exam marks can be explained by study hours and attendance. This indicates a strong model.
Summary
Multiple Regression: Predicts a dependent variable using multiple independent variables, crucial for 2025’s data-driven decision-making.
Use: When predicting a variable using multiple factors, e.g., exam marks based on study hours and attendance.
Avoid: In cases of non-linear relationships, high multicollinearity, or categorical dependent variables.
Example: Predicted exam marks using study hours and attendance, model: \( Y = 10 + 5X_1 + 0.6X_2 \), R² = 0.92.
Formula:
GOOD WORK