Demystifying Multiple Linear Regression: Python Code with pandas, numpy, and statsmodels
MLR is a statistical technique used to model the relationship between a continuous dependent variable (what you're trying to predict) and two or more independent variables (factors that influence the dependent variable). It's an extension of simple linear regression, which only considers one independent variable.
Implementation in Python
Here's a breakdown of the steps involved:
Import Libraries:
import pandas as pd import numpy as np import statsmodels.api as sm
pandas
: Used for data manipulation, loading datasets from CSV, and creating DataFrames for storing and handling data.numpy
: Provides mathematical functions and array operations essential for linear algebra calculations in MLR.statsmodels.api as sm
: Offers statistical functions, including MLR implementation through theOLS
(Ordinary Least Squares) class.
Load Data:
data = pd.read_csv("your_data.csv")
Prepare Data (Optional):
- Handling Missing Values: You might need to impute (fill in) missing values using strategies like mean/median imputation or more sophisticated techniques depending on the data.
- Encoding Categorical Variables (if applicable): If your data contains categorical features, you'll likely need to encode them using techniques like one-hot encoding (creating binary columns for each category).
- Scaling Features (often recommended): It's generally a good practice to scale features to a similar range using techniques like standardization (z-score) or normalization (min-max scaling) to improve the model's convergence and stability. Libraries like
scikit-learn
can be helpful for these tasks.
Define Feature Matrix (X) and Target Variable (y):
X = data.drop("target_variable", axis=1) # Assuming "target_variable" is your target column y = data["target_variable"]
- Create a DataFrame
X
containing only the independent variables (features) by dropping the target variable column from the DataFrame. - Extract the target variable column into a separate NumPy array
y
.
- Create a DataFrame
Create and Fit the MLR Model:
model = sm.OLS(y, sm.add_constant(X)).fit() # Add a constant term for the intercept
- Create a linear regression model object using
sm.OLS
. - Pass the target variable
y
and the feature matrixX
with a constant term added (usingsm.add_constant
) to the model constructor. - Call the
fit()
method on the model to train it on the data.
- Create a linear regression model object using
Evaluate the Model:
print(model.summary()) # View model summary statistics
Key Points:
statsmodels
provides a comprehensive approach to regression analysis, offering various diagnostic tests and tools beyond just fitting a model.- Consider using
scikit-learn
if your primary focus is on building and evaluating machine learning models (it offers advantages like cross-validation and hyperparameter tuning).
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Sample data (replace with your actual data)
data = pd.DataFrame({
"Age": [25, 30, 35, 40, 45],
"Experience (Years)": [2, 4, 6, 8, 10],
"Salary (USD)": [40000, 50000, 60000, 70000, 80000]
})
# Define feature matrix (X) and target variable (y)
X = data.drop("Salary (USD)", axis=1) # Features
y = data["Salary (USD)"] # Target variable
# Add a constant term for the intercept
X = sm.add_constant(X)
# Create and fit the multiple linear regression model
model = sm.OLS(y, X).fit()
# Print the model summary
print(model.summary())
This code:
- Imports the necessary libraries (
pandas
,numpy
, andstatsmodels.api
). - Creates a sample DataFrame
data
with features ("Age" and "Experience (Years)") and the target variable ("Salary (USD)"). Replace this with your actual data. - Defines the feature matrix
X
and target variabley
by extracting them from the DataFrame. - Adds a constant term for the intercept to
X
usingsm.add_constant
. - Creates a linear regression model
model
usingsm.OLS
and fits it to the data using thefit()
method. - Prints the model summary using
model.summary()
, which provides valuable information like coefficients, p-values, and R-squared.
- Strengths:
- Reduces the impact of collinearity (highly correlated features) by adding a penalty term (regularization) to the cost function.
- Can be useful for datasets with many features or when overfitting is a concern.
- Weaknesses:
- Implementation:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0) # Adjust alpha for regularization strength
ridge_model.fit(X, y)
- Strengths:
- Performs feature selection by shrinking some coefficients to zero, potentially leading to a more interpretable model.
- Useful for datasets with many features or when variable selection is desired.
- Weaknesses:
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=1.0) # Adjust alpha for regularization strength
lasso_model.fit(X, y)
- Strengths:
- Weaknesses:
from sklearn.linear_model import ElasticNet
elastic_net_model = ElasticNet(alpha=1.0, l1_ratio=0.5) # Adjust alpha and l1_ratio
elastic_net_model.fit(X, y)
Polynomial Regression (numpy, statsmodels):
- Strengths:
- Captures non-linear relationships between features and target variable.
- Useful when the underlying relationship is not well-represented by a linear model.
- Weaknesses:
- Can lead to overfitting if the polynomial degree is too high.
- Feature engineering is often required to create polynomial terms.
- Implementation (statsmodels):
import statsmodels.formula.api as smf
# Create polynomial terms (adjust degree as needed)
X_poly = sm.add_interact(X, order=2)
# Fit the model with polynomial terms
model_poly = smf.OLS(y, X_poly).fit()
Decision Trees (scikit-learn):
- Strengths:
- Can handle non-linear relationships and complex interactions between features.
- No need for feature scaling or assumptions about the underlying data distribution.
- Weaknesses:
- Can be prone to overfitting if not carefully pruned or regularized.
- May not be as interpretable as linear models.
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)
python numpy statistics