Fitting Theoretical Distributions to Real-World Data with Python's SciPy
What is it?
This process involves finding a theoretical probability distribution (like normal, exponential, etc.) that best describes the pattern observed in your actual data (empirical distribution). SciPy's scipy.stats
module provides tools to achieve this.
- Understanding data: By fitting a theoretical distribution, you gain insights into the underlying characteristics of your data, such as central tendency, spread, and potential skewness.
- Prediction: You can use the fitted distribution to make predictions about future data points that might follow the same pattern.
- Modeling: Fitted distributions can be used as building blocks in simulations or other statistical models.
How it works:
Import necessary libraries:
import numpy as np import scipy.stats as stats
Evaluate the fit:
- The
goodness_of_fit_pvalue
provides a statistical measure of fit. A high p-value (e.g., > 0.05) suggests that the data cannot be rejected as coming from the chosen distribution. - You can also visualize the fit using techniques like histograms and probability density functions (PDFs) to see how closely the theoretical distribution aligns with your data.
- The
Example:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000) # Simulate normal distribution
# Fit a normal distribution
parameters, goodness_of_fit_pvalue = stats.norm.fit(data)
# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue) # Should be high
# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.norm.pdf(x, *parameters)
# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Normal PDF')
plt.legend()
plt.show()
Additional considerations:
- Explore alternative distributions if the initial fit isn't satisfactory.
- Consider using methods like Kolmogorov-Smirnov (KS) test or Anderson-Darling test for a more rigorous goodness-of-fit evaluation.
- If you have a strong hunch about the distribution type, you can provide initial parameter values to the
fit
method.
By following these steps and exploring different distributions, you can effectively fit theoretical distributions to your empirical data using SciPy in Python.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000) # Simulate normal distribution
# Fit a normal distribution
parameters, goodness_of_fit_pvalue = stats.norm.fit(data)
# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)
# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.norm.pdf(x, *parameters)
# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Normal PDF')
plt.legend()
plt.show()
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Sample data (replace with your actual data)
data = np.random.exponential(scale=3, size=1000) # Simulate exponential distribution
# Fit an exponential distribution
parameters, goodness_of_fit_pvalue = stats.expon.fit(data)
# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)
# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.expon.pdf(x, *parameters)
# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Exponential PDF')
plt.legend()
plt.show()
Poisson Distribution (for discrete data):
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Sample data (replace with your actual data)
data = np.random.poisson(lam=5, size=1000) # Simulate Poisson distribution
# Fit a Poisson distribution
parameters, goodness_of_fit_pvalue = stats.poisson.fit(data)
# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)
# Create theoretical PMF (Probability Mass Function)
x = np.arange(min(data), max(data) + 1) # Discrete values for Poisson
pmf = stats.poisson.pmf(x, *parameters)
# Plot the data histogram and theoretical PMF
plt.hist(data, density=True, bins=np.arange(min(data)-0.5, max(data)+1.5), alpha=0.7, label='Data')
plt.bar(x, pmf, alpha=0.7, label='Poisson PMF')
plt.legend()
plt.show()
Remember to replace the sample data with your actual data and adjust the distribution type (norm
, expon
, poisson
, etc.) based on your data's characteristics. These examples provide a foundation for fitting various theoretical distributions using SciPy's scipy.stats
module.
Curve Fitting with scipy.optimize.curve_fit:
- This method allows you to define a custom function that represents the theoretical distribution's probability density function (PDF) or cumulative distribution function (CDF).
- You then use
curve_fit
to find the parameters of the theoretical distribution that minimize the difference between the theoretical function and the empirical data.
Here's an example fitting a normal distribution:
import numpy as np
from scipy import optimize
import scipy.stats as stats
# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)
# Define function for normal distribution PDF
def normal_pdf(x, mu, sigma):
return stats.norm.pdf(x, mu, sigma)
# Initial guess for parameters (adjust based on your data)
initial_guess = [np.mean(data), np.std(data)]
# Fit the normal distribution
popt, pcov = optimize.curve_fit(normal_pdf, data, xdata=data, p0=initial_guess)
# Print the fitted parameters
print("Fitted parameters:", popt)
# Create theoretical PDF with fitted parameters
x = np.linspace(min(data), max(data), 100)
pdf = normal_pdf(x, *popt)
# Plot the data histogram and theoretical PDF (similar to previous example)
Distribution Fitting with Sum of Squared Errors (SSE):
- This approach iterates through different theoretical distributions offered by SciPy.
- For each distribution, it calculates the sum of squared errors (SSE) between the distribution's histogram and the data's histogram.
- The distribution with the lowest SSE is considered the best fit.
Here's an example implementation:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)
# Function to calculate SSE between histograms
def calculate_sse(data, distribution):
counts, bins = np.histogram(data, bins=20)
pdf = distribution.pdf(bins[:-1])
return np.sum((counts - pdf * len(data))**2)
# Loop through distributions and find best fit
best_distribution = None
best_sse = np.inf
for dist in [stats.norm, stats.expon, stats.poisson]: # Add more as needed
try:
sse = calculate_sse(data, dist)
if sse < best_sse:
best_distribution = dist
best_sse = sse
except: # Handle exceptions for discrete distributions on continuous data
pass
# Print the best fitting distribution
print("Best fitting distribution:", best_distribution.name)
# Generate data and plot for best fit (similar to previous example)
Using Specialized Goodness-of-Fit Tests:
- SciPy doesn't directly provide goodness-of-fit tests, but you can use libraries like
statsmodels
for tests like Kolmogorov-Smirnov (KS) or Anderson-Darling. - These tests provide a statistical p-value to assess how well the fitted distribution describes the data.
Remember:
- Choose the method that best suits your needs and data type.
- Curve fitting provides more flexibility but requires defining a function.
- Distribution fitting with SSE is simpler but may be less accurate.
- Goodness-of-fit tests offer a statistical measure of fit.
Experiment with these techniques and explore appropriate distributions for your data to find the best fit!
python numpy statistics