Fitting Theoretical Distributions to Real-World Data with Python's SciPy

2024-05-24

What is it?

This process involves finding a theoretical probability distribution (like normal, exponential, etc.) that best describes the pattern observed in your actual data (empirical distribution). SciPy's scipy.stats module provides tools to achieve this.

  • Understanding data: By fitting a theoretical distribution, you gain insights into the underlying characteristics of your data, such as central tendency, spread, and potential skewness.
  • Prediction: You can use the fitted distribution to make predictions about future data points that might follow the same pattern.
  • Modeling: Fitted distributions can be used as building blocks in simulations or other statistical models.

How it works:

  1. Import necessary libraries:

    import numpy as np
    import scipy.stats as stats
    
  2. Evaluate the fit:

    • The goodness_of_fit_pvalue provides a statistical measure of fit. A high p-value (e.g., > 0.05) suggests that the data cannot be rejected as coming from the chosen distribution.
    • You can also visualize the fit using techniques like histograms and probability density functions (PDFs) to see how closely the theoretical distribution aligns with your data.

Example:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)  # Simulate normal distribution

# Fit a normal distribution
parameters, goodness_of_fit_pvalue = stats.norm.fit(data)

# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)  # Should be high

# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.norm.pdf(x, *parameters)

# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Normal PDF')
plt.legend()
plt.show()

Additional considerations:

  • Explore alternative distributions if the initial fit isn't satisfactory.
  • Consider using methods like Kolmogorov-Smirnov (KS) test or Anderson-Darling test for a more rigorous goodness-of-fit evaluation.
  • If you have a strong hunch about the distribution type, you can provide initial parameter values to the fit method.

By following these steps and exploring different distributions, you can effectively fit theoretical distributions to your empirical data using SciPy in Python.




import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)  # Simulate normal distribution

# Fit a normal distribution
parameters, goodness_of_fit_pvalue = stats.norm.fit(data)

# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)

# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.norm.pdf(x, *parameters)

# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Normal PDF')
plt.legend()
plt.show()
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Sample data (replace with your actual data)
data = np.random.exponential(scale=3, size=1000)  # Simulate exponential distribution

# Fit an exponential distribution
parameters, goodness_of_fit_pvalue = stats.expon.fit(data)

# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)

# Create theoretical PDF (replace with your fitted distribution)
x = np.linspace(min(data), max(data), 100)
pdf = stats.expon.pdf(x, *parameters)

# Plot the data histogram and theoretical PDF
plt.hist(data, density=True, bins=20, alpha=0.7, label='Data')
plt.plot(x, pdf, 'r-', label='Exponential PDF')
plt.legend()
plt.show()

Poisson Distribution (for discrete data):

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Sample data (replace with your actual data)
data = np.random.poisson(lam=5, size=1000)  # Simulate Poisson distribution

# Fit a Poisson distribution
parameters, goodness_of_fit_pvalue = stats.poisson.fit(data)

# Print the fitted parameters
print("Fitted parameters:", parameters)
print("Goodness-of-fit p-value:", goodness_of_fit_pvalue)

# Create theoretical PMF (Probability Mass Function)
x = np.arange(min(data), max(data) + 1)  # Discrete values for Poisson
pmf = stats.poisson.pmf(x, *parameters)

# Plot the data histogram and theoretical PMF
plt.hist(data, density=True, bins=np.arange(min(data)-0.5, max(data)+1.5), alpha=0.7, label='Data')
plt.bar(x, pmf, alpha=0.7, label='Poisson PMF')
plt.legend()
plt.show()

Remember to replace the sample data with your actual data and adjust the distribution type (norm, expon, poisson, etc.) based on your data's characteristics. These examples provide a foundation for fitting various theoretical distributions using SciPy's scipy.stats module.




Curve Fitting with scipy.optimize.curve_fit:

  • This method allows you to define a custom function that represents the theoretical distribution's probability density function (PDF) or cumulative distribution function (CDF).
  • You then use curve_fit to find the parameters of the theoretical distribution that minimize the difference between the theoretical function and the empirical data.

Here's an example fitting a normal distribution:

import numpy as np
from scipy import optimize
import scipy.stats as stats

# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)

# Define function for normal distribution PDF
def normal_pdf(x, mu, sigma):
  return stats.norm.pdf(x, mu, sigma)

# Initial guess for parameters (adjust based on your data)
initial_guess = [np.mean(data), np.std(data)]

# Fit the normal distribution
popt, pcov = optimize.curve_fit(normal_pdf, data, xdata=data, p0=initial_guess)

# Print the fitted parameters
print("Fitted parameters:", popt)

# Create theoretical PDF with fitted parameters
x = np.linspace(min(data), max(data), 100)
pdf = normal_pdf(x, *popt)

# Plot the data histogram and theoretical PDF (similar to previous example)

Distribution Fitting with Sum of Squared Errors (SSE):

  • This approach iterates through different theoretical distributions offered by SciPy.
  • For each distribution, it calculates the sum of squared errors (SSE) between the distribution's histogram and the data's histogram.
  • The distribution with the lowest SSE is considered the best fit.

Here's an example implementation:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Sample data (replace with your actual data)
data = np.random.normal(loc=5, scale=2, size=1000)

# Function to calculate SSE between histograms
def calculate_sse(data, distribution):
  counts, bins = np.histogram(data, bins=20)
  pdf = distribution.pdf(bins[:-1])
  return np.sum((counts - pdf * len(data))**2)

# Loop through distributions and find best fit
best_distribution = None
best_sse = np.inf

for dist in [stats.norm, stats.expon, stats.poisson]:  # Add more as needed
  try:
    sse = calculate_sse(data, dist)
    if sse < best_sse:
      best_distribution = dist
      best_sse = sse
  except:  # Handle exceptions for discrete distributions on continuous data
    pass

# Print the best fitting distribution
print("Best fitting distribution:", best_distribution.name)

# Generate data and plot for best fit (similar to previous example)

Using Specialized Goodness-of-Fit Tests:

  • SciPy doesn't directly provide goodness-of-fit tests, but you can use libraries like statsmodels for tests like Kolmogorov-Smirnov (KS) or Anderson-Darling.
  • These tests provide a statistical p-value to assess how well the fitted distribution describes the data.

Remember:

  • Choose the method that best suits your needs and data type.
  • Curve fitting provides more flexibility but requires defining a function.
  • Distribution fitting with SSE is simpler but may be less accurate.
  • Goodness-of-fit tests offer a statistical measure of fit.

Experiment with these techniques and explore appropriate distributions for your data to find the best fit!


python numpy statistics


Why do people write "#!/usr/bin/env python" on the first line of a Python script?

I'd be glad to explain the concept of "#!usr/bin/env python" in Python scripts:Shebang Line (#!):The first line of a Python script that starts with #! (shebang) is a special instruction for the operating system...


Keeping Your Code Repository Organized: A Guide to .gitignore for Python Projects (including Django)

What is a .gitignore file?In Git version control, a .gitignore file specifies files and patterns that Git should exclude from tracking and version history...


Selecting Rows in Pandas DataFrames: Filtering by Column Values

Context:Python: A general-purpose programming language.pandas: A powerful library for data analysis in Python. It provides structures like DataFrames for handling tabular data...


Taming Floating-Point Errors: Machine Epsilon and Practical Examples in Python

In Python's NumPy library, machine epsilon refers to the smallest positive number that, when added to 1.0, will not produce a result equal to 1.0 due to limitations in floating-point representation on computers...


Managing Packages with Virtual Environments

Understanding pip and Packages:pip is a powerful tool that simplifies installing and managing external packages (libraries) in Python...


python numpy statistics