Ziaeemehr
Posted on January 28, 2020
I am going to provide a quick guide for data analysis method, by providing a short description of what the method is doing and provide some simple examples. I usually do this for little things that I learn every day and document it in a simple and clear way to reduce the time when I need them later.
I use chapter 15 of Numerical python by Robert Johansson, Statistical Modeling.
Patsy The patsy library provides features for defining statistical models with a simple formula language inspired by statistical software such as r. The patsy library is designed to be a companion library for statistical modeling packages, such as statsmodels. For more information about the project and its documentation, see the web page http://patsy.readthedocs.org.
As a simplest example I am going to fit a line to some 1 dimensional data, y = f(x)
. Knowing the function f(x)
would allow us to compute the value of Y for any of values x
. If we do not know the function f(x)
, but we have access to data for observations {yi, xi}
, we can parameterize the function f(x)
and fit the values of the parameters to the
data. An example of a parameterization of f(x)
is the linear model f(x)= beta_0 + beta_1 x
, where the coefficients beta_0
and beta_1
are the parameters of the model.
Importing the libraries
import patsy
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
Generating some data and plot them.
x = np.arange(0, 10, 0.1)
y = 5 * np.random.rand(len(x)) + x
# plotting the data
pl.scatter(x, y, s=10)
putting the data in a pandas data frame.
data = {"y": y, "x": x}
df_data = pd.DataFrame(data)
I directly pass the patsy formula y ~ x
to the ordinary linear regression (ols). Note that we leave out coefficients in the model formula, as it is implicitly assumed that each term in the formula has a model parameter as coefficient.
model = smf.ols("y ~ x", df_data)
result = model.fit()
print(result.params)
#Intercept 2.230047
#x 1.020011 (the slope)
#dtype: float64
Finally I plot the fitted values.
pl.plot(x, result.fittedvalues, lw=2, c="k",
label="y=ax+b, a=%.3f, b=%.3f" % (result.params["x"],
result.params["Intercept"]))
pl.legend()
pl.xlabel("x")
pl.ylabel("y")
pl.show()
instead of this we could use lower level method least square fitting, which give us the save result.
X = np.vstack([np.ones(len(x)), x]).T
beta, res, rank, sval = np.linalg.lstsq(X, y, rcond=None)
print(beta)
#[2.23004704 1.02001068]
If X
is a vector in Y= beta_0 + beta_1 X
we have multiple linear regression, and if Y is a vector it is known as multivariate linear regression.
An example of multiple linear regression:
### from least square fitting
y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])
X = np.vstack([np.ones(5), x1, x2, x1 * x2]).T
# print(X)
beta, res, rank, sval = np.linalg.lstsq(X, y, rcond=None)
# print (beta)
# [-5.55555556e-01 1.88888889e+00 -8.88888889e-01 -1.11022302e-15]
### DesignMatrix: to create a dictionary data that maps the variable
# names to the corresponding data arrays
data = {"y": y, "x1": x1, "x2": x2}
y, X = patsy.dmatrices("y ~ 1 + x1 + x2 + x1 * x2", data)
# print (X)
### DesignMatrix from data frame
df_data = pd.DataFrame(data)
y, X = patsy.dmatrices("y ~ 1 + x1 + x2 + x1:x2", df_data, return_type="dataframe")
# print(X)
ordinary linear regression (OLS)
we can use the class OLS from the statsmodels library instead of using the lower-level method np.linalg.lstsq
model = sm.OLS(y, X)
result = model.fit()
# print(result.params)
# Intercept -5.555556e-01
# x1 1.888889e+00
# x2 -8.888889e-01
# x1:x2 -7.771561e-16
# dtype: float64
# print(result.params["x1:x2"])
we can directly pass the Patsy formula for the model when we create a model instance, which completely eliminates the need for first creating the design matrices
model = smf.ols("y ~ 1 + x1 + x2 + x1:x2", df_data)
result = model.fit()
# print(result.params)
# Intercept -5.555556e-01
# x1 1.888889e+00
# x2 -8.888889e-01
# x1:x2 -7.771561e-16
# dtype: float64
As concrete examples for patsy formula, consider the following formula and the resulting right-hand side terms (which we can extract from the design_info
attribute using the term_names
attribute):
from collections import defaultdict
data = defaultdict(lambda: np.array([]))
print(patsy.dmatrices("y ~ a", data=data)[1].design_info.term_names)
# ['Intercept', 'a']
print(patsy.dmatrices("y ~ 1 + a + b", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b']
print(patsy.dmatrices("y ~ -1 + a + b", data=data)[1].design_info.term_names)
# ['a', 'b']
print(patsy.dmatrices("y ~ a * b", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b']
print(patsy.dmatrices("y ~ a * b * c", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c', 'a:b:c']
print(patsy.dmatrices("y ~ a * b * c - a:b:c", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c']
Posted on January 28, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.