Get prediction of OLS fit from statsmodels - python-3.6

I am trying to get in sample predictions from an OLS fit as below,
import numpy as np
import pandas as pd
import statsmodels.api as sm
macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.period_range('1959Q1', '2009Q3', freq='Q')
mod = sm.OLS(macrodata['realgdp'], sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).fit()
mod.get_prediction(sm.add_constant(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']])).summary_frame(0.95).head()
This is fine. But if I alter the positions of regressors in mod.get_prediction, I get different estimates,
mod.get_prediction(sm.add_constant(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']])).summary_frame(0.95).head()
This is surprising. Can't mod.get_prediction identify the regressors based on column names?

As noted in the comments, sm.OLS will convert your data frame into an array for fitting, and likewise for prediction, it expects the predictors to be in the same order.
If you would like the column names to be used, you can use the formula interface, see the documentation for more details. Below I apply your example :
import statsmodels.api as sm
import statsmodels.formula.api as smf
macrodata = sm.datasets.macrodata.load_pandas().data
mod = smf.ols(formula='realgdp ~ realdpi + realinv + tbilrate + unemp', data=macrodata)
res = mod.fit()
In the order provided :
res.get_prediction(macrodata[['realdpi', 'realinv', 'tbilrate', 'unemp']]).summary_frame(0.95).head()
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 2716.423418 14.608110 2715.506229 2717.340607 2710.782460 2722.064376
1 2802.820840 13.714821 2801.959737 2803.681943 2797.188729 2808.452951
2 2781.041564 12.615903 2780.249458 2781.833670 2775.419588 2786.663539
3 2786.894138 12.387428 2786.116377 2787.671899 2781.274166 2792.514110
4 2848.982580 13.394688 2848.141577 2849.823583 2843.353507 2854.611653
Results are the same if we flip the columns:
res.get_prediction(macrodata[['tbilrate', 'unemp', 'realdpi', 'realinv']]).summary_frame(0.95).head()
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 2716.423418 14.608110 2715.506229 2717.340607 2710.782460 2722.064376
1 2802.820840 13.714821 2801.959737 2803.681943 2797.188729 2808.452951
2 2781.041564 12.615903 2780.249458 2781.833670 2775.419588 2786.663539
3 2786.894138 12.387428 2786.116377 2787.671899 2781.274166 2792.514110
4 2848.982580 13.394688 2848.141577 2849.823583 2843.353507 2854.611653

Related

Having issue using Julia library

I am trying to run this code in Julia to calculate the knn value, but I get the following error when I run it.
ERROR: LoadError: syntax: extra token "ScikitLearn" after end of expression
Stacktrace:
[1] top-level scope
# e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
in expression starting at e:\Fontbonne\CIS 585 Independent Study\Code\knn.jl:6
The error seems to be the library on line 6. I have searched for a couple of hours to try and find a solution. Any help would be greatly appreciated.
Here is the code:
import Pkg
Pkg.add("ScikitLearn")
using ScikitLearn: fit!, predict, #sk_import
using DataFrames, CSV, DataStructures
from ScikitLearn.neighbors import KNeighborsClassifier
from ScikitLearn.model_selection import train_test_split
from ScikitLearn.metrics import accuracy_score
function splitTrainTest(data, at = 0.8)
n = nrow(data)
ind = shuffle(1:n)
train_ind = view(ind, 1:floor(Int, at*n))
test_ind = view(ind, (floor(Int, at*n)+1):n)
return data[train_ind,:], data[test_ind,:]
end
# data preparation
df = open("breast-cancer.data") do file
read(file, String)
end
print(df)
X, y = splitTrainTest(df)
# split data into train and test
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
# make model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
# check accuracy
print(accuracy_score(y_test, knn.predict(x_test)))
That comment should have been an answer: You're doing
from ScikitLearn.neighbors import KNeighborsClassifier
which is Python syntax, not Julia syntax. If you're trying to use a Python model in ScikitLearn.jl you probably want the #sk_import macro, in your case:
julia> #sk_import neighbors: KNeighborsClassifier
PyObject <class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Import data vector from julia to R using RCall

Assume I have a Julia data array like this:
Any[Any[1,missing], Any[2,5], Any[3,6]]
I want to import it to R using RCall so I have an output equivalent to this:
data <- cbind(c(1,NA), c(2,5), c(3,6))
Note: the length of data is dynamic and it may be not 3!
could anyone help me how can I do this? Thank you
You can just interpolate a matrix into R:
a = [ 1 2 3
missing 5 6 ]
R"data <- $a"
To reorgnize your "array of array" into a matrix, you need to concat them
b = Any[Any[1,missing], Any[2,5], Any[3,6]]
a = hcat(b...)
R"data <- $a"

How to reproduce results of predict function in R

Lets say I train a model in R.
model <- lm(as.formula(paste((model_Data)[2],"~",paste((model_Data)[c(4,5,6,7,8,9,10,11,12,13,15,16,17,18,20,21,22,63,79,90,91,109,125,132,155,175,197,202,210,251,252,279,287,292,300,313,318)],collapse="+"),sep="")),data=model_Data)
I then use the model to predict an unknown.
prediction <- predict(model,unknown[1,])
1
8.037219
Instead of using predict lets pull out the coefficients and do it manually.
model$coefficients
9.250265284
0.054054202
0.052738367
-0.55119556
0.019686046
0.392728331
0.794558094
0.200555755
-0.63218309
0.050404541
0.089660195
-0.04889444
-0.24645514
0.225817891
-0.10411162
0.108317865
0.004281512
0.219695437
0.037514904
-0.00914805
0.077885231
0.656321472
-0.05436867
0.033296525
0.072551915
-0.11498145
-0.03414029
0.081145352
0.11187141
0.690106624
NA
-0.11112986
-0.18002883
0.006238802
0.058387332
-0.04469568
-0.02520228
0.121577926
Looks like the model couldn't find a coefficient for one of the variables.
Here are the independent variables for our unknown.
2.048475484
1.747222331
-1.240658767
-1.26971135
-0.61858754
-1.186401425
-1.196781456
-0.437969964
-1.37330171
-1.392555895
-0.147275619
0.315190159
0.544014105
-1.137999082
0.464498153
-1.825631473
-1.824991143
0.61730876
-1.311527708
-0.457725059
-0.455920549
-0.196326975
0.636723746
0.128123676
-0.0064055
-0.788435688
-0.493452602
-0.563353694
-0.441559371
-1.083489708
-0.882784077
-0.567873188
1.068504735
1.364721122
0.294178454
2.302875604
-0.998685333
If I multiply each independent variable by it's coefficient and add on the intercept the predicted value for the unknown is 8.450137349
The predict function gave us 8.037219 and the manual calculation gave 8.450137349. What is happening within the predict function that is causing it to predict a different value than the manual calculation? What has to be done to make the values match?
I get a lot closer to the predict answer when using the code below:
b <- c(9.250265284, 0.054054202, 0.052738367, -0.55119556, 0.019686046, 0.392728331, 0.794558094, 0.200555755, -0.63218309, 0.050404541, 0.089660195, -0.04889444, -0.24645514, 0.225817891, -0.10411162, 0.108317865, 0.004281512, 0.219695437, 0.037514904, -0.00914805, 0.077885231, 0.656321472, -0.05436867, 0.033296525, 0.072551915, -0.11498145, -0.03414029, 0.081145352, 0.11187141, 0.690106624, NA, -0.11112986, -0.18002883, 0.006238802, 0.058387332, -0.04469568, -0.02520228, 0.121577926)
x <- c(1, 2.048475484, 1.747222331, -1.240658767, -1.26971135, -0.61858754, -1.186401425, -1.196781456, -0.437969964, -1.37330171, -1.392555895, -0.147275619, 0.315190159, 0.544014105, -1.137999082, 0.464498153, -1.825631473, -1.824991143, 0.61730876, -1.311527708, -0.457725059, -0.455920549, -0.196326975, 0.636723746, 0.128123676, -0.0064055, -0.788435688, -0.493452602, -0.563353694, -0.441559371, -1.083489708, -0.882784077, -0.567873188, 1.068504735, 1.364721122, 0.294178454, 2.302875604, -0.998685333)
# remove the missing value in `b` and the corresponding value in `x`
x <- x[-31]
b <- b[-31]
x %*% b
# [,1]
# [1,] 8.036963

rpy2 does not convert back to pandas

I have an R object that will not convert to Pandas, and the strange part is that it doesn't throw an error.
Updated with the code I'm using, sorry not to supply that up front -- and to miss the request for 2 weeks!
Python code that calls an R script
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import datetime
from rpy2.robjects.conversion import localconverter
def serial_date_to_string(srl_no):
new_date = datetime.datetime(1970,1,1,0,0) + datetime.timedelta(srl_no - 1)
return new_date.strftime("%Y-%m-%d")
jurisdiction='TX'
r=ro.r
r_df=r['source']('farrington.R')
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)
The issue is that pd_from_r_df returns an R object rather than a Pandas dataframe:
>>> pd_from_r_df
R object with classes: ('list',) mapped to:
[ListSexpVector, BoolSexpVector]
value: <class 'rpy2.rinterface.ListSexpVector'>
<rpy2.rinterface.ListSexpVector object at 0x7faa4c4eff08> [RTYPES.VECSXP]
visible: <class 'rpy2.rinterface.BoolSexpVector'>
<rpy2.rinterface.BoolSexpVector object at 0x7faa4c4e7948> [RTYPES.LGLSXP]
Here's the R script "farrington.R", which returns a surveillance time series, which ro.conversion.rpy2py isn't (as used above) converting to a pandas dataframe
library('surveillance')
library(readr)
library(tidyr)
library(dplyr)
w<-1
b<-3
nfreq<-52
steps_back<- 28
alpha<-0.05
counts <- read_csv("Weekly_counts_of_death_by_jurisdiction_and_cause_of_death.csv")
counts<-counts[,!colnames(counts) %in% c('Cause Subgroup','Time Period','Suppress','Note','Average Number of Deaths in Time Period','Difference from 2015-2019 to 2020','Percent Difference from 2015-2019 to 2020')]
wide_counts_by_cause<-pivot_wider(counts,names_from='Cause Group',values_from='Number of Deaths',values_fn=(`Cause Group`=sum))
wide_state <- filter(wide_counts_by_cause,`State Abbreviation`==jurisdiction)
wide_state <- filter(wide_state,Type=='Unweighted')
wide_state[is.na(wide_state)] <-0
important_columns=c('Alzheimer disease and dementia','Cerebrovascular diseases','Heart failure','Hypertensive dieases','Ischemic heart disease','Other diseases of the circulatory system','Malignant neoplasms','Diabetes','Renal failure','Sepsis','Chronic lower respiratory disease','Influenza and pneumonia','Other diseases of the respiratory system','Residual (all other natural causes)')
all_columns <- append(c('Year','Week'),important_columns)
selected_wide_state<-wide_state[, names(wide_state) %in% all_columns]
start<-c(as.numeric(min(selected_wide_state[,'Year'])),as.numeric(min(selected_wide_state[,'Week'])))
freq<-as.numeric(max(selected_wide_state[,'Week']))
sts <- new("sts",epoch=1:nrow(numeric_wide_state),start=start,freq=freq,observed=numeric_wide_state)
sts_4 <- aggregate(sts[,important_columns],nfreq=nfreq)
start_idx=end_idx-steps_back
cntrlFar <- list(range=start_idx:end_idx,w==w,b==b,alpha==alpha)
surveil_ts_4_far <- farrington(sts_4,control=cntrlFar)
far_df<-tidy.sts(surveil_ts_4_far)
far_df
(using the NCHS data here [from a couple months back] https://data.cdc.gov/NCHS/Weekly-counts-of-death-by-jurisdiction-and-cause-o/u6jv-9ijr/ )
In R, when calling source() by default on a script without named functions, the returned object is a list of two named components, $value and $visible, where:
$value is the last displayed or defined object which in your case is the far_df data frame (which in R data.frame is a class object extending list type);
$visible is a boolean vector indicating if last object was displayed or not which in your case is TRUE. This would be FALSE had you ended script at far_df <- tidy.sts(surveil_ts_4_far).
In fact, your Python error confirms this output indicatating a list of [ListSexpVector, BoolSexpVector].
Therefore, since you only want the first item, index for first item accordingly by number or name.
r_raw = ro.r['source']('farrington.R') # IN R: r_raw <- source('farrington.R')
r_df = r_raw[0] # IN R: r_df <- r_raw[1]
r_df = r_raw[r_raw.names.index('value')] # IN R: r_df <- r_raw$value
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(r_df)

How to make `Heatmaps` in `Bokeh` with a continuous color map, using Python 3?

I was trying to replicate this style of HeatMap that maps continuous values to a LinearColorMapper instance: http://docs.bokeh.org/en/latest/docs/gallery/unemployment.html
I wanted to make a HeatMap (w/ either charts or rect) and then add a single selection widget to select the obsv_id and then a slider widget to go through the dates.
However, I was having trouble in the beginning with the HeatMap itself with a single obsv_id/date pair. What am I doing wrong in creating this HeatMap? This would essentially be a 3x3 rectangle plot of the size variable and the loc variable.
Bonus: Can you help me/give some advice on how to wire the output of these widgets to control the plot?
I saw these posts but all of the examples use actual hex colors as a list instead of mapping using a continuous measure:
python bokeh, how to make a correlation plot? http://docs.bokeh.org/en/latest/docs/gallery/categorical.html
# Init
import numpy as np
import pandas as pd
from bokeh.plotting import figure, output_notebook, output_file, reset_output, show, ColumnDataSource
from bokeh.models import LinearColorMapper
reset_output()
output_notebook()
np.random.seed(0)
# Coords
dates = ["07-3","07-11","08-6","08-28"]
#locs = ["air","water","earth"]
locs = [0,1,2]
size = [3.0, 0.2, 0.025]
observations = ["obsv_%d"%_ for _ in range(10)]
# Data
Ar_tmp = np.zeros(( len(dates)*len(locs)*len(size)*len(observations), 5 ), dtype=object)
i = 0
for date in dates:
for loc in locs:
for s in size:
for obsv_id in observations:
Ar_tmp[i,:] = np.array([obsv_id, date, loc, s, np.random.random()])
i += 1
DF_tmp = pd.DataFrame(Ar_tmp, columns=["obsv_id", "date", "loc", "size", "value"])
DF_tmp["value"] = DF_tmp["value"].astype(float)
DF_tmp["size"] = DF_tmp["size"].astype(float)
DF_tmp["loc"] = DF_tmp["loc"].astype(float)
# obsv_id date loc size value
# 0 obsv_0 07-3 air 3.0 0.548814
# 1 obsv_1 07-3 air 3.0 0.715189
# 2 obsv_2 07-3 air 3.0 0.602763
# 3 obsv_3 07-3 air 3.0 0.544883
# 4 obsv_4 07-3 air 3.0 0.423655
mapper = LinearColorMapper(low = DF_tmp["value"].min(), high = DF_tmp["value"].max())
# # Create Heatmap of a single observation and date pair
query_idx = set(DF_tmp.index[DF_tmp["obsv_id"] == "obsv_0"]) & set(DF_tmp.index[DF_tmp["date"] == "08-28"])
# p = HeatMap(data=DF_tmp.loc[query_idx,:], x="loc", y="size", values="value")
p = figure()
p.rect(x="loc", y="size",
source=ColumnDataSource(DF_tmp.loc[query_idx,:]),
fill_color={'field': 'value', 'transform': mapper},
line_color=None)
show(p)
My Error:
# Javascript error adding output!
# TypeError: Cannot read property 'length' of null
# See your browser Javascript console for more details.
You have to provide a palette to LinearColorMapper. For example:
mapper = LinearColorMapper(
palette='Magma256',
low=DF_tmp["value"].min(),
high=DF_tmp["value"].max()
)
From the LinearColorMapper doc:
class LinearColorMapper(palette=None, **kwargs)
Map numbers in a range [low, high] linearly into a sequence of colors (a palette).
Not related to your exception, but you'll also need to pass a width and height parameters to p.rect().

Resources