How to add predicted values in a dataframe? - jupyter-notebook

I extended the predictions to five values from this link. Now, I want to add the new five predicted values (New_Interest_Rate and New_Unemployment_Rate) so I can plot them together in a new figure together with the original timeseries.
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
}
df = pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df['Stock_Index_Price']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# prediction with sklearn
New_Interest_Rate = [2.75, 3, 4, 1, 2]
New_Unemployment_Rate = [5.3, 4, 3, 2, 1]
for i in range(len(New_Interest_Rate)):
print (str(i+1) + ' - Predicted Stock Index Price: \n',
regr.predict([[New_Interest_Rate[i] ,New_Unemployment_Rate[i]]]))
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
I cannot figure out how to append that because when I try, an error comes out.
Interest_Rate=Interest_Rate.append(New_Interest_Rate)
TypeError: cannot concatenate object of type "<class 'float'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
My goal is to plot the extended predicted values. I use jupyter notebook. The original code comes from thislink. Thank you!

Running the code you provided seems to work on my computer, but with some warning messages. The versions I'm using are python 3.9.7, pandas 1.3.3-1, sklearn-pandas 2.2.0-1, and statsmodels 0.13.0 . I just saved it to a file and ran it in a terminal with "python copypastedcode.py". I got this output:
Intercept:
1798.4039776258544
Coefficients:
[ 345.54008701 -250.14657137]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
1 - Predicted Stock Index Price:
[1422.86238865]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
2 - Predicted Stock Index Price:
[1834.43795318]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
3 - Predicted Stock Index Price:
[2430.12461156]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
4 - Predicted Stock Index Price:
[1643.6509219]
/usr/lib/python3.9/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
5 - Predicted Stock Index Price:
[2239.33758028]
OLS Regression Results
==============================================================================
Dep. Variable: Stock_Index_Price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Wed, 20 Oct 2021 Prob (F-statistic): 4.04e-11
Time: 09:07:19 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
Interest_Rate 345.5401 111.367 3.103 0.005 113.940 577.140
Unemployment_Rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==============================================================================
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
the "X does not have valid feature names..." warnings can be fixed by changing
regr.fit(X,Y)
to
regr.fit(X.values, Y.values)
If you want to use New_Interest_rate and New_Unemployment_Rate to create the regression, then you would need Y to have 5 more corresponding stock prices. I don't think that's what you want to do if you're trying to predict stock prices from interest and unemployment rates. Here's how you would do that though:
New_Interest_Rate = [2.75, 3, 4, 1, 2]
New_Unemployment_Rate = [5.3, 4, 3, 2, 1]
New_Stock_Prices = [1,2,3,4,5]
X_new = pd.DataFrame(data={'Interest_Rate': New_Interest_Rate,'Unemployment_Rate': New_Unemployment_Rate})
Y_new = pd.DataFrame(data={'Stock_Index_Price': New_Stock_Prices})
regr = linear_model.LinearRegression()
X = X.append(X_df)
Y = Y.append(Y_df)
regr.fit(X.values, Y.values)
And if you want to make plots, you can make a small function to get stock predictions from input arrays with something like this:
def predict_stock_price(future_interest_rate, future_unemployment_rate):
return [regr.predict([[i ,j]])[0,0] for i,j in zip(future_interest_rate,future_unemployment_rate)]
prices = predict_stock_price(New_Interest_Rate,New_Unemployment_Rate)
print("list of predicted stock prices:",prices)
predicted_stock_market = {'Month': range(13,13+len(prices)), #just to have a time axis to plot with
'Interest_Rate': New_Interest_Rate,
'Unemployment_Rate': New_Unemployment_Rate,
'Stock_Index_Price': prices}
predicted_df = pd.DataFrame(predicted_stock_market)
predicted_df.plot( x="Month",y="Stock_Index_Price",kind='scatter')
plt.show()

Related

Calculate partial eta-squared with type 3 sum of square in r

I have ran a 2 X 2 X 2 mixed ANOVA using ezANOVA and type 3 sum of squared in r.
The code looks like
ezANOVA(data = D, between = condition, within = c(Notation,Operation), dv = Acc, wid = ID,type=3)
The output does not include the sum of square and the effect size was the generalized eta-squared. I am not sure how to calculate the partial eta-squared with type 3 sum of square in r.
I have tried to use the aov() function and eta_squared() function from package effectsize, but the aov() function uses type 1 sum of square and so the effect size is different from the type 3 sum of square effect size.
Thus, I am wondering if there is any way to calculate the partial eta squared for a 3-way mixed ANOVA using type 3 sum of square in R.
Thank you in advance for your help
Short of manually calculating partial eta squared, I wasn't able to find a function that worked with ezANOVA. I want to point out that the column labeled ges is the generalized eta squared (not partial, though).
However, I do have a method that will work for using SS 3 from the package jmv, along with both within and between ANOVA, while providing partial eta squared. It's a bit more of a mouthful to put together the function, as well. I added tons of options that you don't have outlined in your function in the question. I did this because this package's help isn't all that helpful. You definitely don't need to use all of these parameters, but at least you'll know what the package is expecting if you do use these options.
Your question isn't reproducible. I started by creating some arbitrary data to work with.
library(jmv)
# some fake data to work with
set.seed(253)
df1 <- data.frame(x = rnorm(200, 50, 3),
y = rnorm(200, 25, 5),
z = rnorm(200, 1.5, .1),
direc = as.factor(rep(c("left","right"), times = 100)))
Next the repeated measures + between ANOVA:
fit = anovaRM(data = df1,
ss = "3", # type of SS (1, 2, or 3)
bs = list("direc"), # between subjects
bsTerms = list("direc"), # between subjects
rm = list(list(label = "tests", # within subjects
levels = c("pretest","mid","posttest"))),
# can use levels(data$factor) if easier
# does not have to be a real variable**
rmCells = list(list(measure = "x", # continuous value
cell = "pretest"), # group label
list(measure = "y", # continuous value
cell = "mid"), # group label
list(measure = "z", # continuous value
cell = "posttest")), # group
rmTerms = list("tests"), # grouping variable/within measures
emMeans = list(list("tests","direc")), # all grouping vars (em tables)
emmPlots = T, # show emm plot
emmTables = T, # show emm tables
effectSize = "partEta", # use partial eta (multi options, see help)
spherTests = T, # use Mauchley test
spherCorr = "GG", # Greenhouse (multi options`, see help)
leveneTest = T, # check homogeneity (p > .05 = good**)
qq = T, # plot normality validation qq plot
postHocCorr = "tukey") # use TukeyHSD
This is the type of output you'll see when you call fit (or whatever you name your ANOVA object).
#
# REPEATED MEASURES ANOVA
#
# Within Subjects Effects
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Sphericity Correction Sum of Squares df Mean Square F p η²-p
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# tests Greenhouse-Geisser 235493.0127 1.399431 168277.62891 9808.561967 < .0000001 0.9802130
# tests:direc Greenhouse-Geisser 105.1735 1.399431 75.15443 4.380599 0.0247609 0.0216454
# Residual Greenhouse-Geisser 4753.7668 277.087435 17.15620
# ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Note. Type 3 Sums of Squares
#
#
# Between Subjects Effects
# ──────────────────────────────────────────────────────────────────────────────────────────
# Sum of Squares df Mean Square F p η²-p
# ──────────────────────────────────────────────────────────────────────────────────────────
# direc 22.01519 1 22.01519 1.954001 0.1637204 0.0097723
# Residual 2230.81167 198 11.26673
# ──────────────────────────────────────────────────────────────────────────────────────────
# Note. Type 3 Sums of Squares
#
#
# ASSUMPTIONS
#
# Tests of Sphericity
# ───────────────────────────────────────────────────────────────────────────────
# Mauchly's W p Greenhouse-Geisser ε Huynh-Feldt ε
# ───────────────────────────────────────────────────────────────────────────────
# tests 0.5708482 < .0000001 0.6997157 0.7031690
# ───────────────────────────────────────────────────────────────────────────────
#
#
# Homogeneity of Variances Test (Levene's)
# ───────────────────────────────────────────────
# F df1 df2 p
# ───────────────────────────────────────────────
# x 1.653217e-4 1 198 0.9897542
# y 0.42682247 1 198 0.5143102
# z 0.01824029 1 198 0.8927043
# ───────────────────────────────────────────────
#
#
# ESTIMATED MARGINAL MEANS
#
# TESTS:DIREC
#
# Estimated Marginal Means - tests:direc
# ───────────────────────────────────────────────────────────────────────────
# direc tests Mean SE Lower Upper
# ───────────────────────────────────────────────────────────────────────────
# left pretest 50.224630 0.307314811 49.618600 50.830660
# mid 24.048471 0.508157857 23.046375 25.050567
# posttest 1.499185 0.009470430 1.480509 1.517860
# right pretest 49.818121 0.307314811 49.212091 50.424151
# mid 25.590657 0.508157857 24.588561 26.592753
# posttest 1.512816 0.009470430 1.494140 1.531492
# ───────────────────────────────────────────────────────────────────────────
#
These are the two plots you'll see (or a variation of these: qq and emm).
This is really an amazing package, but it isn't very self-explanatory. If you have any questions, leave a comment.

Generate table with side-by-side node models of `partykit:mob()` object

Let's say I fit a model using partykit:mob(). Afterward, I would like to generate a side-by-side table with all the nodes (including the model fitted using the whole sample). Here I attempted to do it using stargazer(), but other ways are more than welcome.
Below an example and attempts to get the table.
library("partykit")
require("mlbench")
## Pima Indians diabetes data
data("PimaIndiansDiabetes", package = "mlbench")
## a simple basic fitting function (of type 1) for a logistic regression
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
## set up a logistic regression tree
pid_tree <- mob(diabetes ~ glucose | pregnant + pressure + triceps + insulin +
mass + pedigree + age, data = PimaIndiansDiabetes, fit = logit)
pid_tree
# Model-based recursive partitioning (logit)
#
# Model formula:
# diabetes ~ glucose | pregnant + pressure + triceps + insulin +
# mass + pedigree + age
#
# Fitted party:
# [1] root
# | [2] mass <= 26.3: n = 167
# | x(Intercept) xglucose
# | -9.95150963 0.05870786
# | [3] mass > 26.3
# | | [4] age <= 30: n = 304
# | | x(Intercept) xglucose
# | | -6.70558554 0.04683748
# | | [5] age > 30: n = 297
# | | x(Intercept) xglucose
# | | -2.77095386 0.02353582
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 355.4578
1.- Extract summary(pid_tree, node = x) + stargazer().
## I want to replicate this table extracting the the nodes from partykit object.
library(stargazer)
m.glm<- glm(diabetes ~ glucose, family = binomial,data = PimaIndiansDiabetes)
typeof(m.glm)
## [1] "list"
class(m.glm)
## [1] "glm" "lm"
stargazer(m.glm)
## ommited output.
## Extracting summary from each node
summ_full_data <- summary(pid_tree, node = 1)
summ_node_2 <- summary(pid_tree, node = 2)
summ_node_4 <- summary(pid_tree, node = 4)
summ_node_5 <- summary(pid_tree, node = 5)
## trying to create stargazer table with coefficients
stargazer(m.glm,
summ_node_2,
summ_node_4,
summ_node_5,title="MOB Results")
##Error: $ operator is invalid for atomic vectors
2.- Extract pid_tree[x] + stargazer().
## Second Attempt (extracting modelparty objects instead)
node_2 <- pid_tree[2]
node_4 <- pid_tree[4]
node_5 <- pid_tree[5]
class(node_5)
##[1] "modelparty" "party"
stargazer(m.glm,
node_2,
node_4,
node_5,title="MOB Results")
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
# % Error: Unrecognized object type.
3.- Not really elegant, I know: Force class to emulate the glm object.
## Force class of object to emulate glm one
class(m.glm)
class(summ_node_2) <- c("glm", "lm")
stargazer(summ_node_2)
##Error in if (p > 0) { : argument is of length zero
A rather pragmatic solution would be just re-fit the model recovering the rules found by partykit:mob() and then use stargaze() on them, but for sure I am missing something here. Thanks in advance.
It's best to extract (or refit) the list of model objects per node and then apply the table package of choice. Personally, I don't like stargazer much and much rather use modelsummary instead or sometimes the good old memisc.
If the tree contains the model $objects in the $info (as for pid_tree) you can use nodeapply() for all nodeids() to extract these:
pid_models <- nodeapply(pid_tree, ids = nodeids(pid_tree), FUN = function(x) x$info$object)
If you just want to extract the fitted models for the terminal nodes (leaves) of the tree, then you can do so by setting ids = nodeids(pid_tree, terminal = TRUE).
Alternatively, especially when the model objects are not stored, you can easily refit them via:
pid_models <- refit.modelparty(pid_tree)
Here, you could also include node = nodeids(pid_tree, terminal = TRUE) to only refit the terminal node models.
In all cases you can subsequently use
msummary(pid_models)
to produce the model summary table. It supports a variety of output formats and of course you can tweak the list further to change the results, e.g., by changing their names etc. The default output looks like this:
My bad, it was a small difference that makes it work. Here a solution, not sure if the best way, but it does the work.-
library(stargazer)
obj_node_full_sample<- pid_tree[1]$node$info$object
obj_node_2<- pid_tree[2]$node$info$object
obj_node_4<- pid_tree[4]$node$info$object
obj_node_5<- pid_tree[5]$node$info$object
stargazer(obj_node_full_sample,
obj_node_2,
obj_node_4,
obj_node_5,title="Results", align=TRUE)

Representing Parametric Survival Model in 'Counting Process' form in JAGS

I'm trying to build a survival model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model — for example, assuming survival follows the Weibull distribution (but I'd like to allow the hazard to vary, so exponential is too simple). So, this is essentially a Bayesian version of what can be done in the flexsurv package, which allows for time-varying covariates in parametric models.
Therefore, I want to be able to enter the data in a 'counting-process' form, where each subject has multiple rows, each corresponding to a time interval in which their covariates remained constant (as described in this pdf or here. This is the (start, stop] formulation that the survival or flexurv packages allow.
Unfortunately, every explanation of how to perform survival analysis in JAGS seems to assume one row per subject.
I attempted to take this simpler approach and extend it to the counting process format, but the model does not correctly estimate the distribution.
A Failed Attempt:
Here's an example. First we generate some data:
library('dplyr')
library('survival')
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
person person_start_time person_censored person_duration
(int) (dbl) (lgl) (dbl)
1 1 11.81416 FALSE 487.4553
2 2 114.20900 FALSE 168.7674
3 3 75.34220 FALSE 356.6298
4 4 339.98225 FALSE 385.5119
5 5 389.23357 FALSE 259.9791
6 6 253.71067 FALSE 259.0032
7 7 419.52305 TRUE 310.4770
Then we split the data into 2 observations per subject. I'm just splitting each subject at time = 300 (unless they didn't make it to time=300, in which they get just one observation).
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
person split START END TINTERVAL CENS TINTERVAL_CENS
(int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 300.0000 0 300.0000 300.00000 1 NA
2 1 300.0000 300 487.4553 187.45530 0 187.45530
3 2 168.7674 0 168.7674 168.76738 1 NA
4 3 300.0000 0 300.0000 300.00000 1 NA
5 3 300.0000 300 356.6298 56.62979 0 56.62979
6 4 300.0000 0 300.0000 300.00000 1 NA
Now we can set up the JAGS model.
## Set-Up JAGS Model -------
dat_jags <- as.list(dat_split)
dat_jags$N <- length(dat_jags$TINTERVAL)
inits <- replicate(n = 2, simplify = FALSE, expr = {
list(TINTERVAL_CENS = with(dat_jags, ifelse(CENS, TINTERVAL + 1, NA)),
END_CENS = with(dat_jags, ifelse(CENS, END + 1, NA)) )
})
model_string <-
"
model {
# set priors on reparameterized version, as suggested
# here: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/d5249e71/?limit=25#8c3b
log_a ~ dnorm(0, .001)
log(a) <- log_a
log_b ~ dnorm(0, .001)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# Estimate Subject-Durations:
CENS[i] ~ dinterval(TINTERVAL_CENS[i], TINTERVAL[i])
TINTERVAL_CENS[i] ~ dweibull( nu, lambda )
}
}
"
library('runjags')
param_monitors <- c('a', 'b', 'nu', 'lambda')
fit_jags <- run.jags(model = model_string,
burnin = 1000, sample = 1000,
monitor = param_monitors,
n.chains = 2, data = dat_jags, inits = inits)
# estimates:
fit_jags
# actual:
c(a=true_shape, b=true_scale)
Depending on where the split point is, the model estimates very different parameters for the underlying distribution. It only gets the parameters right if the data isn't split into the counting process form. It seems like this is not the way to format the data for this kind of problem.
If I am missing an assumption and my problem is less related to JAGS and more related to how I'm formulating the problem, suggestions are very welcome. I might be despairing that time-varying covariates can't be used in parametric survival models (and can only be used in models like the Cox model, which assumes constant hazards and which doesn't actually estimate the underlying distribution)— however, as I mentioned above, the flexsurvreg package in R does accommodate the (start, stop] formulation in parametric models.
If anyone knows how to build a model like this in another language (e.g. STAN instead of JAGS) that would be appreciated too.
Edit:
Chris Jackson provides some helpful advice via email:
I think the T() construct for truncation in JAGS is needed here. Essentially for each period (t[i], t[i+1]) where a person is alive but the covariate is constant, the survival time is left-truncated at the start of the period, and possibly also right-censored at the end. So you'd write something like y[i] ~ dweib(shape, scale[i])T(t[i], )
I tried implementing this suggestion as follows:
model {
# same as before
log_a ~ dnorm(0, .01)
log(a) <- log_a
log_b ~ dnorm(0, .01)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# modified to include left-truncation
CENS[i] ~ dinterval(END_CENS[i], END[i])
END_CENS[i] ~ dweibull( nu, lambda )T(START[i],)
}
}
Unfortunately this doesn't quite do the trick. With the old code, the model was mostly getting the scale parameter right, but doing a very bad job on the shape parameter. With this new code, it gets very close to the correct shape parameter, but consistently over-estimates the scale parameter. I have noticed that the degree of over-estimation is correlated with how late the split point comes. If the split-point is early (cens_point = 50), there's not really any over-estimation; if it's late (cens_point = 350), there is a lot.
I thought maybe the problem could be related to 'double-counting' the observations: if we see a censored observation at t=300, then from that same person, an uncensored observation at t=400, it seems intuitive to me that this person is contributing two data-points to our inference about the Weibull parameters when really they should just be contributing one point. I, therefore, tried incorporating a random-effect for each person; however, this completely failed, with huge estimates (in the 50-90 range) for the nu parameter. I'm not sure why that is, but perhaps that's a question for a separate post. Since I'm not whether the problems are related, you can find the code for this whole post, including the JAGS code for that model, here.
You can use rstanarm package, which is a wrapper around STAN. It allows to use standard R formula notation to describe survival models. stan_surv function accepts arguments in a "counting process" form. Different base hazard functions including Weibull can be used to fit the model.
The survival part of rstanarm - stan_surv function is still not available at CRAN so you should install the package directly from mc-stan.org.
install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Please see the code below:
library(dplyr)
library(survival)
library(rstanarm)
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
dat_split$CENS <- as.integer(!(dat_split$CENS))
# Fit STAN survival model
mod_tvc <- stan_surv(
formula = Surv(START, END, CENS) ~ 1,
data = dat_split,
iter = 1000,
chains = 2,
basehaz = "weibull-aft")
# Print fit coefficients
mod_tvc$coefficients[2]
unname(exp(mod_tvc$coefficients[1]))
Output, which is consistent with true values (true_shape <- 2; true_scale <- 365):
> mod_tvc$coefficients[2]
weibull-shape
1.943157
> unname(exp(mod_tvc$coefficients[1]))
[1] 360.6058
You can also look at STAN source using rstan::get_stanmodel(mod_tvc$stanfit) to compare STAN code with the attempts you made in JAGS.

How to retrieve correlation matrix from glm models in R

I am using the gls function from the nlme package. You can copy and paste the following code to reproduce my analysis.
library(nlme) # Needed for gls function
# Read in wide format
tlc = read.table("http://www.hsph.harvard.edu/fitzmaur/ala2e/tlc.dat",header=FALSE)
names(tlc) = c("id","trt","y0","y1","y4","y6")
tlc$trt = factor(tlc$trt, levels=c("P","A"), labels=c("Placebo","Succimer"))
# Convert to long format
tlc.long = reshape(tlc, idvar="id", varying=c("y0","y1","y4","y6"), v.names="y", timevar="time", direction="long")
# Create week numerical variable
tlc.long$week = tlc.long$time-1
tlc.long$week[tlc.long$week==2] = 4
tlc.long$week[tlc.long$week==3] = 6
tlc.long$week.f = factor(tlc.long$week, levels=c(0,1,4,6))
The real analysis starts from here:
# Including group main effect assuming unstructured covariance:
mod1 = gls(y ~ trt*week.f, corr=corSymm(, form= ~ time | id),
weights = varIdent(form = ~1 | time), method = "REML", data=tlc.long)
summary(mod1)
In the summary(mod1), the following parts of the results are of interest to me that I would love to retrieve.
Correlation Structure: General
Formula: ~time | id
Parameter estimate(s):
Correlation:
1 2 3
2 0.571
3 0.570 0.775
4 0.577 0.582 0.581
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | time
Parameter estimates:
1 2 3 4
1.000000 1.325880 1.370442 1.524813
The closest I can get is to use the following method.
temp = mod1$modelStruct$varStruct
Variance function structure of class varIdent representing
1 2 3 4
1.000000 1.325880 1.370442 1.524813
However, whatever you stored with temp, I cannot get the five numbers out. I tried as.numeric(temp) and unclass(temp), but none of them works. There is no way I can just get the five numbers as a clean numeric vector.
Thanks in advance!
When you run mod1$modelStruct$varStruct in R console, R first inspects the class of it
> class(mod1$modelStruct$varStruct)
[1] "varIdent" "varFunc"
and then dispatch the corresponding print function. In this case, it is nlme:::print.varFunc. i.e., the actual command running is nlme:::print.varFunc(mod1$modelStruct$varStruct).
If you run nlme:::print.varFunc, you can see the function body of it
function (x, ...)
{
if (length(aux <- coef(x, uncons = FALSE, allCoef = TRUE)) >
0) {
cat("Variance function structure of class", class(x)[1],
"representing\n")
print(aux, ...)
}
else {
cat("Variance function structure of class", class(x)[1],
"with no parameters, or uninitialized\n")
}
invisible(x)
}
<bytecode: 0x7ff4bf688df0>
<environment: namespace:nlme>
What it does is evaluating the coef and print it, and the unevaluated x is returned invisibly.
Therefore, in order to get the cor/var, you need
coef(mod1$modelStruct$corStruct, uncons = FALSE, allCoef = TRUE)
coef(mod1$modelStruct$varStruct, uncons = FALSE, allCoef = TRUE)

degrees of freedom, t-statistic, and f-values of combined multiply imputed data

I am a novice R user. I installed Zelig version 4.1-3 and Amelia II version 1.7. I am puzzled on how I can obtain the degrees of freedom, t-statistic, and f-values of combined multiply imputed data using R packages and functions.
First, I loaded Amelia and Zelig:
require(Amelia)
require(Zelig)
Then, I loaded the sample data that came with Amelia:
data(freetrade)
I created 5 imputations for this dataset using the amelia function.
a.out <- amelia(freetrade, m = 5, ts = "year", cs = "country")
Then, to combine imputations, I used the zelig function:
z.out.imp <- zelig(tariff ~ polity + pop + gdp.pc + year + country,
data = a.out$imputations, model = "ls" )
However, I got coefficients that appeared to be coefficients of individual imputations, and not those of the combined set when I used this code:
summary(z.out.imp)
They were as follows:
Coefficients:
Value Std. Error t-stat p-value
(Intercept) 2.766176e+03 6.670110e+02 4.1471215 0.0003572868
polity 1.645011e-01 3.078134e-01 0.5344183 0.5938286336
pop -6.079963e-08 6.518429e-08 -0.9327345 0.3774275934
gdp.pc -4.246794e-04 1.945866e-03 -0.2182470 0.8319093062
year -1.335563e+00 3.519513e-01 -3.7947390 0.0009787456
countryIndonesia -7.000319e+01 4.646330e+01 -1.5066343 0.1700377061
countryKorea -8.643855e+01 4.671629e+01 -1.8502870 0.0926657863
countryMalaysia -8.815182e+01 5.389486e+01 -1.6356256 0.1393312364
countryNepal -8.215250e+01 5.475828e+01 -1.5002753 0.1702129176
countryPakistan -4.349869e+01 5.149729e+01 -0.8446791 0.4238033944
countryPhilippines -8.088975e+01 5.320694e+01 -1.5202857 0.1673234716
countrySriLanka -7.668840e+01 5.695485e+01 -1.3464771 0.2161986616
countryThailand -7.400481e+01 5.186395e+01 -1.4269026 0.1903428838
The Amelia manual shows what some of the coefficients for the combined multiply imputed dataset should be although there is no explanation on how to obtain all of them using R (see page 46 of http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf)
Complete DF = 167
DF: min = 10.36
avg = 18.81
max = 37.62
F( 2, 10.4) = 15.50
Prob > F = 0.0008
Value Std. Error t-stat p-value
polity -0.206 0.39 -0.53 0.61
pop -3.21 e-08 8.72e-09 3.68 0.004
gdp.pc -0.0027 0.000644 -4.28 0.000
Intercept 32.7 2.66 12.29 0.000
Because the amelia function uses monte carlo simulations, we can expect small differences between runs. However, the huge difference in the intercept was a clue that the zelig function returned regression statistics for something else than the combined set.
The Amelia manual provides this code:
> b.out <-NULL
> se.out <-NULL
> for(i in 1:a.out$m){
+ ols.out <- lm(tariff ~ polity + pop + gdp.pc, data = a.out$imputations[[i]])
+ b.out <- rbind(b.out, ols.out$coef)
+ se.out <-rbind(se.out, coef(summary(ols.out))[,2])
+ }
> combined.results<-mi.meld(q=b.out, se = se.out)
> combined.results
I tried using it. The returned results are very close to the values and standard errors shown on page 46:
$q.mi
(Intercept) polity pop gdp.pc
[1,] 33.17325 -0.1499587 2.967196e-08 -0.002724229
$se.mi
(Intercept) polity pop gdp.pc
[1,] 2.116721 0.276968 6.061993e-09 0.0006596203
However, they do not include the t-statistic, degrees of freedom, or f-values.
Are there open-source packages or functions available in R so that I can obtain the degrees of freedom, t-statistic, and f-values without having to do manual calculations?
Thanks.
Here is an annotated and edited transcript of my attempt to reproduce your problem:
> library(Zelig)
ZELIG (Versions 4.1-3, built: 2013-01-30)
> a.out <- amelia(freetrade, idvars="country", m = 5)
Error: could not find function "amelia"
The first problem I had is that you did not mention that we need to load the Amelia package. After correcting that I tried again to run the first line:
> library(Amelia)
## (Version 1.7, built: 2013-02-10)
> a.out <- amelia(freetrade, idvars="country", m = 5)
Error in amelia(freetrade, idvars = "country", m = 5) :
object 'freetrade' not found
This fails because you did not say how to get the freetrade data. Guessing here:
> data(freetrade)
> a.out <- amelia(freetrade, m = 5)
Amelia Error Code: 38
The following variable(s) are characters:
country
You may have wanted to set this as a ID variable to remove it
from the imputation model or as an ordinal or nominal
variable to be imputed. Please set it as either and
try again.
Your example does not work because the amelia function needs to be told what to do with character variables. So I modified your example in order to get one that would actually run:
> a.out <- amelia(freetrade, idvars="country", m = 5)
> z.out.imp <- zelig(tariff ~ polity + pop + gdp.pc + year + country,
+ data = a.out$imputations, model = "ls")
Running summary on the result gives me the combined model statistics"
# This part works just fine.
> summary(z.out.imp)
Model: ls
Number of multiply imputed data sets: 5
Combined results:
Call:
lm(formula = formula, weights = weights, model = F, data = data)
Coefficients:
Value Std. Error t-stat p-value
(Intercept) 3.294715e+03 6.425487e+02 5.1275725 1.330807e-05
polity 2.761343e-01 3.354271e-01 0.8232319 4.145813e-01
pop -6.443769e-08 5.526885e-08 -1.1658953 2.659143e-01
gdp.pc 4.549885e-04 1.354139e-03 0.3359984 7.382138e-01
year -1.599422e+00 3.306932e-01 -4.8365739 2.649602e-05
countryIndonesia -7.396526e+01 4.112206e+01 -1.7986760 1.009329e-01
countryKorea -9.673542e+01 5.036909e+01 -1.9205317 8.713903e-02
countryMalaysia -9.271187e+01 4.998836e+01 -1.8546690 9.424041e-02
countryNepal -8.863525e+01 4.920061e+01 -1.8015072 9.990792e-02
countryPakistan -4.789370e+01 4.362907e+01 -1.0977475 2.960914e-01
countryPhilippines -8.548672e+01 4.662372e+01 -1.8335456 9.533829e-02
countrySriLanka -8.446560e+01 4.939918e+01 -1.7098586 1.170170e-01
countryThailand -8.026702e+01 4.741244e+01 -1.6929529 1.213329e-01
For combined results from datasets i to j, use summary(x, subset = i:j).
For separate results, use print(summary(x), subset = i:j).
In short, the only thing in your example that works for me is the one thing you claim does not work for you. Please post the code and output showing exactly what you did and exactly what happened, because at the moment I don't have enough information to help solve the problem.

Resources