I am currently trying to evaluate a tslm-model using timeseries cross validation. I want to use a fixed model (without parameter reestimation) an look at the 1 to 3 step ahead horizon forecasts for the evaluation period of the last year.
I have trouble to get tsCV and tslm from the forecast-library to work well together. What am I missing?
library(forecast)
library(ggfortify)
AirPassengers_train <- head(AirPassengers, 100)
AirPassengers_test <- tail(AirPassengers, 44)
## Holdout Evaluation
n_train <- length(AirPassengers_train)
n_test <- length(AirPassengers_test)
pred_train <- ts(rnorm(n_train))
pred_test <- ts(rnorm(n_test))
fit <- tslm(AirPassengers_train ~ trend + pred_train)
forecast(fit, newdata = data.frame(pred_train = pred_test)) %>%
accuracy(AirPassengers_test)
#> ME RMSE MAE MPE MAPE MASE
#> Training set 1.135819e-15 30.03715 23.41818 -1.304311 10.89785 0.798141
#> Test set 3.681350e+01 76.39219 55.35298 6.513998 11.96379 1.886546
#> ACF1 Theil's U
#> Training set 0.6997632 NA
#> Test set 0.7287923 1.412804
## tsCV Evaluation
fc_reg <- function(x) forecast(x, newdata = data.frame(pred_train = pred_test),
h = h, model = fit)
tsCV(AirPassengers_test, fc_reg, h = 1)
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1957 NA NA NA NA NA NA NA NA
#> 1958 NA NA NA NA NA NA NA NA NA NA NA NA
#> 1959 NA NA NA NA NA NA NA NA NA NA NA NA
#> 1960 NA NA NA NA NA NA NA NA NA NA NA NA
forecast(AirPassengers_test, newdata = data.frame(pred_train = pred_test),
h = 1, model = fit)
#> Error in forecast.ts(AirPassengers_test, newdata = data.frame(pred_train = pred_test),
#> : Unknown model class
I have a feeling, that https://gist.github.com/robjhyndman/d9eb5568a78dbc79f7acc49e22553e96 is relevant. How would I apply it to the scenario above?
For time series cross-validation, you should be fitting a separate model to every training set, not passing an existing model. With predictor variables, the function needs to be able to grab the relevant elements when fitting each model, and other elements when producing forecasts.
The following will work.
fc <- function(y, h, xreg)
{
if(NROW(xreg) < length(y) + h)
stop("Not enough xreg data for forecasting")
X <- xreg[seq_along(y),]
fit <- tslm(y ~ X)
X <- xreg[length(y)+seq(h),]
forecast(fit, newdata=X)
}
# Predictors of the same length as the data
# and with the same time series characteristics.
pred <- ts(rnorm(length(AirPassengers)), start=start(AirPassengers),
frequency=frequency(AirPassengers))
# Now pass the whole time series and the corresponding predictors
tsCV(AirPassengers, fc, xreg=pred)
If you have more than one predictor variable, then xreg should be a matrix.
I came here to post my ugly workaround to the same problem (and possibly find out what is wrong with it):
myxreg<-regmat[,c("xvar1","xvar2")]
flm_xreg<-function(x,h,xreg,newxreg){
forecast(Arima(x,order=c(0,0,0),xreg=xreg),xreg=newxreg)
}
e<-tsCV(regmat[,"yvar"],flm_xreg,h=14,xreg=myexreg)
I ended up using a function to forecast a trend. I'm not sure if this is correctly specified but the rmse looks about right.
flm <- function(y, h) { forecast(tslm(y ~ trend, lambda=0), h=h) }
e <- tsCV(tsDF, flm, h=6)
sqrt(mean(e^2, na.rm=TRUE))
#robhyndman
Related
I need to calculate the correlation of some specific variables (columns).
To calculate the correlation of specific columns I get through this code:
df<-read.csv("http://renatabrandt.github.io/EBC2015/data/varechem.csv", row.names=1)
cor_df<-(cor(df, method="spearman")[1:6, 7:14])%>%as.data.frame()
output
However I would like R to create a new matrix but only with the correlations with a level of significance, whose p-value <0.05, only for the set [1:6, 7:14], that is to say exclude those not significant (p-value >0.05)
I expect the non-significant ones to be deleted, or filled in with NA, or a new data.frame with just the signifiers.
my expectavie is:
Please find below one possible solution using Hmisc, corrplot and dplyr libraries
Reprex
Computes the correlation coefficients and corresponding pvalues using the rcorr() function of the Hmisc library
library(Hmisc)
library(corrplot)
library(dplyr)
coeffs <- rcorr(as.matrix(df), type="spearman")[[1]][1:6, 7:14]
coeffs
#> Al Fe Mn Zn Mo Baresoil
#> N -0.151805133 -0.1295934 -0.01261144 -0.07526648 0.004643575 0.15481627
#> P -0.001739509 -0.1200000 0.60782609 0.73423234 0.035371924 0.03043478
#> K 0.006089604 -0.1156773 0.67579910 0.74244074 -0.039359822 0.18264841
#> Ca -0.289628187 -0.3982609 0.63130435 0.68638545 -0.175533171 0.27739130
#> Mg -0.187866932 -0.2382609 0.57043478 0.60069601 -0.118938093 0.29739130
#> S 0.320574163 0.1117634 0.51402480 0.77789865 0.334337367 0.07784301
#> Humdepth pH
#> N 0.1307120 -0.07186484
#> P 0.2102302 -0.12114884
#> K 0.2963972 -0.31001388
#> Ca 0.4396914 -0.25114066
#> Mg 0.4912655 -0.33161178
#> S 0.1698382 -0.21448892
pvalues <- rcorr(as.matrix(df), type="spearman")[[3]][1:6, 7:14]
pvalues
#> Al Fe Mn Zn Mo Baresoil
#> N 0.4788771 0.54615126 0.9533606683 7.266830e-01 0.9828194 0.4700940
#> P 0.9935636 0.57648987 0.0016290786 4.418653e-05 0.8696630 0.8877339
#> K 0.9774704 0.59039698 0.0002896520 3.264276e-05 0.8551122 0.3929703
#> Ca 0.1698232 0.05391473 0.0009388912 2.126270e-04 0.4119734 0.1894124
#> Mg 0.3793530 0.26221751 0.0036070461 1.909894e-03 0.5798929 0.1581543
#> S 0.1266908 0.60311127 0.0101838168 7.669395e-06 0.1103062 0.7176938
#> Humdepth pH
#> N 0.54266218 0.7386046
#> P 0.32412825 0.5728181
#> K 0.15961613 0.1404062
#> Ca 0.03156073 0.2365150
#> Mg 0.01477451 0.1134202
#> S 0.42754109 0.3141949
Visualization using the corrplot() function
r <- corrplot(coeffs,
method = "number",
p.mat = pvalues,
sig.level = 0.05, # displays only corr. coeff. for p < 0.05
insig = "blank", # else leave the cell blank
tl.srt = 0, # control the orintation of text labels
tl.offset = 1) # control of the offset of the text labels
Use the results of the corrplot() function to build a more "traditionnally" matrix of results
# Keep only the correlation coefficients for pvalues < 0.05
ResultsMatrix <- r$corrPos %>%
mutate(corr = ifelse(p.value < 0.05, corr, NA))
# Set factors to control the order of rows and columns in the final cross-table
ResultsMatrix$xName <- factor(ResultsMatrix$xName,
levels = c("Al", "Fe", "Mn", "Zn", "Mo", "Baresoil", "Humdepth", "pH"))
ResultsMatrix$yName <- factor(ResultsMatrix$yName,
levels = c("N", "P", "K", "Ca", "Mg", "S"))
# Build the cross-table and get a dataframe as final result
xtabs(corr ~ yName + xName,
data = ResultsMatrix,
sparse = TRUE,
addNA = TRUE) %>%
as.matrix() %>%
as.data.frame()
Output
#> Al Fe Mn Zn Mo Baresoil Humdepth pH
#> N NA NA NA NA NA NA NA NA
#> P NA NA 0.6078261 0.7342323 NA NA NA NA
#> K NA NA 0.6757991 0.7424407 NA NA NA NA
#> Ca NA NA 0.6313043 0.6863854 NA NA 0.4396914 NA
#> Mg NA NA 0.5704348 0.6006960 NA NA 0.4912655 NA
#> S NA NA 0.5140248 0.7778986 NA NA NA NA
Created on 2021-12-21 by the reprex package (v2.0.1)
I am trying to run a Zero-Inflated Negative Binomial Count Model on some data containing the number of campaign visits by a politician by county. (Log Liklihood tests indicate Negative Binomial is correct, Vuong test suggests Zero-Inflated, though that could be thrown off by the fact that my Zero-Inflated model is clearly not converging.) I am using the pscl package in R. The problem is that when I run
Call:
zeroinfl(formula = Sanders_Adjacent_Clinton_Visit ~ Relative_Divisiveness + Obama_General_Percent_12 +
Percent_Over_65 + Percent_F + Percent_White + Percent_HS + Per_Capita_Income +
Poverty_Rate + MRP_Ideology_Mean + Swing_State, data = Unity_Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.96406 -0.24339 -0.11744 -0.03183 16.21356
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.216e+01 NA NA NA
Relative_Divisiveness -3.831e-01 NA NA NA
Obama_General_Percent_12 1.904e+00 NA NA NA
Percent_Over_65 -4.848e-02 NA NA NA
Percent_F 1.737e-01 NA NA NA
Percent_White 2.980e+00 NA NA NA
Percent_HS -3.563e-02 NA NA NA
Per_Capita_Income 7.413e-05 NA NA NA
Poverty_Rate -2.273e-02 NA NA NA
MRP_Ideology_Mean -8.316e-01 NA NA NA
Swing_State 1.580e+00 NA NA NA
Log(theta) 9.595e+00 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.024e+02 NA NA NA
Relative_Divisiveness -3.265e+00 NA NA NA
Obama_General_Percent_12 -2.300e+01 NA NA NA
Percent_Over_65 -7.768e-02 NA NA NA
Percent_F 2.873e+00 NA NA NA
Percent_White 5.156e+00 NA NA NA
Percent_HS -5.097e-01 NA NA NA
Per_Capita_Income 2.831e-04 NA NA NA
Poverty_Rate 1.391e-02 NA NA NA
MRP_Ideology_Mean -2.569e+00 NA NA NA
Swing_State 5.075e-01 NA NA NA
Theta = 14696.9932
Number of iterations in BFGS optimization: 94
Log-likelihood: -596.5 on 23 Df
Obviously, all of those NA's are less then helpful to me. Any advice would be greatly appreciated! I'm pretty novice at R, StackOverflow, and Statistics, but trying to learn. I'm trying to provide everything needed for the minimal reproducible example, but I don't see anywhere to share my actual data... so if that's something you need in order to answer the question, let me know where I can put it!
I read that the R flexsurv package can also be used for modeling time-dependent covariates according to Christopher Jackson (2016) ["flexsurv: a platform for parametric survival modeling in R, Journal of Statistical Software, 70 (1)].
However, I was not able to figure out how, even after several adjustments and searches in online forums.
Before turning to the estimation of time-dependent covariates I tried to create a simple model with only time-independent covariates to test whether I specified the Surv object correctly. Here is a small example.
library(splitstackshape)
library(flexsurv)
## create sample data
n=50
set.seed(2)
t <- rpois(n,15)+1
x <- rnorm(n,t,5)
df <- data.frame(t,x)
df$id <- 1:n
df$rep <- df$t-1
Which looks like this:
t x id rep
1 12 17.696149 1 11
2 12 20.358094 2 11
3 11 2.058789 3 10
4 16 26.156213 4 15
5 13 9.484278 5 12
6 15 15.790824 6 14
...
And the long data:
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:n){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
Which looks like this:
t x id start stop censrec
1 12 17.69615 1 1 2 0
1.1 12 17.69615 1 2 3 0
1.2 12 17.69615 1 3 4 0
1.3 12 17.69615 1 4 5 0
1.4 12 17.69615 1 5 6 0
1.5 12 17.69615 1 6 7 0
1.6 12 17.69615 1 7 8 0
1.7 12 17.69615 1 8 9 0
1.8 12 17.69615 1 9 10 0
1.9 12 17.69615 1 10 11 0
1.10 12 17.69615 1 11 12 1
2 12 20.35809 2 1 2 0
...
Now I can estimate a simple Cox model to see whether it works:
coxph(Surv(t)~x,data=df)
This yields:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
And in the long format:
coxph(Surv(start,stop,censrec)~x,data=long.df)
I get:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
Taken together I conclude that my transformation into the long format was correct. Now, turning to the flexsurv framework:
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
yields:
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00086 4.05569 6.16631 0.53452 NA NA NA
scale NA 13.17215 11.27876 15.38338 1.04293 NA NA NA
x 15.13380 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
But
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
causes an error:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull") :
Initial value for parameter 1 out of range
Would anyone happen to know the correct syntax for the latter Surv object? If you use the correct syntax, do you get the same estimates?
Thank you very much,
best,
David
===============
EDIT AFTER FEEDBACK FROM 42
===============
library(splitstackshape)
library(flexsurv)
x<-c(8.136527, 7.626712, 9.809122, 12.125973, 12.031536, 11.238394, 4.208863, 8.809854, 9.723636)
t<-c(2, 3, 13, 5, 7, 37 ,37, 9, 4)
df <- data.frame(t,x)
#transform into long format for time-dependent covariates
df$id <- 1:length(df$t)
df$rep <- df$t-1
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:length(df$t)){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
coxph(Surv(t)~x,data=df)
coxph(Surv(start,stop,censrec)~x,data=long.df)
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull",inits=c(shape=.1, scale=1))
Which yields the same estimates for both coxph models but
Call:
flexsurvreg(formula = Surv(time = t) ~ x, data = df, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.0783 0.6608 1.7594 0.2694 NA NA NA
scale NA 27.7731 3.5548 216.9901 29.1309 NA NA NA
x 9.3012 -0.0813 -0.2922 0.1295 0.1076 0.9219 0.7466 1.1383
N = 9, Events: 9, Censored: 0
Total time at risk: 117
Log-likelihood = -31.77307, df = 3
AIC = 69.54614
and
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.8660 0.4054 1.8498 0.3353 NA NA NA
scale NA 24.0596 1.7628 328.3853 32.0840 NA NA NA
x 8.4958 -0.0912 -0.3563 0.1739 0.1353 0.9128 0.7003 1.1899
N = 108, Events: 9, Censored: 99
Total time at risk: 108
Log-likelihood = -30.97986, df = 3
AIC = 67.95973
Reading the error message:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull", :
initial values must be a numeric vector
And then reading the help page, ?flexsurvreg, it seemed as though an attempt at setting values for inits to a named numeric vector should be attempted:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull", inits=c(shape=.1, scale=1))
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00082 4.05560 6.16633 0.53454 NA NA NA
scale NA 13.17213 11.27871 15.38341 1.04294 NA NA NA
x 15.66145 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
N = 715, Events: 50, Censored: 665
Total time at risk: 715
Log-likelihood = -131.5721, df = 3
AIC = 269.1443
Extremely similar results. My guess was basically a stab in the dark, so I have no guidance on how to make a choice if this had not succeeded other than to "expand the search."
I just want to mention that in flexsurv v1.1.1, running this code:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
doesn't return any errors. It also gives the same estimates as the non time-varying command
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
I am trying to make a nls fit for a little bit complicated expression that includes two integrals with two of the fit parameters in their upper limits.
I got the error
"Error in nlsModel(formula, mf, start, wts) : singular gradient
matrix at initial parameter estimates".
I have searched already in the previous answers, but didn't help. The parameters initialization seem to be ok, I have tried to change the parameters but none work. If my function has just one integral everything works very nicely, but when adding a second integral term just got the error. I don't believe the function is over-parametrized, as I have performed other fits with much more parameters and they worked. Below I have wrote a list with some data.
The minimal example is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, T2, N, D, x){
int1 = integrate(integrand, lower=0, upper = T1)$value
int2 = integrate(integrand, lower=0, upper = T2)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2
)+(448.956*(x/T1)^3*int1)+(299.304*(x/T2)^3*int2))
}
fit = nls(y ~ fitting(T1, T2, N, D, x),
start=list(T1=400,T2=200,N=0.01,D=2))
------>For reference, the fit that worked is the following:
integrand <- function(X) {
return(X^4/(2*sinh(X/2))^2)
}
fitting = function(T1, N, D, x){
int = integrate(integrand, lower=0, upper = T1)$value
return(N*(D/x)^2*(exp(D/x)/(1+exp(D/x))^2 )+(748.26)*(x/T1)^3*int)
}
fit = nls(y ~ fitting(T1 , N, D, x), start=list(T1=400,N=0.01,D=2))
------->Data to illustrate the problem:
dat<- read.table(text="x y
0.38813 0.0198
0.79465 0.02206
1.40744 0.01676
1.81532 0.01538
2.23105 0.01513
2.64864 0.01547
3.05933 0.01706
3.47302 0.01852
3.88791 0.02074
4.26301 0.0256
4.67607 0.03028
5.08172 0.03507
5.48327 0.04283
5.88947 0.05017
6.2988 0.05953
6.7022 0.07185
7.10933 0.08598
7.51924 0.0998
7.92674 0.12022
8.3354 0.1423
8.7384 0.16382
9.14656 0.19114
9.55062 0.22218
9.95591 0.25542", header=TRUE)
I cannot figure out what happen. I need to perform this fit for three integral components, but even for two I have this problem. I appreciate so much your help. Thank you.
You could try some other optimizers:
fitting1 <- function(par, x, y) {
sum((fitting(par[1], par[2], par[3], par[4], x) - y)^2)
}
library(optimx)
res <- optimx(c(400, 200, 0.01, 2),
fitting1,
x = DF$x, y = DF$y,
control = list(all.methods = TRUE))
print(res)
# p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2 xtimes
#BFGS 409.7992 288.6416 -0.7594461 39.00871 1.947484e-03 101 100 NA 1 NA NA 0.22
#CG 401.1281 210.9087 -0.9026459 20.80900 3.892929e-01 215 101 NA 1 NA NA 0.25
#Nelder-Mead 414.6402 446.5080 -1.1298606 -227.81280 2.064842e-03 89 NA NA 0 NA NA 0.02
#L-BFGS-B 412.4477 333.1338 -0.3650530 37.74779 1.581643e-03 34 34 NA 0 NA NA 0.06
#nlm 411.8639 333.4776 -0.3652356 37.74855 1.581644e-03 NA NA 45 0 NA NA 0.04
#nlminb 411.9678 333.4449 -0.3650271 37.74753 1.581643e-03 50 268 48 0 NA NA 0.07
#spg 422.0394 300.5336 -0.5776862 38.48655 1.693119e-03 1197 NA 619 0 NA NA 1.06
#ucminf 412.7390 332.9228 -0.3652029 37.74829 1.581644e-03 45 45 NA 0 NA NA 0.05
#Rcgmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#Rvmmin NA NA NA NA 8.988466e+307 NA NA NA 9999 NA NA 0.00
#newuoa 396.3071 345.1165 -0.3650286 37.74754 1.581643e-03 3877 NA NA 0 NA NA 1.02
#bobyqa 410.0392 334.7074 -0.3650289 37.74753 1.581643e-03 7866 NA NA 0 NA NA 2.07
#nmkb 569.0139 346.0856 282.6526588 -335.32320 2.064859e-03 75 NA NA 0 NA NA 0.01
#hjkb 400.0000 200.0000 0.0100000 2.00000 3.200269e+00 1 NA 0 9999 NA NA 0.01
Levenberg-Marquardt converges too, but nlsLM fails when it tries to create an nls model object from the result because the gradient matrix is singular:
library(minpack.lm)
fit <- nlsLM(y ~ fitting(T1, T2, N, D, x),
start=list(T1=412,T2=333,N=-0.36,D=38), data = DF, trace = TRUE)
#It. 0, RSS = 0.00165827, Par. = 412 333 -0.36 38
#It. 1, RSS = 0.00158186, Par. = 417.352 329.978 -0.3652 37.746
#It. 2, RSS = 0.00158164, Par. = 416.397 330.694 -0.365025 37.7475
#It. 3, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#It. 4, RSS = 0.00158164, Par. = 416.618 330.568 -0.365027 37.7475
#Error in nlsModel(formula, mf, start, wts) :
# singular gradient matrix at initial parameter estimates
I want to do a trend analysis for an ANOVA that has both btw-Ss and within-Ss factors.
The btw factors are "treatments"
The within factors are "trials".
test.data <- data.frame(sid = rep(c("s1", "s2", "s3", "s4", "s5"), each = 4),
treatments = rep(c("a1", "a2"), each = 20),
trials = rep(c("b1", "b2", "b3", "b4"), 10),
responses = c(3,5,9,6,7,11,12,11,9,13,14,12,4,8,11,7,1,3,5,4,5,6,11,7,10,12,18,15,10,15,15,14,6,9,13,9,3,5,9,7))}
The ANOVA matches the one in the textbook (Keppel, 1973) exactly:
aov.model.1 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
What I am having trouble with is the trend analysis. I want to look at the linear, quadratic, and cubic trends for “trials”. Would also be nice to look at these same trends for “treatments x trials”.
I have set up the contrasts for the trend analyses as:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
contrasts(tab20.09$trials)
[,1] [,2] [,3]
b1 -3 1 -1
b2 -1 -1 3
b3 1 -1 -3
b4 3 1 1
for the linear, quadratic, and cubic trends.
According to Keppel the results for the trends should be:
TRIALS:
SS df MS F
(Trials) (175.70) 3
Linear 87.12 1 87.12 193.60
Quadratic 72.90 1 72.90 125.69
Cubic 15.68 1 15.68 9.50
TREATMENTS X TRIALS
SS df MS F
(Trtmt x Trials)
(3.40) 3
Linear 0.98 1 0.98 2.18
Quadratic 0.00 1 0.00 <1
Cubic 2.42 1 2.42 1.47
ERROR TERMS
(21.40) (24)
Linear 3.60 8 0.45
Quadratic 4.60 8 0.58
Cubic 13.20 8 1.65
I have faith in his answers as once upon the time I had to derive them myself using a 6 function calculator supplemented by paper and pencil. However, when I do this:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
aov.model.2 <- aov(responses ~ treatments*trials + Error(sid/trials), data=tab20.09)
summary(lm(aov.model.2))
what I get seems not to make sense.
summary(lm(aov.model.2))
Call:
lm(formula = aov.model.2)
Residuals:
ALL 40 residuals are 0: no residual degrees of freedom!
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.750e+00 NA NA NA
treatmentsa2 3.500e+00 NA NA NA
trials1 6.500e-01 NA NA NA
trials2 -1.250e+00 NA NA NA
trials3 -4.500e-01 NA NA NA
sids10 -3.250e+00 NA NA NA
sids2 4.500e+00 NA NA NA
sids3 6.250e+00 NA NA NA
sids4 1.750e+00 NA NA NA
sids5 -2.500e+00 NA NA NA
sids6 -2.000e+00 NA NA NA
sids7 4.500e+00 NA NA NA
sids8 4.250e+00 NA NA NA
sids9 NA NA NA NA
treatmentsa2:trials1 2.120e-16 NA NA NA
treatmentsa2:trials2 -5.000e-01 NA NA NA
treatmentsa2:trials3 5.217e-16 NA NA NA
trials1:sids10 1.500e-01 NA NA NA
trials2:sids10 7.500e-01 NA NA NA
trials3:sids10 5.000e-02 NA NA NA
trials1:sids2 -1.041e-16 NA NA NA
trials2:sids2 -2.638e-16 NA NA NA
trials3:sids2 5.000e-01 NA NA NA
trials1:sids3 -1.500e-01 NA NA NA
trials2:sids3 -2.500e-01 NA NA NA
trials3:sids3 4.500e-01 NA NA NA
trials1:sids4 -5.000e-02 NA NA NA
trials2:sids4 -7.500e-01 NA NA NA
trials3:sids4 1.500e-01 NA NA NA
trials1:sids5 -1.000e-01 NA NA NA
trials2:sids5 5.000e-01 NA NA NA
trials3:sids5 3.000e-01 NA NA NA
trials1:sids6 -1.000e-01 NA NA NA
trials2:sids6 5.000e-01 NA NA NA
trials3:sids6 -2.000e-01 NA NA NA
trials1:sids7 4.000e-01 NA NA NA
trials2:sids7 5.000e-01 NA NA NA
trials3:sids7 -2.000e-01 NA NA NA
trials1:sids8 -5.000e-02 NA NA NA
trials2:sids8 2.500e-01 NA NA NA
trials3:sids8 6.500e-01 NA NA NA
trials1:sids9 NA NA NA NA
trials2:sids9 NA NA NA NA
trials3:sids9 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 39 and 0 DF, p-value: NA
Any ideas what I am doing wrong? I suspect there is some problem with “lm” and the ANOVA but I don’t know what and I don’t know how to put in my trend analyses.
###### MORE DETAILS in response to ssdecontrol's response
Well, "trials" is a factor as it codes four levels of experience that are being manipulated. Likewise "sid" is the "subject identification number" that is definitely "nominal" not "ordinal" or "interval". Subjects are pretty much always treated as Factors in ANOVAS.
However, I did try both of these changes, but it greatly distorted the ANOVA (try it yourself and compare). Likewise, it didn't seem to help. PERHAPS MORE DIRECTLY RELEVANT, when I try to create and apply my contrasts I am told that it cannot be done as my numerics need to be factors:
contrasts(tab20.09$trials) <- cbind(c(-3, -1, 1, 3), c(1, -1, -1, 1), c(-1, 3, -3, 1))
Error in `contrasts<-`(`*tmp*`, value = c(-3, -1, 1, 3, 1, -1, -1, 1, :
contrasts apply only to factors
STARTING OVER
I seem to make more progress using contr.poly as in
contrasts(tab20.09$trials) <- contr.poly(levels(tab20.09$trials))
The ANOVA doesn't change at all. So that is good and when I do:
lm.model <- lm(responses ~ trials, data = tab20.09)
summary.lm(lm.model)
I get basically the same pattern as Keppel.
BUT, as I am interested in the linear trend of the interaction (treatments x trials), not just on trials, I tried this:
lm3 <- lm(responses ~ treatments*trials, data = tab20.09)
summary.lm(lm3)
and the ME of "trials" goes away . . .
In Keppel’s treatment, he calculated separate error terms for each contrast (i.e., Linear, Quadratic, and Cubic) and used that on both the main effect of “trial” as well as on the “treatment x trial” interaction.
I certainly could hand calculate all of these things again. Perhaps I could even write R functions for the general case; however, it seems difficult to believe that such a basic core contrast for experimental psychology has not yet found an R implementation!!??
Any help or suggestions would be greatly appreciated. Thanks. W
It looks like trials and sids are factors, but you are intending for them to be numeric/integer. Run sapply(tab20.09, class) to see if that's the case. That's what the output means; instead of fitting a continuous/count interaction, it's fitting a dummy variable for each level of each variable and computing all of the interactions between them.
To fix it, just reassign tab20.09$trials <- as.numeric(tab20.09$trials) and tab20.09$sids <- as.numeric(tab20.09$sids) in list syntax, or you can use matrix syntax like tab20.09[, c("trials", "sids")] <- apply(tab20.09[, c("trials", "sids")], 2, as.numeric). The first one is easier in this case, but you should be aware of the second one as well.