Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.
I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
library(arm)
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
oID<-rep(1:231)
#participant ID
ID<-rep(1:77, each=3)
tmp2<-sample(0:1,77,replace=TRUE,prob=c(.5,.5))
ITT<-tmp2[ID]
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
bva<-bvaset[ID]
#random effect intercept
S.in <- rnorm(77, 0, sqrt(Vint))
#random effect for slope
S.sl<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+ S.in[ID]+S.sl[ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
ORF<-ifelse(ORFset<0,0,ORFset)
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,S.in[ID],S.sl[ID],eps)
#run the model
fit1<-lmer(ORF~1+wave+ITT+wave:ITT+(1+wave|ID),data=mydata)
fit1
#grab variance components
vc<-VarCorr(fit1)
#Select Tau and Sigma to select in the out object
varcomps=c(unlist(lapply(vc,diag)),attr(vc,"sc")^2)
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
out
mydata
}
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
xyplot(ORF~wave,groups=interaction(ITT),data=mydata,type=c("a","p","g"))
Try plotLMER.fnc() from the languageR package, or the effects package.
The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
library(merTools)
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
simCase
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
preds
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
library(ggplot2)
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
Unfortunately the data here results in a rather uninteresting, but easy to interpret plot.
Related
I am trying to simulate two-level data, with level 1 interaction term.
For example,
I have two level-2 variables, two level -1 variable, and interaction variable.
CN<- 40 # number of cluster
nj<- 2 # cluster size, in this case, dyadic data.
l2v1 <- rep(rnorm(CN, mean=0, sd=1), each=nj) # level 2 variable 1
l2v2 <- rep(sample(rep(c(-1, 1), CN),each=nj) # level 2 variable 2, which is binary variable
l1v1 <- rnorm(CN*ng, 0, 1) # level 1 variable 2
l1v2 <- rnorm(sample(rep(c(-1,1),CN*nj) # level 2 variable 2, which is binary
error2 <- rep(rnorm(CN, 0,1) each = nj)) # error for level 2
error1 <- rnorm(CN*nj) # error for level 1
## putting together
y <- coef1*l1v1 + coef2*l1v2 + coef3*l2v1 + coef4*l2v2 + coef5 * l1v1* l1v2 + error2 + error1
In this case, how can I control ICC?
For example, I want to simulate this data with ICC of 0.3
ICC = between variance / total variance
ICC= {coef3^2 + coef4^2 + 1}/{coef3^2 + coef4^2 + 1 + 1+ coef1^2 + coef2^2 + coeff5^2}
and coef1 = coef3, coef 2= coef 4 due to some research questions.
so I plugged in the arbitrary numbers in coef 1, 2, 3, 4 and tried to set coeff 5 so that I can have data with targeted ICC. However, it seems like it does not work.
Did I miss something?
I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions
Using R and estimating a simple equation by least squares that has the dependent variable as a independent (explanatory, right hand side) variable, I want to forecast out of sample and use the dependent variable forecasts in the out of sample period as a lag for each step ahead.
I.e., I want to extend forecasts of y to be outside the data period
a <- lm( y ~ x + lag(y,1), data= dset1)
b <- forecast(a,newdata=dset2)
where dset2 has the full period of extra x variables, but not the lagged y.
Here is an example using the AirPassengers data set, where dset2 was created with some missing ap data. The results below show only row 143 gets filled in not 144 because forecast did not have the 143 lag.
I looked at the dyn dynlm and forecast packages but nonw seem to work with type of model. (I do not want to restate as an ARMA or a VAR)
What package can easily do this, or am I using forecast incorrectly?
I can loop and step ahead on period at a time, but rather not do that.
##Example case using airline data
data("AirPassengers", package = "datasets")
ap <- log(AirPassengers)
ap <- as.ts(ap)
d1 <- data.frame(ap, index= as.Date(ap))
m1 <- lm(ap ~ lag(ap,1), data=d1)
m2 <- dynlm(ap ~ lag(ap,1), data=d1)
m3 <- dyn(lm(ap ~ lag(ap,1), data=d1))
summary(m3)
##Neither lm or dyn or dynlm obects worked as I want
## Try forecast missing values, 2 steps, rows 143 and 144
d2 <- d1
d2$apx = d2$ap
d2$apx[143:144]= NA
mx <- lm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Results:
> b
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1 NA NA NA NA NA
2 4.756850 4.619213 4.894488 4.545513 4.968188
3 4.807218 4.669783 4.944653 4.596191 5.018245
....
140 6.411559 6.273546 6.549572 6.199644 6.623474
141 6.386407 6.248507 6.524306 6.174667 6.598146
142 6.216154 6.078941 6.353368 6.005467 6.426841
143 6.122453 5.985553 6.259354 5.912247 6.332659
144 NA NA NA NA NA
other lm like objects produced errors for forecast
mx <- dynlm(apx ~ lag(apx,1), data=d2)
b <- forecast(mx,newdata=d2)
Error in forecast.lm(mx, newdata = d2) : invalid type/length
(symbol/0) in vector allocation
mx <- dyn(lm(apx ~ lag(apx,1), data=d2))
b <- forecast(mx,newdata=d2)
Error in predict.lm(object, newdata = newdata, se.fit = TRUE, interval
= "prediction", : formal argument "se.fit" matched by multiple actual arguments
Using the ranger package I run the following script:
rf <- ranger(Surv(time, Y) ~ ., data = train_frame[1:50000, ], write.forest = TRUE, num.trees = 100)
test_frame <- train_frame[50001:100000, ]
preds <- predict(rf, test_frame)
chfs <- preds$chf
plot(chfs[1, ])
The cumulative hazard function has indexes 1 - 36 on the X-axis. Obviously this corresponds with time, but I'm not sure how: my time of observation variable ranges from a minimum of 0 to a maximum of 399. What is the mapping between the original data and the predicted output from predict.ranger, and how can I operationalize this to quantify degree of risk for a given subject after a given length of time?
Here's a sample of what my time/event data looks like:
Y time
<int> <dbl>
1 1 358
2 0 90
3 0 162
4 0 35
5 0 307
6 0 69
7 0 184
8 0 24
9 0 366
10 0 33
And here's what the CHF of the first subject looks like:
Can anyone help me connect the dots? There are no row or columns names on the "matrix" object that is preds$chf.
In the prediction object is vector called unique.death.times containing the time points where the CHF and survival estimates are computed. The chf matrix has observations in the rows and these time points in the columns, same for survival.
Reproducible example:
library(survival)
library(ranger)
## Split the data
n <- nrow(veteran)
idx <- sample(n, 2/3*n)
train <- veteran[idx, ]
test <- veteran[-idx, ]
## Grow RF and predict
rf <- ranger(Surv(time, status) ~ ., train, write.forest = TRUE)
preds <- predict(rf, test)
## Example CHF plot
plot(preds$unique.death.times, preds$chf[1, ])
## Example survival plot
plot(preds$unique.death.times, preds$survival[1, ])
Setting importance = "impurity" for survival forests should throw an error.
I am going to eventually do multivariate regression for a vary large set of predictors. To make sure that I am putting the data in correctly and getting expected results with a toy model. However when I try to use predict it does not predict on the new data, also since the size of the new data is different from the training set it gives me an error.
I have looked and tried various things on the Internet and none have worked. I am almost ready to give up and write my own functions but I am also building models with the please package, which I am guessing probably calls this internally already so I want to be consistent. Here is the short script I wrote:
x1<-c(1.1,3.4,5.6,1.2,5,6.4,0.9,7.2,5.4,3.1) # Orginal Variables
x2<-c(10,21,25,15.2,18.9,19,16.2,22.1,18.6,22)
y<-2.0*x1+1.12*x2+rnorm(10,mean=0,sd=0.2) # Define output variable
X<-data.frame(x1,x2)
lfit<-lm(y~.,X) # fit model
n_fit<-lfit$coefficients
xg1<-runif(15,1,10) # define new data
xg2<-runif(15,10,30)
X<-data.frame(xg1,xg2)# put into data frame
y_guess<-predict(lfit,newdata=X) #Predict based on fit
y_actual<-2.0*xg1+1.12*xg2 # actual values because I know the coefficients
y_pred=n_fit[1]+n_fit[2]*xg1+n_fit[3]*xg2 # What predict should give me based on fit
print(y_guess-y_actual) #difference check
print(y_guess-y_pred)
These are the values I am getting and the error message:
[1] -4.7171499 -16.9936498 6.9181074 -6.1964788 -11.1852816 0.9257043 -13.7968731 -6.6624086 15.5365141 -8.5009428
[11] -22.8866505 2.0804016 -1.8728602 -18.7670797 1.2251849
[1] -4.582645 -16.903164 7.038968 -5.878723 -11.149987 1.162815 -13.473351 -6.483111 15.731694 -8.456738
[11] -22.732886 2.390507 -1.662446 -18.627342 1.431469
Warning messages:
1: 'newdata' had 15 rows but variables found have 10 rows
2: In y_guess - y_actual :
longer object length is not a multiple of shorter object length
3: In y_guess - y_pred :
longer object length is not a multiple of shorter object length
The predicted coefficient are 1.97 and 1.13 and intercept -0.25, it should be 0 but I added noise, this would not cause a big discrepancy as it is. How do I get it so I can predict an independent test set.
From the help - documentation, ?predict.lm:
"Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit)."
The data.frame(), created in: X <- data.frame(xg1, xg2), has different names: (xg1, xg2). predict() cannot find the original names (x1, x2) and will then search for the correct variables in the formula instead. The result is that you obtain the fitted values from your original data.
Solve this by making your names in the newdata consistent with the original:
X <- data.frame(x1=xg1, x2=xg2) :
x1 <- c(1.1, 3.4, 5.6, 1.2, 5, 6.4, 0.9, 7.2, 5.4, 3.1) # Orginal Variables
x2 <- c(10, 21, 25, 15.2, 18.9, 19, 16.2, 22.1, 18.6, 22)
y <- 2.0*x1 + 1.12*x2 + rnorm(10, mean=0, sd=0.2) # Define output variable
X <- data.frame(x1, x2)
lfit <- lm(y~., X) # fit model
n_fit <- lfit$coefficients
xg1 <- runif(15, 1, 10) # define new data
xg2 <- runif(15, 10, 30)
X <- data.frame(x1=xg1, x2=xg2) # put into data frame
y_guess <- predict(lfit, newdata=X) #Predict based on fit
y_actual <- 2.0*xg1 + 1.12*xg2 # actual values because I know the coefficients
y_pred = n_fit[1] + n_fit[2]*xg1 + n_fit[3]*xg2 # What predict should give me based on fit
> print(y_guess - y_actual) #difference check
1 2 3 4 5 6 7 8 9 10 11 12 13
-0.060223916 -0.047790535 -0.018274280 -0.096190467 -0.079490487 -0.063736231 -0.047506981 -0.009523583 -0.047774006 -0.084276807 -0.106322290 -0.030876942 -0.067232989
14 15
-0.023060651 -0.041264431
> print(y_guess - y_pred)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0