simulating two-level data with level 1 interaction term - r

I am trying to simulate two-level data, with level 1 interaction term.
For example,
I have two level-2 variables, two level -1 variable, and interaction variable.
CN<- 40 # number of cluster
nj<- 2 # cluster size, in this case, dyadic data.
l2v1 <- rep(rnorm(CN, mean=0, sd=1), each=nj) # level 2 variable 1
l2v2 <- rep(sample(rep(c(-1, 1), CN),each=nj) # level 2 variable 2, which is binary variable
l1v1 <- rnorm(CN*ng, 0, 1) # level 1 variable 2
l1v2 <- rnorm(sample(rep(c(-1,1),CN*nj) # level 2 variable 2, which is binary
error2 <- rep(rnorm(CN, 0,1) each = nj)) # error for level 2
error1 <- rnorm(CN*nj) # error for level 1
## putting together
y <- coef1*l1v1 + coef2*l1v2 + coef3*l2v1 + coef4*l2v2 + coef5 * l1v1* l1v2 + error2 + error1
In this case, how can I control ICC?
For example, I want to simulate this data with ICC of 0.3
ICC = between variance / total variance
ICC= {coef3^2 + coef4^2 + 1}/{coef3^2 + coef4^2 + 1 + 1+ coef1^2 + coef2^2 + coeff5^2}
and coef1 = coef3, coef 2= coef 4 due to some research questions.
so I plugged in the arbitrary numbers in coef 1, 2, 3, 4 and tried to set coeff 5 so that I can have data with targeted ICC. However, it seems like it does not work.
Did I miss something?

Related

calculate new random number considering distribution of already existing numbers in r

I have a dataframe with participants and I want to randomly assign them to a group (0,1). Each group should have approximately the same amount of participants.
My problem: I will keep adding participants. So, when I calculate a new random number for that participant, it should take into accound the distribution of the random numbers I already have.
This is my code:
groupData <- data.frame(participant = c(1), Group = floor(runif(1, min=0, max=2)))
groupData[nrow(groupData) + 1,] = c(2,floor(runif(1, min=0, max=2))) # with this I will be adding participants
I think what you're saying is that when iteratively adding participants to groupData, you want to randomly assign them to a group such that over time, the groups will be evenly distributed.
N.B., iteratively adding rows to a frame scales horribly, so if you're doing this with a lot of data, it will slow down a lot. See "Growing Objects" in The R Inferno.
We can weight the different groups proportion to their relative size (inversely), so that a new participant has a slightly-higher likelihood of being assigned an under-populated group.
For instance, if we already have 100 participants with unbalanced groups:
set.seed(42)
groupData <- data.frame(participant = 1:100, Group = sample(c(rep(0, 70), rep(1, 30))))
head(groupData)
# participant Group
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 1
# 5 5 0
# 6 6 1
table(groupData$Group)
# 0 1
# 70 30
then we can prioritize the under-filled group using
100 / (table(c(0:1, groupData$Group))-1)
# 0 1
# 1.428571 3.333333
which can be used with sample as in
sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group)) - 1) )
I use table(c(0:1, ..)) - 1 because I want this to work when there may not yet be participants in one of the groups; by concatenating 0:1 to it, I ensure heac group has at least one, and the "minus one" compensates for this artificiality, trying to keep the ratios unbiased.
To "prove" that this eventually rounds out ...
for (pa in 101:400) {
newgroup <- sample(0:1, size = 1, prob = 100 / (table(c(0:1, groupData$Group))-1))
groupData <- rbind(groupData, data.frame(participant=pa, Group=newgroup))
}
library(ggplot2)
transform(groupData, GroupDiff = cumsum(Group == 0) - cumsum(Group == 1)) |>
ggplot(aes(participant, y = GroupDiff)) +
geom_point() +
geom_hline(yintercept=0) +
geom_vline(xintercept = 100) +
geom_text(data=data.frame(participant=101, GroupDiff=c(-Inf, -1, 1), vjust=c(-0.5, 0.5, -0.5), label=c("Start of group-balancing", "Group0-heavy", "Group1-heavy")), hjust=0, aes(label=label, vjust=vjust))
It is possible (even likely) that the balance will sway from side-to-side, but in general (asymptotically) it should stay balanced.
It occurs to me that the simplest method is just to assign people in pairs. Draw a random number (0 or 1) assign person N to the group associated with that value and assign person N+1 to the other group. That guarantees random assignment as well as perfectly equal group sizes.
Whether this properly simulates the situation you want to analyze is a separate issue.

How to simulate data in R for a polynomial SVM

I'm new to simulating data in R and would like to know how to generate a polynomial separator of degree 2 in R. This is the question:
Generate a Data Set by Simulations:
We seek to generate 5000 cases x^1 ... x^5000 in R^4
each case x = [ x1 x2 x3 x4 ] has 4 numerical features.
using random sampling of a uniform distribution over the interval [-2, +2]:
select 16 random numbers Aij with i= 1 2 3 4 and j = 1 2 3 4; select 4 random numbers Bi with i= 1 2 3 4; select 1 random number c
Define the polynomial of degree 2 in the 4 variables x1 x2 x3 x4 as follows:
Pol(x) = ∑i ∑j Aij xi xj + ∑i Bi xi + c/20
so far I've generated A, B, and C:
A <- matrix(runif(16, -2, 2), nrow=4, ncol=4)
B <- runif(4, -2, 2)
C <- runif(1, -2, 2)
But I'm having trouble finding out how to define a polynomial using the values I generated.

Modify summaryFunction in caret to compute grouped Brier-Score

I want to compare a multinomial logit model and a random forest using a grouped brier score within cross validation. The theoretical foundation of this approach is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3702649/pdf/nihms461154.pdf
My dependent variable has three outcomes and my data-set compremises life-time data, where the lifetime lies between 0-5.
To make things reproducable, my dataset looks like:
library(data.table)
N <- 1000
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
length <- sample(0:5,N,T)
Ycont <- 0.5*X1 - 0.3*X2 + 10 + rnorm(N, 0, 6)
Ycateg <- ntile(Ycont,3)
df <- data.frame(id=1:N,length,X1, X2, Ycateg)
df$Ycateg=ifelse(df$Ycateg==1,"current",ifelse(df$Ycateg==2,"default","prepaid"))
df=setDT(df)[,.SD[rep(1L,length)],by = id]
df=df[ , time := 1:.N , by=id]
df=df[,-c("length")]
head(df)
id X1 X2 Ycateg time
1: 1 178.0645 10.84313 1 1
2: 2 169.4208 34.39831 1 1
3: 2 169.4208 34.39831 1 2
4: 2 169.4208 34.39831 1 3
5: 2 169.4208 34.39831 1 4
6: 2 169.4208 34.39831 1 5
What I did so far is:
library(caret)
fitControl <- trainControl(method = 'cv',number=5)
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl)
cv
Since the models are used to predict probabilities at each time point, I want to compute the following for each fold:
Brier Score for each category of the dependent variable: BS_i=(Y_it,k - p_it,k)² - where i denotes observation i of the test-fold,t the time and k the class k of the dependent variable
Summarise for this one fold 1. by computing 1/n_t (BS_i) where n_t are the number of observations which do have an observed time t - so a grouped computation
So in the end, what I want to report - for example for a 3 fold CV & knowing that time ranges from 0-5 - is an output like this:
fold time Brier_0 Brier_1 Brier_2
1 1 0 0.39758714 0.11703814 0.8711775
2 1 1 0.99461281 0.95051037 0.1503217
3 1 2 0.01791559 0.83653814 0.1553521
4 1 3 0.92067849 0.55275340 0.6466206
5 1 4 0.73112563 0.07603891 0.5769286
6 1 5 0.29500600 0.66219814 0.7590742
7 2 0 0.24691469 0.06736522 0.8612998
8 2 1 0.13629191 0.55973431 0.5617303
9 2 2 0.48006915 0.01357407 0.4515544
10 2 3 0.01257112 0.40250469 0.1814620
. . . . . .
I know that I have to set up a customized version of the summaryFunction, but I'm really lost on how to do this. So my main aim is not to tune a model but to validate it.
There is one thing that should be remarked: the summaryFunction can only return a single numeric vector - correct me if I'm wrong. Futher, the data-parameter of the summaryFunction contains a column rowIndex which can be used to extract additional variables form the original data set.
customSummary <- function (data, lev = NULL, model = NULL) { # for training on a next-period return
#browser() #essential for debugging
dat=dim(data)
# get observed dummy
Y_obs = model.matrix( ~ data[, "obs"] - 1) # create dummy - for each level of the outcome
# get predicted probabilities
Y_pre=as.data.frame(data[ , c("current","default","prepaid")])
# get rownumbers
rows=data[,"rowIndex"]
# get time of each obs
time=df[rows,]$time
# put it all together
df_temp=data.frame(Y_obs,Y_pre,time)
names(df_temp)=c("Y_cur","Y_def","Y_pre","p_cur","p_def","p_pre","time")
# group by time and compute crier score
out=df_temp %>% group_by(time) %>% summarise(BS_cur=1/n()*sum((Y_cur-p_cur)^2),BS_def=1/n()*sum((Y_def-p_def)^2),BS_pre=1/n()*sum((Y_pre-p_pre)^2))
# name
names(out)=c("time","BS_cur","BS_def","BS_pre")
# now create one line of return - caret seems to be able to hande only one
out=as.data.frame(out)
out_stack=stack(out)
out_stack=out_stack[(max(out$time)):length(out_stack[,1]),]
out_stack=out_stack[-1,]
out_stack$ind=paste(out_stack$ind,out$time,sep = "_")
# recall, the return type must be simply numeric
out_final=(t(out_stack[,1]))
names(out_final)=(out_stack[,2])
return(out_final)
}
# which type of cross validation to do
fitControl <- trainControl(method = 'cv',number=5,classProbs=TRUE,summaryFunction=customSummary, selectionFunction = "best", savePredictions = TRUE)
grid <- expand.grid(decay = 0 )
cv=train(as.factor(Ycateg)~.,
data = df,
method = "multinom",
maxit=150,
trControl = fitControl,
tuneGrid = grid
)
cv$resample
BS_cur_1 BS_cur_2 BS_cur_3 BS_cur_4 BS_cur_5 BS_def_1 BS_def_2 BS_def_3 BS_def_4 BS_def_5 BS_pre_1 BS_pre_2 BS_pre_3 BS_pre_4 BS_pre_5
1 0.1657623 0.1542842 0.1366912 0.1398001 0.2056348 0.1915512 0.2256758 0.2291467 0.2448737 0.2698545 0.1586134 0.2101389 0.1432483 0.2076886 0.1663780
2 0.1776843 0.1919503 0.1615440 0.1654297 0.1200515 0.2108787 0.2185783 0.2209958 0.2467931 0.2199898 0.1580643 0.1595971 0.2015860 0.1826029 0.1947144
3 0.1675981 0.1818885 0.1893253 0.1402550 0.1400997 0.2358501 0.2342476 0.2079819 0.1870549 0.2065355 0.2055711 0.1586077 0.1453172 0.1638555 0.2106146
4 0.1796041 0.1573086 0.1500860 0.1738538 0.1171626 0.2247850 0.2168341 0.2031590 0.1807209 0.2616180 0.1677508 0.1965577 0.1873078 0.1859176 0.1344115
5 0.1909324 0.1640292 0.1556209 0.1371598 0.1566207 0.2314311 0.1991000 0.2255612 0.2195158 0.2071910 0.1976272 0.1777507 0.1843828 0.1453439 0.1736540
Resample
1 Fold1
2 Fold2
3 Fold3
4 Fold4
5 Fold5

Prediction with lm

I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions

Simulating an interaction effect in a lmer() model in R

Is there an R package with a function that can:
(1) simulate the different values of an interaction variable,
(2) plot a graph that demonstrates the effect of the interaction on Y for different values of the terms in interaction, and
(3) works well with the models fitted with the lmer() function of the lme4 package?
I have looked in arm, ez, coefplot2, and fanovaGraph packages, but could not find what I was looking for.
I'm not sure about a package, but you can simulate data varying the terms in the interaction, and then graph it. Here is an example for a treatment by wave (i.e. longitudinal) interaction and the syntax to plot. I think the story behind the example is a treatment to improve oral reading fluency in school age children. The term of the interaction is modified by changing the function value for bX.
library(arm)
sim1 <- function (b0=50, bGrowth=4.672,bX=15, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34) {
#observation ID
oID<-rep(1:231)
#participant ID
ID<-rep(1:77, each=3)
tmp2<-sample(0:1,77,replace=TRUE,prob=c(.5,.5))
ITT<-tmp2[ID]
#longitudinal wave: for example 0, 4, and 7 months after treatment
wave <-rep(c(0,4,7), 77)
bvaset<-rnorm(77, 0, 11.58)
bva<-bvaset[ID]
#random effect intercept
S.in <- rnorm(77, 0, sqrt(Vint))
#random effect for slope
S.sl<-rnorm(77, 0, sqrt(Vslope))
#observation level error
eps <- rnorm(3*77, 0, sqrt(Verror))
#Create Outcome as product of specified model
ORFset <- b0 + b01*bva+ bGrowth*wave +bX*ITT*wave+ S.in[ID]+S.sl[ID]*wave+eps[oID]
#if else statement to elimiante ORF values below 0
ORF<-ifelse(ORFset<0,0,ORFset)
#Put into a data frame
mydata <- data.frame( oID,ID,ITT, wave,ORF,bva,S.in[ID],S.sl[ID],eps)
#run the model
fit1<-lmer(ORF~1+wave+ITT+wave:ITT+(1+wave|ID),data=mydata)
fit1
#grab variance components
vc<-VarCorr(fit1)
#Select Tau and Sigma to select in the out object
varcomps=c(unlist(lapply(vc,diag)),attr(vc,"sc")^2)
#Produce object to output
out<-c(coef(summary(fit1))[4,"t value"],coef(summary(fit1))[4,"Estimate"],as.numeric(varcomps[2]),varcomps[3])
#outputs T Value, Estimate of Effect, Tau, Sigma Squared
out
mydata
}
mydata<-sim1(b0=50, bGrowth=4.672, bX=1.25, b01=.770413, b11=.005, Vint=771, Vslope=2.24, Verror=40.34)
xyplot(ORF~wave,groups=interaction(ITT),data=mydata,type=c("a","p","g"))
Try plotLMER.fnc() from the languageR package, or the effects package.
The merTools package has some functionality to make this easier, though it only applies to working with lmer and glmer objects. Here's how you might do it:
library(merTools)
# fit an interaction model
m1 <- lmer(y ~ studage * service + (1|d) + (1|s), data = InstEval)
# select an average observation from the model frame
examp <- draw(m1, "average")
# create a modified data.frame by changing one value
simCase <- wiggle(examp, var = "service", values = c(0, 1))
# modify again for the studage variable
simCase <- wiggle(simCase, var = "studage", values = c(2, 4, 6, 8))
After this, we have our simulated data which looks like:
simCase
y studage service d s
1 3.205745 2 0 761 564
2 3.205745 2 1 761 564
3 3.205745 4 0 761 564
4 3.205745 4 1 761 564
5 3.205745 6 0 761 564
6 3.205745 6 1 761 564
7 3.205745 8 0 761 564
8 3.205745 8 1 761 564
Next, we need to generate prediction intervals, which we can do with merTools::predictInterval (or without intervals you could use lme4::predict)
preds <- predictInterval(m1, level = 0.9, newdata = simCase)
Now we get a preds object, which is a 3 column data.frame:
preds
fit lwr upr
1 3.312390 1.2948130 5.251558
2 3.263301 1.1996693 5.362962
3 3.412936 1.3096006 5.244776
4 3.027135 1.1138965 4.972449
5 3.263416 0.6324732 5.257844
6 3.370330 0.9802323 5.073362
7 3.410260 1.3721760 5.280458
8 2.947482 1.3958538 5.136692
We can then put it all together to plot:
library(ggplot2)
plotdf <- cbind(simCase, preds)
ggplot(plotdf, aes(x = service, y = fit, ymin = lwr, ymax = upr)) +
geom_pointrange() + facet_wrap(~studage) + theme_bw()
Unfortunately the data here results in a rather uninteresting, but easy to interpret plot.

Resources