Having issues using the lme4 predict function on my mixed models - r

I’m having a bit of a struggle trying to use the lme4 predict function on my mixed models. When making predications I want to be able to set some of my explanatory variables to a specified level but average across others.
Here’s some made up data that is a simplified, nonsense version of my original dataset:
a <- data.frame(
TLR4=factor(rep(1:3, each=4, times=4)),
repro.state=factor(rep(c("a","j"),each=6,times=8)),
month=factor(rep(1:2,each=8,times=6)),
sex=factor(rep(1:2, each=4, times=12)),
year=factor(rep(1:3, each =32)),
mwalkeri=(sample(0:15, 96, replace=TRUE)),
AvM=(seq(1:96))
)
The AvM number is the water vole identification number. The response variable (mwalkeri) is a count of the number of fleas on each vole. The main explanatory variable I am interested in is Tlr4 which is a gene with 3 different genotypes (coded 1, 2 and 3). The other explanatory variables included are reproductive state (adult or juvenile), month (1 or 2), sex (1 or 2) and year (1, 2 or 3). My model looks like this (of course this model is now inappropriate for the made up data but that shouldn't matter):
install.packages("lme4")
library(lme4)
mm <- glmer(mwalkeri~TLR4+repro.state+month+sex+year+(1|AvM), data=a,
family=poisson,control=glmerControl(optimizer="bobyqa"))`
summary(mm)
I want to make predictions about parasite burden for each different Tlr4 genotype while accounting for all the other covariates. To do this I created a new dataset to specify the level I wanted to set each of the explanatory variables to and used the predict function:
b <- data.frame(
TLR4=factor(1:3),
repro.state=factor(c("a","a","a")),
month=factor(rep(1, times=3)),
sex=factor(rep(1, times=3)),
year=factor(rep(1, times=3))
)
predict(mm, newdata=b, re.form=NA, type="response")
This did work but I would really prefer to average across years instead of setting year to one particular level. However, whenever I attempt to average year I get this error message:
Error in model.frame.default(delete.response(Terms), newdata, na.action = na.action, : factor year has new level
Is it possible for me to average across years instead of selecting a specified level? Also, I've not worked out how to get the standard error associated with these predictions. The only way I've been able to get standard error for predictions was using the lsmeans() function (from the lsmeans package):
c <- lsmeans(mm, "TLR4", type="response")
summary(c, type="response")
Which automatically generates the standard error. However, this is generated by averaging across all the other explanatory variables. I'm sure it’s probably possible to change that but I would rather use the predict() function if I can. My goal is to create a graph with Tlr4 genotype on the x-axis and predicted parasite burden on the y-axis to demonstrate the predicted differences in parasite burden for each genotype while all other significant covariants are accounted for.

You might be interested in the merTools package which includes a couple of functions for creating datasets of counterfactuals and then making predictions on that new data to explore the substantive impact of variables on the outcome. A good example of this comes from the README and the package vignette:
Let's take the case where we want to explore the impact of a model with an interaction term between a category and a continuous predictor. First, we fit a model with interactions:
data(VerbAgg)
fmVA <- glmer(r2 ~ (Anger + Gender + btype + situ)^2 +
(1|id) + (1|item), family = binomial,
data = VerbAgg)
Now we prep the data using the draw function in merTools. Here we draw the average observation from the model frame. We then wiggle the data by expanding the dataframe to include the same observation repeated but with different values of the variable specified by the var parameter. Here, we expand the dataset to all values of btype, situ, and Anger.
# Select the average case
newData <- draw(fmVA, type = "average")
newData <- wiggle(newData, var = "btype", values = unique(VerbAgg$btype))
newData <- wiggle(newData, var = "situ", values = unique(VerbAgg$situ))
newData <- wiggle(newData, var = "Anger", values = unique(VerbAgg$Anger))
head(newData, 10)
#> r2 Anger Gender btype situ id item
#> 1 N 20 F curse other 5 S3WantCurse
#> 2 N 20 F scold other 5 S3WantCurse
#> 3 N 20 F shout other 5 S3WantCurse
#> 4 N 20 F curse self 5 S3WantCurse
#> 5 N 20 F scold self 5 S3WantCurse
#> 6 N 20 F shout self 5 S3WantCurse
#> 7 N 11 F curse other 5 S3WantCurse
#> 8 N 11 F scold other 5 S3WantCurse
#> 9 N 11 F shout other 5 S3WantCurse
#> 10 N 11 F curse self 5 S3WantCurse
Now we simply pass this new dataset to predictInterval in order to generate predictions for these counterfactuals. Then we plot the predicted values against the continuous variable, Anger, and facet and group on the two categorical variables situ and btype respectively.
plotdf <- predictInterval(fmVA, newdata = newData, type = "probability",
stat = "median", n.sims = 1000)
plotdf <- cbind(plotdf, newData)
ggplot(plotdf, aes(y = fit, x = Anger, color = btype, group = btype)) +
geom_point() + geom_smooth(aes(color = btype), method = "lm") +
facet_wrap(~situ) + theme_bw() +
labs(y = "Predicted Probability")

Related

Using mice for panel data models

My panel dataset contains 7 variables and 1452 observations, covering 6 years. I would like to regress y on x, while controlling for the other variables. The data contains quite a lot of missing observations, 35 % for the independent variable, x, and 23 % for the dependent one, y.
a, b, and c also contain missings but not to that extent.
A toy dataset looks like this:
Name
Year
id
y
x
a
b
c
A
2015
1
6
n.a.
9
4
1
A
2016
1
n.a.
2
9
3
n.a.
I used multiple imputation as provided by the mice function, which worked well. Diagnostics of the distributions of the imputed datasets also seem to be okay. Here is my code (excluding the diagnostics):
predictormatrix<-quickpred(data,
include=c("a", "b", "c", "x", "y"),
exclude=c("Name", "Year", "id"),
mincor = 0.1)
imp <- mice(data,
predictorMatrix = predictormatrix,
m=5,
maxit=5,
meth='pmm')
I can manage to conduct a simple pooled regression:
fitimp <- with(imp,
lm(y ~ x + a + b + c))
summary(pool(fitimp))
However, as a pooled OLS does not take into account the structure of the panel data, I would like to fit a fixed effects and a random effects model and decide on a model on the basis of the Hausman test. I tried using the with function like this:
fitimp.fe <- with(imp,
plm(y ~ x + a + b + c),
data = imp,
index = c("Name", "Year"),
effect = "individual", model = "within")
summary(pool(fitimp.fe))
But it gives me an error: No tidy method for objects of class mids. Plus a warning: Infinite sample size assumed.
Apart from fitting a fixed and random effects model to the imputed datasets, I do not know how to compare them (as mentioned, e.g., on the basis of a Hausman test). Can this be done with the with function?
I've been trying to solve this for quite some time now and would be very grateful if someone could help me. I found a lot about imputation for multilevel data, but if I understood it correctly, this does not apply to my dataset. Last but not least, I've read multiple times to install broom.mixed, which didn't help.

How can I incorporate the prior weight in to my GLM function?

I am trying to incorporate the prior settings of my dependent variable in my logistic-regression in r using the glm-function. The data-set I am using is created to predict churn.
So far I am using the function below:
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family =
binomial(link='logit'))
What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)
My dataset Is build-up in the following manner:
Where the independent variables vary in data type between continues and class variables and
Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.
And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.
Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.
set.seed(111)
library(MASS)
# we sample 10 from Yes and 90 from No
idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
Data = Pima.tr
trn = Data[idx,]
test = Data[-idx,]
table(trn$type)
No Yes
90 10
Lets try regressing it with weight 9 if positive, 1 if negative:
library(caret)
W = 9
lvl = levels(trn$type)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
# we test it on the test set
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 34 26
Yes 8 32
You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3
W = 3
lvl = levels(trn$type)
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 39 30
Yes 3 28
Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.
You can also look around the other stats provided by confusionMatrix in caret to guide your choice.
In your dataset trainingset create a column called weights_col that contains your weights (.1, .9) and then run
V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = binomial(link='logit'), weights = weights_col)

Get marginal effect and predicted probability for glmer model in R

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.
So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?
Here is some sample data:
### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100),
state = rep(c("montana", "idaho",
"colorado", "wyoming", "utah"),
times = 10, each = 20),
site_id = as.factor(rep(1:100, times = 10)),
cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
ind_cont_variable = rnorm(100, mean = 20, sd = 5),
event_occurred = as.factor(sample(c(0, 1),
replace = TRUE,
size = 1000)))
### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df,
select_columns = "state",
remove_first_dummy = TRUE)
I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:
library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
cat_variable*year*state +
(1|site_id),
data = df,
family = binomial(link = "logit"),
nAGQ = 0,
control = glmerControl(optimizer = "nloptwrap"))
I can use ggeffects to get the predicted probability values for each state and category combination over time:
library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state,
terms = c("year [all]",
"cat_variable",
"state")))
head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
# x predicted std.error conf.low conf.high group facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655 0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289 0 idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708 0 montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101 0 utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088 0 wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719 1 colorado
This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):
library(margins)
fit_state_me <- summary(margins(fit_state,
at = list(year = 2001:2010),
variables = "cat_variable"))
head(fit_state_me)
# factor year AME SE z p lower
# cat_variable1 2001.0000 0.0224 0.0567 0.3953 0.6926 -0.0887
# cat_variable1 2002.0000 0.0146 0.0490 0.2978 0.7659 -0.0814
# cat_variable1 2003.0000 0.0062 0.0418 0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845
The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:
Error: cannot allocate vector of size 369.5 Gb
Any advice for getting around this issue so that I can use this function on my data?
I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

Wrong prediction in linear SVM

I am writing a R script which when run gives the predicted value of dependent variable. All of my variables are categorically divided (as shown in picture) and assigned a number, total number of classes are 101. (each class is song name).
So I have a training dataset which contains pairs like {(2,5,6,1)82, (2,5,6,1)45, (2,5,3,1)34, ...}. I trained this dataset using linear svm in R studio and for some values of given (x,y,z,w) it gives correct answers. but even though records like (2,5,6,1)X existed in training dataset, why it doesn't predict values 82 or 45? I am pretty confused as it neglects this terms and shows whole new output 23.
training_set = dataset;
library(e1071)
classifier = svm(formula = Song ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
y_pred = predict(classifier, data.frame(Emotion = 2, Pact = 5, Mact = 6, Session = 1)).
What I want is my answer to come closest. What can I do for achieving these goals?
Get atleast 10 closest outcomes instead of 1 in R.
Is linear svm model doing good here?
How do I get value 82,45 like in training dataset, if no entry present then find the closest one. (Is there any model without going for simply euclidean distance)?
What makes you think that your classifier will predict the same outcome for a set of predictors as your original observation? I think there might be some fundamental misconceptions about how classification works.
Here is a simple counter-example using a linear regression model. The same principle applies to your SVM.
Simulate some data
set.seed(2017);
x <- seq(1:10);
y <- x + rnorm(10);
We now modify one value of y and show the data of (x,y) pairs.
y[3] = -10;
df <- cbind.data.frame(x = x, y = y);
df;
# x y
#1 1 2.434201
#2 2 1.922708
#3 3 -10.000000
#4 4 2.241395
#5 5 4.930175
#6 6 6.451906
#7 7 5.041634
#8 8 7.998476
#9 9 8.734664
#10 10 11.563223
Fit a model and get predictions.
fit <- lm(y ~ x, data = df);
pred <- predict(fit);
Let's take a look at predicted responses y.pred and compare them to the original data (x, y).
data.frame(df, y.pred = pred)
# x y y.pred
#1 1 2.434201 -2.1343357
#2 2 1.922708 -0.7418526
#3 3 -10.000000 0.6506304
#4 4 2.241395 2.0431135
#5 5 4.930175 3.4355966
#6 6 6.451906 4.8280796
#7 7 5.041634 6.2205627
#8 8 7.998476 7.6130458
#9 9 8.734664 9.0055288
#10 10 11.563223 10.3980119
Note how the predicted response for x=3 is y.pred=0.65 even though you observed y=-10.

Equivalent R script for repeated measures SAS script, nested design

I'm having trouble converting an SAS script to the corresponding R script.
The model is a repeated measures analysis of the response (resp) based on treatment (trt) with plot (plot) nested in the treatment.
SAS code:
data data_set;
input trt $ plot time resp;
datalines;
Burn 1 1 27
Burn 1 9 25
Burn 1 12 18
Burn 1 15 21
Burn 2 1 5
Burn 2 9 15
Burn 2 12 10
Burn 2 15 12
...
Unburn 1 1 57
Unburn 1 9 46
Unburn 1 12 49
Unburn 1 15 51
Unburn 2 1 43
Unburn 2 9 59
Unburn 2 12 59
Unburn 2 15 60
proc mixed data = data_set;
class trt plot time;
model resp = trt time trt*time / ddfm = kr;
repeated time / subject = trt(plot) type = vc rcorr;
run;
R code attempted (loading the data set from a CSV file):
library(nlme)
data.set <- read.csv( "data_set.csv" )
data.set$plot <- factor( data.set$plot )
data.set$time <- factor( data.set$time )
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~1 | plot )
This works, but isn't the desired model. Other attempts I've tried have generally resulted in the error:
Error in getGroups.data.frame(dataMix, groups) :
invalid formula for groups
Basically I'm off in the weeds here...
Question 1: how to specify the same model in R as what is already specified in SAS?
Question 2: I want to be able to change the covariance matrix to replicate other work done in SAS. I believe I know how to do this with the correlation parameter for the lme function. But please correct me if I'm wrong.
Thanks in advance.
The specification of the model in R would logically be:
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~1 | trt:plot )
This given that plot is nested in treatment per the coding, or alternatively, there is an interaction between plot and treatment. However if specified as such, then it generates the warning mentioned:
Error in getGroups.data.frame(dataMix, groups) : invalid formula
for groups
The problem encountered has to do with the levels introduced (I think) by using such an interaction. Regardless of the exact issue, the problem can be resolved by creating a combined treatment plot predictor variable:
data.set$trtplot <- with( data.set, factor( paste( trt, plot, sep = "." ) ) )
And then performing the analysis as follows:
model1 <- lme( resp ~ trt + time + trt:time, data = data.set, random = ~ 1 | trtplot )
For completeness this could just as easily be the following, where each predictor variable is added plus the interaction:
model1 <- lme( resp ~ trt * time, data = data.set, random = ~ 1 | trtplot )
This then matches results achieved in SAS when a Compound Symmetry (CS) covariance structure is specified (although the AIC criterion is a different - not sure why). So a little different to the SAS code above where a Variance Components (VC) covariance structure is specified, but this is just a matter of changing the structure type in the SAS code.
As for comparing different covariance structures, this appears to be more of a challenge. The covariance structures that I would like to investigate are:
Compound Symmetry (CS) - done
Variance Components(VC)
Unstructured (UN)
Spatial Power (SP)
Any thoughts would be most welcome!

Resources