Converting mixed model formula from SAS to R - r

I want to fit a mixed model using nlme package in R which is equivalent to following SAS codes:
proc mixed data = one;
class var1 var2 year loc rep;
model yld = var1 * var2;
random loc year(loc) rep*year(loc);
EDITS: Explanation of what is experiment about
the same combination of var1 and var2 were tested in replicates (rep- replicates are numbered 1:3). The replicates (rep) is considered random. This set of experiment is repeated over locations (loc) and years (year). Although replicates are numbered 1:3 within each location and year for covinience because they do not have any name, replication 1 within a location and a year doesnot have correlation replication 1 within other location and other year
I tried the following codes:
require(nlme)
fm1 <- lme(yld ~ var1*var2, data = one, random = loc + year / loc + rep * year / loc)
Is my codes correct?
EDITS: data and model based on suggestions
you can download the example data file from the following link:
https://sites.google.com/site/johndatastuff/mydata1.csv
data$var1 <- as.factor(data$var1)
data$var2 <- as.factor(data$var2)
data$year <- as.factor(data$year)
data$loc <- as.factor(data$loc)
data$rep <- as.factor(data$rep)
following suggestions from the comments below:
fm1 <- lme(yld ~ var1*var2, data = data, random = ~ loc + year / loc + rep * year / loc)
Error in getGroups.data.frame(dataMix, groups) :
Invalid formula for groups
EXPECTED BASED ON SAS OUTPUT
Type 3 tests of fixed effects
var1*var2 14 238 F value 16.12 Pr >F = < 0.0001
Covariance parameters:
loc = 0, year(loc) = 922161, year*rep(loc) = 2077492, residual = 1109238
I tried the following model, I still getting some errors:
Edits: Just for information I tried the following model
require(lme4)
fm1 <- lmer(yld ~ var1*var2 + (1|loc) + (1|year / loc) + (1|rep : (year / loc)),
data = data)
Error in rep:`:` : NA/NaN argument
In addition: Warning message:
In rep:`:` : numerical expression has 270 elements: only the first used

Thanks for the more detailed information. I stored the data in d to avoid confusion with the data function and parameter; the commands works either way but this avoiding data is generally considered good practice.
Note that the interaction is hard to fit because of the lack of balance between var and var2; for reference here's the crosstabs:
> xtabs(~var1 + var2, data=d)
var2
var1 1 2 3 4 5
1 18 18 18 18 18
2 0 18 18 18 18
3 0 0 18 18 18
4 0 0 0 18 18
5 0 0 0 0 18
Normally to just fit the interaction (and no main effects) you'd use : instead of *, but here it works best to make a single factor, like this:
d$var12 <- factor(paste(d$var1, d$var2, sep=""))
Then with nlme, try
fm1 <- lme(yld ~ var12, random = ~ 1 | loc/year/rep, data = d)
anova(fm1)
and with lme4, try
fm1 <- lmer(yld ~ var12 + (1 | loc/year/rep), data=d)
anova(fm1)
Also note that because nlme and lme4 have overlap in their function names you need to only load one at time into your R session; to switch you need to close R and restart. (Other ways exist but that's the simplest to explain.)

Related

Error with predict function and zero-inflated negative binomial model in R

I have been given an Rdata file containing a large number of inputs and outputs from a regression model. I have been able to extract the data analyzed by the model and reproduce the parameter estimates. However, when I attempt to use the original predict statement I receive an error even though the predict statement does not return an error when applied to the model stored in the Rdata file.
I am hoping there is enough information presented below that someone may be able to tell me how to correct my predict statement, my.probs, even though I am not providing a functional reproducible example. This, I think, is the first time I have ever posted a question here without providing such an example. The data set contains > 100,000 observations, is somewhat sensitive and I am unsure how I would reproduce the Rdata file.
library(msm)
library(MASS)
library(pscl)
# model output returned when extracting the model name from the `Rdata` file
original.model
# Call:
# zeroinfl(formula = AA ~ log(BB) + CC + DD + CC:DD | log(BB) + DD, data = original.data,
# offset = log(EE), dist = "negbin")
#
# Count model coefficients (negbin with log link):
# (Intercept) log(BB) CC3 CC4 CC5 DDPrivate CC3:DDPrivate CC4:DDPrivate CC5:DDPrivate
# -2.05317 0.31178 -0.41402 -0.71208 -0.92290 0.17878 -0.18476 -0.18674 0.07307
# Theta = 0.8551
#
# Zero-inflation model coefficients (binomial with logit link):
# (Intercept) log(BB) DDPrivate
# 1.6724 -0.5022 0.9742
#
# Warning message:
# In deparse(x$call, width.cutoff = floor(getOption("width") * 0.85)) :
# invalid 'cutoff' value for 'deparse', using default
# data for new observation for use in the predict statement
new.data
# DD EE CC BB
# 1 Private 1 4 1118.948
str(new.data)
#'data.frame': 1 obs. of 4 variables:
# $ DD : Factor w/ 2 levels "Public","Private": 2
# $ EE : num 1
# $ CC : Factor w/ 4 levels "2","3","4","5": 3
# $ BB: num 1119
original.probs <- predict(original.model, new.data, type='prob')
original.probs
# truncated probabilities returned by the predict statement. These sum to one if vector not truncated
c(0.7534319, 0.1552296, 0.05681916, 0.02133936, 0.008116065, 0.003110019, 0.001197667)
# reproduce the original model
my.version <- zeroinfl(formula = AA ~ log(BB) + CC + DD + CC:DD | log(BB) + DD, offset = log(EE), dist = "negbin")
# Error returned by the predict statement
my.probs <- predict(my.version, new.data, type='prob')
my.probs
# Error in exp(X %*% object$coefficients$count + offsetx)[, 1] :
# incorrect number of dimensions
# In addition: Warning message:
# In X %*% object$coefficients$count + offsetx :
# Recycling array of length 1 in array-vector arithmetic is deprecated.
# Use c() or as.vector() instead.
The predict function worked after I grouped the input variables into a data.frame and included the data option in the zeroinfl model statement:
my.data <- data.frame(AA = AA,
BB = BB,
CC = CC,
DD = DD,
EE = EE)
my.version <- zeroinfl(formula = AA ~ log(BB) + CC + DD + CC:DD | log(BB) + DD,
offset = log(EE), dist = "negbin", data = my.data)
summary(my.version)
my.probs <- predict(my.version, new.data, type='prob')
my.probs

Applying logistic regression to simple dataset

I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")

Representing Parametric Survival Model in 'Counting Process' form in JAGS

I'm trying to build a survival model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model — for example, assuming survival follows the Weibull distribution (but I'd like to allow the hazard to vary, so exponential is too simple). So, this is essentially a Bayesian version of what can be done in the flexsurv package, which allows for time-varying covariates in parametric models.
Therefore, I want to be able to enter the data in a 'counting-process' form, where each subject has multiple rows, each corresponding to a time interval in which their covariates remained constant (as described in this pdf or here. This is the (start, stop] formulation that the survival or flexurv packages allow.
Unfortunately, every explanation of how to perform survival analysis in JAGS seems to assume one row per subject.
I attempted to take this simpler approach and extend it to the counting process format, but the model does not correctly estimate the distribution.
A Failed Attempt:
Here's an example. First we generate some data:
library('dplyr')
library('survival')
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
person person_start_time person_censored person_duration
(int) (dbl) (lgl) (dbl)
1 1 11.81416 FALSE 487.4553
2 2 114.20900 FALSE 168.7674
3 3 75.34220 FALSE 356.6298
4 4 339.98225 FALSE 385.5119
5 5 389.23357 FALSE 259.9791
6 6 253.71067 FALSE 259.0032
7 7 419.52305 TRUE 310.4770
Then we split the data into 2 observations per subject. I'm just splitting each subject at time = 300 (unless they didn't make it to time=300, in which they get just one observation).
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
person split START END TINTERVAL CENS TINTERVAL_CENS
(int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 300.0000 0 300.0000 300.00000 1 NA
2 1 300.0000 300 487.4553 187.45530 0 187.45530
3 2 168.7674 0 168.7674 168.76738 1 NA
4 3 300.0000 0 300.0000 300.00000 1 NA
5 3 300.0000 300 356.6298 56.62979 0 56.62979
6 4 300.0000 0 300.0000 300.00000 1 NA
Now we can set up the JAGS model.
## Set-Up JAGS Model -------
dat_jags <- as.list(dat_split)
dat_jags$N <- length(dat_jags$TINTERVAL)
inits <- replicate(n = 2, simplify = FALSE, expr = {
list(TINTERVAL_CENS = with(dat_jags, ifelse(CENS, TINTERVAL + 1, NA)),
END_CENS = with(dat_jags, ifelse(CENS, END + 1, NA)) )
})
model_string <-
"
model {
# set priors on reparameterized version, as suggested
# here: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/d5249e71/?limit=25#8c3b
log_a ~ dnorm(0, .001)
log(a) <- log_a
log_b ~ dnorm(0, .001)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# Estimate Subject-Durations:
CENS[i] ~ dinterval(TINTERVAL_CENS[i], TINTERVAL[i])
TINTERVAL_CENS[i] ~ dweibull( nu, lambda )
}
}
"
library('runjags')
param_monitors <- c('a', 'b', 'nu', 'lambda')
fit_jags <- run.jags(model = model_string,
burnin = 1000, sample = 1000,
monitor = param_monitors,
n.chains = 2, data = dat_jags, inits = inits)
# estimates:
fit_jags
# actual:
c(a=true_shape, b=true_scale)
Depending on where the split point is, the model estimates very different parameters for the underlying distribution. It only gets the parameters right if the data isn't split into the counting process form. It seems like this is not the way to format the data for this kind of problem.
If I am missing an assumption and my problem is less related to JAGS and more related to how I'm formulating the problem, suggestions are very welcome. I might be despairing that time-varying covariates can't be used in parametric survival models (and can only be used in models like the Cox model, which assumes constant hazards and which doesn't actually estimate the underlying distribution)— however, as I mentioned above, the flexsurvreg package in R does accommodate the (start, stop] formulation in parametric models.
If anyone knows how to build a model like this in another language (e.g. STAN instead of JAGS) that would be appreciated too.
Edit:
Chris Jackson provides some helpful advice via email:
I think the T() construct for truncation in JAGS is needed here. Essentially for each period (t[i], t[i+1]) where a person is alive but the covariate is constant, the survival time is left-truncated at the start of the period, and possibly also right-censored at the end. So you'd write something like y[i] ~ dweib(shape, scale[i])T(t[i], )
I tried implementing this suggestion as follows:
model {
# same as before
log_a ~ dnorm(0, .01)
log(a) <- log_a
log_b ~ dnorm(0, .01)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# modified to include left-truncation
CENS[i] ~ dinterval(END_CENS[i], END[i])
END_CENS[i] ~ dweibull( nu, lambda )T(START[i],)
}
}
Unfortunately this doesn't quite do the trick. With the old code, the model was mostly getting the scale parameter right, but doing a very bad job on the shape parameter. With this new code, it gets very close to the correct shape parameter, but consistently over-estimates the scale parameter. I have noticed that the degree of over-estimation is correlated with how late the split point comes. If the split-point is early (cens_point = 50), there's not really any over-estimation; if it's late (cens_point = 350), there is a lot.
I thought maybe the problem could be related to 'double-counting' the observations: if we see a censored observation at t=300, then from that same person, an uncensored observation at t=400, it seems intuitive to me that this person is contributing two data-points to our inference about the Weibull parameters when really they should just be contributing one point. I, therefore, tried incorporating a random-effect for each person; however, this completely failed, with huge estimates (in the 50-90 range) for the nu parameter. I'm not sure why that is, but perhaps that's a question for a separate post. Since I'm not whether the problems are related, you can find the code for this whole post, including the JAGS code for that model, here.
You can use rstanarm package, which is a wrapper around STAN. It allows to use standard R formula notation to describe survival models. stan_surv function accepts arguments in a "counting process" form. Different base hazard functions including Weibull can be used to fit the model.
The survival part of rstanarm - stan_surv function is still not available at CRAN so you should install the package directly from mc-stan.org.
install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Please see the code below:
library(dplyr)
library(survival)
library(rstanarm)
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
dat_split$CENS <- as.integer(!(dat_split$CENS))
# Fit STAN survival model
mod_tvc <- stan_surv(
formula = Surv(START, END, CENS) ~ 1,
data = dat_split,
iter = 1000,
chains = 2,
basehaz = "weibull-aft")
# Print fit coefficients
mod_tvc$coefficients[2]
unname(exp(mod_tvc$coefficients[1]))
Output, which is consistent with true values (true_shape <- 2; true_scale <- 365):
> mod_tvc$coefficients[2]
weibull-shape
1.943157
> unname(exp(mod_tvc$coefficients[1]))
[1] 360.6058
You can also look at STAN source using rstan::get_stanmodel(mod_tvc$stanfit) to compare STAN code with the attempts you made in JAGS.

Problems with apply R

I Have a problem with using the apply function in R. I made the following function:
TrainSupportVectorMachines <- function(trainingData,kernel,G,C){
####train het model
fit<-svm(Device~.,data=trainingData,kernel=kernel,probability=TRUE,
gamma =G, costs=C)
return(fit);
}
I want to train the model with different values of Cost(c). Therefore, I tried the following commend:
cst = matrix(2^(-4:-2),ncol=3)
kernl = "sigmoid"
fitSVMBP <- apply(cst,2,function(x)TrainSupportVectorMachines(dtr1,kernl,0.625,x))
My opinion is that, fitSVMBP becomes a list with different SVM models with different values for cost. But I get a list with different SVM model but they have all a cost of 1.
Does anybody know what I do wrong?
EDIT:
I use the e1071 package.
And the dataset looks like:
> head(dtr1)
Device Geslacht Leeftijd Invultijd Type Maanden.geleden
1 pc M 45 16.0 A 15
2 pc V 43 27.5 A 3
3 pc V 28 16.0 A 15
4 pc V 17 10.0 A 13
5 pc M 56 16.0 A 15
6 pc M 50 27.5 A 3
You have called the argument costs and not cost. Here's an example using the sample data in ?svm so you can try this:
model <- svm(Species ~ ., data = iris, cost=.6)
model$cost
# [1] 0.6
model <- svm(Species ~ ., data = iris, costs=.6)
model$cost
# [1] 1
R will do partial matching (so in this case cos=.6 would work) but if you overspecify an argument it doesn't match.
Nor will it always complain if you give it an argument it doesn't expect:
> model <- svm(Species ~ ., data = iris, costs=.6, asjkdakjshd=1)
>
Because unmatched args get caught in the ... argument.
If you take this too far, you get:
> model <- svm(Species ~ ., data = iris, c=.122)
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
argument 4 matches multiple formal arguments
because c matches cost, coef0, class.weights and cachesize.

MCMCglmm multinomial model in R

I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).

Resources