by group analysis using svyglm in a data.table

by group analysis using svyglm in a data.table - r

I have the following data in a data.table:
h x1 y1 swNx11
1: 1 39.075565717 0 1.03317231703408
2: 1 40.445951251 0 7.14418755725832
3: 1 37.800722944 0 0.435946586361557
4: 1 41.085221504 0 0.381347141150498
5: 1 36.318077491 0 0.497077163135359
---
24996: 25 39.110138193 0 0.942922612158002
24997: 25 39.331940413 0 1.42227399208458
24998: 25 37.479473784 0 0.390657876415799
24999: 25 35.892044242 0 0.599937357458247
25000: 25 40.699588303 0 0.486486760245521
I've created a function to analyse them in svyglm:
msmMC <- function(y, x, sw, name){
msm <- svyglm(y ~ x,family=quasibinomial(link="logit"),design = svydesign(~ 1, weights = ~ sw))
out <- cbind("name",coef(summary(msm))[2,1],coef(summary(msm))[2,2])
return(out)
}
msmswNx1<-dt2[,list(dtmsm=list(msmMC(y1, x1, swNx1, Nx1))),by="h"]
outNx1 <- unlist(dt.lm[,msmswNx1])
When I run this function, I get the following error:
Error in [.data.table(dt2, , list(dtmsm = list(msmMC(y1, x1, swNx1, :
column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Useage: DT[,sum(colC),by=list(colA,month(colB))]
Yet it works fine with a different model, such as glm or polr. So what is going on here? Why is svyglm so picky about by-group processing with a data.table?

I doubt that it has worked for lm glm or polr as the error is an argument matching one.
You will need to wrap the whole thing in list
dt2[,list(dtmsm=list(msmMC(y1, x1, swNx1, Nx1))),by="h"]
Or perhaps, you have just misplaced the list call given that msmMC appears to return an object that might be a data.frame, list or data.table
dt2[,list(dtmsm=msmMC(y1, x1, swNx1, Nx1)),by="h"]

Related

lmer Error in eval(predvars, data, env) : object x not found

Applying lmer() function across all columns in dataframe. I have made a list of variables and used lapply. Below is the code:
varlist=names(Genus_abundance)[5:ncol(Genus_abundance)]
lapply(varlist, function(x){lmer(substitute(i ~ Status + (1|Match), list(i=as.name(x), data=Genus_abundance, na.action = na.exclude)))})
However, I keep getting this error:
Error in eval(predvars, data, env) : object 'Acetatifactor' not found
I have checked and Acetatifactor is in the Genus_abundance dataframe.
Bit stuck about where its going wrong
EDIT:
Added a working example:
set.seed(43)
n <- 6
dat <- data.frame(id=1:n, Status=rep(LETTERS[1:2], n/2), age= sample(18:90, n, replace=TRUE), match=1:n, Acetatifactor=runif(n), Acutalibacter=runif(n), Adlercreutzia=runif(n))
head(dat)
id Status age match Acetatifactor Acutalibacter Adlercreutzia
1 1 A 49 1 0.1861022 0.1364904 0.8626298
2 2 B 31 2 0.7297301 0.8246794 0.3169752
3 3 A 23 3 0.4118721 0.5923042 0.2592606
4 4 B 64 4 0.4140497 0.7943970 0.7422665
5 5 A 60 5 0.4803101 0.7690324 0.7473611
6 6 B 79 6 0.4274945 0.9180564 0.9179040

lapply(varlist,
function(x){lmer(substitute(i ~ status + (1|match), list(i=as.name(x))),
data=dd)
})
The specific problem here is misplaced parentheses. You should close the substitute(..., list(i=as.name(x))) with three close-parentheses so that the whole chunk is properly understood as the first argument to lme4.
More generally I agree with #Kat in the comments that this is a good place to look. Since your arguments are already strings (not symbols) you don't really need all of the substitute() business and could use
fit_fun <- function(v) {
lmer(reformulate(c("status", "(1|match)"), response = v),
data = dd, na.action = na.exclude)
}
lapply(varlist, fit_fun)
Or you could use refit to fit the first column, then update the fit with each of the next columns. For large models this is much more efficient.
m1 <- lmer(resp1 ~ status + (1|match), ...)
m_other <- lapply(dd[-(1:3)], refit, object = m1)
c(list(m1), m_other)

Custom function to compute contrasts in emmeans

I want to create a custom contrast function in emmeans which could remove a given list of levels from the input vector and apply the built-in contrast method ("trt.vs.ctrl") on the remaining levels. An example dataset is available here. I am using the following R code for computing ANOVA and post hoc comparisons:
options(contrasts=c("contr.sum", "contr.poly"))
my_lm <- lm(D1 ~ C*R, data=df)
Anova(my_lm, type = "III")
#show Interaction effects using emmeans
emmip(my_lm, C ~ R )
emm = emmeans(my_lm, ~ C * R)
emm
contrast(emmeans(my_lm, ~ C * R), "consec", by = "C")
#compare 1st with next 3 groups (how to remove other three levels?)
contrast(emmeans(my_lm, ~ C * R), "trt.vs.ctrl", by = "R")
The built-in contrast option ("trt.vs.ctrl") compares the first level with everything that follows it (there are 7 factor levels in C, and I want to remove last 3 of them and compute the contrasts for the remaining 4). An example is provided in the official documentation to write a custom contrast function.
skip_comp.emmc <- function(levels, skip = 1, reverse = FALSE) {
if((k <- length(levels)) < skip + 1)
stop("Need at least ", skip + 1, " levels")
coef <- data.frame()
coef <- as.data.frame(lapply(seq_len(k - skip - 1), function(i) {
sgn <- ifelse(reverse, -1, 1)
sgn * c(rep(0, i - 1), 1, rep(0, skip), -1, rep(0, k - i - skip - 1))
}))
names(coef) <- sapply(coef, function(x)
paste(which(x == 1), "-", which(x == -1)))
attr(coef, "adjust") = "fdr" # default adjustment method
coef
}
However due to my limited understanding I am not very sure where to apply the modifications that I need to to customise the example. Any ideas?

Is this something you are going to want to do lots of times in the future? My guess is not, that you only want to do this once, or a few times at most; in which case it is way too much trouble to write a custom contrast function. Just get the contrast coefficients you need, and use that as the second argument in contrast.
Now, consider these results:
> con <- emmeans:::trt.vs.ctrl.emmc(1:7)
> con
2 - 1 3 - 1 4 - 1 5 - 1 6 - 1 7 - 1
1 -1 -1 -1 -1 -1 -1
2 1 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 1 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 0 1
From the description, I think you just want the first 3 sets of contrast coefficients. So use those columns:
contrast(emm, con[, 1:3], by = "R")
Update
StackOverflow can occasionally inspire developers to add software features. In this case, I decided it could be useful to add an exclude argument to most built-in .emmc functions in emmeans (all except poly.emmc()). This was fairly straightforward to do, and those features are now incorporated in the latest push to github -- https://github.com/rvlenth/emmeans. These features will be included in the next CRAN update as well.

Modelling for zero using glm function in R

I am trying to build a logistic regression model using glm function in R. My dependent variable is binomial with 0 and 1 only. Here 0 - Non Return , 1- Return.
I want to model for Non-Return (0's),but glm function of R by default build for 1's. Like in SAS which by default build for lower value and we can use descending attribute in proc logistic to change the order, do we have something similar in glm too ?
I have one option to achieve this by changing 0 to 1 and vice-versa in my raw data but don't want to change my raw data.
Please help me or guide how can I do the similar thing in R.
Thanks in advance.

Just specify 1 - y as the DV:
set.seed(42)
y <- sample(c(0, 1), 10, TRUE)
#[1] 1 1 0 1 1 1 1 0 1 1
fit <- glm(y ~ 1, family = binomial)
coef(fit)
# (Intercept)
# 1.386294
log(mean(y) / (1 - mean(y)))
#[1] 1.386294
1 - y
#[1] 0 0 1 0 0 0 0 1 0 0
fit1 <- glm(1 - y ~ 1, family = binomial)
coef(fit1)
#(Intercept)
#-1.386294
log(mean(1 - y) / (1 - mean(1 - y)))
#[1] -1.386294

Alternatively, you can temporarily transform your data by using...transform:
glm( data = transform( data.frame(y=0), y=y+1 ), ... )

Plotting Logistic Equation Fit or many variables in ggplot2

I have a logistic model fit, say myfit that I've saved. The data frame I'm using is in the format of (where the first column is the outcome).
medical10 age female nonwhite bmi smoked condxs insuredd smi2d
1 0 60 0 1 29.97 0 0 1 0
2 0 42 0 1 25.85 1 3 1 1
3 0 62 1 0 25.06 0 1 1 0
4 0 62 0 0 36.27 0 2 0 0
5 0 32 0 0 33.36 0 0 1 0
6 0 41 0 0 21.70 1 0 0 0
...
What I would like to do is to make a logistic plot (in this form: http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html) for each combination of variables.
Since there are 8 variables, there are 2^8 permutations of having one variable on the x-axis while holding the other 7 constant. Is there a way I can automate the plot using ggplot2?
For instance, if 'x' was age, I would get the mean of bmi, and then pick 0 for female, 0 for nonwhite, 0 for smoked, 0 for condxs, 0 for insuredd and 0 for smi2d. I would then do a prediction and make a ggplot of x vs y.
However, this is quite tedious and I was hoping there was a better way?

I don't know of anything particular in ggplot that will make this easy. But I did find a way (though it was more work than I was expecting. Perhaps others can improve. Anyway, first let's define a more useful set of sample data
N<-100
set.seed(15)
invlogit <- function(x) exp(x)/(exp(x)+1)
dd <- transform(data.frame(
age=runif(N,30,60),
female=sample(0:1, N, replace=T),
white=sample(c("Y","N"), N, replace=T),
bmi=rnorm(N,30,2)),
medical=as.numeric(invlogit((-60+2*age-1.5*bmi+3*female)/10)>runif(N)))
fit<-glm(medical~. ,dd, family=binomial)
So now we have some data and a model. Now i'll define a helper function that will predict values for a single variable while holding the others at the mean value.
predictone<-function(fit, var, xlim=NULL, fix=list(), n=101,
xname=var, type="response") {
tt <- terms(fit)
vv <- as.list(attr(tt, "variables"))[-c(1,attr(tt, "response")+1)]
vn <- sapply(vv, deparse)
stopifnot(var %in% vn)
others <- vn[vn != var]
def<-lapply(others, function(x) {
if(x %in% names(fix)) {
if(is.factor(val)) {
stopifnot(fix[[x]] %in% levels(val))
val[val==fix[[x]]][1]
} else {
fix[[x]]
}
} else {
val <- fit$data[[x]]
if(is.factor(val)) {
val[val==names(sort(table(val))[1])][1]
} else {
mean(val)
}
}
})
if(is.factor(fit$data[[var]])) {
newdata <- data.frame(def, unique(fit$data[[var]]))
} else {
if(is.null(xlim)) {
xlim <- range(fit$data[[var]])
}
newdata <- data.frame(def, seq(min(xlim), max(xlim), length.out=n))
}
names(newdata)<-c(others, var)
pp<-data.frame(newdata[[var]], predict(fit,newdata, type=type))
names(pp)<-c(xname, type)
attr(pp,"fixed")<-setNames(def, others)
pp
}
Basically this function exists to calculate the averages of all the other variables and then do the actual prediction. We can use it with the test data to make a bunch of plots with
plots<-lapply(names(dd)[1:4], function(x) {
if(is.factor(dd[[x]])) {
ggplot(predictone(fit, x), aes_string(x=x, y="response")) + geom_point()
} else {
ggplot(predictone(fit, x), aes_string(x=x, y="response")) + geom_line()
}
})
require(gridExtra)
do.call(grid.arrange, plots)
which will return
Note that factors are treated differently than regular numeric values. When you code categorical variables with 0/1 R can't tell they are categorical so it doesn't do a good job of inferring the values which make sense. I would encourage you to convert 0/1 values to a proper factor variable.

An update to the R rms package to be posted on CRAN on about 2015-01-01 includes a new function ggplot.Predict (called by ggplot()) that provides a general way to generate such curves using ggplot2, handling multiple moving variables, interactions, etc. You can see some example usage at https://github.com/harrelfe/rms/blob/master/man/ggplot.Predict.Rd . You can do all this with the current version of rms using lattice graphics and the plot.Predict function.

Creating an R function to use mclapply from the multicore package

I need to analyze some simulated data with the following structure:
h c x1 y1 x1c10
1 0 37.607056431 104.83097593 5
1 1 27.615251557 140.85532974 10
1 0 34.68915314 114.59312842 2
1 1 30.090387454 131.60485642 9
1 1 39.274429397 106.76042522 10
1 0 33.839385007 122.73681319 2
...
where h ranges from 1 to 2500, and indexes the Monte Carlo sample, each sample with 1000 observations. I'm analysing these data with the following code that gives me two objects (fnN1, fdQB101):
mc<-2500 ##create loop index
fdN1<-matrix(0,mc,1000)
fnQB101 <- matrix(0,mc,1000) ##create 2500x1000 storage matrices, elements zero
for(j in 1:mc){
fdN1[j,] <- dnorm(residuals(lm(x1 ~ c,data=s[s$h==j,])),
mean(residuals(lm(x1 ~ c,data=s[s$h==j,]))),
sd(residuals(lm(x1 ~ c,data=s[s$h==j,]))))
x1c10<-as.matrix(subset(s,s$h==j,select=x1c10))
fdQB100 <- as.matrix(predict(polr(as.factor(x1c10) ~ c ,
method="logistic", data=s[s$h==j,]),
type="probs"))
indx10<- as.matrix(cbind(as.vector(seq(1:nrow(fdQB100))),x1c10))
fdQB101[j,] <- fdQB100[indx10]
}
The objects fdN1 and fdQB101 are 2500x1000 matrices with predicted probabilities as elements. I need to create a function out of this loop that I can call with lapply() or mclapply(). When I wrap this in the following function command:
ndMC <- function(mc){
for(j in 1:mc){
...
}
return(list(fdN1,fdQB101))
}
lapply(mc,ndMC)
the objects fdN1 and fdQB101 are each returned as 2500x1000 matrices of zeros, instead of the predicted probabilities. What am I doing wrong?

You should be able to do this with the data.table package. Here is an example:
library(data.table)
dt<-data.table(h=rep(1L,6), c=c(0L,1L,0L,1L,1L,0L),
X1=c(37.607056431,27.615251557,34.68915314,30.090387454,39.274429397,33.839385007),
y1=c(104.83097593,140.85532974,114.59312842,131.60485642,106.76042522,122.73681319),
x1c10=c(5L,10L,2L,9L,10L,2L))
## Create a linear model for every grouping of variable h:
fdN1.partial<-dt[,list(lm=list(lm(X1~c))),by="h"]
## Retrieve the linear model for h==1:
fdN1.partial[h==1,lm]
## [[1]]
##
## Call:
## lm(formula = X1 ~ c)
##
## Coefficients:
## (Intercept) c
## 35.379 -3.052
You could also write a function to generalize this solution:
f.dnorm<-function(y,x) {
f<-lm(y ~ x)
out<-list(dnorm(residuals(f), mean(residuals(f)), sd(residuals(f))))
return(out)
}
## Generate two dnorm lists for every grouping of variable h:
dt.lm<-dt[,list(dnormX11=list(f.dnorm(X1,rep(1,length(X1)))), dnormX1c=list(f.dnorm(X1,c))),by="h"]
## Retrieve one of the dnorm lists for h==1:
unlist(dt.lm[h==1,dnormX11])
## 1 2 3 4 5 6
## 0.06296194 0.03327407 0.08884549 0.06286739 0.04248756 0.09045784

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

by group analysis using svyglm in a data.table - r

Related

lmer Error in eval(predvars, data, env) : object x not found

Custom function to compute contrasts in emmeans

Modelling for zero using glm function in R

Plotting Logistic Equation Fit or many variables in ggplot2

Creating an R function to use mclapply from the multicore package

Categories

Resources