Looping over dependent variable in a mixed model - r

I am trying to loop over multiple variables in a mixed model (using the rptGaussian function from the rptR package) but I am unable to do it despite several efforts. I am trying the following code. I use the following code without a loop and it works fine:
(rptGaussian(Arg ~ (1|class)+(1|kit)+(1|sex),
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
However, when I try to loop more variables I get the error
Error in terms.default(formula) : no terms component nor attribute
I am trying the following code for the loop.
varlist<-c("var1", "var2")
blups.models <- lapply(varlist, function(x) {
rptGaussian(substitute(i ~ (1|class)+(1|kit)+(1|sex),
list(i = as.name(x))),
grname=c("class","kit","lab","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
})
Here is a dummy data table:
sex class kit var1 var2 var3 var4
Female A Cont 10.79730768 10 20 18
Female A Exp 11.2474347 17 1 17
Female A Cont 11.64820939 10 5 17
Female A Exp 15.62800413 20 8 4
Female B Cont 12.41705885 5 16 8
Female B Exp 12.80249244 9 10 1
Female B Cont 10.76949177 6 13 2
Female B Exp 14.71370141 7 12 11
Male A Cont 8.931529823 8 3 6
Male A Exp 10.46899683 3 12 13
Male A Cont 8.363257621 3 13 17
Male A Exp 8.753117911 10 16 10
Male B Cont 9.110946315 9 13 4
Male B Exp 9.595131886 18 10 17
Male B Cont 9.454670188 1 10 11
Male B Exp 10.59379123 11 1 3

In general this kind of looping is easier (IMO) with string-based solutions, especially the reformulate() wrapper function, than with substitute().
I used read.table(header=TRUE,text="...") to read the data above and this slightly modified code for the single model:
library(rptR)
r1 <- rptGaussian(var1 ~ (1|class)+(1|kit)+(1|sex),
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
For multiple models:
varlist <- c("var1", "var2")
Make list of formulas:
formulas <- lapply(varlist,
reformulate,
termlabels="(1|class)+(1|kit)+(1|sex)")
Apply rptGaussian to formulas:
blups.models <- lapply(formulas,
rptGaussian,
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
If you want to collapse the results to a nice form, you have to figure out how to extract the results from a single fit into a data frame or similar structure. In this case the result is a rpt object and methods(class="rpt") tells you that there are only print, plot, and summary methods, but the summary() method returns an object that has lots of potentially useful bits. Here's an example:
## extract estimates and standard errors of estimates as a 1-row data frame
sumfun <- function(x) {
ss <- summary(x)
se.names <- paste(rownames(ss$se),"se",sep=".")
cbind(ss$R,setNames(as.data.frame(t(ss$se)),se.names))
}
A possibly-better alternative would be to return data.frame(term=names(ss$R),rpt=unlist(ss$R),se=ss$se) (a 3-column by n-row data frame) instead.
I'm going to use dplyr::bind_rows() because it's handy, but you could use base-R tools (do.call(rbind(...))) instead if you prefer.
names(blups.models) <- varlist
dplyr::bind_rows(lapply(blups.models,sumfun),
.id="var")
var class kit sex Fixed class.se kit.se sex.se Fixed.se
1 var1 0 0.1444659 0.65887365 0 0.04992624 0.2136589 0.2954982 0
2 var2 0 0.3322780 0.01734343 0 0.01981748 0.2243989 0.1158878 0
Are you sure it makes sense to do repeatability scores across sexes and other categories with small numbers of levels?

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

Perform multiple survival analysis with loop in R

I am recently working on survival analysis with R. I have two data frames, geneDf for gene expression, survDf for the follow-up. As the following samples:
#Data frame:geneID
geneID=c("EGFR","Her2","E2F1","PTEN")
patient1=c(12,23,56,23)
patient2=c(23,34,11,6)
patient3=c(56,44,32,45)
patient4=c(23,64,45,23)
geneDf=data.frame(patient1,patient2,patient3,patient4,geneID)
> geneDf
patient1 patient2 patient3 patient4 geneID
1 12 23 56 23 EGFR
2 23 34 44 64 Her2
3 56 11 32 45 E2F1
4 23 6 45 23 PTEN
#Data frame:survDf
ID=c("patient1","patient2","patient3","patient4")
time=c(23,7,34,56)
status=c(1,0,1,1)
survDf=data.frame(ID,time,status)
#
> survDf
ID time status
1 patient1 23 1
2 patient1 7 0
3 patient1 34 1
4 patient1 56 1
I extract the expression data of specific gene from geneDf, and use the median of its expression as cut off value to perform survival analysis by “survival”package, and gain the p value by survdiff. In the following codes I use "EGFR" gene as an example.
#extract expression of a certain gene
targetGene<-subset(geneDf,grepl("EGFR",geneDf$geneID))
targetGene$geneID<-NULL
#Transpose the table and adjust its format
targetGene<-t(targetGene[,1:ncol(targetGene)])
targetGene<-data.frame(as.factor(rownames(targetGene)),targetGene)
colnames(targetGene)<-c("ID","Expression")
rownames(targetGene)<-NULL
targetGene$Expression1<-targetGene$Expression
targetGene$Expression1[ targetGene$Expression<median( targetGene$Expression)]<-1
targetGene$Expression1[ targetGene$Expression>=median( targetGene$Expression)]<-2
#Survival analysis
library(survival)
##Add survival object
survDf$SurvObj<-with(survDf, Surv(time,status==1))
## Kaplan-Meier estimator for stage
km<-survfit(SurvObj~targetGene$Expression1, data=survDf, conf.type = "log-log")
sdf<-survdiff(Surv(time, status) ~targetGene$Expression1, data=survDf)
#gain p value
p.val <-1-pchisq(sdf$chisq, length(sdf$n) - 1)
> p.val
[1] 0.1572992
I can do this through different genes one by one. But the question is: There are more than 10,000 gene need to be analyzed. I want gain all the p-values of them and put them to a new data frame. Do I need use loop or apply?
This is an ugly scritp but working.
In the Data10, in the first column you need to have the time, in the second one the status and in the next any treatments that you want.(patients as rownames)
loopsurff<-function(Data10){combos<-
rbind.data.frame(rep(1,ncol(Data10)- 2),
rep(2,ncol(Data10)-2),rep(3:(ncol(Data10)-2),1))
combos<-as.matrix(sapply(combos, as.numeric));library(plyr);
library(survival)
vv<-adply(combos, 2, function(x) {
fit <-survdiff(Surv(Data10[,1], Data10[,2]) ~ Data10[, x[3]],data=Data10)
p<-1 - pchisq(fit$chisq, 1)
out <- data.frame("var1"=colnames(Data10)[x[3]],"p.value" =
as.numeric(sprintf("%.3f", p)))
return(out)
})
}`
You will get a data frame with the column names of yourdata[,3:ncol(yourdata)] and the p value to each one.

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Novice needs to loop lm in R

I'm a PhD student of genetics and I am trying do association analysis of some genetic data using linear regression. In the table below I'm regressing each 'trait' against each 'SNP' There is also a interaction term include as 'var'
I've only used R for 2 weeks and I don't have any programming background so please explain any help provided as I want to understand.
This is a sample of my data:
Sample ID var trait 1 trait 2 trait 3 SNP1 SNP2 SNP3
77856517 2 188 3 2 1 0 0
375689755 8 17 -1 -1 1 -1 -1
392513415 8 28 14 4 1 1 1
393612038 8 85 14 6 1 1 0
401623551 8 152 11 -1 1 0 0
348466144 7 -74 11 6 1 0 0
77852806 4 81 16 6 1 1 0
440614343 8 -93 8 0 0 1 0
77853193 5 3 6 5 1 1 1
and this is the code I've been using for a single regression:
result1 <-lm(trait1~SNP1+var+SNP1*var, na.action=na.exclude)
I want to run a loop where every trait is tested against each SNP.
I've been trying to modify codes I've found online but I always run into some error that I don't understand how to solve.
Thank you for any and all help.
Personally I don't find the problem so easy. Specially for an R novice.
Here a solution based on creating dynamically the regression formula.
The idea is to use paste function to create different formula terms, y~ x + var + x * var then coercing the result string tp a formula using as.formula. Here y and x are the formula dynamic terms: y in c(trait1,trai2,..) and x in c(SNP1,SNP2,...). Of course here I use lapply to loop.
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
factor1 <- x
factor2 <- 'var'
factor3 <- paste(x,'var',sep='*')
listfactor <- c(factor1,factor2,factor3)
form <- as.formula(paste(y, "~",paste(listfactor,collapse="+")))
lm(formula = form, data = dat)
})
I hope someone come with easier solution, ore more R-ish one:)
EDIT
Thanks to #DWin comment , we can simplify the formula to just y~x*var since it means y is modeled by x,var and x*var
So the code above will be simplified to :
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
LHS <- paste(x,'var',sep='*')
form <- as.formula(paste(y, "~",LHS)
lm(formula = form, data = dat)
})

create dataframe in for loop using dataframe array

I'm having a dataframe as like below. I need to extract df based on the region which is availabe in RL
>avg_data
region SN value
beta 1 32
alpha 2 44
beta 3 55
beta 4 60
atp 5 22
> RL
V1
1 beta
2 alpha
That dataframe should be in array something like REGR[beta] which should contain beta related information as like below
region SN value
beta 1 32
beta 3 55
beta 4 60
Similarly for REGR[alpha]
region SN value
alpha 2 44
So that I can pass REGR as a argument for plotting graph.
REGR <- data.frame()
for (i in levels(RL$V1)){
REGR[i,] <- avg_data[avg_data$region==i, ];
}
I did some mistake in the above code. Please correct me.. Thank you
The split function may be of interest to you. From the help page, split divides the data in the vector x into the groups defined by f.
So for your data, it may look something like:
> split(avg_data, avg_data$region)
$alpha
region SN value
2 alpha 2 44
$atp
region SN value
5 atp 5 22
$beta
region SN value
1 beta 1 32
3 beta 3 55
4 beta 4 60
If you want to filter out the records that do not occur in RL, I'd probably do that in a preprocessing step using the %in% function and [ for extraction:
x <- avg_data[avg_data$region %in% RL$V1,]
#-----
region SN value
1 beta 1 32
2 alpha 2 44
3 beta 3 55
That's what I'd feed to split if you want to drop atp.
The approach above may be overkill if you are just wanting to plot. Here's an example using sapply to iterate through each level of region and make a plot:
sapply(unique(x$region), function(z)
plot(x[x$region == z,"value"], main=z[1]))

Resources