Novice needs to loop lm in R - r

I'm a PhD student of genetics and I am trying do association analysis of some genetic data using linear regression. In the table below I'm regressing each 'trait' against each 'SNP' There is also a interaction term include as 'var'
I've only used R for 2 weeks and I don't have any programming background so please explain any help provided as I want to understand.
This is a sample of my data:
Sample ID var trait 1 trait 2 trait 3 SNP1 SNP2 SNP3
77856517 2 188 3 2 1 0 0
375689755 8 17 -1 -1 1 -1 -1
392513415 8 28 14 4 1 1 1
393612038 8 85 14 6 1 1 0
401623551 8 152 11 -1 1 0 0
348466144 7 -74 11 6 1 0 0
77852806 4 81 16 6 1 1 0
440614343 8 -93 8 0 0 1 0
77853193 5 3 6 5 1 1 1
and this is the code I've been using for a single regression:
result1 <-lm(trait1~SNP1+var+SNP1*var, na.action=na.exclude)
I want to run a loop where every trait is tested against each SNP.
I've been trying to modify codes I've found online but I always run into some error that I don't understand how to solve.
Thank you for any and all help.

Personally I don't find the problem so easy. Specially for an R novice.
Here a solution based on creating dynamically the regression formula.
The idea is to use paste function to create different formula terms, y~ x + var + x * var then coercing the result string tp a formula using as.formula. Here y and x are the formula dynamic terms: y in c(trait1,trai2,..) and x in c(SNP1,SNP2,...). Of course here I use lapply to loop.
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
factor1 <- x
factor2 <- 'var'
factor3 <- paste(x,'var',sep='*')
listfactor <- c(factor1,factor2,factor3)
form <- as.formula(paste(y, "~",paste(listfactor,collapse="+")))
lm(formula = form, data = dat)
})
I hope someone come with easier solution, ore more R-ish one:)
EDIT
Thanks to #DWin comment , we can simplify the formula to just y~x*var since it means y is modeled by x,var and x*var
So the code above will be simplified to :
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
LHS <- paste(x,'var',sep='*')
form <- as.formula(paste(y, "~",LHS)
lm(formula = form, data = dat)
})

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

Looping over dependent variable in a mixed model

I am trying to loop over multiple variables in a mixed model (using the rptGaussian function from the rptR package) but I am unable to do it despite several efforts. I am trying the following code. I use the following code without a loop and it works fine:
(rptGaussian(Arg ~ (1|class)+(1|kit)+(1|sex),
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
However, when I try to loop more variables I get the error
Error in terms.default(formula) : no terms component nor attribute
I am trying the following code for the loop.
varlist<-c("var1", "var2")
blups.models <- lapply(varlist, function(x) {
rptGaussian(substitute(i ~ (1|class)+(1|kit)+(1|sex),
list(i = as.name(x))),
grname=c("class","kit","lab","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
})
Here is a dummy data table:
sex class kit var1 var2 var3 var4
Female A Cont 10.79730768 10 20 18
Female A Exp 11.2474347 17 1 17
Female A Cont 11.64820939 10 5 17
Female A Exp 15.62800413 20 8 4
Female B Cont 12.41705885 5 16 8
Female B Exp 12.80249244 9 10 1
Female B Cont 10.76949177 6 13 2
Female B Exp 14.71370141 7 12 11
Male A Cont 8.931529823 8 3 6
Male A Exp 10.46899683 3 12 13
Male A Cont 8.363257621 3 13 17
Male A Exp 8.753117911 10 16 10
Male B Cont 9.110946315 9 13 4
Male B Exp 9.595131886 18 10 17
Male B Cont 9.454670188 1 10 11
Male B Exp 10.59379123 11 1 3
In general this kind of looping is easier (IMO) with string-based solutions, especially the reformulate() wrapper function, than with substitute().
I used read.table(header=TRUE,text="...") to read the data above and this slightly modified code for the single model:
library(rptR)
r1 <- rptGaussian(var1 ~ (1|class)+(1|kit)+(1|sex),
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
For multiple models:
varlist <- c("var1", "var2")
Make list of formulas:
formulas <- lapply(varlist,
reformulate,
termlabels="(1|class)+(1|kit)+(1|sex)")
Apply rptGaussian to formulas:
blups.models <- lapply(formulas,
rptGaussian,
grname=c("class","kit","sex","Fixed"),
data=ggm2, nboot=10, npermut=10, adjusted=FALSE)
If you want to collapse the results to a nice form, you have to figure out how to extract the results from a single fit into a data frame or similar structure. In this case the result is a rpt object and methods(class="rpt") tells you that there are only print, plot, and summary methods, but the summary() method returns an object that has lots of potentially useful bits. Here's an example:
## extract estimates and standard errors of estimates as a 1-row data frame
sumfun <- function(x) {
ss <- summary(x)
se.names <- paste(rownames(ss$se),"se",sep=".")
cbind(ss$R,setNames(as.data.frame(t(ss$se)),se.names))
}
A possibly-better alternative would be to return data.frame(term=names(ss$R),rpt=unlist(ss$R),se=ss$se) (a 3-column by n-row data frame) instead.
I'm going to use dplyr::bind_rows() because it's handy, but you could use base-R tools (do.call(rbind(...))) instead if you prefer.
names(blups.models) <- varlist
dplyr::bind_rows(lapply(blups.models,sumfun),
.id="var")
var class kit sex Fixed class.se kit.se sex.se Fixed.se
1 var1 0 0.1444659 0.65887365 0 0.04992624 0.2136589 0.2954982 0
2 var2 0 0.3322780 0.01734343 0 0.01981748 0.2243989 0.1158878 0
Are you sure it makes sense to do repeatability scores across sexes and other categories with small numbers of levels?

How can I filter out rows from linear regression based on another linear regression

I would like to conduct a linear regression that will have three steps: 1) Running the regression on all data points 2) Taking out the 10 outiers as found by using the absolute distanse value of rstandard 3) Running the regression again on the new data frame.
I know how to do it manually but these is very awkwarding. Is there a way to do it automatically? Can it be done for taking out columns as well?
Here is my toy data frame and code (I'll take out 2 top outliers):
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 16
608 0 1 5
123 1 17 7
321 1 8 7
226 0 2 7
556 0 20 3
334 1 6 3
225 0 1 1
999 0 3 11
987 0 30 1 ",header = TRUE)
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
userid target birds wolfs rstandard
4 543 1 2 3 1.189858
13 334 1 6 3 1.122579
modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])
I don't see how you could do this without estimating two models, the first to identify the most influential cases and the second on the data without those cases. You could simplify your code and avoid cluttering the workspace, however, by doing it all in one shot, with the subsetting process embedded in the call to estimate the "final" model. Here's code that does this for the example you gave:
model <- lm(target ~ birds + wolfs,
data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])
Here, the initial model, evaluation of influence, and ensuing subsetting of the data are all built into the code that comes after the first data =.
Also, note that the resulting model will differ from the one your code produced. That's because your g did not correctly identify the two most influential cases, as you can see if you just eyeball the results of abs(rstandard(lm(target ~ birds + wolfs, data=df))). I think it has to do with your use of unique(), which seems unnecessary, but I'm not sure.

Plotting Logistic Regression in R, but I keep getting errors

I'm trying to plot a logistic regression in R, for a continuous independent variable and a dichotomous dependent variable. I have very limited experience with R, but my professor has asked me to add this graph to a paper I'm writing, and he said R would probably be the best way to create it. Anyway, I'm sure there are tons of mistakes here, but this is the sort of this previously suggested on StackOverflow:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=FALSE)
curve(predict(ggg, data.frame(V1=x), type="response"), add=TRUE)
where vvv is the name of my csv file (31 obs. of 2 variables), V1 is the continuous variable, and V2 is the dichotomous one. Also, ggg (List of 30?) is the following:
ggg<- glm(formula = vvv$V2 ~ vvv$V1, family = "binomial", data = vvv)
The ggplot function produces a graph of my data points, but no logistic regression curve. The curve function results in the following error:
"Error in curve(predict(ggg, data.frame(V1 = x), type = "resp"), add = TRUE) : 'expr' did not evaluate to an object of length 'n'
In addition: Warning message:'newdata' had 101 rows but variables found have 31 rows"
I'm not sure what the problem is, and I'm having trouble finding resources for this specific issue. Can anybody help? It would be greatly appreciated :)
Edit: Thanks to anyone who responded! My data, vvv, is the following, where the percent was the initial probability for presence/absence of a species in a specific area, and the 1 and 0 indicate whether or not a species ended up being observed.:
V1 V2
1 95.00% 1
2 95.00% 0
3 95.00% 1
4 92.00% 1
5 92.00% 1
6 92.00% 1
7 92.00% 1
8 92.00% 1
9 92.00% 1
10 92.00% 1
11 85.00% 1
12 85.00% 1
13 85.00% 1
14 85.00% 1
15 85.00% 1
16 80.00% 1
17 80.00% 0
18 77.00% 1
19 77.00% 1
20 75.00% 0
21 70.00% 1
22 70.00% 0
23 70.00% 0
24 70.00% 1
25 70.00% 0
26 69.00% 1
27 65.00% 0
28 60.00% 1
29 50.00% 1
30 35.00% 0
31 25.00% 0
As #MrFlick commented, V1 is probably a factor. So, first you have to change it to numeric class. This just substitutes "%" for nothing and divides by 100, so you will have proportions as numeric class:
vvv$V1<-as.numeric(sub("%","",vvv$V1))/100
Doing this, you can use your own code and you will have a plot for a logistic regression:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=F)
This should print not only the points, but also the logistic regression curve. I don't understand what is the point of using curves. From what I could understand from your question, this is enough for what you need.

coxph() X matrix deemed to be singular;

I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.

Resources