I sometimes vectorise the variables I used in a model and do other stuff with it (e.g. descriptives etc...). The problem is that sometimes I use "as.numeric(var)" or "as.factor(var)", or center "I(var-15)". I then need the name of the original variables.
The problem is that I can't simply gsub(lmfit$model,"as.factor(","") because I get an error, and I want to avoid delete variables that contain I etc... so I need to delete I(* -any number) and as.factor(*), where * is the variable name that I want to remain untouched.
Let's say I have a vector of coefficients from a model:
outcome <- c(1:9)
INDEX <- c(18,17,15,20,10,20,25,13,12)
BODYFAT <- c(18,18,15,20,20,20,15,20,15)
lmfit <- glm(outcome ~ as.factor(BODYFAT) + I(INDEX-15), family = gaussian())
names(lmfit$model)
How would you work on names(lmfit$model) to get the original variable names back (i.e. BODYFAT and INDEX?
I've started creating some clunky code to remove all the centering numbers (assuming 1 to 500 should be enough in most cases)
b<-paste(paste0("- ",1:500,"|",collapse=""),"-501",collapse="")
library(stringr)
str_replace_all(names(lmfit$model),b, " ")
But I'm having real problems with the removing I() and as.factor(). Any suggestions?
Many thanks in advance
Related
I am trying desperately to automate some model testing in lme4::lmer (as I have too many to do)
The functions I use run some models and find the best one by checking their stats, and create res (that has two Lists in it.
res$res is a data frame
res$res$model is the text I need rerun the best model (to clear it out, I only use 1)
res$fits is a List of 1, with res$fits[1] being a "Formal Class 'lmerMod' with 13 slots, the name of which is always exactly the same as models
Here's some code to make more sense:
models <- theBigList
## run function
res <- fit.func(models=models, response='bnParam1')
# show model selection table
res$res
# This is where you get the best model from above and put it in here to set it up for plotting.
#
models <- res$res$model[1]
# run function
res <- fit.func(models=models, response='bnParam1')
## model selection table
res$res
# Once you get a model where the best result is the same as the "previous" one you copy and paste it in here to graph it.
# It will be the one with the the lowest CV.R2 from the 2nd 'models'
top.fit <- res$fits$'INSERT models HERE'
#top.fit is a class lmerMod, and has the list of everything needed to be extracted and calculated to ggplot it
Normally I copy and paste the text for the best model into the space where it says 'INSERT models HERE', but I would like to automate it.
I can't seem to use models as an input, nor force it, eg as.Class or as.String, things like that, nor use other ways of referencing from a list. I am at a loss as to how to assign the right variable.
EDIT #######
So res$res is the first List in res that is a data frame, it will output something like this:
> res$res
model nPar D aic d.aic w.aic R2 cv.R2
1 Sp + (1|Spec) + SE + TC 51 3804.244 3906.244 0 1 0.6376789 0.2586369
To expand on my last sentence which is the most important. Normally the last bit of code passes the parameters to lme4::fixef like this for e.g.:
top.fit <- res$fits$"Sp + (1|Spec) + SE + TC"
This line of code also has the last part of that (that I discovered earlier but changes everytime I run a different analysis):
models <- res$res$model[1]
> models
[1] "Sp + (1|Spec) + SE + TC"
So I'd basically like to put to something like this top.fit <- res$fits$models but I assume there is some form of Type incompatibility or problem with using 'models' within the reference to the List/Class?
Single square brackets were changing the class to a list which wasn't working.
I solved this by referencing with double square brackets, which stopped the conversion, and passes the List as the original lmerMod Class it needs to be. Now I know.
top.fit <- res$fits[[1]]
I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!
I understand that in the following
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(colnames(s)[c(21,259,330,380)], collapse='+'))))
I am missing x
but i really don't understand how and where to insert it to be correct.
Thank you for any help.
Making this an answer instead of a comment due to amount of text.
If I understand you correctly, you're trying to iterate over a list of variables, which you want to add (each in turn) to a set of independent variables in a survival model. The issue in the code you gave is that you don't give x a place. There are several approaches to do so.
The first one is very similar to what you're doing, and creates the formulas. I demonstrate this using the 'cancer' dataset:
library(survival)
data(cancer)
myvars <- c("meal.cal","wt.loss")
a1 <- sapply(myvars,function(x){
as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
}
)
#then we can fit our models
lapply(a1,function(x){coxph(formula=x,data=cancer)})
In my opinion, this is a bit convoluted and can be done in one step:
models <- lapply(myvars, function(x){
form <- as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
fit <- coxph(formula=form, data=cancer)
return(fit)
})
Using the code you started with, we can simply add 'x' to the vector of dependent variables. However, this is not very readable code and I'm always a bit nervous about feeding column indices to models. You might be safer using variable names instead.
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(c(x,colnames(s)[c(21,259,330,380)]), collapse='+'))))
currently I have a piece of code that looks liek this
as.formula(paste0('Y~',paste('factor','(', names(te)[w],')', sep="",collapse="+")))
the response (Y) and the predictors TRY1,Y2,UYP21 and GHT9 are columnames of the dataframe te and w is a vector which indexes the column names as only specific columns from the data frame are chosen for the model.
My problem is that this code will write the formula for all predictors as factor(). How can i write a piece that will decide that for w=12 (12th column of te) it should be not factor but as.numeric.
Even more general it should check the class of the data frame column with class() and then decide whether to use factor or as numeric. The desired output is
Y~factor(TRY1)+factor(TRY2)+factor(UYP21)+as.numeric(GHT9)
while the current code produces
Y~factor(TRY1)+factor(TRY2)+factor(UYP21)+factor(GHT9)
the answer provided works very well but the problem is that it really woudl net to be as.numeric not only numeri
This isn't the best coding, but maybe it helps.
forFormula <- NULL
for(i in 1:dim(te)[2]){
one <- paste0(class(te[,i]), "(", colnames(te)[i], ")")
forFormula <- c(forFormula, one)
}
forFormula <- as.formula(paste("Y ~", (paste(forFormula, collapse="+"))))
I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea