I tried to predict the t121 columns using the "lm" command below like this,
Model<-lm(t121 ~ t1 + t2 + ..... +t120, mydata)
In my data dependent variables are more than 100, So it's difficult for predicting each columns using "lm" command that's why i want to write the program for my data like this given below i written,
for(j in 120:179){
model[[j+1]]<-lm(t[j+1] ~ add1(t1:t[j]),mydata)
}
Instead of add1 place i used add.bigq,sum commands but these three commands are not correct please tell me what is the command suitable for that place?
From what I understand, you want to write a loop that allows you to use lm with different formulas. The nice thing about lm is that it can take objects of the class formula as its first argument. Lets see how that works.
# Create a data set
df <- data.frame(col1=(1:10+rnorm(10)), col2 = 1:10, col3 = rnorm(10), col4 = rnorm(10))
If we want to run lm on col1 as the dependent and col2 as the independent variable, then we can do this:
model_a <- lm(col1 ~ col2, data = df)
form_b <- as.formula("col1 ~ col2")
model_b <- lm(form_b, data = df)
all.equal(model_a,model_b)
# [1] "Component “call”: target, current do not match when deparsed"
So the only thing that differed between the two models is that the function call was different (in model_b we used form_b, not col1 ~ col2). Other than that, the models are identical.
So now you know how to use the formula class to run lm. You can easily construct formulas with paste, by setting collapse to +
ind_vars <- paste(names(df)[-1],collapse = " + ")
form_lm <- paste(names(df)[1], "~", ind_vars)
form_lm
# [1] "col1 ~ col2 + col3 + col4"
If we want three different models, we can do a couple of things, for example:
lis <- list()
for (i in 2:length(names(df))) {
ind_vars <- paste(names(df)[2:i], collapse="+")
form_lm <- paste(names(df)[1], "~", ind_vars)
lis[[i-1]] <- lm(form_lm,data=df)
}
Related
I wanted to model my snps array. I can do this one by one using the following code.
Data$DX=as.factor(Data$DX)
univariate=glm(relevel(DX, "CON") ~ relevel(rs6693065_D,"AA"), family = binomial, data = Data)
summary(univariate)
exp(cbind(OR = coef(univariate), confint(univariate)))
How can I do this for all other snps using a loop or apply? The snps are rs6693065_D, rs6693065_A and hundreds of them. From the above code only "rs6693065_D" will be replaced by all other snps.
Best Regards
Zillur
Consider developing a generalized method to handle any snps. Then call it iteratively passing every snps column using lapply or sapply:
# GENERALIZED METHOD
proc_glm <- function(snps) {
univariate <- glm(relevel(data$DX, "CON") ~ relevel(snps, "AA"), family = binomial)
return(exp(cbind(OR = coef(univariate), confint(univariate))))
}
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], proc_glm)
Use tryCatch in case of errors like relevel:
# BUILD LIST OF FUNCTION OUTPUT
glm_list <- lapply(Data[3:426], function(col)
tryCatch(proc_glm(col), error = function(e) e))
For building a data frame, adjust method and lapply call followed with a do.call + rbind:
proc_glm <- function(col){
# BUILD FORMULA BY STRING
univariate <- glm(as.formula(paste("y ~", col)), family = binomial, data = Data)
# RETURN DATA FRAME OF COLUMN AND ESTIMATES
cbind.data.frame(COL = col,
exp(cbind(OR = coef(univariate), confint(univariate)))
)
}
# BUILD LIST OF DFs, PASSING COLUMN NAMES
glm_list <- lapply(names(Data)[3:426],
tryCatch(proc_glm(col), error = function(e) NA))
# APPEND ALL DFs FOR SINGLE MASTER DF
final_df <- do.call(rbind, glm_list)
I want to perform a certain number of statistical models based on selection criteria specified in a dataframe. So using a basic example, say I had 2 responses variables and 2 explanatory variables:
#######################Data Input############################
Responses <- as.data.frame(matrix(sample(0:10, 1*100, replace=TRUE), ncol=2))
colnames(Responses) <- c("A","B")
Explanatories <- as.data.frame(matrix(sample(20:30, 1*100, replace=TRUE), ncol=2))
colnames(Explanatories) <- c("x","y")
I then define which statistical models that I would like to run, which can include different combinations of Response / Explanatory variables and different statistical functions:
###################Model selection#########################
Function <- c("LIN","LOG","EXP") ##Linear, Logarithmic (base 10) and exponential - see the formula for these below
Respo <- c("A","B","B")
Explan <- c("x","x","y")
Model_selection <- data.frame(Function,Respo,Explan)
How do I then perform a list of models based on these selection criteria? Here is an example of the models I would like to create based on the inputs from the Model_selection data frame.
####################Model creation#########################
Models <- list(
lm(Responses$A ~ Explanatories$x),
lm(Responses$B ~ log10(Explanatories$x)),
lm(Responses$B ~ exp(Explanatories$y))
)
I would guess that some kind of loop function would be required and after looking around perhaps paste too? Thanks in advance for any help with this
This isn't the prettiest solution, but it seems to work for your example:
Models <- list()
idx <- 1L
for (row in 1:nrow(Model_selection)){
if (Model_selection$Function[row]=='LOG'){
expl <- paste0('LOG', Model_selection$Explan[row])
Explanatories[[expl]] <- log10(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='EXP'){
expl <- paste0('EXP', Model_selection$Explan[row])
Explanatories[[expl]] <- exp(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='LIN'){
expl <- paste0('LIN', Model_selection$Explan[row])
Explanatories[[expl]] <- Explanatories[[Model_selection$Explan[row]]]
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
names(Models)[idx] <- paste(Model_selection$Respo[row], '~', expl)
idx <- idx+1L
}
Models
This is a perfect use-case for the tidyverse
library(tidyverse)
## cbind both data sets into one
my_data <- cbind(Responses, Explanatories)
## use 'mutate' to change function names to the existing function names
## mutate_all to transform implicit factors to characters
## NB this step could be ommitted if Function would already use the proper names
model_params <- Model_selection %>%
mutate(Function = case_when(Function == "LIN" ~ "identity",
Function == "LOG" ~ "log10",
Function == "EXP" ~ "exp")) %>%
mutate_all(as.character)
## create a function which estimates the model given the parameters
## NB: function params must be named exactly like columns
## in the model_selection df
make_model <- function(Function, Respo, Explan) {
my_formula <- formula(paste0(Respo, "~", Function, "(", Explan, ")"))
my_mod <- lm(my_formula, data = my_data)
## syntactic sugar: such that we see the value of the formula in the print
my_mod$call$formula <- my_formula
my_mod
}
## use purrr::pmap to loop over the model params
## creates a list with all the models
pmap(model_params, make_model)
In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))
In the minimal example below, I am trying to use the values of a character string vars in a regression formula. However, I am only able to pass the string of variable names ("v2+v3+v4") to the formula, not the real meaning of this string (e.g., "v2" is dat$v2).
I know there are better ways to run the regression (e.g., lm(v1 ~ v2 + v3 + v4, data=dat)). My situation is more complex, and I am trying to figure out how to use a character string in a formula. Any thoughts?
Updated below code
# minimal example
# create data frame
v1 <- rnorm(10)
v2 <- sample(c(0,1), 10, replace=TRUE)
v3 <- rnorm(10)
v4 <- rnorm(10)
dat <- cbind(v1, v2, v3, v4)
dat <- as.data.frame(dat)
# create objects of column names
c.2 <- colnames(dat)[2]
c.3 <- colnames(dat)[3]
c.4 <- colnames(dat)[4]
# shortcut to get to the type of object my full code produces
vars <- paste(c.2, c.3, c.4, sep="+")
### TRYING TO SOLVE FROM THIS POINT:
print(vars)
# [1] "v2+v3+v4"
# use vars in regression
regression <- paste0("v1", " ~ ", vars)
m1 <- lm(as.formula(regression), data=dat)
Update:
#Arun was correct about the missing "" on v1 in the first example. This fixed my example, but I was still having problems with my real code. In the code chunk below, I adapted my example to better reflect my actual code. I chose to create a simpler example at first thinking that the problem was the string vars.
Here's an example that does not work :) Uses the same data frame dat created above.
dv <- colnames(dat)[1]
r2 <- colnames(dat)[2]
# the following loop creates objects r3, r4, r5, and r6
# r5 and r6 are interaction terms
for (v in 3:4) {
r <- colnames(dat)[v]
assign(paste("r",v,sep=""),r)
r <- paste(colnames(dat)[2], colnames(dat)[v], sep="*")
assign(paste("r",v+2,sep=""),r)
}
# combine r3, r4, r5, and r6 then collapse and remove trailing +
vars2 <- sapply(3:6, function(i) {
paste0("r", i, "+")
})
vars2 <- paste(vars2, collapse = '')
vars2 <- substr(vars2, 1, nchar(vars2)-1)
# concatenate dv, r2 (as a factor), and vars into `eq`
eq <- paste0(dv, " ~ factor(",r2,") +", vars2)
Here is the issue:
print(eq)
# [1] "v1 ~ factor(v2) +r3+r4+r5+r6"
Unlike regression in the first example, eq does not bring in the column names (e.g., v3). The object names (e.g., r3) are retained. As such, the following lm() command does not work.
m2 <- lm(as.formula(eq), data=dat)
I see a couple issues going on here. First, and I don't think this is causing any trouble, but let's make your data frame in one step so you don't have v1 through v4 floating around both in the global environment as well as in the data frame. Second, let's just make v2 a factor here so that we won't have to deal with making it a factor later.
dat <- data.frame(v1 = rnorm(10),
v2 = factor(sample(c(0,1), 10, replace=TRUE)),
v3 = rnorm(10),
v4 = rnorm(10) )
Part One Now, for your first part, it looks like this is what you want:
lm(v1 ~ v2 + v3 + v4, data=dat)
Here's a simpler way to do that, though you still have to specify the response variable.
lm(v1 ~ ., data=dat)
Alternatively, you certainly can build up the function with paste and call lm on it.
f <- paste(names(dat)[1], "~", paste(names(dat)[-1], collapse=" + "))
# "v1 ~ v2 + v3 + v4"
lm(f, data=dat)
However, my preference in these situations is to use do.call, which evaluates expressions before passing them to the function; this makes the resulting object more suitable for calling functions like update on. Compare the call part of the output.
do.call("lm", list(as.formula(f), data=as.name("dat")))
Part Two About your second part, it looks like this is what you're going for:
lm(factor(v2) + v3 + v4 + v2*v3 + v2*v4, data=dat)
First, because v2 is a factor in the data frame, we don't need that part, and secondly, this can be simplified further by better using R's methods for using arithmetical operations to create interactions, like this.
lm(v1 ~ v2*(v3 + v4), data=dat)
I'd then simply create the function using paste; the loop with assign, even in the larger case, is probably not a good idea.
f <- paste(names(dat)[1], "~", names(dat)[2], "* (",
paste(names(dat)[-c(1:2)], collapse=" + "), ")")
# "v1 ~ v2 * ( v3 + v4 )"
It can then be called using either lm directly or with do.call.
lm(f, data=dat)
do.call("lm", list(as.formula(f), data=as.name("dat")))
About your code The problem you had with trying to use r3 etc was that you wanted the contents of the variable r3, not the value r3. To get the value, you need get, like this, and then you'd collapse the values together with paste.
vars <- sapply(paste0("r", 3:6), get)
paste(vars, collapse=" + ")
However, a better way would be to avoid assign and just build a vector of the terms you want, like this.
vars <- NULL
for (v in 3:4) {
vars <- c(vars, colnames(dat)[v], paste(colnames(dat)[2],
colnames(dat)[v], sep="*"))
}
paste(vars, collapse=" + ")
A more R-like solution would be to use lapply:
vars <- unlist(lapply(colnames(dat)[3:4],
function(x) c(x, paste(colnames(dat)[2], x, sep="*"))))
TL;DR: use paste.
create_ctree <- function(col){
myFormula <- paste(col, "~.", collapse="")
ctree(myFormula, data)
}
create_ctree("class")
I am trying create model to predict "y" from data "D" that contain predictor x1 to x100 and other 200 variables . since all Xs are not stored consequently I can't call them by column.
I can't use ctree( y ~ , data = D) because other variables , Is there a way that I can refer them x1:100 ?? in the model ?
instead of writing a very long code
ctree( y = x1 + x2 + x..... x100)
Some recommendation would be appreciated.
Two more. The simplest in my mind is to subset the data:
ctree(y ~ ., data = D[, c("y", paste0("x", 1:100))]
Or a more functional approach to building dynamic formulas:
ctree(reformulate(paste0("x", 1:100), "y"), data = D)
Construct your formula as a text string, and convert it with as.formula.
vars <- names(D)[1:100] # or wherever your desired predictors are
fm <- paste("y ~", paste(vars, collapse="+"))
fm <- as.formula(fm)
ctree(fm, data=D, ...)
You can use this:
fml = as.formula(paste("y", paste0("x", 1:100, collapse=" + "), sep=" ~ "))
ctree(fmla)