I'm creating a function that performs cross-validation and ridge regression to select predictors for a model. The inputs of my function are dataframe and the desired outcome variable outcome (what is being predicted). I'm using model.matrix() to create an x matrix that I will pass to glmnet(). My function uses outcome as the object argument in model.matrix(), but it looks like outcome is the wrong data type to pass through model.matrix(). Using model.matrix() normally, I would write something like model.matrix(Weight~.,dataframe). In this case, however, model.matrix won't work as model.matrix(outcome~.,dataframe) or model.matrix(dataframe$outcome~.,dataframe). Any ideas?
If 'outcome' is the object that stores the string "Weight", then we can paste with formula
model.matrix(formula(paste(outcome, "~ .")), dataframe)
A reproducible example with 'iris' dataset
data(iris)
outcome <- "Species"
m1 <- model.matrix(formula(paste(outcome, "~ .")), iris)
m2 <- model.matrix(Species ~ ., iris)
identical(m1, m2)
#[1] TRUE
Related
I am trying to run many linear regressions models simultaneously. Please help me to make a code for this.
I am working on two data frames. In first data frame have 100 dependent variables and in second data frame i have 100 independent variables. Now I want simple linear models like
lm1 <- lm(data_frame_1[[1]] ~ data_frame_2[[1]])
lm2 <- lm(data_frame[[2]] ~ data_frame[[2]])
and so on .That means I have to run 100 regression models. I want to do this simultaneously. Please help me to make respective codes to run these all models simultaneously.
It is not that clear what you mean by simultaneously. But maybe doing a loop is fine in your case?
model.list = list()
for (i in 1:100){
model.list[[i]] = lm(data.frame.1[[i]] ~ data.frame2[[i]])
}
Using dataframe_1 and dataframe_2 defined in the Note at the end we define a function LM that takes an x name and y name and performs a regression of y on x using the columns from those data frames. The result is a list of lm objects. Note that the Call: line in the output of each output list component correctly identifies which columns were used.
LM <- function(xname, yname) {
fo <- formula(paste(yname, "~", xname))
do.call("lm", list(fo, quote(cbind(dataframe_1, dataframe_2))))
}
Map(LM, names(dataframe_1), names(dataframe_2))
giving:
$x1
Call:
lm(formula = y1 ~ x1, data = cbind(dataframe_1, dataframe_2))
Coefficients:
(Intercept) x1
3.0001 0.5001
... etc ...
Note
Using the builtin anscombe data frame define dataframe_1 as the x columns and data_frame_2 as the y columns.
dataframe_1 <- anscombe[grep("x", names(anscombe))]
dataframe_2 <- anscombe[grep("y", names(anscombe))]
I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.
You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))
Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)
I have a regression model created with by. I know I can use sapply to extract specific parts of the model for each factor, but what if I wanted something like the whole summary, anova, etc.?
model <- with(data, by(data, factor, function(data) lm(y ~ x, data=data)))
sapply will coerce the results of summary.lm and anova.lm to a matrix. I think you may want to use lapply, which applies a function (here summary) on each element in the list produced by by, and returns a list.
models <- by(warpbreaks, warpbreaks$tension, function(x){
lm(breaks ~ wool, data = x)
})
lapply(models, summary)
I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.
Define the model:
Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}
Execute the model on subsets:
Modrpt <- ddply(msadata, "Dmsa", Model)
Print Modrpt gives the model coefficients, but no Y estimates.
I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:
Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)
fits <- Model$fitted.values
I have tried both of the following to get the fitted values for the subsets, but no luck:
fits <- fitted.values(ddply(msadata, "Dmsa", Model))
fits <- ddply(msadata, "Dmsa", fitted.values(Model))
I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?
You can use an anonymous function in your call to ddply e.g.
require(plyr)
data(iris)
model <- function(df){
lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
}
ddply( iris , "Species" , function(x) fitted.values( model(x) ) )
This has the advantage that you can also, without rewriting your model function, get thecoef values by doing
ddply( iris , "Species" , function(x) coef( model(x) ) )
As #James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.
(I make no claims for statistical relevance or correctness of the example model - it is just an example)
I'd recommending doing this in two steps:
library(plyr)
# First first the models
models <- dlply(iris, "Species", lm,
formula = Petal.Length ~ Sepal.Length + Sepal.Width )
# Next, extract the fitted values
ldply(models, fitted.values)
# Or maybe
ldply(models, as.data.frame(fitted.values))