Predict() + assign() loop across variables - r

I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.

You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

How to extract arguments used in a fitted nls model for use in a second fitting with do.call

I am trying to use the same original arguments from a fitted nls model in the fitting of a second model using a subset of the data (for a cross validation exercise). I can retrieve the arguments (e.g. fit$call), but am having a hard time passing these arguments to do.call.
# original model ----------------------------------------------------------
# generate data
set.seed(1)
n <- 100
x <- sort(rlnorm(n, 1, 0.2))
y <- 0.1*x^3 * rlnorm(n, 0, 0.1)
df <- data.frame(x, y)
plot(y~x,df)
# fit model
fit <- nls(y ~ a*x^b, data=df, start=list(a=1,b=2), lower=list(a=0, b=0), algo="port")
summary(fit)
plot(y~x,df)
lines(df$x, predict(fit), col=4)
# perform model fit on subset with same starting arguments ----------------
# df sampled subset
dfsub <- df[sample(nrow(df), nrow(df)*0.5),]
dim(dfsub)
plot(y~x, dfsub)
ARGS <- fit$call # original call information
ARGS$data <- dfsub # substitute dfsub as data
ARGS <- as.list(ARGS) # change class from "call", to "list"
fitsub <- do.call(nls, args = ARGS )
# Error in xj[i] : invalid subscript type 'closure'
Also, as a side note, fit$data just returns the name of the data object. Is the data actually contained within the fitted nls object as well (as lm and other model fits sometimes do)?
Use update to add a subset argument:
nr <- nrow(df)
update(fit, subset = sample(nr, nr * 0.5) )
You can use the update function to refit the model with a different data set:
fitsub <- update(fit, data = dfsub)

Formula from Data.frame Columns

I want to create a regression model from a vector (IC50) against a number of different molecular descriptors (A,B,C,D etc).
I want to use,
model <- lm (IC50 ~ A + B + C + D)
the molecular descriptors are found in the columns of a data.frame. I would like to use a function that takes the IC50 vector and the appropriately sub-setted data.frame as inputs.
My problem is that I can't convert the columns to formula for the model.
Can anyone help.
Sample data and feeble attempt,
IC50 <- c(0.1,0.2,0.55,0.63,0.005)
descs <- data.frame(A=c(0.002,0.2,0.654,0.851,0.654),
B=c(56,25,89,55,60),
C=c(0.005,0.006,0.004,0.009,0.007),
D=c(189,202,199,175,220))
model <- function(x=IC50,y=descs) {
a <- lm(x ~ y)
return(a)
}
I went down the substitute/deparse route but this didn't import the data.
You can do simply
model <- function(x = IC50, y = descs)
lm(x ~ ., data = y)

R:fit dynamic number of explanatory variable into polynomial regression

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

using lm() in R for a series of independent fits

I want to use lm() in R to fit a series (actually 93) separate linear regressions. According to the R lm() help manual:
"If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix."
This works fine as long as there are no missing data points in the Y response matrix. When there are missing points, instead of fitting each regression with the available data, every row that has a missing data point in any column is discarded. Is there any way to specify that lm() should fit all of the columns in Y independently and not discard rows where an individual column has a missing data point?
If you are looking to do n regressions between Y1, Y2, ..., Yn and X, you don't specify that with lm() rather you should use R's apply functions:
# create the response matrix and set some random values to NA
values <- runif(50)
values[sample(1:length(values), 10)] <- NA
Y <- data.frame(matrix(values, ncol=5))
colnames(Y) <- paste0("Y", 1:5)
# single regression term
X <- runif(10)
# create regression between each column in Y and X
lms <- lapply(colnames(Y), function(y) {
form <- paste0(y, " ~ X")
lm(form, data=Y)
})
# lms is a list of lm objects, can access them via [[]] operator
# or work with it using apply functions once again
sapply(lms, function(x) {
summary(x)$adj.r.squared
})
#[1] -0.06350560 -0.14319796 0.36319518 -0.16393125 0.04843368

Resources