Creating variable names from varying lists - r

I am trying to create variable name from lists in R, but am struggling!
What I would ultimately like to do is to use previously created lists to create a formula for a multiple linear regression, whereby each value within the list will identify one of the explanatory variables of the regression formula.
I am starting with x lists of variable lengths (GoodModels_LMi, where i goes from 1
to x) and use each list to create a separate formula.
for (i in 1:x){
lm(formula created from appropriate list)
i<-i+1
}
The lists correspond to variable numbers to be chosen from a data matrix (AllData). So for
example if:
GoodModels_LM1<-c(2,4,8)
I would like my regression formula to be:
AllData[,1]~AllData[,2]+AllData[,4]+AllData[,8]
I have been trying to use as.formula() and paste() to achieve this, however, I am not sure how to create the second part of my formula.
as.formula(paste("AllData[,",i,"]~",paste(?????????)))
I know that this below is not right, but is as close as I have come:
paste("AllData[,",paste("GoodModels_LM",i,sep=""),"]",collapse="+")
I have also looked into assign(), but have not succeeded as the value argument was the same as the x argument.
Thanks very much for any help with this!
Olivia

Your formula should contain the column names, not the actual data. Here is a small demo using iris.
Imagine you want to run a regression using columns 2, 4, and 5 from iris. First, construct a formula using paste():
vars <- c(2, 4, 5)
frm <- paste("Sepal.Length ~ ", paste(names(iris)[vars], collapse=" + "))
frm
"Sepal.Length ~ Sepal.Width + Petal.Width + Species"
So, the object frm is a string containing a formula that you can pass to lm():
lm(frm, iris)
Call:
lm(formula = frm, data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Width
2.5211 0.6982 0.3716
Speciesversicolor Speciesvirginica
0.9881 1.2376

Related

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

Regarding multiple linear models simultaneously

I am trying to run many linear regressions models simultaneously. Please help me to make a code for this.
I am working on two data frames. In first data frame have 100 dependent variables and in second data frame i have 100 independent variables. Now I want simple linear models like
lm1 <- lm(data_frame_1[[1]] ~ data_frame_2[[1]])
lm2 <- lm(data_frame[[2]] ~ data_frame[[2]])
and so on .That means I have to run 100 regression models. I want to do this simultaneously. Please help me to make respective codes to run these all models simultaneously.
It is not that clear what you mean by simultaneously. But maybe doing a loop is fine in your case?
model.list = list()
for (i in 1:100){
model.list[[i]] = lm(data.frame.1[[i]] ~ data.frame2[[i]])
}
Using dataframe_1 and dataframe_2 defined in the Note at the end we define a function LM that takes an x name and y name and performs a regression of y on x using the columns from those data frames. The result is a list of lm objects. Note that the Call: line in the output of each output list component correctly identifies which columns were used.
LM <- function(xname, yname) {
fo <- formula(paste(yname, "~", xname))
do.call("lm", list(fo, quote(cbind(dataframe_1, dataframe_2))))
}
Map(LM, names(dataframe_1), names(dataframe_2))
giving:
$x1
Call:
lm(formula = y1 ~ x1, data = cbind(dataframe_1, dataframe_2))
Coefficients:
(Intercept) x1
3.0001 0.5001
... etc ...
Note
Using the builtin anscombe data frame define dataframe_1 as the x columns and data_frame_2 as the y columns.
dataframe_1 <- anscombe[grep("x", names(anscombe))]
dataframe_2 <- anscombe[grep("y", names(anscombe))]

Using a for loop to extract coefficients from multiple models

I have multiple cox models (with one variable static in all models) and am trying to extract the coefficient for that variable.
In all models the coefficient is indexed as follows: for example in model1 it is model1[[8]][1] ; for model2 it is model2[[8]][1] etc. I attempted to create a for loop but R as shown below but its not working.
Could someone help me why I am getting an error when running the following code
for (i in 1:5) {
coef[i] <- exp(summary(model[i])[[8]][1])
}
I get the following error "object 'model' not found".
Many thanks in advance
A
Here is an example of what I meant in my comment
data(iris)
model1 <- lm(data = iris, Sepal.Length ~ Sepal.Width + Species)
model2 <- lm(data = iris, Sepal.Length ~ Sepal.Width)
You can do this so you don't have to type all the models.
model.list<-mget(grep("model[0-9]+$", ls(),value=T))
ls() lists all the object you have and grep() is taking all the objects that have names "model" followed by a number.
coefs<-lapply(model.list,function(x)coef(x)[2])
unlist(coefs)
Sepal.Width Sepal.Width
0.8035609 -0.2233611
Here's a generalized example:
model1 <- 1:5
model2 <- 2:6
I can execute a function like mean to find the average of each vector with a for loop:
for(i in 1:2) print(mean(get(paste0('model', i))))
#[1] 3
#[1] 4
It works. But the a more standard approach is to use the list object. Then I can execute the desired function with built-in functions like sapply:
lst <- list(model1, model2)
sapply(lst, mean)
#[1] 3 4

R:fit dynamic number of explanatory variable into polynomial regression

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

Accessing fitted.values when using ddply

I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.
Define the model:
Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}
Execute the model on subsets:
Modrpt <- ddply(msadata, "Dmsa", Model)
Print Modrpt gives the model coefficients, but no Y estimates.
I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:
Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)
fits <- Model$fitted.values
I have tried both of the following to get the fitted values for the subsets, but no luck:
fits <- fitted.values(ddply(msadata, "Dmsa", Model))
fits <- ddply(msadata, "Dmsa", fitted.values(Model))
I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?
You can use an anonymous function in your call to ddply e.g.
require(plyr)
data(iris)
model <- function(df){
lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
}
ddply( iris , "Species" , function(x) fitted.values( model(x) ) )
This has the advantage that you can also, without rewriting your model function, get thecoef values by doing
ddply( iris , "Species" , function(x) coef( model(x) ) )
As #James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.
(I make no claims for statistical relevance or correctness of the example model - it is just an example)
I'd recommending doing this in two steps:
library(plyr)
# First first the models
models <- dlply(iris, "Species", lm,
formula = Petal.Length ~ Sepal.Length + Sepal.Width )
# Next, extract the fitted values
ldply(models, fitted.values)
# Or maybe
ldply(models, as.data.frame(fitted.values))

Resources