How to use one variable in regression with many independent variables in lm() - r

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')

Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.

The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

Related

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

PGLS returns an error when referring to variables by their column position in a caper object

I am carrying out PGLS between a trait and 21 environmental variables for a clade of plant species. I am using a loop to do this 21 times, once for each of the environmental variables, and extract the p-values and some other values into a results matrix.
When normally carrying each PGLS individually I will refer to the variables by their column names, for example:
pgls(**trait1**~**meanrainfall**, data=caperobject)
But in order to loop this process for multiple environmental variables, I am referring to the variables by their column position in the data frame (which is in the form of the caper object for PGLS) instead of their column name:
pgls(**caperobject[,2]**~**caperobject[,5]**, data=caperobject)
This returns the error:
Error in model.frame.default(formula, data$data, na.action = na.pass) :
invalid type (list) for variable 'caperobject[, 2]'
This is not a problem when running a linear regression using the original data frame -- referring to the variables by their column name only produces this error when using the caper object as the data using PGLS. Does this way of referring to the column names not work for caper objects? Is there another way I could refer to the column names so I can incorporate them into a PGLS loop?
Your solution is to use caperobject$data[,2] ~ caperobject$data[,5], because comparative.data class is a list with the trait values located in the list data. Here is an example:
library(ape)
library(caper)
# generate random data
seed <- 245937
tr <- rtree(10)
dat <- data.frame(taxa = tr$tip.label,
trait1 = rTraitCont(tr, root.value = 3),
meanrainfall = rnorm(10, 50, 10))
# prepare a comparative.data structure
caperobject <- comparative.data(tr, dat, taxa, vcv = TRUE, vcv.dim = 3)
# run PGLS
pgls(trait1 ~ meanrainfall, data = caperobject)
pgls(caperobject$data[, 1] ~ caperobject$data[, 2], data = caperobject)
Both options return identical values for the intercept = 3.13 and slope = -0.003.
A good practice in problems with data format is to check, how the data are stored with str(caperobject).

Use string of independent variables within the lm function

I have a dataframe with many variables. I want to apply a linear regression to explain the last one with the others. So as I had to much to write I thought about creating a string with the independent variables e.g. Var1 + Var2 +...+ VarK. I achieved it pasting "+" to all column names except for the last one with this code:
ExVar <- toString(paste(names(datos)[1:11], "+ ", collapse = ''))
I also had to remove the last "+":
ExVar <- substr(VarEx, 1, nchar(ExVar)-2)
So I copied and pasted the ExVar string within the lm() function and the result looked like this:
m1 <- lm(calidad ~ Var1 + Var 2 +...+ Var K)
The question is: Is there any way to use "ExVar" within the lm() function as a string, not as a variable, to have a cleaner code?
For better understanding:
If I use this code:
m1 <- lm(calidad ~ ExVar)
It is interpreting ExVar as a independent variable.
The following will all produce the same results. I am providing multiple methods because there is are simpler ways of doing what you are asking (see examples 2 and 3) instead of writing the expression as a string.
First, I will generate some example data:
n <- 100
p <- 11
dat <- array(rnorm(n*p),c(n,p))
dat <- as.data.frame(dat)
colnames(dat) <- paste0("X",1:p)
If you really want to specify the model as a string, this example code will help:
ExVar <- toString(paste(names(dat[2:11]), "+ ", collapse = ''))
ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
model1 <- paste("X1 ~ ",ExVar)
fit1 <- lm(eval(parse(text = model1)),data = dat)
Otherwise, note that the 'dot' notation will specify all other variables in the model as predictors.
fit2 <- lm(X1 ~ ., data = dat)
Or, you can select the predictors and outcome variables by column, if your data is structured as a matrix.
dat <- as.matrix(dat)
fit3 <- lm(dat[,1] ~ dat[,-1])
All three of these fit objects have the same estimates:
fit1
fit2
fit3
if you have a dataframe, and you want to explain the last one using all the rest then you can use the code below:
lm(calidad~.,dat)
or you can use
lm(rev(dat))#Only if the last column is your response variable
Any of the two above will give you the results needed.
To do it your way:
EXV=as.formula(paste0("calidad~",paste0(names(datos)[-12],collapse = '+')))
lm(EXV,dat)
There is no need to do it this way since the lm function itself will do this by using the first code above.

How to use predict from a model stored in a list in R?

I have a dataframe dfab that contains 2 columns that I used as argument to generate a series of linear models as following:
models = list()
for (i in 1:10){
models[[i]] = lm(fc_ab10 ~ (poly(nUs_ab, i)), data = dfab)
}
dfab has 32 observations and I want to predict fc_ab10 for only 1 value.
I thought of doing so:
newdf = data.frame(newdf = nUs_ab)
newdf[] = 0
newdf[1,1] = 56
prediction = predict(models[[1]], newdata = newdf)
First I tried writing newdf as a dataframe with only one position, but since there are 32 in the dataset on which the model was built, I thought I had to provide at least 32 points as well. I don't think this is necessary though.
Every time I run that piece of code I am given the following error:
Error: variable 'poly(nUs_ab, i) was fitted with type “nmatrix.1” but type “numeric” was supplied.
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
I thought all I need to use predict was a LM model, predictors (the number 56) given in a column-named dataframe. Obviously, I am mistaken.
How can I fix this issue?
Thanks.
newdf should be a data.frame with column name nUs_ab, otherwise R won't be able to know which column to operate upon (i.e., generate the prediction design matrix). So the following code should work
newdf = data.frame(nUs_ab = 56)
prediction = predict(models[[1]], newdata = newdf)

Looping through list for each element computation in R

I am naive at R and trying to get a stuff done so advanced apologies if it is a stupid way of doing it.
I am trying to get coefficient and relevance of x-values to y-values. Values in X are criteria to which co-relevance is being tested.
I need to find postive or negative relevance/confidence for results represented in myList. Rather than putting one column in Y manually I just want to iterate through it for result of each column.
library(rms)
parameters <- read.csv(file="C:/Users/manjaria/Documents/Lek papers/validation_csv.csv", header=TRUE)
#attach(parameters)
myList <- c("name1","name2","name3","name4","name5")
for (cnt in seq(length(myList))) {
Y<- cbind(myList[cnt])
X<- cbind(age,female,income,employed,traveldays,modesafety,prPoolsize)
XVar <-c("age","female","income","employed","traveldays","modesafety","prPoolsize")
summary (Y)
summary (X)
table(Y)
ddist<- datadist(XVar)
options(datadist = 'ddist')
ologit<- lrm(Y ~ X, data = parameters)
print(ologit)
fitted<- predict(ologit, newdata=parameters, type = "fitted.ind")
colMeans(fitted)
}
I encounter:
Error in model.frame.default(formula = Y ~ X, data = parameters, na.action = function (frame) :
variable lengths differ (found for 'X')
If I don't loop through for-loop and use a static name for Y like
Y<- cbind(name1) it works well.

Resources