use more than 100 features lm function in R - r

Assume I have a dataframe consisting of 101 columns, where the first 100 are named data1 to data100 and the 101th column is named y.
If I want to use the lm function in R where data1 to data100 are the features.
I know this can be written as:
lin_reg <- lm(y ~ data1+...+data100, dataframe)
Is there a better way of doing this?

lin_reg <- lm(y ~ ., data = dataframe)
This assumes your data is really only consisting of your outcome + all feature variables. No extra column. The "." indicates "take everything else from that data frame".
Since - as per comment - the TO wants to exclude certain columns:
data_frame_subset <- dataframe[, !names(dataframe) %in% c("data5", "data10")]
lin_reg <- lm(y ~ ., data = dataframe_subset)
In this example, I would exclude the columns data5 and data10.

You can create the formula dynamically with reformulate :
lin_reg <- lm(reformulate(paste0('data', 1:100), 'y'), dataframe)

Related

Passing arguments to subset within a function

I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

Use paste and !is.na to subset data frame

I am trying to define a subset of a dataframe for a standard lm model using a "for loop". In the subset expression, I want to refer to col1 using paste and subset all observations where col1-3 is not NA. I have tried the following things, but they do not work:
for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(paste0("col", i))))
}
OR define the colname separately:
for(i in 1:3) {
colname <- as.name(paste0("col", i))
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(colname)))
}
BTW: This is a simplified code to illustrate what I am trying to do. The code in my script does not give an error but ignores the !is.na condition of the subset expression. However, it works if done manually like this:
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(col1)))
I would greatly appreciate some advice!
Thanks in advance!
FK
The is.na() portion is being "ignored" because what you think is being evaluated isn't what is being evaulated. What is being evaluated is:
!is.na("col1")
and the string "col1" is obviously not NA, so it evaluates to TRUE and is recycled for all rows in your data. The issue you are having is you have a variable name stored as a string, and subset() needs a logical vector. So you need a way to use your variable name stored in a string and use it to get the corresponding evaluated logical vector that subset() needs. You can update your code to use something along the lines of:
for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3=="Y" & !is.na(df[[paste0("col", i)]])))
}
While this isn't optimal, and there are other ways you can and probably should update your code. Something along the lines of:
for(i in 1:3) {
lm(y ~ x1 + x2, data = df,
subset = df$x3 == "Y" & !is.na(df[[paste0("col", i)]]))
}
is a bit cleaner as it uses the subset argument to subset your data.
You still have the issue of you're not storing the results of your call to lm() anywhere.

For loops: Running through column names

I was looking for a shorter way to write this using for loops
ie: i is 1 to 22 and my data will add columns 1 through 22 in the multiple regression:
reg <-lm(log(y)~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z1+z+z3+z4+z5+z6+z7+z8+z9+z10+z11+z12, data)
To clarify, x1 and x2 and x3 are all column names - they are x two (not x squared), I am trying to do a multiple regression with the last 22 columns in my data set
Someone suggested to do this:
reg1 <- lm(log(data$y)~terms( as.formula(
paste(" ~ (", paste0("X", 29:ncol(data) , collapse="+"), ")")
)
))
But
It doesn't work
I don't think it is doing multiple regression (xone + xtwo+ xthree), rather it assigned the binary value 1 to each variable x1, x2, x3... and added them, which is not what I want.
I know that a for-loop was requested but it would have been a clumsy strategy, so here's a possible correct strategy:
formchr <- paste(
paste( "log(y)" , paste0( "x", 1:10, collapse="+"), sep="~"),
# the LHS and first 10 terms
paste0( "z", 1:12, collapse="+"), #next 12 terms
sep="+") # put both parts together
reg1 <- lm( as.formula(formchr), data=data)
The full character-version of the formula should be passed to the as.formula function and the paste and paste0 functions are fully vectorized, so no loop is needed.
If the first 22 columns were the desired target for the RHS terms, you could have pasted together names(data)[1:22] or ...[29:50] if those were hte locations, and htis would be substituted for the RHS terms in the second paste above, dropping the third paste.
The only reason I used data as the name of an object is that it was implied by the question. It is a very confusing practice to use that name. data is an R function and objects should have specific names that do not overlap with function names. The other very commonly abused name in this regard is df, which is the density function for the distribution.
You could first subset your data into a data.frame which contains only the columns of interest. Then, you can run a linear model using the . formula syntax to select all columns other than the y variable.
Example using 1000 rows and 50 cols of data
N <- 1000
P <- 50
data <- as.data.frame(rep(data.frame(rnorm(N)), P))
Assign your y data to y.
y <- as.data.frame(rep(data.frame(rnorm(N)), 1))
Create a new data.frame containing y and the last 22 columns.
model_data <- cbind(y, data[ ,29:50])
colnames(model_data) <- c("y", paste0("x", 1:10), paste0("z",1:12))
The following should do the trick. The . formula syntax will select all columns other than the y column.
reg <-lm(log(y) ~ ., data = model_data)

With R, loop over data frames, and assign appropriate names to objects created in the loop

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))

Resources