Regression in R iteratively by levels in categorical variable - r

So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. It's a categorical variable with ~98 levels called [store] - this is the name of each store. I am trying to predict each stores sales [sales] which is a continuous numeric variable. So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]?
My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable).

Sure, this is a pretty common type of analysis. For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset:
data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))
The result is a list of the species-specific regression models.
To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine:
test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl) # Test obs. in same order as predictions
Of course, for your problem you'll need to decide how to subset the data. One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function.

Related

Creating a data frame with pooled imputated values

So I want to do a regression analysis. and in order to account for missing data in my data set, I imputed NAs with mice. So far everything's fine, I ran mice() with m=5 and now I have the 5 imputation models. The next step according to the documentation would be to do the actual regression and combine the different imputation by using pool(), so:
model <- with (data_imp, exp = ...)
summary(pool(model))
This will create the regression output for my analysis. Seems good so far.
However, I also want to provide some descriptive statistics (namely, a boxplot) on my dependent and independent variables. So therefore I need a dataframa that contains both a) the values that were already given and b) the combined, imputed values that were inserted in place of the NAs and used in the regression model. How can I create this data.frame?
I saw in this tutorial know that you can combine the imputed data with the already given values into a new dataframe by using data_full <- complete(data_imp) but apparently thisonly works, if you want to specifically choose only one of the 5 imputations (for example data_full <- complete(data_imp, 1) to choose the first imputation). If you dont specifiy any nummer, I think it will just use the first imputation. I however want to use the combined, estimateed values from every 5 imputations and combine them into a dataframe. How can I do this?
I'd be really grateful for every piece of advice :) Thanks in advance!
I'm not entirely clear on how you want to display the data, but if you want to look at a box plot for each imputation, you can iterate through each of the 5 imputations run with complete() using lapply() or purrr::map() as per below:
library(mice)
library(tidyverse)
imp <- mice(nhanes)
map_dfr(1:5, ~ complete(imp, .x), .id = "imp") |>
ggplot(aes(x = imp, y = bmi)) +
geom_boxplot()

Why is R removing some residuals and how to avoid it?

I am creating linear models in R and testing them for model assumptions.
I noticed that when I create my models, R removes some residuals, giving this:
(2 observations deleted due to missingness)
This prevents me from checking the relationship between the independent variable and the residuals and any further analysis because of the different lengths for x and y.
edit:
Do you have any ideas on how to fix this?
R isn't removing residuals when you run lm(). Rather, it cannot create residuals for samples that have any missing data in the model (nor actually use them in the analysis). Therefore, the summary(model_5) output notifies you that some samples (observations) cannot be used (i.e., are deleted).
To run a correlation between the residuals and the independent variable when there is a difference in their lengths, and when for some reason we cannot find the missing data to remove from the dataset (e.g., if dataset[!complete.cases(dataset), ] isn't working), we first need to figure another way to find which observations are kept/removed in the model. We may be able to rely on the observation ID or the dataset's row names for this.
Example
# sample data
set.seed(12345)
dataset <- data.frame(indep_var = c(NA, rnorm(9)), dep_var = c(rnorm(9), NA))
dataset$index <- rownames(dataset)
# model residuals
resid <- lm(data=dataset, dep_var ~ indep_var)$residuals
dataset.resid <- data.frame(index = names(resid), resid)
# join or match the residuals and the variables by their observation identifier
cor.data <- dplyr::inner_join(dataset.resid, dataset, by = "index")
# correlation analysis
cor.test(~ resid + indep_var, cor.data)
Note that names(resid) are the row names of the observations used in the model from dataset. Any unused rownames(dataset) or observations/rows from dataset (due to missingness) will not be included in names(resid).

Unequal Data for Model Compare in R

I'm fairly new to R and am trying to compare two models with the modelCompare function. However, the data set that I am working with is a bit large and has unevenly distributed missing values. When I try the following code for example:
Model_A <- lm(DV~var1*var2 + cont.var, data=df)
Model_C <- lm(DV~ cont.var, data=df)
modelCompare(Model_C,Model_A)
I get an error that the models have different N values and cannot be compared because data is differentially omitted between the two models. Is there an easy way to remove this variation, as I will be running a number of regression analyses with this data set?
What are you looking to compare? If you want to compare intercepts between the models just:
Model_A
Model_C
If you want to compare accuracy of the model, use a training and testing dataset!

quick way to fit the same mixed model to multiple dataframes or subsets in R?

I have dataset containing growth data for multiple crops. The experiment treatment is within-crop (i.e. I am not comparing between crops) and has a randomised block design, with repeated measures over time. I would like to run the same mixed model on each different crop in R:
model<-lmer(response~treatment*time+(1|block), data=data)
My dataset contains 20 crops and I want to extract all the fixed intercepts and coefficients, so it is annoying to run the model on each crop separately. Somehow I got the impression from some other posts that lmList, given the below formula, would run a separate but common mixed model on each crop. However having now checked the help page and the output in more detail, I think it just does a fixed effect linear model (lm).
model_set <- lme4::lmList(response~treatment*time+(1|block)|crop,data=data)
I can't find any nice posts on setting up a loop to run the same mixed effect model on multiple datasets and extract coefficients, or if there is any existing function for this. It would be great if someone could recommend a quick simple solution.
As an example dataset I suggest the lme4 sleepstudy data, but let's just make a fake 'group' factor to mimic my 'crop' factor:
library(lme4)
data(sleepstudy)
library(car)
sleepstudy$group<-recode(sleepstudy$Subject,
" c('308','309','310','330','331','332','333','334') = 'group1';
c('335','337','349','350','351','352','369','370','371','372') = 'group2'")
model<-lmer(Reaction~Days+(1|Subject), data=sleepstudy)
How can I automate running the same model on each group separately and then extracting the intercept and coefficients?
Many thanks in advance for any suggestions.
Today I have stumbled across the 'split' function, which means I can answer my own question. The split function subsets a dataframe into a list of dataframes based on levels of a specified factor. A simple loop can then be set up to run the same model on each dataframe in the list.
list_df <- split(sleepstudy, sleepstudy$group) #split example dataset by group factor
results_df <- as.data.frame(matrix(ncol=2,nrow=length(list_df))) # make an empty dataframe
colnames(results_df)<-c("intercept","Days") #give the dataframe column names
for (i in 1:length(list_df)){ #run a loop over the dataframes in the list
mod<-lmer(Reaction~Days+(1|Subject),data=list_df[[i]]) #mixed model
results_df[i,]<-fixef(mod) #extract coefficients to dataframe
rownames(results_df)[i]<-names(list_df)[i] #assign rowname to results from data used
}

Manually set coefficient for new factor level when predicting

I have a linear model where one of the independent variables is a factor and where I am trying to make predictions on a data set that contains a new factor level (a factor level that wasn't in the data set the model was estimated on). I want to be able to make predictions for the observations with the new factor level by manually specifying the coefficient that will be applied to the factor. For example, suppose I estimate daily sales volumes for three types of stores, and I introduce a fourth type of store into the dataset. I have no historical data for it, but I might assume it will behave like some weighted combination of the other stores, for whom I have model coefficients.
If I try to apply predict.lm() to the new data I will get an error telling me that the factor has new levels (this makes sense).
df <- data.frame(y=rnorm(100), x1=factor(rep(1:4,25)))
lm1 <- lm(y ~ x1, data=df)
newdata <- data.frame(y=rnorm(100), x1=factor(rep(1:5,20)))
predict(lm1, newdata)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor x2 has new levels 5
I could do the prediction manually by simply multiplying the coefficients by the individual columns in the data.frame. However, this is cumbersome given that the real model I'm working with has many variables and interaction terms, and I want to be able to easily cycle through various model specifications by changing the model formula. Is there a way for me to essentially add a new coefficient to a model object and then use it to make forecasts? If not, is there another approach that is less cumbersome than setting up the entire prediction step manually?
Assumming you want level 5 to be evenly weighted, you can convert to a matrix, plug in the 25%, and multiply it by the coefficients from the model...
n.mat <- model.matrix(~x1, data=newdata)
n.mat[n.mat[,5] == 1, 2:4] <- .25
n.mat <- n.mat[,-5]
n.prediction <- n.mat %*% coef(lm1)
Here is what you could do:
Using rbind, stack the training and test datasets.
Factorize the predictors.
Divide the stack back to training and test datasets.
This way all the levels will be present in both the datasets.

Resources