Why is R removing some residuals and how to avoid it? - r

I am creating linear models in R and testing them for model assumptions.
I noticed that when I create my models, R removes some residuals, giving this:
(2 observations deleted due to missingness)
This prevents me from checking the relationship between the independent variable and the residuals and any further analysis because of the different lengths for x and y.
edit:
Do you have any ideas on how to fix this?

R isn't removing residuals when you run lm(). Rather, it cannot create residuals for samples that have any missing data in the model (nor actually use them in the analysis). Therefore, the summary(model_5) output notifies you that some samples (observations) cannot be used (i.e., are deleted).
To run a correlation between the residuals and the independent variable when there is a difference in their lengths, and when for some reason we cannot find the missing data to remove from the dataset (e.g., if dataset[!complete.cases(dataset), ] isn't working), we first need to figure another way to find which observations are kept/removed in the model. We may be able to rely on the observation ID or the dataset's row names for this.
Example
# sample data
set.seed(12345)
dataset <- data.frame(indep_var = c(NA, rnorm(9)), dep_var = c(rnorm(9), NA))
dataset$index <- rownames(dataset)
# model residuals
resid <- lm(data=dataset, dep_var ~ indep_var)$residuals
dataset.resid <- data.frame(index = names(resid), resid)
# join or match the residuals and the variables by their observation identifier
cor.data <- dplyr::inner_join(dataset.resid, dataset, by = "index")
# correlation analysis
cor.test(~ resid + indep_var, cor.data)
Note that names(resid) are the row names of the observations used in the model from dataset. Any unused rownames(dataset) or observations/rows from dataset (due to missingness) will not be included in names(resid).

Related

Fit multiple linear regression without an intercept with the function lm() in R

can you please help with this question in R, i need to get more than one predictor:
Fit multiple linear regression without an intercept with the function lm() to train data
using variable (y.train) as a goal variable and variables (X.mat.train) as
predictors. Look at the vector of estimated coefficients of the model and compare it with
the vector of ’true’ values beta.vec graphically
(Tip: build a plot for the differences of the absolute values of estimated and true values).
i have already tried it out with a code i will post at the end but it give me only one predictor but in this example i need to get more than one predictor:
and i think the wrong one is the first line but i couldn't find a way to fix it :
i can't put the data set here it's large but i have a variable that stores 190 observation from a victor (y.train) and another value that stores 190 observation from a matrix (X.mat.trian).. should give more than one predictor but for me it's giving one..
simple.fit = lm(y.train~0+ X.mat.train) #goal var #intercept # predictor
summary(simple.fit)# Showing the linear regression output
plot(simple.fit)
abline(simple.fit)
n <- summary(simple.fit)$coefficients
estimated_coeff <- n[ , 1]
estimated_coeff
plot(estimated_coeff)
#Coefficients: X.mat.train 0.5018
v <- sum(beta.vec)
#0.5369
plot(beta.vec)
plot(beta.vec, simple.fit)

Creating a data frame with pooled imputated values

So I want to do a regression analysis. and in order to account for missing data in my data set, I imputed NAs with mice. So far everything's fine, I ran mice() with m=5 and now I have the 5 imputation models. The next step according to the documentation would be to do the actual regression and combine the different imputation by using pool(), so:
model <- with (data_imp, exp = ...)
summary(pool(model))
This will create the regression output for my analysis. Seems good so far.
However, I also want to provide some descriptive statistics (namely, a boxplot) on my dependent and independent variables. So therefore I need a dataframa that contains both a) the values that were already given and b) the combined, imputed values that were inserted in place of the NAs and used in the regression model. How can I create this data.frame?
I saw in this tutorial know that you can combine the imputed data with the already given values into a new dataframe by using data_full <- complete(data_imp) but apparently thisonly works, if you want to specifically choose only one of the 5 imputations (for example data_full <- complete(data_imp, 1) to choose the first imputation). If you dont specifiy any nummer, I think it will just use the first imputation. I however want to use the combined, estimateed values from every 5 imputations and combine them into a dataframe. How can I do this?
I'd be really grateful for every piece of advice :) Thanks in advance!
I'm not entirely clear on how you want to display the data, but if you want to look at a box plot for each imputation, you can iterate through each of the 5 imputations run with complete() using lapply() or purrr::map() as per below:
library(mice)
library(tidyverse)
imp <- mice(nhanes)
map_dfr(1:5, ~ complete(imp, .x), .id = "imp") |>
ggplot(aes(x = imp, y = bmi)) +
geom_boxplot()

How to perform Post-Hoc tests (including tukey and Gammel-Howell) on imputed data using MICE

I would like to perform post-hoc tests on imputed data using MICE in R.
Typically MICE imputes data which is converted to long data to calculate total scores and can be converted back into MIDS elements. Analysis is then conducted over a MIRO element after which analysis can be pooled.
However, I am not able to get it running for post hoc tests including Tukey and Gammel-Howell. Would someone be able to help?
IMP <- mice(data, m=5, maxit=10)
IMP_long <- data.frame(complete(IMP, include=TRUE, action= 'long')
IMP_mids <- as.mids(IMP_long, where = NULL, .imp='.imp', .id = '.id'
fit <- with(IMP_mids, expr=lm(total_score ~ GroupingVariable))
The grouping variable consists of 3 groups which I would like to compare pairwise. Namely 1vs2, 2vs3 and 1vs3.
summary(pool(fit))
-> this gives comparisons between two groups, relative to the intercept. Similarly by using contrasts before creating model
Someone who knows how to compare the three groups in one analysis with tukey and/or gammel-howell?
Thanks in advance!!

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Regression in R iteratively by levels in categorical variable

So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. It's a categorical variable with ~98 levels called [store] - this is the name of each store. I am trying to predict each stores sales [sales] which is a continuous numeric variable. So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]?
My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable).
Sure, this is a pretty common type of analysis. For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset:
data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))
The result is a list of the species-specific regression models.
To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine:
test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl) # Test obs. in same order as predictions
Of course, for your problem you'll need to decide how to subset the data. One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function.

Resources