Effect of an interaction term in a linear model - r

I am using the effects package to find the effect of variables in my linear model.
library(effects)
data(iris)
lm1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length*Petal.Width,data=iris)
For a simple term in the model, I can get the effects for each data point using
effect("Sepal.Width", lm1, xlevels=iris['Sepal.Width'])
How can I get a similar 1-dimensional vector of values for my interaction term at each point? Does this even make sense? Everything thing I've tried is returning a 2-d matrix e.g.
effect("Petal.Length:Petal.Width", lm1 ,xlevels=iris['Petal.Length']*iris['Petal.Width'])
I'm not sure what should be used for the the xlevels argument in this case to give me more than just the default 5 equally spaced points.

Think I've figure out something which gives me what I want.
# Create dataframe with all possible combinations
eff_df <- data.frame(effect("Petal.Length:Petal.Width",lm1,xlevels=list(Petal.Length=iris$Petal.Length, Petal.Width=iris$Petal.Width)))
# Create column to merge on in eff_df
eff_df$merge_col <- paste0(eff_df$Petal.Length,eff_df$Petal.Width)
# Create column it will share in iris
iris$merge_col <- paste0(iris$Petal.Length,iris$Petal.Width)
# Only eff_df$fit values which correspond to a value in iris will be merged
iris <- merge(iris, eff_df[,c(7,3)], by="merge_col", all.x=T)
Then the effects vector is stored in iris$fit.

Related

How to start R regression at specific value

I need to find a regression in R which has the form of
lm(Binary_value ~ Age, data=dataframe)
But my age variable starts at 15 yrs old so I'm not interested in ages that are less than 14. How can I specify that I only want my regression to be accurate at the age point of 15 and not worry about smaller values? I tried it this way:
lm(Binary_value ~ Age, data=dataframe)
But I get nonsense results for higher ages.
First things first, remember that R is case-sensitive, so the function would look like lm, not LM. I edited your question to fix that. Second, a regression only includes data that is available for prediction. It will not magically make up 14 data points if they are not already present, so there is no issue there. However, the regression line will not map to just => 15 years old because it uses the model coefficients to draw an intercept. An example below with fake data:
#### Create Fake Data ####
set.seed(123)
x <- 15:100 # use these numbers for age
age <- sample(x, # using x
size=1000, # sample 1000 times
replace=T) # sample with replacement
outcome <- age * .60 + rnorm(n=1000,sd=15) # make fake outcome variable
df <- data.frame(age,outcome)
#### Fit Data ####
fit <- lm(outcome ~ age, data = df)
summary(fit)
plot(age,outcome)
abline(fit,
col = "red")
You will see that the regression line, despite only including 15, will still draw to the left where there is no data. This is because the intercept is a conditional value based on the coefficients.
P.S. I used a normal Gaussian regression for this example because you used the lm function in your question, but included a binary response. For a logistic regression, the rationale would be the same, but it would use glm instead.

loop of AIC criterion for optimal model selection

i have a xlsx file (data.xlsx) with 20 columns. The first column contains the dependent variable (Y) and all the others the independent ones (X1,X2,X3,...,X19). I want to create a loop that calculates the AIC for all possible combinations of Xi's with Y and prints the results. I have the following code that do not include the loop.
install.packages("readxl")
library(readxl)
data<-read_excel("data.xlsx")
data
lm1 <- lm(Y ~ . , data = data)
AIC(lm1) # calculate the AIC value for one model
Y <- data[,1]
Y
X <- data[,-1]
X
How can i create a loop to calculate the AIC of all combinations? Can somenone help me?
There would be a lot of possible combinations (524,477 if I calculated correctly). If your goal is to find the best model you should try using the stepAIC function from the MASS package with directions = both.

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Random predictions from linear model in R

I have some data with some missing values for one variable, and I want to be able to create (random) predictions for what these could be. Here's my first thought:
# miss indicates where the observations with missing response are
library(MASS)
model <- glm.nb(data[-miss,4] ~ ., data=data[-miss,-4])
predict(model, newdata=data[miss,-4])
However, if I repeat the last line, it gives the same answers over and over - it appears to give the predicted mean of responses given that data and the model. I want a random prediction which incorporates variance i.e. a random draw from the distribution of the response of an observation with such predictors under the given model.
It could have something to do with the pred.var argument, but I'm unsure how to use that.
Suppose we have data like this:
set.seed(101)
dd <- data.frame(x=(1:20)*0.1)
dd$y <- rnbinom(20,mu=exp(dd$x),size=1)
## make some missing values
miss <- c(2,3,5)
dd$y[miss] <- NA
Now fit a model:
m1 <- MASS::glm.nb(y~x,dd,na.action=na.exclude)
Now use predictions from that model to get the expected mean value and rnbinom to generate the random values ...
p <- predict(m1,newdata=dd,type="response")
randvals <- rnbinom(length(p),mu=p,size=m1$theta)
(This gives random values for every element, not just the missing ones, but obviously you can pick out just the ones you want ...) It would be nice if the simulate method did this, but it's not quite flexible enough ...

Regression in R iteratively by levels in categorical variable

So I have a small data set which should be great for modeling (<1 million records), but one variable is giving me problems. It's a categorical variable with ~98 levels called [store] - this is the name of each store. I am trying to predict each stores sales [sales] which is a continuous numeric variable. So the vector size is over 10GB and crashes with memory errors in R. Is it possible to make 98 different regression equations, and run them one by one for every level of [store]?
My other idea would be to try and create 10 or 15 clusters of this [store] variable, then use the cluster names as my categorical variable in predicting the [sales] variable (continuous variable).
Sure, this is a pretty common type of analysis. For instance, here is how you would split up the iris dataset by the Species variable and then build a separate model predicting Sepal.Width from Sepal.Length in each subset:
data(iris)
models <- lapply(split(iris, iris$Species), function(df) lm(Sepal.Width~Sepal.Length, data=df))
The result is a list of the species-specific regression models.
To predict, I think it would be most efficient to first split your test set, then call the corresponding prediction function on each subset, and finally recombine:
test.iris <- iris
test.spl <- split(test.iris, test.iris$Species)
predictions <- unlist(lapply(test.spl, function(df) {
predict(models[[df$Species[1]]], newdata=df)
}))
test.ordered <- do.call(rbind, test.spl) # Test obs. in same order as predictions
Of course, for your problem you'll need to decide how to subset the data. One reasonable approach would be clustering with something like kmeans and the passing the cluster of each point to the split function.

Resources