Fitting a linear regression model by group gives NaN p-values - r

I am using plyr::ddply to run a regression model
model <- rating ~ A + B + C + D + E + F
by the factor resp.id. I can create a data frame of the betas by each factor with:
indiv.betas <- ddply(data.coded, "resp.id",
function(df) coef(lm(model, data=df)))
I am now trying to extract the p-values for the variables by the factor using:
indiv.pvalues <- ddply(data.coded, "resp.id",
function(df) coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])
Unfortunately, it just gives me a data frame with NaN.
Although, if I run a model across the entire data set, I can extract p-values from this one model as a data frame successfully with:
pvalue <- as.data.frame(coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])
How can I create a data frame of the p-values by the factor?
Thanks.

When you fit a single model
rating ~ A + B + C + D + E + F
you get meaningful, non-NA result. While when you fit the same model for each subset / factor level by resp.id, you get NaN result. I am 100% sure that for some factor level, you don't have enough data to fit the above model. It would be a good idea, to first check how many data there are for each group. You can use:
N <- with(data.coded, tapply(rating, resp.id, FUN = length))
Your model has 7 coefficients (1 for intercept and 1 each for A, B, ..., F). So which(N < 7) will tell you which factor levels are producing NaN.
In this part, I will show that I am not able to reproduce your problem with iris dataset.
library(plyr)
model <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
ddply(iris, "Species", function(df) coefficients(lm(model, data=df)))
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 2.351890 0.6548350 0.2375602 0.2521257
#2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238
#3 virginica 0.699883 0.3303370 0.9455356 -0.1697527
ddply(iris, "Species", function(df) coef(summary(lm(model, data=df)))[, 4])
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 3.034183e-07 6.834434e-09 2.593594e-01 0.470987
#2 versicolor 5.112246e-04 6.488965e-02 1.666695e-06 0.125599
#3 virginica 1.961563e-01 6.439972e-02 1.074269e-13 0.395875
In this part, I will show why NaN could appear when there are more coefficients than data.
set.seed(0);
x1 <- rnorm(3); x2 <- rnorm(3); x3 <- rnorm(3)
y <- rnorm(3)
fit <- lm(y ~ x1 + x2 + x3) ## 3 data, 4 coefficients
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.4217653 NaN NaN NaN
#x1 0.4124869 NaN NaN NaN
#x2 1.1489330 NaN NaN NaN

Related

Is there a way to change default formula in mlr3?

I'm studying {mlr3} package and i would like to compare 2 models:
Y ~ .
Y ~ .^2
I did not find in the documentation a way to specify the formula in the task creation. The default is ~ ..
I can find a hacky way with model.matrix and set another taks with these new columns but i might be missing something.
You are probably looking for PipeOpModelMatrix function from mlr3pipelines.
For instance
task <- as_task_regr(iris[1:4], target = "Sepal.Length")
lrn <- lrn("regr.lm")
pop <- po("modelmatrix", formula = ~ . ^ 2)
pop %>>%
lrn -> gr
GraphLearner$new(gr) -> gr
gr$train(task)
gr$model$modelmatrix$outtasklayout
#output
id type
1: (Intercept) numeric
2: Petal.Length numeric
3: Petal.Length:Petal.Width numeric
4: Petal.Length:Sepal.Width numeric
5: Petal.Width numeric
6: Petal.Width:Sepal.Width numeric
7: Sepal.Width numeric
gr$model$regr.lm$model
#output
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) `(Intercept)` Petal.Length Petal.Width Sepal.Width
1.39837 NA 1.15756 -1.66219 0.84812
`Petal.Length:Petal.Width` `Petal.Length:Sepal.Width` `Petal.Width:Sepal.Width`
0.06695 -0.16772 0.27626

How would I get the pattern of errors on test items for a logistic regression model?

I am trying to analyse the pattern of error (accuracy) on test items for the model I coded below. I would like to find out how often Setosa and Versicolor Species of iris are incorrectly classified as Virginica and how often Virginica Species of iris are incorrectly classified as not Virginica. Could this be done? Any suggestions would be great. Here are my logistic regression model and a built classifer using the model...
library(datasets)
iris$dummy_virginica_iris <- 0
iris$dummy_virginica_iris[iris$Species == 'virginica'] <- 1
iris$dummy_virginica_iris
# Logistic regression model.
glm <- glm(dummy_virginica_iris ~ Petal.Width + Sepal.Width,
data = iris,
family = 'binomial')
summary(glm)
# Classifer.
glm.pred <- predict(glm, type="response")
virginica <- ifelse(glm.pred > .5, TRUE, FALSE)
You can create a new vector to seperate the flowers into virginica / non-virginica like this:
species <- as.character(iris$Species)
species[species != "virginica"] <- "non-virginica"
Then you can just tabulate this against your model's predictions as a 2 x 2 contingency table:
result <- table(virginica, species)
print(result)
# species
# virginica non-virginica virginica
# FALSE 96 3
# TRUE 4 47
Which allows for easy calculations of sensitivity, specificity and accuracy of your model like this:
sensitivity <- result[2, 2] / sum(result[, 2])
specificity <- result[1, 1] / sum(result[, 1])
accuracy <- (result[1, 1] + result[2, 2]) / sum(result)
sensitivity
# [1] 0.94
specificity
# [1] 0.96
accuracy
# [1] 0.9533333

Linear regression with `lm()`: prediction interval for aggregated predicted values

I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
library(dplyr)
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
# lower upper
#1.792908 3.206549
#
#$PI
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
# lower upper
#1.262712 1.356887
#
#$PI
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?
I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22
This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

R and factor coding in formula

How do I use the formula interface if I want custom valued dummies, e.g. if I want values 1 and two, rather than 0 and 1. The estimation might look like the following where supp is a factor variable.
fit <- lm(len ~ dose + supp, data = ToothGrowth)
In this example, there is not much use of the different values, but in many cases of a "re-written" model it can be useful.
EDIT: Actually, I have e.g. 3 levels, and want the two columns to be coded differently, so one is a 1/0 variable, and the other is a 1/2 variable. The above example only has two levels.
You can set the contrasts to be whatever you want by creating the matrix you want to use and setting it either to the contrasts argument of lm or setting the default contrast of the factor itself.
Some sample data:
set.seed(6)
d <- data.frame(g=gl(3,5,labels=letters[1:3]), x=round(rnorm(15,50,20)))
The contrasts you have in mind:
mycontrasts <- matrix(c(0,0,1,0,1,1), byrow=TRUE, nrow=3)
colnames(mycontrasts) <- c("12","23")
mycontrasts
# 12 23
#[1,] 0 0
#[2,] 1 0
#[3,] 1 1
Then you use this in the lm call:
> lm(x ~ g, data=d, contrasts=list(g=mycontrasts))
Call:
lm(formula = x ~ g, data = d, contrasts = list(g = mycontrasts))
Coefficients:
(Intercept) g12 g23
58.8 -13.6 5.8
We can check that it does the right thing by comparing the means:
> diff(tapply(d$x, d$g, mean))
b c
-13.6 5.8
The default contrast is to use the first level as baseline:
> lm(x ~ g, data=d)
Call:
lm(formula = x ~ g, data = d)
Coefficients:
(Intercept) gb gc
58.8 -13.6 -7.8
But that can be changed with the contrasts command:
> contrasts(d$g) <- mycontrasts
> lm(x ~ g, data=d)
Call:
lm(formula = x ~ g, data = d)
Coefficients:
(Intercept) g12 g23
58.8 -13.6 5.8

Resources