I have a large dataset that I intend generating a sample of 10% from it to run my machine learning model 20 times. To test how it will work, I decided to use iris dataset to try it. First, I split the dataset into training and testing dataset and then used a While loop to try a simple loop but it doesn't seem to work as I got an error message. Please is there something I missed out?
### partitioning dataset
part <- sample(1:150, size = 100, replace = F)
training <- iris[part,]
testing <- iris[-part,]
## using a loop
n <-1
while (n<6) {
Train(n)<-training[sample(1:100,0.3*nrow(training), replace = F),]
fit <- randomForest(Species~., data = Train(n))
pred <- predict(fit, testing)
confusionMatrix(pred, testing$Species))
n <-n+1
}
The error message I got is
Error: unexpected '}' in "}"
Here is the loop corrected and tested.
suppressPackageStartupMessages({
library(randomForest)
library(caret)
})
set.seed(2022)
part <- sample(1:150, size = 100, replace = FALSE)
training <- iris[part,]
testing <- iris[-part,]
## using a loop
result <- vector("list", 6L)
n <- 1L
while(n < 6L) {
Train <- training[sample(1:100, 0.3*nrow(training), replace = FALSE), ]
fit <- randomForest(Species ~ ., data = Train)
pred <- predict(fit, testing)
result[[n]] <- confusionMatrix(pred, testing$Species)
n <- n + 1L
}
## see the first result
result[[1]]
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction setosa versicolor virginica
#> setosa 16 0 0
#> versicolor 0 11 1
#> virginica 0 3 19
#>
#> Overall Statistics
#>
#> Accuracy : 0.92
#> 95% CI : (0.8077, 0.9778)
#> No Information Rate : 0.4
#> P-Value [Acc > NIR] : 1.565e-14
#>
#> Kappa : 0.8778
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: setosa Class: versicolor Class: virginica
#> Sensitivity 1.00 0.7857 0.9500
#> Specificity 1.00 0.9722 0.9000
#> Pos Pred Value 1.00 0.9167 0.8636
#> Neg Pred Value 1.00 0.9211 0.9643
#> Prevalence 0.32 0.2800 0.4000
#> Detection Rate 0.32 0.2200 0.3800
#> Detection Prevalence 0.32 0.2400 0.4400
#> Balanced Accuracy 1.00 0.8790 0.9250
Created on 2022-05-11 by the reprex package (v2.0.1)
There's nothing to gain with a while loop versus a for loop, you are manually incrementing n and that's what for loops are meant for.
The equivalent for loop is the following.
result <- vector("list", 6L)
for(n in 1:6) {
Train <- training[sample(1:100, 0.3*nrow(training), replace = FALSE), ]
fit <- randomForest(Species ~ ., data = Train)
pred <- predict(fit, testing)
result[[n]] <- confusionMatrix(pred, testing$Species)
}
Related
I use the "vegan" package to perform a PERMANOVA (adonis2()), and I also want to calculate the effect size (ω²). For this, I tried to use omega_squared() from the "effectsize" package, but I failed. I think it does not understand the output table, specifically the part with the mean squares. Is it possible to fix this or do I have to calculate manually?
library(vegan)
#> Lade nötiges Paket: permute
#> Lade nötiges Paket: lattice
#> This is vegan 2.6-4
library(effectsize)
data(dune)
data(dune.env)
ado <- adonis2(dune ~ Management, data = dune.env, permutations = 100)
ado
#> Permutation test for adonis under reduced model
#> Terms added sequentially (first to last)
#> Permutation: free
#> Number of permutations: 100
#>
#> adonis2(formula = dune ~ Management, data = dune.env, permutations = 100)
#> Df SumOfSqs R2 F Pr(>F)
#> Management 3 1.4686 0.34161 2.7672 0.009901 **
#> Residual 16 2.8304 0.65839
#> Total 19 4.2990 1.00000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
w2 <- omega_squared(ado)
#> Error in `[[<-.data.frame`(`*tmp*`, "Mean_Square", value = numeric(0)): Ersetzung hat 0 Zeilen, Daten haben 3
interpret_omega_squared(w2)
#> Error in interpret(es, rules): Objekt 'w2' nicht gefunden
Created on 2022-11-15 with reprex v2.0.2
EDIT
I tried to do it manually:
library(vegan, quietly = T, warn.conflicts = F)
#> This is vegan 2.6-4
library(effectsize)
library(dplyr, quietly = T, warn.conflicts = F)
library(tibble)
library(purrr)
data(dune)
data(dune.env)
ado <- adonis2(dune ~ Management, data = dune.env, permutations = 100)
w2 <- omega_squared(ado) # Does not work
#> Error in `[[<-.data.frame`(`*tmp*`, "Mean_Square", value = numeric(0)): Ersetzung hat 0 Zeilen, Daten haben 3
interpret_omega_squared(w2) # Does not work
#> Error in interpret(es, rules): Objekt 'w2' nicht gefunden
ado_tidy <- tibble( # manually create Adonis test result table
parameter = c("Management", "Residual", "Total"),
df = ado %>% pull("Df"), # Degree of freedom
ss = ado %>% pull("SumOfSqs"), # sum of squares
meansqs = ss / df, # mean squares
p_r2 = ado %>% pull("R2"), # partial R²
f = ado %>% pull("F"), # F value
p = ado %>% pull("Pr(>F)") # p value
)
ado_tidy
#> # A tibble: 3 x 7
#> parameter df ss meansqs p_r2 f p
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Management 3 1.47 0.490 0.342 2.77 0.00990
#> 2 Residual 16 2.83 0.177 0.658 NA NA
#> 3 Total 19 4.30 0.226 1 NA NA
# Formula:
# W2 = (DFm * (F - 1)) / ((DFm * (F - 1)) + (DFm + 1))
W2 <- abs(
(ado_tidy %>% pull(df) %>% chuck(3) * (ado_tidy %>% pull(f) %>% chuck(1) - 1)) /
((ado_tidy %>% pull(df) %>% chuck(3) * (ado_tidy %>% pull(f) %>% chuck(1) - 1) +
ado_tidy %>% pull(df) %>% chuck(3) + 1)
)
)
W2
#> [1] 0.6267099
interpret_omega_squared(W2, rules = "field2013")
#> [1] "large"
#> (Rules: field2013)
Created on 2022-11-15 with reprex v2.0.2
Hopefully, the equation is correct...
Here is the MicEco::adonis_OmegaSq function edited so that it works both with the current vegan::adonis2 and deprecated vegan::adonis:
#' Calculate (partial) Omega-squared (effect-size calculation) for PERMANOVA and add it to the input object
#'
#' #param adonisOutput An adonis object
#' #param partial Should partial omega-squared be calculated (sample size adjusted). Default TRUE
#' #return Original adonis object with the (partial) Omega-squared values added
#' #import vegan
#' #export
adonis_OmegaSq <- function(adonisOutput, partial = TRUE){
if(!(is(adonisOutput, "adonis") || is(adonisOutput, "anova.cca")))
stop("Input should be an adonis object")
if (is(adonisOutput, "anova.cca")) {
aov_tab <- adonisOutput
aov_tab$MeanSqs <- aov_tab$SumOfSqs / aov_tab$Df
aov_tab$MeanSqs[length(aov_tab$Df)] <- NA
} else {
aov_tab <- adonisOutput$aov.tab
}
heading <- attr(aov_tab, "heading")
MS_res <- aov_tab[pmatch("Residual", rownames(aov_tab)), "MeanSqs"]
SS_tot <- aov_tab[rownames(aov_tab) == "Total", "SumsOfSqs"]
N <- aov_tab[rownames(aov_tab) == "Total", "Df"] + 1
if(partial){
omega <- apply(aov_tab, 1, function(x) (x["Df"]*(x["MeanSqs"]-MS_res))/(x["Df"]*x["MeanSqs"]+(N-x["Df"])*MS_res))
aov_tab$parOmegaSq <- c(omega[1:(length(omega)-2)], NA, NA)
} else {
omega <- apply(aov_tab, 1, function(x) (x["SumsOfSqs"]-x["Df"]*MS_res)/(SS_tot+MS_res))
aov_tab$OmegaSq <- c(omega[1:(length(omega)-2)], NA, NA)
}
if (is(adonisOutput, "adonis"))
cn_order <- c("Df", "SumsOfSqs", "MeanSqs", "F.Model", "R2",
if (partial) "parOmegaSq" else "OmegaSq", "Pr(>F)")
else
cn_order <- c("Df", "SumOfSqs", "F", if (partial) "parOmegaSq" else "OmegaSq",
"Pr(>F)")
aov_tab <- aov_tab[, cn_order]
attr(aov_tab, "names") <- cn_order
attr(aov_tab, "heading") <- heading
if (is(adonisOutput, "adonis"))
adonisOutput$aov.tab <- aov_tab
else
adonisOutput <- aov_tab
return(adonisOutput)
}
source() this function and it should work. In my test it gave the same results for both adonis2 and adonis.
I have imputed data saved as a mids object and am trying to adapt my usual workflow around imputed data. However, I cannot figure out how to use sjPlot's tab_corr() and tab_df() and psych's describe on a mids object.
My goal is to generate a table of descriptive statistics and a correlation matrix without averaging the imputed datasets together. I was able to generate correlations using miceadds::micombine.cor, but the output isn't formatted like a typical correlation matrix. I also can individually compute means, SDs, etc. of variables from the mids object, but I'm looking for something that will generate a table.
library(mice)
library(miceadds)
library(sjPlot)
library(tidyverse)
library(psych)
set.seed(123)
## correlation matrix
data(nhanes)
imp <- mice(nhanes, print = FALSE)
head(micombine.cor(mi.res = imp)) # ugly
#> variable1 variable2 r rse fisher_r fisher_rse fmi
#> 1 age bmi -0.38765907 0.1899398 -0.40904214 0.2234456 0.09322905
#> 2 age hyp 0.51588273 0.1792162 0.57071301 0.2443348 0.25939786
#> 3 age chl 0.37685482 0.2157535 0.39638877 0.2515615 0.30863126
#> 4 bmi hyp -0.01748158 0.2244419 -0.01748336 0.2245067 0.10249784
#> 5 bmi chl 0.29082393 0.2519295 0.29946608 0.2752862 0.44307791
#> 6 hyp chl 0.30271060 0.1984525 0.31250096 0.2185381 0.04935528
#> t p lower95 upper95
#> 1 -1.83061192 0.06715849 -0.68949235 0.0288951
#> 2 2.33578315 0.01950255 0.09156846 0.7816509
#> 3 1.57571320 0.11509191 -0.09636276 0.7111171
#> 4 -0.07787455 0.93792784 -0.42805131 0.3990695
#> 5 1.08783556 0.27666771 -0.23557593 0.6852881
#> 6 1.42996130 0.15272813 -0.11531056 0.6296450
data(iris)
iris %>%
select(-c(Species)) %>%
tab_corr() # pretty
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length -0.118 0.872\*\*\* 0.818\*\*\*
Sepal.Width -0.118 -0.428\*\*\* -0.366\*\*\*
Petal.Length 0.872\*\*\* -0.428\*\*\* 0.963\*\*\*
Petal.Width 0.818\*\*\* -0.366\*\*\* 0.963\*\*\*
Computed correlation used pearson-method with listwise-deletion.
## descriptive statistics
psych::describe(imp) # error
#> Warning in mean.default(x, na.rm = na.rm): argument is not numeric or logical:
#> returning NA
#> Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'
mean(imp$data$age) # inefficient
#> [1] 1.76
iris %>%
select(-c(Species)) %>%
psych::describe() %>%
select(-(c(vars, n, median, trimmed, mad))) %>%
tab_df() # pretty
mean sd min max range skew kurtosis se
5.84 0.83 4.30 7.90 3.60 0.31 -0.61 0.07
3.06 0.44 2.00 4.40 2.40 0.31 0.14 0.04
3.76 1.77 1.00 6.90 5.90 -0.27 -1.42 0.14
1.20 0.76 0.10 2.50 2.40 -0.10 -1.36 0.06
Created on 2021-12-11 by the reprex package (v2.0.1)
The previous code was incorrect. I have created two functions, mice_df and mice_cor (link to Github repo here) that will generate a correlation matrix and a table of descriptive statistics from a mids object using Rubin's Rules.
gtsummary will neatly format models based on mids objects.
library(mice)
library(gtsummary)
library(tablecloth)
library(dplyr)
data(nhanes)
imp <- mice(nhanes, m = 3, print = FALSE)
mod <- with(imp, lm(age ~ bmi + chl))
tbl_regression(as.mira(mod)) %>% as_kable()
vs <- c("bmi", "chl", "age", "hyp")
title <- "Table 1: Correlation matrix"
mice_cor(imp = imp,
vs = vs,
title = title)
I see my error now - I was using imp when I should have used imp$data. This works for tab_df and tab_corr, but not tab_model..
library(tidyverse)
library(sjPlot)
library(mice)
library(psych)
set.seed(123)
# Imputed data
data(nhanes)
imp <- mice(nhanes, print = FALSE)
## tab_df
imp$data %>%
select(age, bmi, chl) %>%
psych::describe() %>%
tab_df(.)
## tab_corr
imp$data %>%
select(age, bmi, chl) %>%
tab_corr(.)
## tab_model
mod <- with(imp, lm(age ~ bmi)) %>% pool()
summary(mod)
tab_model(mod) # error...
#> Error in fam.info$is_linear || identical(fam.info$link_function, "identity"): invalid 'x' type in 'x || y'
I am trying to analyse the pattern of error (accuracy) on test items for the model I coded below. I would like to find out how often Setosa and Versicolor Species of iris are incorrectly classified as Virginica and how often Virginica Species of iris are incorrectly classified as not Virginica. Could this be done? Any suggestions would be great. Here are my logistic regression model and a built classifer using the model...
library(datasets)
iris$dummy_virginica_iris <- 0
iris$dummy_virginica_iris[iris$Species == 'virginica'] <- 1
iris$dummy_virginica_iris
# Logistic regression model.
glm <- glm(dummy_virginica_iris ~ Petal.Width + Sepal.Width,
data = iris,
family = 'binomial')
summary(glm)
# Classifer.
glm.pred <- predict(glm, type="response")
virginica <- ifelse(glm.pred > .5, TRUE, FALSE)
You can create a new vector to seperate the flowers into virginica / non-virginica like this:
species <- as.character(iris$Species)
species[species != "virginica"] <- "non-virginica"
Then you can just tabulate this against your model's predictions as a 2 x 2 contingency table:
result <- table(virginica, species)
print(result)
# species
# virginica non-virginica virginica
# FALSE 96 3
# TRUE 4 47
Which allows for easy calculations of sensitivity, specificity and accuracy of your model like this:
sensitivity <- result[2, 2] / sum(result[, 2])
specificity <- result[1, 1] / sum(result[, 1])
accuracy <- (result[1, 1] + result[2, 2]) / sum(result)
sensitivity
# [1] 0.94
specificity
# [1] 0.96
accuracy
# [1] 0.9533333
I have a matrix of the counts of transitions from one state to another and I would like to calculate the Maximum Likelihood Estimates, standard errors and standard deviations. The "markovchain" package has an example but the data is a sequence. My data is obtained from a balanced panel dataset of 155 companies so the example code they provide doesn't work for me.
This is the example I followed:
data(rain)
rain$rain[1:10]
[1] "6+" "1-5" "1-5" "1-5" "1-5" "1-5" "1-5" "6+" "6+" "6+"
#obtaining the empirical transition matrix
createSequenceMatrix(stringchar = rain$rain)
0 1-5 6+
0 362 126 60
1-5 136 90 68
6+ 50 79 124
#fitting the DTMC by MLE
alofiMcFitMle <- markovchainFit(data = rain$rain, method = "mle", name = "Alofi")
alofiMcFitMle
$estimate
Alofi
A 3 - dimensional discrete Markov Chain defined by the following states:
0, 1-5, 6+
The transition matrix (by rows) is defined as follows:
0 1-5 6+
0 0.6605839 0.2299270 0.1094891
1-5 0.4625850 0.3061224 0.2312925
6+ 0.1976285 0.3122530 0.4901186
$standardError
0 1-5 6+
0 0.03471952 0.02048353 0.01413498
1-5 0.03966634 0.03226814 0.02804834
6+ 0.02794888 0.03513120 0.04401395
$confidenceInterval
$confidenceInterval$confidenceLevel
[1] 0.95
$confidenceInterval$lowerEndpointMatrix
0 1-5 6+
0 0.6034754 0.1962346 0.08623909
1-5 0.3973397 0.2530461 0.18515711
6+ 0.1516566 0.2544673 0.41772208
$confidenceInterval$upperEndpointMatrix
0 1-5 6+
0 0.7176925 0.2636194 0.1327390
1-5 0.5278304 0.3591988 0.2774279
6+ 0.2436003 0.3700387 0.5625151
$logLikelihood
[1] -1040.419
Because I already have a matrix of count data I can't use the above code. I just want to take my 6x6 matrix of transition counts and determine the maximum likelihood estimators, standard errors (confidence interval) and standard deviation. Does anyone have an example I could follow?
I'm not sure how you could do this with the markovchain package, but you could write a function that would produce similar results. The function below generates the estimates of the transition probabilities through MLE, but the standard errors, lower and upper confidence bounds via parametric bootstrap. You can to either use the standard errors and estimates to make normal theory confidence intervals (ci = "normal") or use the relevant quantiles of the parametric bootstrap distributions (ci = "quantile"). The parallel flag identifies whether the levels of the matrix should be considered ordered or not. If ordered, the model uses an ordered login to calculate probabilities. If unordered (i.e., parallel = FALSE), then a multinomial login is used to calculate probabilities.
If I understand correctly, you're starting with something that looks like the output from createSequenceMatrix(stringchar = rain$rain), without the underlying sequence of values that generated it. Here's how it would work.
Generate the data.
library(markovchain)
#> Package: markovchain
#> Version: 0.9.0
#> Date: 2022-07-01
#> BugReport: https://github.com/spedygiorgio/markovchain/issues
data(rain)
mat <- createSequenceMatrix(stringchar = rain$rain)
Make the functions to estimate the models and print the results.
mcmat_est <- function(mat, parallel = FALSE, ci = c("normal", "quantile"), conf_level=.95, R=2500){
ci <- match.arg(ci)
if(!is.null(rownames(mat))){
vals <- rownames(mat)
}else{
vals <- 1:nrow(mat)
}
state_vals <- lapply(vals, \(i)data.frame(vals = rep(vals, mat[i, ])))
if(parallel){
state_vals <- lapply(state_vals, \(x){x$vals <- factor(x$vals, levels=vals, ordered=TRUE); x})
mods <- lapply(state_vals, \(x)MASS::polr(vals ~ 1, data=x, Hess = TRUE))
draws <- lapply(mods, \(m)cbind(-Inf, MASS::mvrnorm(R, m$zeta, vcov(m)), Inf))
qs <- lapply(draws, \(d)plogis(d))
probs<- lapply(qs, \(x)x[,-1] - x[,-ncol(x)])
ests <- lapply(mods, \(m)plogis(c(-Inf, m$zeta, Inf)))
ests <- lapply(ests, \(e)e[-1]-e[-length(e)])
}else{
state_vals <- lapply(state_vals, \(x){x$vals <- factor(x$vals, levels=vals); x})
mods <- lapply(state_vals, \(x)nnet::multinom(vals ~ 1, data=x, Hess = TRUE, trace=FALSE))
draws <- lapply(mods, \(m)MASS::mvrnorm(R, c(0, coef(m)),
cbind(0, rbind(0, vcov(m)))))
probs <- lapply(draws, \(d)prop.table(exp(d), 1))
ests <- lapply(mods, \(m)exp(c(0, coef(m))))
ests <- lapply(ests, \(e)e/sum(e))
}
logLik <- sum(sapply(mods, logLik))
nobs <- sum(sapply(mods, \(x)attr(logLik(x), "nobs")))
df <- sum(sapply(mods, \(x)attr(logLik(x), "df")))
attr(logLik, "nobs") <- nobs
attr(logLik, "df") <- df
class(logLik) <- "logLik"
est <- do.call(rbind, ests)
se <- do.call(rbind, lapply(probs, \(p)apply(p, 2, sd)))
crit_z <- 1-(1-conf_level)/2
if(ci == "normal"){
lwr <- est - qnorm(crit_z)*se
upr <- est + qnorm(crit_z)*se
}else{
lwr <- do.call(rbind, lapply(probs, \(p)apply(p, 2, quantile, 1-crit_z)))
upr <- do.call(rbind, lapply(probs, \(p)apply(p, 2, quantile, crit_z)))
}
rownames(est) <- rownames(se) <- rownames(lwr) <- rownames(upr) <- rownames(mat)
colnames(est) <- colnames(se) <- colnames(lwr) <- colnames(upr) <- colnames(mat)
res <- list(estimate = est, se = se, lower=lwr, upper=upr, logLik=logLik)
class(res) <- "mcest"
res
}
print.mcest <- function(x, ..., digits=3){
cat("\nEstimated Transition Probabilities\n")
print(x$estimate, digits=digits)
cat("\nStandard Errors of Transition Probabilities\n")
print(x$se, digits=digits)
cat("\nLower Confidence Bounds\n")
print(x$lower, digits=digits)
cat("\nUpper Confidence Bounds\n")
print(x$upper, digits=digits)
cat('\n')
print(x$logLik)
}
Estimate the models using the function I generated:
out <- mcmat_est(mat)
out
#>
#> Estimated Transition Probabilities
#> 0 1-5 6+
#> 0 0.661 0.230 0.109
#> 1-5 0.463 0.306 0.231
#> 6+ 0.198 0.312 0.490
#>
#> Standard Errors of Transition Probabilities
#> 0 1-5 6+
#> 0 0.0203 0.0181 0.0135
#> 1-5 0.0281 0.0269 0.0248
#> 6+ 0.0245 0.0289 0.0309
#>
#> Lower Confidence Bounds
#> 0 1-5 6+
#> 0 0.621 0.194 0.083
#> 1-5 0.407 0.253 0.183
#> 6+ 0.150 0.256 0.430
#>
#> Upper Confidence Bounds
#> 0 1-5 6+
#> 0 0.700 0.265 0.136
#> 1-5 0.518 0.359 0.280
#> 6+ 0.246 0.369 0.551
#>
#> 'log Lik.' -1040.419 (df=6)
Compare toe the markovchainFit() output.
fit <- markovchainFit(data = rain$rain, method = "mle", name = "Alofi")
fit
#> $estimate
#> Alofi
#> A 3 - dimensional discrete Markov Chain defined by the following states:
#> 0, 1-5, 6+
#> The transition matrix (by rows) is defined as follows:
#> 0 1-5 6+
#> 0 0.6605839 0.2299270 0.1094891
#> 1-5 0.4625850 0.3061224 0.2312925
#> 6+ 0.1976285 0.3122530 0.4901186
#>
#>
#> $standardError
#> 0 1-5 6+
#> 0 0.03471952 0.02048353 0.01413498
#> 1-5 0.03966634 0.03226814 0.02804834
#> 6+ 0.02794888 0.03513120 0.04401395
#>
#> $confidenceLevel
#> [1] 0.95
#>
#> $lowerEndpointMatrix
#> 0 1-5 6+
#> 0 0.5925349 0.1897800 0.0817850
#> 1-5 0.3848404 0.2428780 0.1763188
#> 6+ 0.1428496 0.2433971 0.4038528
#>
#> $upperEndpointMatrix
#> 0 1-5 6+
#> 0 0.7286330 0.2700740 0.1371931
#> 1-5 0.5403296 0.3693669 0.2862663
#> 6+ 0.2524073 0.3811089 0.5763843
#>
#> $logLikelihood
#> [1] -1040.419
Created on 2022-12-15 by the reprex package (v2.0.1)
The log-likelihoods for the two models are the same, which provides some confidence that the underlying estimation process is similar. The standard errors are different - the original function doesn't use a parametric bootstrap. I'm not sure what they do since the function to estimate the model is written in C++ and I haven't investigated the source code. If this difference is unacceptable, perhaps someone else will be able to get closer.
I'm using predict.lm(fit, newdata=newdata, interval="prediction") to get predictions and their prediction intervals (PI) for new observations. Now I would like to aggregate (sum and mean) these predictions and their PI's based on an additional variable (i.e. a spatial aggregation on the zip code level of predictions for single households).
I learned from StackExchange, that you cannot aggregate the prediction intervals of single predictions just by aggregating the limits of the prediction intervals. The post is very helpful to understand why this can't be done, but I have a hard time translating this bit into actual code. The answer reads:
Here's a reproducible example:
library(dplyr)
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit regression model
fit1 <- lm(Petal.Width ~ Petal.Length, data=train)
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
#Predict Pedal.Width for new data incl prediction intervals for each prediction
predictions1<-predict(fit1, newdata=pred, interval="prediction")
predictions2<-predict(fit2, newdata=pred, interval="prediction")
# Aggregate data by summing predictions for species
#NOT correct for prediction intervals
predictions_agg1<-data.frame(predictions1,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
predictions_agg2<-data.frame(predictions2,Species=pred$Species) %>%
group_by(Species) %>%
summarise_all(funs(sum,mean))
I couldn't find a good tutorial or package which describes how to properly aggregate predictions and their PI's in R when using predict.lm(). Is there something out there? Would highly appreciate if you could point me in the right direction on how to do this in R.
Your question is closely related to a thread I answered 2 years ago: linear model with `lm`: how to get prediction variance of sum of predicted values. It provides an R implementation of Glen_b's answer on Cross Validated. Thanks for quoting that Cross Validated thread; I didn't know it; perhaps I can leave a comment there linking the Stack Overflow thread.
I have polished my original answer, wrapping up line-by-line code cleanly into easy-to-use functions lm_predict and agg_pred. Solving your question is then simplified to applying those functions by group.
Consider the iris example in your question, and the second model fit2 for demonstration.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
We split pred by group Species, then apply lm_predict (with diag = FALSE) on all sub data frames.
oo <- lapply(split(pred, pred$Species), lm_predict, lmObject = fit2, diag = FALSE)
To use agg_pred we need to specify a weight vector, whose length equals to the number of data. We can determine this by consulting the length of fit in each oo[[i]]:
n <- lengths(lapply(oo, "[[", 1))
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
#List of 3
# $ setosa : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
# $ versicolor: num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
# $ virginica : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
SUM <- Map(agg_pred, w, oo)
SUM[[1]] ## result for the first group, for example
#$mean
#[1] 2.499728
#
#$var
#[1] 0.1271554
#
#$CI
# lower upper
#1.792908 3.206549
#
#$PI
# lower upper
#0.999764 3.999693
sapply(SUM, "[[", "CI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 1.792908 16.41526 26.55839
#upper 3.206549 17.63953 28.10812
If aggregation operation is average, we rescale w by n and call agg_pred.
w <- mapply("/", w, n)
#List of 3
# $ setosa : num [1:11] 0.0909 0.0909 0.0909 0.0909 0.0909 ...
# $ versicolor: num [1:13] 0.0769 0.0769 0.0769 0.0769 0.0769 ...
# $ virginica : num [1:14] 0.0714 0.0714 0.0714 0.0714 0.0714 ...
AVE <- Map(agg_pred, w, oo)
AVE[[2]] ## result for the second group, for example
#$mean
#[1] 1.3098
#
#$var
#[1] 0.0005643196
#
#$CI
# lower upper
#1.262712 1.356887
#
#$PI
# lower upper
#1.189562 1.430037
sapply(AVE, "[[", "PI") ## some nice presentation for CI, for example
# setosa versicolor virginica
#lower 0.09088764 1.189562 1.832255
#upper 0.36360845 1.430037 2.072496
This is great! Thank you so much! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix?
Use the fast_agg_pred function provided in the revision of the original Q & A. Let's start it all over.
set.seed(123)
data(iris)
#Split dataset in training and prediction set
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
pred <- iris[-train_ind, ]
#Fit multiple linear regression model
fit2 <- lm(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=train)
## list of new data
newdatlist <- split(pred, pred$Species)
n <- sapply(newdatlist, nrow)
#setosa versicolor virginica
# 11 13 14
If aggregation operation is sum, we do
w <- lapply(n, rep.int, x = 1)
SUM <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
If aggregation operation is average, we do
w <- mapply("/", w, n)
AVE <- mapply(fast_agg_pred, w, newdatlist,
MoreArgs = list(lmObject = fit2, alpha = 0.95),
SIMPLIFY = FALSE)
Note that we can't use Map in this case as we need to provide more arguments to fast_agg_pred. Use mapply in this situation, with MoreArgs and SIMPLIFY.