I wish to create a data.frame with two columns, and each column contains multiple columns. (I need it to feed plsr in the pls package)
It's like the oliveoil data.
> oliveoil
chemical.Acidity chemical.Peroxide chemical.K232 chemical.K270 chemical.DK sensory.yellow sensory.green
G1 0.7300 12.7000 1.9000 0.1390 0.0030 21.4 73.4
G2 0.1900 12.3000 1.6780 0.1160 -0.0040 23.4 66.3
G3 0.2600 10.3000 1.6290 0.1160 -0.0050 32.7 53.5
G4 0.6700 13.7000 1.7010 0.1680 -0.0020 30.2 58.3
G5 0.5200 11.2000 1.5390 0.1190 -0.0010 51.8 32.5
I1 0.2600 18.7000 2.1170 0.1420 0.0010 40.7 42.9
I2 0.2400 15.3000 1.8910 0.1160 0.0000 53.8 30.4
I3 0.3000 18.5000 1.9080 0.1250 0.0010 26.4 66.5
I4 0.3500 15.6000 1.8240 0.1040 0.0000 65.7 12.1
I5 0.1900 19.4000 2.2220 0.1580 -0.0030 45.0 31.9
S1 0.1500 10.5000 1.5220 0.1160 -0.0040 70.9 12.2
S2 0.1600 8.1400 1.5270 0.1063 -0.0020 73.5 9.7
S3 0.2700 12.5000 1.5550 0.0930 -0.0020 68.1 12.0
S4 0.1600 11.0000 1.5730 0.0940 -0.0030 67.6 13.9
S5 0.2400 10.8000 1.3310 0.0850 -0.0030 71.4 10.6
S6 0.3000 11.4000 1.4150 0.0930 -0.0040 71.4 10.0
sensory.brown sensory.glossy sensory.transp sensory.syrup
G1 10.1 79.7 75.2 50.3
G2 9.8 77.8 68.7 51.7
G3 8.7 82.3 83.2 45.4
G4 12.2 81.1 77.1 47.8
G5 8.0 72.4 65.3 46.5
I1 20.1 67.7 63.5 52.2
I2 11.5 77.8 77.3 45.2
I3 14.2 78.7 74.6 51.8
I4 10.3 81.6 79.6 48.3
I5 28.4 75.7 72.9 52.8
S1 10.8 87.7 88.1 44.5
S2 8.3 89.9 89.7 42.3
S3 10.8 78.4 75.1 46.4
S4 11.9 84.6 83.8 48.5
S5 10.8 88.1 88.5 46.7
S6 11.4 89.5 88.5 47.2
And it is a data.frame with 2 columns:
> is.data.frame(oliveoil)
[1] TRUE
> dim(oliveoil)
[1] 16 2
I tried the following code:
x = data.frame(a = c(1,2,3), b = c(1,3,4))
y = data.frame(c = c(3,4,5), d = c(5,4,2))
d = data.frame(x = x, y = y)
it returns:
> d
x.a x.b y.c y.d
1 1 1 3 5
2 2 3 4 4
3 3 4 5 2
but I cannot call x with d$x
> d$x
NULL
what I expect is:
> d$x
a b
1 1 1
2 2 3
3 3 4
I am expecting some arguments in the data.frame function make it work, something like:
d = data.frame(x = x, y = y, merge.columns = F)
But I cannot find any arguments doing this in the docs
The pls::plsr() function does not require data to be set up exactly like oliveoil. plsr() allows the response term to be a matrix, and oliveoil has a particular way of storing matrices, but you can supply any matrix to plsr().
For example, this fits a model without error:
y <- matrix(rnorm(n), nrow = 10)
x <- matrix(rnorm(n), nrow = 10)
plsr(y ~ x)
# Partial least squares regression , fitted with the kernel algorithm.
# Call:
# plsr(formula = y ~ x)
Also, consider that the yarn dataset is also used in the pls docs, which just stores regular matrices in a data frame rather than the I() approach used by oliveoil.
For a bit more explanation:
The sub-components of oliveoil are not actually of class data.frame.
If you run str(oliveoil), you'll see the sensory and chemical objects in oliveoil are cast as AsIs objects. They're not technically data frame-classed objects, and in fact they were probably matrices with named rows and columns to begin with.
str(oliveoil)
'data.frame': 16 obs. of 2 variables:
$ chemical: 'AsIs' num [1:16, 1:5] 0.73 0.19 0.26 0.67 0.52 0.26 0.24 0.3 0.35 0.19 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr "G1" "G2" "G3" "G4" ...
.. ..$ : chr "Acidity" "Peroxide" "K232" "K270" ...
$ sensory : 'AsIs' num [1:16, 1:6] 21.4 23.4 32.7 30.2 51.8 40.7 53.8 26.4 65.7 45 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr "G1" "G2" "G3" "G4" ...
.. ..$ : chr "yellow" "green" "brown" "glossy" ...
The AsIs class means they were stored in oliveoil using the I() function (I think "I" is for "Identity"). I() protects an object from being converted into something else during an operation, like storage into a data frame.
You can reproduce this with a simple example (although note that if you try and store two data frames in a data frame with I() you'll get an error):
n <- 100
matrix_a <- matrix(rnorm(n), nrow = 10)
matrix_b <- matrix(rnorm(n), nrow = 10)
df <- data.frame(a = I(matrix_a), b = I(matrix_b))
str(df)
'data.frame': 10 obs. of 2 variables:
$ a: 'AsIs' num [1:10, 1:10] -0.817 -0.233 -1.987 0.523 -1.596 ...
$ b: 'AsIs' num [1:10, 1:10] 1.9189 -0.7043 0.0624 0.0152 -0.5409 ...
And df now contains matrix_a as $a and matrix_b as $b:
df$a
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.8167554 -0.61629222 0.3673423 1.30882012 0.97618868 -0.53124825
[2,] -0.2329451 0.08556506 -0.5839086 0.86298000 1.20452166 0.09825958
[3,] -1.9873738 -0.93537922 0.1057309 0.63585036 -1.09604531 1.33080572
[4,] 0.5227912 1.89505993 1.1184905 1.20683770 -0.02431886 -1.15878634
# ...
You could also just save matrix_a and matrix_b as matrices, directly:
# also works
df2 <- data.frame(a = matrix_a, b = matrix_b, foo = letters[1:10])
TL;DR - plsr() takes any matrix, but if you want your data stored in a data frame, create a matrix and save it into a data frame, with or without I().
Related
I would like to convert a list to dataframe (picture as below)
I did use do.call(rbind.data.frame, contrast), however, I got this Error in xi[[j]] : this S4 class is not subsettable. I still can read them separately. Anyone know about this thing?
This list I got when running the ART anova test by using the package ARTool
Update
This my orignial code to calculate and get the model done.
Organism_df_posthoc <- bird_metrics_long_new %>%
rbind(plant_metrics_long_new) %>%
mutate(Type = factor(Type, levels = c("Forest", "Jungle rubber", "Rubber", "Oil palm"))) %>%
mutate(Category = factor(Category)) %>%
group_by(Category) %>%
mutate_at(c("PD"), ~(scale(.) %>% as.vector())) %>%
ungroup() %>%
nest_by(n1) %>%
mutate(fit = list(art.con(art(PD ~ Category + Type + Category:Type, data = data),
"Category:Type",adjust = "tukey", interaction = T)))
And the output of fit is that I showed already.
With rbind, instead of rbind.data.frame, there is a specific method for 'emmGrid' object and it can directly use the correct method by matching the class if we specify just rbind
do.call(rbind, contrast)
-output
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 12 estimates
The reason is that there is a specific method for rbind when we load the emmeans
> methods('rbind')
[1] rbind.data.frame rbind.data.table* rbind.emm_list* rbind.emmGrid* rbind.grouped_df* rbind.zoo*
The structure in the example created matches the OP's structure showed
By using rbind.data.frame, it doesn't match because the class is already emmGrid
data
library(multcomp)
library(emmeans)
warp.lm <- lm(breaks ~ wool*tension, data = warpbreaks)
warp.emmGrid <- emmeans(warp.lm, ~ tension | wool)
contrast <- list(warp.emmGrid, warp.emmGrid)
If the OP used 'ARTool' and if the columns are different, the above solution may not work because rbind requires all objects to have the same column names. We could convert to tibble by looping over the list with map (from purrr) and bind them
library(ARTool)
library(purrr)
library(tibble)
map_dfr(contrast, as_tibble)
-output
# A tibble: 42 × 8
contrast estimate SE df t.ratio p.value Moisture_pairwise Fertilizer_pairwise
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 m1 - m2 -23.1 4.12 8.00 -5.61 0.00226 NA NA
2 m1 - m3 -33.8 4.12 8.00 -8.20 0.000169 NA NA
3 m1 - m4 -15.2 4.12 8.00 -3.68 0.0256 NA NA
4 m2 - m3 -10.7 4.12 8 -2.59 0.118 NA NA
5 m2 - m4 7.92 4.12 8 1.92 0.291 NA NA
6 m3 - m4 18.6 4.12 8 4.51 0.00849 NA NA
7 NA 6.83 10.9 24 0.625 0.538 m1 - m2 f1 - f2
8 NA 15.3 10.9 24 1.40 0.174 m1 - m3 f1 - f2
9 NA -5.83 10.9 24 -0.533 0.599 m1 - m4 f1 - f2
10 NA 8.50 10.9 24 0.777 0.445 m2 - m3 f1 - f2
# … with 32 more rows
data
data(Higgins1990Table5, package = "ARTool")
m <- art(DryMatter ~ Moisture*Fertilizer + (1|Tray), data=Higgins1990Table5)
a1 <- art.con(m, ~ Moisture)
a2 <- art.con(m, "Moisture:Fertilizer", interaction = TRUE)
contrast <- list(a1, a2)
I've run some analysis that outputs data in the following format:
> sft
Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
1 1 0.35400 8.4300 0.7710 146.00 145.00 166.0
2 2 0.21900 2.2500 0.8960 83.30 82.80 107.0
3 3 0.17300 1.1600 0.9310 49.90 49.80 72.0
4 4 0.04100 0.3070 0.7360 31.60 31.20 50.3
5 5 0.00165 -0.0298 0.4610 21.30 21.00 37.3
6 6 0.05310 -0.1780 -0.1240 15.30 14.60 28.9
7 7 0.21300 -0.2610 -0.0113 11.60 10.90 24.0
8 8 0.63800 -0.5280 0.5560 9.27 8.18 22.3
9 9 0.82500 -0.6310 0.8110 7.69 6.14 21.2
10 10 0.85000 -0.7400 0.8100 6.59 4.97 20.3
11 11 0.82200 -0.8310 0.7710 5.77 3.95 19.6
12 12 0.81900 -0.8480 0.7680 5.16 3.27 19.0
13 13 0.73300 -0.8670 0.6660 4.67 2.80 18.4
14 14 0.65300 -0.9170 0.5840 4.28 2.39 17.9
15 15 0.70200 -0.9130 0.6440 3.97 2.22 17.4
What I want is to extract the Power that gave the highest (maximum) SFT.R.sq value.
Here is the table's attributes:
>str(sft)
List of 2
$ powerEstimate: int NA
$ fitIndices :'data.frame': 15 obs. of 7 variables:
..$ Power : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ SFT.R.sq : num [1:15] 0.35392 0.21883 0.17291 0.04098 0.00165 ...
..$ slope : num [1:15] 8.4267 2.2461 1.158 0.307 -0.0298 ...
..$ truncated.R.sq: num [1:15] 0.771 0.896 0.931 0.736 0.461 ...
..$ mean.k. : num [1:15] 145.8 83.3 49.9 31.6 21.3 ...
..$ median.k. : num [1:15] 145.1 82.8 49.8 31.2 21 ...
..$ max.k. : num [1:15] 165.6 107.1 72 50.3 37.3 ...
I can grab the two columns I need easily with:
sft$fitIndices$Power
sft$fitIndices$SFT.R.sq
But I can't get the actual power associated with the highest SFT.R.sq value:
>sft$fitIndices$Power[max(sft$fitIndices$SFT.R.sq)]
integer(0)
Examples of what I'm trying to do usually involve dataframes where you extract a value based on the value from another column - but it doesn't seem to work with attributes.
We need which.max to return the position of max value for subsetting the 'Power'
sft$fitIndices$Power[which.max(sft$fitIndices$SFT.R.sq)]
Also, if we need to slice the row, extract the data.frame element and slice
library(dplyr)
library(purrr)
pluck(sft, "fitIndices") %>%
slice_max(n = 1, order_by = "SFT.R.sq")
I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))
As an R beginner, I have found it surprisingly difficult to figure out how to compute descriptive statistics on multiply imputed data (more so than running some of the other basic analyses, such as correlations and regressions).
These types of questions are prefaced with apologies (Descriptive statistics (Means, StdDevs) using multiply imputed data: R) but have have not been answered (https://stats.stackexchange.com/questions/296193/pooling-basic-descriptives-from-several-multiply-imputed-datasets-using-mice) or are quickly cast a down vote.
Here is a description of a miceadds function(https://www.rdocumentation.org/packages/miceadds/versions/2.10-14/topics/stats0), which I find difficult to follow with data that has been stored in the mids format.
I have gotten some output such as mean, median, min, max using the summary(complete(imp)) but would love to know how to get additional summary output (e.g., skew/kurtosis, standard deviation, variance).
Illustration borrowed from a previous poster above:
> imp <- mice(nhanes, seed = 23109)
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
1 5 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
> summary(complete(imp))
age bmi hyp chl
1:12 Min. :20.40 1:18 Min. :113
2: 7 1st Qu.:24.90 2: 7 1st Qu.:186
3: 6 Median :27.40 Median :199
Mean :27.37 Mean :194
3rd Qu.:30.10 3rd Qu.:218
Max. :35.30 Max. :284
Would someone kindly take the time to illustrate how one might take the mids object to get the basic descriptives?
Below are some steps you can do to better understand what happens with R objects after each step. I would also recommend to look at this tutorial:
https://gerkovink.github.io/miceVignettes/
library(mice)
# nhanes object is just a simple dataframe:
data(nhanes)
str(nhanes)
#'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
#$ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
#$ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
#$ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
# you can generate multivariate imputation using mice() function
imp <- mice(nhanes, seed=23109)
#The output variable is an object of class "mids" which you can explore using str() function
str(imp)
# List of 17
# $ call : language mice(data = nhanes)
# $ data :'data.frame': 25 obs. of 4 variables:
# ..$ age: num [1:25] 1 2 1 3 1 3 1 1 2 2 ...
# ..$ bmi: num [1:25] NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
# ..$ hyp: num [1:25] NA 1 1 NA 1 NA 1 1 1 NA ...
# ..$ chl: num [1:25] NA 187 187 NA 113 184 118 187 238 NA ...
# $ m : num 5
# ...
# $ imp :List of 4
#..$ age: NULL
#..$ bmi:'data.frame': 9 obs. of 5 variables:
#.. ..$ 1: num [1:9] 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
#.. ..$ 2: num [1:9] 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
#.. ..$ 3: num [1:9] 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
#.. ..$ 4: num [1:9] 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
#.. ..$ 5: num [1:9] 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
#...
#You can extract individual components of this object using $, for example
#To view the actual imputation for bmi column
imp$imp$bmi
# 1 2 3 4 5
# 1 28.7 27.2 22.5 27.2 28.7
# 3 30.1 30.1 30.1 22.0 28.7
# 4 22.7 27.2 20.4 22.7 20.4
# 6 24.9 25.5 22.5 21.7 21.7
# 10 30.1 29.6 27.4 25.5 25.5
# 11 35.3 26.3 22.0 27.2 22.5
# 12 27.5 26.3 26.3 24.9 22.5
# 16 29.6 30.1 27.4 30.1 25.5
# 21 33.2 30.1 35.3 22.0 22.7
# The above output is again just a regular dataframe:
str(imp$imp$bmi)
# 'data.frame': 9 obs. of 5 variables:
# $ 1: num 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
# $ 2: num 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
# $ 3: num 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
# $ 4: num 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
# $ 5: num 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
# complete() function returns imputed dataset:
mat <- complete(imp)
# The output of this function is a regular data frame:
str(mat)
# 'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi: num 28.7 22.7 30.1 22.7 20.4 24.9 22.5 30.1 22 30.1 ...
# $ hyp: num 1 1 1 2 1 2 1 1 1 1 ...
# $ chl: num 199 187 187 204 113 184 118 187 238 229 ...
# So you can run any descriptive statistics you need with this object
# Just like you would do with a regular dataframe:
> summary(mat)
# age bmi hyp chl
# Min. :1.00 Min. :20.40 Min. :1.00 Min. :113.0
# 1st Qu.:1.00 1st Qu.:24.90 1st Qu.:1.00 1st Qu.:187.0
# Median :2.00 Median :27.50 Median :1.00 Median :204.0
# Mean :1.76 Mean :27.48 Mean :1.24 Mean :204.9
# 3rd Qu.:2.00 3rd Qu.:30.10 3rd Qu.:1.00 3rd Qu.:229.0
# Max. :3.00 Max. :35.30 Max. :2.00 Max. :284.0
There are several mistakes in both your code and the answer from Katia and the link provided by Katia is no longer available.
To compute simple statistics after multiple imputation, you must follow Rubin's Rule, which is the method used in mice for a selected bunch of model fits.
When using
library(mice)
imp <- mice(nhanes, seed = 23109)
mat <- complete(imp)
mat
age bmi hyp chl
1 1 28.7 1 199
2 2 22.7 1 187
3 1 30.1 1 187
4 3 22.7 2 204
5 1 20.4 1 113
6 3 24.9 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 30.1 1 229
11 1 35.3 1 187
12 2 27.5 1 229
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 238
16 1 29.6 1 238
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 206
21 1 33.2 1 238
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 284
25 2 27.4 1 186
You only return the first imputed dataset, whereas you imputed five by default. See ?mice::complete for more informations "The default is action = 1L returns the first imputed data set."
To get the five imputed datasets, you have to specify the action argument of mice::complete
mat2 <- complete(imp, "long")
mat2
.imp .id age bmi hyp chl
1 1 1 1 28.7 1 199
2 1 2 2 22.7 1 187
3 1 3 1 30.1 1 187
4 1 4 3 22.7 2 204
5 1 5 1 20.4 1 113
6 1 6 3 24.9 2 184
7 1 7 1 22.5 1 118
8 1 8 1 30.1 1 187
9 1 9 2 22.0 1 238
10 1 10 2 30.1 1 229
...
115 5 15 1 29.6 1 187
116 5 16 1 25.5 1 187
117 5 17 3 27.2 2 284
118 5 18 2 26.3 2 199
119 5 19 1 35.3 1 218
120 5 20 3 25.5 2 218
121 5 21 1 22.7 1 238
122 5 22 1 33.2 1 229
123 5 23 1 27.5 1 131
124 5 24 3 24.9 1 186
125 5 25 2 27.4 1 186
Both summary(mat) and summary(mat2) are false.
Let's focus on bmi. The first one provides the mean bmi over the first imputed dataset. The second one provides the mean of an artifical m times larger dataset. The second dataset also has inappropriately low variance.
mean(mat$bmi)
27.484
mean(mat2$bmi)
26.5192
I have not found a better solution than applying manually Rubin's rule to the mean estimate. The correct estimate is simply the mean of estimates accross all imputed datasets
res <- with(imp, mean(bmi)) #get the mean for each imputed dataset, stored in res$analyses
do.call(sum, res$analyses) / 5 #compute mean over m = 5 mean estimations
26.5192
The variance / standard deviation has to be calculated appropriately. You can use Rubin's rule to compute any simple statistic that you wish. You can find the way of doing so here https://bookdown.org/mwheymans/bookmi/rubins-rules.html
Hope this helps.
I am to compute multiple quantiles for a certain variable:
> res1 <- aggregate(airquality$Wind, list(airquality$Month), function (x) quantile(x, c(0.9, 0.95, 0.975)))
> head(res1)
Group.1 x.90% x.95% x.97.5%
1 5 16.6000 17.5000 18.8250
2 6 14.9000 15.5600 17.3650
3 7 14.3000 14.6000 14.9000
4 8 12.6000 14.0500 14.6000
5 9 14.9600 15.5000 15.8025
The result looks good at first, but aggregate actually returns it in a very strange form, where the last 3 columns are not columns of a data.frame, but a single matrix!
> names(res1)
[1] "Group.1" "x"
> dim(res1)
[1] 5 2
> class(res1[,2])
[1] "matrix"
This causes a lot of problems in further processing.
Few questions:
Why is aggregate() behaving so strange?
Is there any way to
persuade it to make the result I expect?
Or am I perhaps using a
wrong function for this purpose? Is there any other prefered way to
get the wanted result?
Of course I could do some transformation of the output of aggregate(), but I look for some more simple and straightforward solution.
Q1: Why is the behavior so strange?
This is actually a documented behavior at ?aggregate (though it may still be unexpected). The relevant argument to look at would be simplify.
If simplify is set to FALSE, aggregate would produce a list instead in a case like this.
res2 <- aggregate(airquality$Wind, list(airquality$Month), function (x)
quantile(x, c(0.9, 0.95, 0.975)), simplify = FALSE)
str(res2)
# 'data.frame': 5 obs. of 2 variables:
# $ Group.1: int 5 6 7 8 9
# $ x :List of 5
# ..$ 1 : Named num 16.6 17.5 18.8
# .. ..- attr(*, "names")= chr "90%" "95%" "97.5%"
# ..$ 32 : Named num 14.9 15.6 17.4
# .. ..- attr(*, "names")= chr "90%" "95%" "97.5%"
# ..$ 62 : Named num 14.3 14.6 14.9
# .. ..- attr(*, "names")= chr "90%" "95%" "97.5%"
# ..$ 93 : Named num 12.6 14.1 14.6
# .. ..- attr(*, "names")= chr "90%" "95%" "97.5%"
# ..$ 124: Named num 15 15.5 15.8
# .. ..- attr(*, "names")= chr "90%" "95%" "97.5%"
Now, both a matrix and a list as columns may seem to be strange behavior, but I presume it's more of a case of "status by design" rather than a "bug" or a "flaw".
For instance, consider the following: We want to aggregate both the "Wind" and the "Temp" columns from the "airquality" dataset, and we know that each aggregation would result in multiple columns (like we would expect with quantile).
res3 <- aggregate(cbind(Wind, Temp) ~ Month, airquality,
function (x) quantile(x, c(0.9, 0.95, 0.975)))
res3
# Month Wind.90% Wind.95% Wind.97.5% Temp.90% Temp.95% Temp.97.5%
# 1 5 16.6000 17.5000 18.8250 74.000 77.500 79.500
# 2 6 14.9000 15.5600 17.3650 87.300 91.100 92.275
# 3 7 14.3000 14.6000 14.9000 89.000 91.500 92.000
# 4 8 12.6000 14.0500 14.6000 94.000 95.000 96.250
# 5 9 14.9600 15.5000 15.8025 91.100 92.550 93.000
In some ways, keeping these values as matrix-columns might make sense--the data aggregated data are easily accessible by their original column names:
res3$Temp
# 90% 95% 97.5%
# [1,] 74.0 77.50 79.500
# [2,] 87.3 91.10 92.275
# [3,] 89.0 91.50 92.000
# [4,] 94.0 95.00 96.250
# [5,] 91.1 92.55 93.000
Q2: How do you get the results as separate columns in a data.frame?
But a list as a column is just as awkward to deal with as a matrix as a column in many cases. If you want to "flatten" your matrix into columns, use do.call(data.frame, ...):
do.call(data.frame, res1)
# Group.1 x.90. x.95. x.97.5.
# 1 5 16.60 17.50 18.8250
# 2 6 14.90 15.56 17.3650
# 3 7 14.30 14.60 14.9000
# 4 8 12.60 14.05 14.6000
# 5 9 14.96 15.50 15.8025
str(.Last.value)
# 'data.frame': 5 obs. of 4 variables:
# $ Group.1: int 5 6 7 8 9
# $ x.90. : num 16.6 14.9 14.3 12.6 15
# $ x.95. : num 17.5 15.6 14.6 14.1 15.5
# $ x.97.5.: num 18.8 17.4 14.9 14.6 15.8a
Q3: Are there other alternatives?
As with most things R, yes of course. My preferred alternative would be to use the "data.table" package, with which you can do:
library(data.table)
as.data.table(airquality)[, as.list(quantile(Wind, c(.9, .95, .975))),
by = Month]
# Month 90% 95% 97.5%
# 1: 5 16.60 17.50 18.8250
# 2: 6 14.90 15.56 17.3650
# 3: 7 14.30 14.60 14.9000
# 4: 8 12.60 14.05 14.6000
# 5: 9 14.96 15.50 15.8025
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 4 variables:
# $ Month: int 5 6 7 8 9
# $ 90% : num 16.6 14.9 14.3 12.6 15
# $ 95% : num 17.5 15.6 14.6 14.1 15.5
# $ 97.5%: num 18.8 17.4 14.9 14.6 15.8
# - attr(*, ".internal.selfref")=<externalptr>