How best to index for max values in data frame? - r

Here dataset in use is genotype from the cran package,MASS.
> names(genotype)
[1] "Litter" "Mother" "Wt"
> str(genotype)
'data.frame': 61 obs. of 3 variables:
$ Litter: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 1 1 1 1 1 ...
$ Mother: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 2 2 2 3 3 ...
$ Wt : num 61.5 68.2 64 65 59.7 55 42 60.2 52.5 61.8 ...
This was the given question from a tutorial:
Exercise 6.7. Find the heaviest rats born to each mother in the genotype() data.
tapply, whence split by factor genotype$Mother gives:
> tapply(genotype$Wt, genotype$Mother, max)
A B I J
68.2 69.8 61.8 61.0
Also:
> out <- tapply(genotype$Wt, genotype[,1:2],max)
> out
Mother
Litter A B I J
A 68.2 60.2 61.8 61.0
B 60.3 64.7 59.0 51.3
I 68.0 69.8 61.3 54.5
J 59.0 59.5 61.4 54.0
First tapply gives the heaviest rats from each mother , and second (out) gives a table that allows me identify which type of litter of each mother was heaviest. Is there another way to match which Litter is has the most weight for each mother, for instance if the 2 dim table is real large.

We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(genotype)). Create the index using which.max and subset the rows of the dataset grouped by the 'Mother'.
library(data.table)#v1.9.5+
setDT(genotype)[, .SD[which.max(Wt)], by = Mother]
# Mother Litter Wt
#1: A A 68.2
#2: B I 69.8
#3: I A 61.8
#4: J A 61.0
If we are only interested in the max of 'Wt' by 'Mother'
setDT(genotype)[, list(Wt=max(Wt)), by = Mother]
# Mother Wt
#1: A 68.2
#2: B 69.8
#3: I 61.8
#4: J 61.0
Based on the last tapply code showed by the OP, if we need similar output, we can use dcast from the devel version of 'data.table'
dcast(setDT(genotype), Litter ~ Mother, value.var='Wt', max)
# Litter A B I J
#1: A 68.2 60.2 61.8 61.0
#2: B 60.3 64.7 59.0 51.3
#3: I 68.0 69.8 61.3 54.5
#4: J 59.0 59.5 61.4 54.0
data
library(MASS)
data(genotype)

From stats:
aggregate(. ~ Mother, data = genotype, max)
or
aggregate(Wt ~ Mother, data = genotype, max)

Related

R: Why is merge dropping data? How to interpolate missing values for a merge

I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).
However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.
problay solved when using na.rm = TRUE in your summarising functions...
a data.table approach:
library( data.table )
dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )
#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50 = min( dbAL050, na.rm = TRUE ),
max_dBL50 = max( dbAL050, na.rm = TRUE ),
mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
med_dBL50 = median( dbAL050, na.rm = TRUE )
),
by = "SiteID" ],
SiteID)
head( result, 10 )
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3D Inf -Inf NaN NA
# 8: CU3M 44.1 71.2 57.650 57.65
# 9: CU3U 45.0 45.0 45.000 45.00
# 10: CU4D Inf -Inf NaN NA
If you would like to perform a left join, but exclude hits that cannot be found (so you do not get rows like the one above on "CU3D") use:
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]
this will result in:
# SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
# 1: CU1D 47.2 54.0 52.300 54.00
# 2: CU1M 35.4 63.0 49.200 49.20
# 3: CU1U 35.3 35.3 35.300 35.30
# 4: CU2D 42.3 42.3 42.300 42.30
# 5: CU2M 43.1 43.2 43.125 43.10
# 6: CU2U 43.7 43.7 43.700 43.70
# 7: CU3M 44.1 71.2 57.650 57.65
# 8: CU3U 45.0 45.0 45.000 45.00
# 9: CU4M 52.4 55.9 54.150 54.15
# 10: CU4U 51.3 51.3 51.300 51.30
In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.
The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).
So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.
So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.
library(imputeTS)
library(tidyverse)
### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging.
### Why? Because the merge drops all the data that would help with the interpolation!!
dB<-read.csv("dB.csv")
dB_clean <- dB %>%
mutate_if(is.integer, as.character)
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the columns represent
# missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
count(SiteID, jDay) %>%
spread(jDay, n)
dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641`
# <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 CU1D NA NA NA NA NA NA NA NA
# 2 CU1M NA 11 24 24 24 24 24 24
# 3 CU1U NA 11 24 24 24 24 24 24
# 4 CU2D NA NA NA NA NA NA NA NA
# 5 CU2M NA 9 24 24 24 24 24 24
# 6 CU2U NA 9 24 24 24 24 21 NA
# 7 CU3D NA NA NA NA NA NA NA NA
# 8 CU3M NA NA NA NA NA NA NA NA
# 9 CU3U NA NA NA NA NA NA NA NA
# 10 CU4D NA NA NA NA NA NA NA NA
# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
gather(jDay, count, 2:88) %>%
filter(is.na(count)) %>%
select(-count, -NA)
# Add these lines to the original, remove the NA jDay rows
# (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
bind_rows(dB_rows_to_add) %>%
filter(jDay != "NA") %>%
arrange(SiteID, jDay)
length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030
## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)
for(i in 1:length(sites)){
temp<-dB[dB$SiteID==sites[i], ]
temp<-temp[order(temp$jDay),]
temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
D<-rbind(D, temp)
}
# delete the first row of zeros from above 'priming'
dBN<-D[-1,]
length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0
Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.
dBN$Year<-"2017" #all data are from 2017
##I could not figure out how jDay was formatted, so I created a manual 'key'
##to get Mo and Day by counting from a known date/jDay pair in original data
#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30
key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)
#make master 'key'
key<-rbind(key4,key5,key6,key7)
# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)
#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")
#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>% ### Have too many observations per day, will duplicate merge out of control.
count(SiteID, jDay, DayL50) %>%
summarise(
min=min(n, na.rm=TRUE),
mean=mean(n, na.rm=TRUE),
max=max(n, na.rm=TRUE)
)
## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1
Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:
> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.
> library(dplyr)
#Here is the NA interpolated 'dB' dataset
> dB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.7 53.1 49.4 50.2
2 CU1M 37.6 65.2 59.5 62.6
3 CU1U 35.5 51 43.7 44.8
4 CU2D 42 52 47.8 49.3
5 CU2M 38.2 49 43.1 42.9
6 CU2U 34.1 53.7 46.5 47
7 CU3D 46.1 53.3 49.7 49.4
8 CU3M 44.5 73.5 61.9 68.2
9 CU3U 42 52.6 47.0 46.8
10 CU4D 42 45.3 44.0 44.6
# ... with 49 more rows
# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>%
+ group_by(SiteID) %>%
+ dplyr::summarise(
+ min_dBL50=min(DayL50, na.rm=TRUE),
+ max_dBL50=max(DayL50, na.rm=TRUE),
+ mean_dBL50=mean(DayL50, na.rm=TRUE),
+ med_dBL50=median(DayL50, na.rm=TRUE)
+ )
# A tibble: 60 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 44.8 50 46.8 47
2 CU1M 59 63.9 62.3 62.9
3 CU1U 37.9 46 43.6 44.4
4 CU2D 42.1 51.6 45.6 44.3
5 CU2M 38.4 48.3 44.2 45.5
6 CU2U 39.8 50.7 45.7 46.4
7 CU3D 46.5 49.5 47.7 47.7
8 CU3M 67.7 71.2 69.5 69.4
9 CU3U 43.3 52.6 48.1 48.2
10 CU4D 43.2 45.3 44.4 44.9
# ... with 50 more rows

Combine rows based on ranges in a column

I have a pretty large dataset where I have a column for time in seconds and I want to combine rows where the time is close (range: .1-.2 seconds apart) as a mean.
Here is an example of how the data looks:
BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4
So I would want to combine the first 3 rows, then the 2 following after that, and the rest would stand alone. Meaning I would want the data to look like this:
BPM seconds
63.9 61.904
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4
We need to create groups, this is the important bit, the rest is standard aggregation:
cumsum(!c(0, diff(df1$seconds)) < 0.2)
# [1] 0 0 0 1 1 2 3 4
Then aggregate using aggregate:
aggregate(df1[, 2], list(cumsum(!c(0, diff(df1$seconds)) < 0.2)), mean)
# Group.1 x
# 1 0 61.90367
# 2 1 130.94500
# 3 2 211.59000
# 4 3 280.50000
# 5 4 290.40000
Or use dplyr:
library(dplyr)
df1 %>%
group_by(myGroup = cumsum(!c(0, diff(seconds)) < 0.2)) %>%
summarise(BPM = first(BPM),
seconds = mean(seconds))
# # A tibble: 5 x 3
# myGroup BPM seconds
# <int> <dbl> <dbl>
# 1 0 63.9 61.9
# 2 1 62.1 131.
# 3 2 61.8 212.
# 4 3 63.8 280.
# 5 4 60.3 290.
Reproducible example data:
df1 <- read.table(text = "BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4", header = TRUE)

Descriptive stats for MI data in R: Take 3

As an R beginner, I have found it surprisingly difficult to figure out how to compute descriptive statistics on multiply imputed data (more so than running some of the other basic analyses, such as correlations and regressions).
These types of questions are prefaced with apologies (Descriptive statistics (Means, StdDevs) using multiply imputed data: R) but have have not been answered (https://stats.stackexchange.com/questions/296193/pooling-basic-descriptives-from-several-multiply-imputed-datasets-using-mice) or are quickly cast a down vote.
Here is a description of a miceadds function(https://www.rdocumentation.org/packages/miceadds/versions/2.10-14/topics/stats0), which I find difficult to follow with data that has been stored in the mids format.
I have gotten some output such as mean, median, min, max using the summary(complete(imp)) but would love to know how to get additional summary output (e.g., skew/kurtosis, standard deviation, variance).
Illustration borrowed from a previous poster above:
> imp <- mice(nhanes, seed = 23109)
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
1 5 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
> summary(complete(imp))
age bmi hyp chl
1:12 Min. :20.40 1:18 Min. :113
2: 7 1st Qu.:24.90 2: 7 1st Qu.:186
3: 6 Median :27.40 Median :199
Mean :27.37 Mean :194
3rd Qu.:30.10 3rd Qu.:218
Max. :35.30 Max. :284
Would someone kindly take the time to illustrate how one might take the mids object to get the basic descriptives?
Below are some steps you can do to better understand what happens with R objects after each step. I would also recommend to look at this tutorial:
https://gerkovink.github.io/miceVignettes/
library(mice)
# nhanes object is just a simple dataframe:
data(nhanes)
str(nhanes)
#'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
#$ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
#$ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
#$ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
# you can generate multivariate imputation using mice() function
imp <- mice(nhanes, seed=23109)
#The output variable is an object of class "mids" which you can explore using str() function
str(imp)
# List of 17
# $ call : language mice(data = nhanes)
# $ data :'data.frame': 25 obs. of 4 variables:
# ..$ age: num [1:25] 1 2 1 3 1 3 1 1 2 2 ...
# ..$ bmi: num [1:25] NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
# ..$ hyp: num [1:25] NA 1 1 NA 1 NA 1 1 1 NA ...
# ..$ chl: num [1:25] NA 187 187 NA 113 184 118 187 238 NA ...
# $ m : num 5
# ...
# $ imp :List of 4
#..$ age: NULL
#..$ bmi:'data.frame': 9 obs. of 5 variables:
#.. ..$ 1: num [1:9] 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
#.. ..$ 2: num [1:9] 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
#.. ..$ 3: num [1:9] 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
#.. ..$ 4: num [1:9] 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
#.. ..$ 5: num [1:9] 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
#...
#You can extract individual components of this object using $, for example
#To view the actual imputation for bmi column
imp$imp$bmi
# 1 2 3 4 5
# 1 28.7 27.2 22.5 27.2 28.7
# 3 30.1 30.1 30.1 22.0 28.7
# 4 22.7 27.2 20.4 22.7 20.4
# 6 24.9 25.5 22.5 21.7 21.7
# 10 30.1 29.6 27.4 25.5 25.5
# 11 35.3 26.3 22.0 27.2 22.5
# 12 27.5 26.3 26.3 24.9 22.5
# 16 29.6 30.1 27.4 30.1 25.5
# 21 33.2 30.1 35.3 22.0 22.7
# The above output is again just a regular dataframe:
str(imp$imp$bmi)
# 'data.frame': 9 obs. of 5 variables:
# $ 1: num 28.7 30.1 22.7 24.9 30.1 35.3 27.5 29.6 33.2
# $ 2: num 27.2 30.1 27.2 25.5 29.6 26.3 26.3 30.1 30.1
# $ 3: num 22.5 30.1 20.4 22.5 27.4 22 26.3 27.4 35.3
# $ 4: num 27.2 22 22.7 21.7 25.5 27.2 24.9 30.1 22
# $ 5: num 28.7 28.7 20.4 21.7 25.5 22.5 22.5 25.5 22.7
# complete() function returns imputed dataset:
mat <- complete(imp)
# The output of this function is a regular data frame:
str(mat)
# 'data.frame': 25 obs. of 4 variables:
# $ age: num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi: num 28.7 22.7 30.1 22.7 20.4 24.9 22.5 30.1 22 30.1 ...
# $ hyp: num 1 1 1 2 1 2 1 1 1 1 ...
# $ chl: num 199 187 187 204 113 184 118 187 238 229 ...
# So you can run any descriptive statistics you need with this object
# Just like you would do with a regular dataframe:
> summary(mat)
# age bmi hyp chl
# Min. :1.00 Min. :20.40 Min. :1.00 Min. :113.0
# 1st Qu.:1.00 1st Qu.:24.90 1st Qu.:1.00 1st Qu.:187.0
# Median :2.00 Median :27.50 Median :1.00 Median :204.0
# Mean :1.76 Mean :27.48 Mean :1.24 Mean :204.9
# 3rd Qu.:2.00 3rd Qu.:30.10 3rd Qu.:1.00 3rd Qu.:229.0
# Max. :3.00 Max. :35.30 Max. :2.00 Max. :284.0
There are several mistakes in both your code and the answer from Katia and the link provided by Katia is no longer available.
To compute simple statistics after multiple imputation, you must follow Rubin's Rule, which is the method used in mice for a selected bunch of model fits.
When using
library(mice)
imp <- mice(nhanes, seed = 23109)
mat <- complete(imp)
mat
age bmi hyp chl
1 1 28.7 1 199
2 2 22.7 1 187
3 1 30.1 1 187
4 3 22.7 2 204
5 1 20.4 1 113
6 3 24.9 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 30.1 1 229
11 1 35.3 1 187
12 2 27.5 1 229
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 238
16 1 29.6 1 238
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 206
21 1 33.2 1 238
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 284
25 2 27.4 1 186
You only return the first imputed dataset, whereas you imputed five by default. See ?mice::complete for more informations "The default is action = 1L returns the first imputed data set."
To get the five imputed datasets, you have to specify the action argument of mice::complete
mat2 <- complete(imp, "long")
mat2
.imp .id age bmi hyp chl
1 1 1 1 28.7 1 199
2 1 2 2 22.7 1 187
3 1 3 1 30.1 1 187
4 1 4 3 22.7 2 204
5 1 5 1 20.4 1 113
6 1 6 3 24.9 2 184
7 1 7 1 22.5 1 118
8 1 8 1 30.1 1 187
9 1 9 2 22.0 1 238
10 1 10 2 30.1 1 229
...
115 5 15 1 29.6 1 187
116 5 16 1 25.5 1 187
117 5 17 3 27.2 2 284
118 5 18 2 26.3 2 199
119 5 19 1 35.3 1 218
120 5 20 3 25.5 2 218
121 5 21 1 22.7 1 238
122 5 22 1 33.2 1 229
123 5 23 1 27.5 1 131
124 5 24 3 24.9 1 186
125 5 25 2 27.4 1 186
Both summary(mat) and summary(mat2) are false.
Let's focus on bmi. The first one provides the mean bmi over the first imputed dataset. The second one provides the mean of an artifical m times larger dataset. The second dataset also has inappropriately low variance.
mean(mat$bmi)
27.484
mean(mat2$bmi)
26.5192
I have not found a better solution than applying manually Rubin's rule to the mean estimate. The correct estimate is simply the mean of estimates accross all imputed datasets
res <- with(imp, mean(bmi)) #get the mean for each imputed dataset, stored in res$analyses
do.call(sum, res$analyses) / 5 #compute mean over m = 5 mean estimations
26.5192
The variance / standard deviation has to be calculated appropriately. You can use Rubin's rule to compute any simple statistic that you wish. You can find the way of doing so here https://bookdown.org/mwheymans/bookmi/rubins-rules.html
Hope this helps.

Using variations of `apply` in R

Often times in research we have to do a summary table. I would like to create a table using tapply in R. The only problem is I have 40 variables and I would like to basically perform the same operation for all 40 variables. Here is an example of the data
Age Wt Ht Type
79 134 66 C
67 199 64 C
39 135 78 T
92 149 61 C
33 138 75 T
68 139 71 C
95 198 62 T
65 132 65 T
56 138 81 C
71 193 78 T
Essentially I would like to get it to produce the means of all the variables given the Type. It should look as
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6
I tried using
sapply(df, tapply(df, df$Type, mean))
but got an error.
Any guidance would be appreciated.
Try:
> sapply(df[1:3], tapply, df$Type, mean)
Age Wt Ht
C 72.4 151.8 68.6
T 60.6 159.2 71.6
alternatively you can use colMeans:
> sapply(split(df[1:3], df$Type), colMeans)
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6
You could use aggregate :
res <- aggregate(DF[,names(DF) != 'Type'],list(DF$Type),mean)
> res
Group.1 Age Wt Ht
1 C 72.4 151.8 68.6
2 T 60.6 159.2 71.6
then transposing it :
m <- t(res[-1]) # convert the data.frame (excluding first col) in a matrix and traspose it
colnames(m) <- res[[1]] # set colnames of the matrix taking them from the data.frame 1st col
> m
C T
Age 72.4 60.6
Wt 151.8 159.2
Ht 68.6 71.6

Reshaping a data frame with more than one measure variable

I'm using a data frame similar to this one:
df<-data.frame(student=c(rep(1,5),rep(2,5)), month=c(1:5,1:5),
quiz1p1=seq(20,20.9,0.1),quiz1p2=seq(30,30.9,0.1),
quiz2p1=seq(80,80.9,0.1),quiz2p2=seq(90,90.9,0.1))
print(df)
student month quiz1p1 quiz1p2 quiz2p1 quiz2p2
1 1 1 20.0 30.0 80.0 90.0
2 1 2 20.1 30.1 80.1 90.1
3 1 3 20.2 30.2 80.2 90.2
4 1 4 20.3 30.3 80.3 90.3
5 1 5 20.4 30.4 80.4 90.4
6 2 1 20.5 30.5 80.5 90.5
7 2 2 20.6 30.6 80.6 90.6
8 2 3 20.7 30.7 80.7 90.7
9 2 4 20.8 30.8 80.8 90.8
10 2 5 20.9 30.9 80.9 90.9
Describing grades received by students during five months – in two quizzes divided into two parts each.
I need to get the two quizzes into separate rows – so that each student in each month will have two rows, one for each quiz, and two columns – for each part of the quiz.
When I melt the table:
melt.data.frame(df, c("student", "month"))
I get the two parts of the quiz in separate lines too.
dcast(dfL,student+month~variable)
of course gets me right back where I started, and I can't find a way to cast the table back in to the required form.
Is there a way to make the melt command function something like:
melt.data.frame(df, measure.var1=c("quiz1p1","quiz2p1"),
measure.var2=c("quiz1p2","quiz2p2"))
Here's how you could do this with reshape(), from base R:
df2 <- reshape(df, direction="long",
idvar = 1:2, varying = list(c(3,5), c(4,6)),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
## Checking the output
rbind(head(df2, 3), tail(df2, 3))
# student month time p1 p2
# 1.1.quiz1 1 1 quiz1 20.0 30.0
# 1.2.quiz1 1 2 quiz1 20.1 30.1
# 1.3.quiz1 1 3 quiz1 20.2 30.2
# 2.3.quiz2 2 3 quiz2 80.7 90.7
# 2.4.quiz2 2 4 quiz2 80.8 90.8
# 2.5.quiz2 2 5 quiz2 80.9 90.9
You can also use column names (instead of column numbers) for idvar and varying. It's more verbose, but seems like better practice to me:
## The same operation as above, using just column *names*
df2 <- reshape(df, direction="long", idvar=c("student", "month"),
varying = list(c("quiz1p1", "quiz2p1"),
c("quiz1p2", "quiz2p2")),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
I think this does what you want:
#Break variable into two columns, one for the quiz and one for the part of the quiz
dfL <- transform(dfL, quiz = substr(variable, 1,5),
part = substr(variable, 6,7))
#Adjust your dcast call:
dcast(dfL, student + month + quiz ~ part)
#-----
student month quiz p1 p2
1 1 1 quiz1 20.0 30.0
2 1 1 quiz2 80.0 90.0
3 1 2 quiz1 20.1 30.1
...
18 2 4 quiz2 80.8 90.8
19 2 5 quiz1 20.9 30.9
20 2 5 quiz2 80.9 90.9
There was a very similar question asked about half a year ago, in which I wrote the following function:
melt.wide = function(data, id.vars, new.names) {
require(reshape2)
require(stringr)
data.melt = melt(data, id.vars=id.vars)
new.vars = data.frame(do.call(
rbind, str_extract_all(data.melt$variable, "[0-9]+")))
names(new.vars) = new.names
cbind(data.melt, new.vars)
}
You can use the function to "melt" your data as follows:
dfL <-melt.wide(df, id.vars=1:2, new.names=c("Quiz", "Part"))
head(dfL)
# student month variable value Quiz Part
# 1 1 1 quiz1p1 20.0 1 1
# 2 1 2 quiz1p1 20.1 1 1
# 3 1 3 quiz1p1 20.2 1 1
# 4 1 4 quiz1p1 20.3 1 1
# 5 1 5 quiz1p1 20.4 1 1
# 6 2 1 quiz1p1 20.5 1 1
tail(dfL)
# student month variable value Quiz Part
# 35 1 5 quiz2p2 90.4 2 2
# 36 2 1 quiz2p2 90.5 2 2
# 37 2 2 quiz2p2 90.6 2 2
# 38 2 3 quiz2p2 90.7 2 2
# 39 2 4 quiz2p2 90.8 2 2
# 40 2 5 quiz2p2 90.9 2 2
Once the data are in this form, you can much more easily use dcast() to get whatever form you desire. For example
head(dcast(dfL, student + month + Quiz ~ Part))
# student month Quiz 1 2
# 1 1 1 1 20.0 30.0
# 2 1 1 2 80.0 90.0
# 3 1 2 1 20.1 30.1
# 4 1 2 2 80.1 90.1
# 5 1 3 1 20.2 30.2
# 6 1 3 2 80.2 90.2

Resources