Simplifying the process of creating a summary table - r

I am pretty sure I am complicating things. I have a data frame with p variables (here: v1 to v3) and two factor variable (here: sex and unemp):
> head(df)
sex unemp v1 v2 v3
1 0 0 2 4 4
2 0 0 2 1 1
3 1 0 3 3 5
4 1 1 2 3 5
5 0 0 1 2 5
6 1 0 3 5 4
I now would like to modify (i.e. compute median and mean and then rearrange the summary table) my data in such way that the resulting data frame looks like this (for men or women):
> df.res.men
median.unemp.1 median.unemp.0 mean.unemp.1 mean.unemp.0
v1 2.0 2.0 2.666667 2.391304
v2 2.0 3.5 2.500000 3.369565
v3 4.5 3.0 4.166667 2.956522
Here is the full code:
library(plyr)
## generate data
set.seed(1)
df <- data.frame(sex=rbinom(100, 1, 0.5),
unemp=rbinom(100, 1, 0.2),
v1=sample(1:5, 100, replace=TRUE),
v2=sample(1:5, 100, replace=TRUE),
v3=sample(1:5, 100, replace=TRUE)
)
head(df)
## compute mean and median for all variables by sex and unemp
df.mean <- ddply(df, .(unemp, sex), .fun=colMeans, na.rm=TRUE)
df.mean
df.median <- ddply(df, .(unemp, sex), .fun=function(x)apply(x,2,median, na.rm=TRUE))
df.median
## rearrange summary table
df.res.men <- cbind(t(subset(df.median, sex==0 & unemp==1)),
t(subset(df.median, sex==0 & unemp==0)),
t(subset(df.mean, sex==0 & unemp==1)),
t(subset(df.mean, sex==0 & unemp==0)))
df.res.men <- df.res.men[-c(1:2),]
colnames(df.res.men) <- c("median.unemp.1", "median.unemp.0",
"mean.unemp.1", "mean.unemp.0")
df.res.men

Here is one approach
library(plyr); library(reshape2)
dfm <- melt(df, id = c('sex', 'unemp'))
df2 <- ddply(dfm, .(variable, unemp, sex), summarize,
avg = mean(value), med = median(value))
df2m <- melt(df2, id = 1:3, variable.name = 'sum_fun')
df_0 <- dcast(df2m, sex + variable ~ sum_fun + unemp, subset = .(sex == 0))
sex variable avg_0 avg_1 med_0 med_1
1 0 v1 2.794872 3.0000 3 3.5
2 0 v2 3.102564 2.8750 3 3.0
3 0 v3 3.205128 3.1875 3 4.0

Here's a two-line solution using reshape alone. The default column names need a bit of work, but the syntax of the melt() and cast() statements is nicely expressive.
(One important note -- unlike reshape, reshape2 can not take a vector of summary function names as its fun.aggregate argument, as I've done below with c(mean, median). Thanks to Ramnath for pointing that out.)
library(reshape)
dmelt <- melt(df, id=c('sex', 'unemp'))
# Results for sex 0
cast(dmelt, variable ~ unemp, c(mean, median), subset = sex==0)
# variable 0_mean 0_median 1_mean 1_median
# 1 v1 2.391304 2.0 2.666667 2.0
# 2 v2 3.369565 3.5 2.500000 2.0
# 3 v3 2.956522 3.0 4.166667 4.5
# Results for sex 1
cast(dmelt, variable ~ unemp, c(mean, median), subset = sex==1)
# variable 0_mean 0_median 1_mean 1_median
# 1 v1 3.027778 3 2.416667 2.0
# 2 v2 2.638889 2 2.750000 3.0
# 3 v3 3.027778 3 2.583333 2.5

Solution without reshaping data.
f <- function(x) rbind(each(mean,median)(na.omit(x)))
#
# This should work but it doesn't.
# It almost work except labelling output with function names
#
df.res <- ddply(df,.(unemp, sex),.fun=numcolwise(f))
#
# Some workaround
#
df.res <- dlply(df,.(unemp, sex),.fun=numcolwise(f))
df.res <- cbind(attr(df.res,"split_labels"),do.call(rbind,df.res))

Related

How to convert an output of "by" function to a data frame in R?

Here is what I want to Do:
I have a dataframe df defined as:
col1 <- c("a","a","a","a","a","a","b","b","b","b","b","b")
col2 <- c("z","z","x","x","z","x", "z","z","x","x","z","x")
col3 <- c(1,2,3,4,5,6,7,8,9,10,11,12)
df <- data.frame(col1,col2,col3)
and a function pred that calculates the mean defined as :
pred <- function(subset_df){return(mean(subset_df$col3))}
I want a data frame through a by function in a below format:
col1 col2 col3_mean
a x 4.33
a z 2.66
b x 10.33
b z 8.66
I am currently using a by() function to partition this data into its strata and apply a pred() function that calculates a mean
by_keys <- c("col1","col2")
data_sub <- by(df, data_sub[,by_keys], pred)
data_sub <- do.call(rbind, data_sub)
I am getting an error here saying the "Error in do.call(rbind, data_sub) : second argument must be a list"
I tried a solution from a similar tread but I dont get col1 and col2 as in desired format
as.data.frame(vapply(data_sub,unlist,unlist(data_sub[[1]])))
Would appreciate any help on this.
Indeed, by as you set up will not return a list but a simplified structure since your output returns numeric vectors. Adjust your pred function to return data frames which being non-simplified structures will force by to return a list and can then be passed into do.call.
pred <- function(subset_df){
df <- data.frame(col1 = subset_df$col1[[1]],
col2 = subset_df$col2[[1]],
col3_mean = mean(subset_df$col3)
)
return(df)
}
data_sub_list <- by(df, df[,by_keys], pred)
data_sub <- do.call(rbind, data_sub_list)
data_sub
# col1 col2 col3_mean
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
However, as commented by #Onyambu, this type of grouped aggregation can be done with aggregate which will return data frames.
# FORMULA VERSION
aggregate(col3 ~ col1 + col2, df, mean)
# col1 col2 col3_mean
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
# NON-FORMULA VERSION
aggregate(df$col3, by=list(col1=df$col1, col2=df$col2), mean)
# col1 col2 x
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
Usually, by (being the object-oriented wrapper to tapply) is best for running larger, extensive data frame operations that you need to run subsets through iteratievly. In fact, if you need multiple aggregates, by then becomes useful:
pred <- function(subset_df){
df <- data.frame(col1 = subset_df$col1[[1]],
col2 = subset_df$col2[[1]],
col3_mean = mean(subset_df$col3),
col3_sd = sd(subset_df$col3),
col3_median = median(subset_df$col3),
col3_min = min(subset_df$col3),
col3_max = max(subset_df$col3),
col3_sum = sum(subset_df$col3),
col3_25pct = quantile(subset_df$col3)[[2]],
col3_75pct = quantile(subset_df$col3)[[4]],
col3_IQR = IQR(subset_df$col3)
)
return(df)
}
data_sub_list <- by(df, df[,by_keys], pred)
data_sub <- do.call(rbind, data_sub_list)
# col1 col2 col3_mean col3_sd col3_median col3_min col3_max col3_sum col3_25pct col3_75pct col3_IQR
# 1 a x 4.333333 1.527525 4 3 6 13 3.5 5.0 1.5
# 2 b x 10.333333 1.527525 10 9 12 31 9.5 11.0 1.5
# 3 a z 2.666667 2.081666 2 1 5 8 1.5 3.5 2.0
# 4 b z 8.666667 2.081666 8 7 11 26 7.5 9.5 2.0
Use dplyr:
library(dplyr)
df %>% group_by(col1, col2) %>%
summarize(col3_mean = mean(col3)) %>%
as.data.frame
col1 col2 col3_mean
1 a x 4.333
2 a z 2.667
3 b x 10.333
4 b z 8.667

Factor levels reaching certain values

I need to find out how many factor levels reach values of a continuous variable.
The code below produces the desired result for the example data, but it is rather an awkward work around.
My real dataframe is much larger and the real plot should show more values (or is continuous) on the x-axis. I would appreciate an applicable code a lot.
set.seed(5)
df <- data.frame(ID = factor(c("a","a","b","c","d","e","e")),values = runif(7,0,6))
seq <- 1:5
length.unique <- function(x) length(unique(x))
sub1 <- df[which(df$values >= 1), ]
sub2 <- df[which(df$values >= 2), ]
sub3 <- df[which(df$values >= 3), ]
sub4 <- df[which(df$values >= 4), ]
sub5 <- df[which(df$values >= 5), ]
N_IDs <- c(length.unique(sub1$ID),length.unique(sub2$ID),length.unique(sub3$ID),length.unique(sub4$ID),length.unique(sub5$ID))
plot(N_IDs ~ seq, type="b")
Using tidyverse, you can save some time by first calculating the max value for each ID,
library(tidyverse)
idmax <- df %>% group_by(ID) %>% summarize(max=max(values)) %>% pull(max)
Then for each cut point, return the count that pass
map_df(1:5, ~data.frame(cut=., count=sum(idmax >.)))
# cut count
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 1
Using non-equi joins:
library(data.table)
setDT(df)
df[.(seq = 1:5), on = .(values >= seq), allow = T, .(N_IDs = uniqueN(ID)), by = .EACHI]
# values N_IDs
#1: 1 4
#2: 2 3
#3: 3 3
#4: 4 3
#5: 5 1

Two-way contingency table in R

I have a dataframe and I want to output a two-way contingency table from two of the columns. They both have values "Too Little", "About Right" or "Too Much".
I'm typing
df %>%
filter(!is.na(col1)) %>%
group_by(col1) %>%
summarise(count = n())
for both of them separately and get something like this:
col1 count
<fctr> <int>
Too Little 19259
About Right 9539
Too Much 2816
What I would like to achieve is this:
Too Little About Right Too Much Total
col1 19259 9539 2816 31614
col2 20619 9374 2262 32255
Total 39878 18913 5078 63869
I've been trying to use table function
addmargins(table(df$col1, df$col2))
But the result is not what I want
Too Little About Right Too Much Sum
Too Little 13770 4424 740 18934
About Right 4901 3706 700 9307
Too Much 1250 800 679 2729
Sum 19921 8930 2119 30970
I'd give tabulate a try, which is the foundation for table (see ?tabulate). For example given
set.seed(123)
vals <- LETTERS[1:3]
df <- as.data.frame(replicate(3, sample(vals, 5, T)))
df <- data.frame(lapply(df, "levels<-", vals))
then you could do
m <- t(sapply(df, tabulate, nbins = length(vals)))
colnames(m) <- vals
addmargins(m)
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15
Or (via #thelatemail) just
addmargins(t(sapply(df, table)))
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15
We can use table in a loop then rbind:
# Using dummy data from #lukeA's answer
addmargins(do.call(rbind, lapply(df1, table)))
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15
Benchmarking
# bigger data
set.seed(123)
vals <- LETTERS[1:20]
df1 <- as.data.frame(replicate(20, sample(vals, 100000, T)))
df1 <- data.frame(lapply(df1, "levels<-", vals))
microbenchmark::microbenchmark(
lukeA = {
m1 <- t(sapply(df1, tabulate, nbins = length(vals)))
colnames(m1) <- vals
m1 <- addmargins(m1)
},
# as vals only used for luke's solution, keep it in.
lukeA_1 = {
vals <- LETTERS[1:20]
m2 <- t(sapply(df1, tabulate, nbins = length(vals)))
colnames(m2) <- vals
m2 <- addmargins(m2)
},
thelatemail = {m3 <- addmargins(t(sapply(df1, table)))},
zx8754 = {m4 <- addmargins(do.call(rbind, lapply(df1, table)))}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# lukeA 2.349969 2.371922 2.518447 2.473839 2.558653 3.363738 100
# lukeA_1 2.351680 2.377196 2.523473 2.473839 2.542831 3.459242 100
# thelatemail 38.316506 42.054136 43.785777 42.674912 44.234193 90.287809 100
# zx8754 38.695101 41.979728 44.933602 42.762006 44.244314 110.834292 100

Returning observations that only occur once in a group [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I am trying to group a data.frame by a factor variable, and then return rows of the data.frame that correspond to observations that occur exactly once in each group. For example, consider the following data:
x = matrix(c(1,1,2,2,2,3,4,4,5,4), nrow = 5, ncol = 2, byrow = F)
x = data.frame(x)
x
# X1 X2
# 1 1 3
# 2 1 4
# 3 2 4
# 4 2 5
# 5 2 4
I would like to group the data by the values in column 1, then return the rows for which the value in column 2 occurs only once within a group. Here, the function would return the first, second, and fourth rows.
Desired output
# X1 X2
# 1 1 3
# 2 1 4
# 4 2 5
I am looking to apply this to a dataset with >1mm rows.
In base R, you can try ave:
x[with(x, ave(X2, X1, X2, FUN = length)) == 1, ]
# X1 X2
# 1 1 3
# 2 1 4
# 4 2 5
Because ave scales very poorly when there are multiple groups and multiple grouping variables, you may want to create a new group first:
x[with(x, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ]
The speeds will vary widely according to the nature of your data.
You can also try:
library(dplyr)
x %>%
group_by(X1, X2) %>%
filter(n() == 1)
# Source: local data frame [3 x 2]
# Groups: X1, X2 [3]
#
# X1 X2
# (dbl) (dbl)
# 1 1 3
# 2 1 4
# 3 2 5
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(x), grouped by the first column i.e. "X1", if, there is only one observation, return the row else remove all the duplicates and return only the unique row.
library(data.table)
setDT(x)[, if(.N==1) .SD else
.SD[!(duplicated(X2)|duplicated(X2, fromLast=TRUE))], X1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
If we are using both "X1" and "X2" as grouping variable
setDT(x)[x[, .I[.N==1], .(X1, X2)]$V1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
NOTE: Data.table is very fast and is compact.
Or without using any group by option, with base R we can do
x[!(duplicated(x)|duplicated(x, fromLast=TRUE)),]
# X1 X2
#1 1 3
#2 1 4
#4 2 5
Or with tally from dplyr
library(dplyr)
x %>%
group_by_(.dots= names(x)) %>%
tally() %>%
filter(n==1) %>%
select(-n)
Note that this should be faster than the other dplyr solution.
Benchmarks
library(data.table)
library(dplyr)
Sample data
set.seed(24)
x1 <- data.frame(X1 = sample(1:5000, 1e6, replace=TRUE),
X2 = sample(1:10000, 1e6, replace=TRUE))
x2 <- copy(as.data.table(x1))
Base R approaches
system.time(x1[with(x1, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ])
# user system elapsed
# 20.245 0.002 20.280
system.time(x1[!(duplicated(x1)|duplicated(x1, fromLast=TRUE)), ])
# user system elapsed
# 1.994 0.000 1.998
dplyr approaches
system.time(x1 %>% group_by(X1, X2) %>% filter(n() == 1))
# user system elapsed
# 33.400 0.006 33.467
system.time(x1 %>% group_by_(.dots= names(x2)) %>% tally() %>% filter(n==1) %>% select(-n))
# user system elapsed
# 2.331 0.000 2.333
data.table approaches
system.time(x2[x2[, .I[.N==1], list(X1, X2)]$V1])
# user system elapsed
# 1.128 0.001 1.131
system.time(x2[, .N, by = list(X1, X2)][N == 1][, N := NULL][])
# user system elapsed
# 0.320 0.000 0.323
Summary: The "data.table" approaches win hands down, but if you're unable to use the package for some reason, using duplicated from base R also performs quite well.
With base, something like
do.call(rbind, lapply(split(x, x$X1),
function(y){y[table(y$X2) == 1,]}))
# X1 X2
# 1.1 1 3
# 1.2 1 4
# 2 2 5
where split splits x into a list of data.frames split by the levels of X1, and then lapply subsets to rows where there is only one occurrence of the value of X2, tabulated by table. do.call(rbind then reassembles the resulting data.frames back into a single one.

Mean of variable by two factors

I have the following data:
a <- c(1,1,1,1,2,2,2,2)
b <- c(2,4,6,8,2,3,4,1)
c <- factor(c("A","B","A","B","A","B","A","B"))
df <- data.frame(
sp=a,
length=b,
method=c)
I can use the following to get a count of the number of samples of each species by method:
n <- with(df,tapply(sp,method,function(x) count(x)))
How do I also get the mean length by method for each species?
Personally I would use aggregate:
aggregate(length ~ sp, data = df, FUN= "mean" )
# by species only
# sp length
#1 1 5.0
#2 2 2.5
aggregate(length ~ sp + method, data = df, FUN= "mean" )
# by species and method
# sp method length
#1 1 A 4
#2 2 A 3
#3 1 B 6
#4 2 B 2
for everything together you may want:
aggregate(length ~ method, data = df, function(x) c(m = mean(x), counts = length(x)) )
# counts and mean for each method
# method length.m length.counts
#1 A 3.5 4.0
#2 B 4.0 4.0
The library plyr is very helpful for stuff like this
library(plyr)
new.df <- ddply(df, c("method", "sp"), summarise,
mean.length=mean(length),
max.length=max(length),
n.obs=length(length))
gives you
> new.df
method sp mean.length max.length n.obs
1 A 1 4 6 2
2 A 2 3 4 2
3 B 1 6 8 2
4 B 2 2 3 2
More examples at http://www.inside-r.org/packages/cran/plyr/docs/ddply.

Resources