R: aggregate by all factor levels (present and not present) - r

I can aggregate a data.frame trivially with dplyr with the following:
z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))
library(dplyr)
z %>%
group_by(b) %>%
summarise(out = n())
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.
For example, let's say the typical dataset should have 5 groups.
z$b <- factor(z$b, levels = letters[1:5])
But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.
Desired output:
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
5 e 0

One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":
library(dplyr)
library(tidyr)
z %>%
mutate(b = factor(b, letters[1:5])) %>%
group_by(b) %>%
summarise(out = n()) %>%
complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
#
# b out
# (fctr) (dbl)
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
# 5 e 0

A workaround is to join with a table containing all levels:
z <- full_join(z, data.frame(b=levels(z$b))
This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.

You could use xtabs:
xtabs(a ~ b, z)
This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:
table(z$a)

Related

Create a list of all values of a variable grouped by another variable in R

I have a data frame that contains two variables, like this:
df <- data.frame(group=c(1,1,1,2,2,3,3,4),
type=c("a","b","a", "b", "c", "c","b","a"))
> df
group type
1 1 a
2 1 b
3 1 a
4 2 b
5 2 c
6 3 c
7 3 b
8 4 a
I want to produce a table showing for each group the combination of types it has in the data frame as one variable e.g.
group alltypes
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
The output would always list the types in the same order (e.g. groups 2 and 3 get the same result) and there would be no repetition (e.g. group 1 is not "a, b, a").
I tried doing this using dplyr and summarize, but I can't work out how to get it to meet these two conditions - the code I tried was:
> df %>%
+ group_by(group) %>%
+ summarise(
+ alltypes = paste(type, collapse=", ")
+ )
# A tibble: 4 × 2
group alltypes
<dbl> <chr>
1 1 a, b, a
2 2 b, c
3 3 c, b
4 4 a
I also tried turning type into a set of individual counts, but not sure if that's actually useful:
> df %>%
+ group_by(group, type) %>%
+ tally %>%
+ spread(type, n, fill=0)
Source: local data frame [4 x 4]
Groups: group [4]
group a b c
* <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0
2 2 0 1 1
3 3 0 1 1
4 4 1 0 0
Any suggestions would be greatly appreciated.
I think you were very close. You could call the sort and unique functions to make sure your result adheres to your conditions as follows:
df %>% group_by(group) %>%
summarize(type = paste(sort(unique(type)),collapse=", "))
returns:
# A tibble: 4 x 2
group type
<int> <chr>
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
To expand on Florian's answer this could be extended to generating an ordered list based on values in your data set. An example could be determining the order of dates:
library(lubridate)
library(tidyverse)
# Generate random dates
set.seed(123)
Date = ymd("2018-01-01") + sort(sample(1:200, 10))
A = ymd("2018-01-01") + sort(sample(1:200, 10))
B = ymd("2018-01-01") + sort(sample(1:200, 10))
C = ymd("2018-01-01") + sort(sample(1:200, 10))
# Combine to data set
data = bind_cols(as.data.frame(Date), as.data.frame(A), as.data.frame(B), as.data.frame(C))
# Get order of dates for each row
data %>%
mutate(D = Date) %>%
gather(key = Var, value = D, -Date) %>%
arrange(Date, D) %>%
group_by(Date) %>%
summarize(Ord = paste(Var, collapse=">"))
Somewhat tangential to the original question but hopefully helpful to someone.

two factor group_by then add row number R dplyr [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a data frame (df):
a <- c("up","up","up","up","down","down","down","down")
b <- c("l","r","l","r","l","l","r","r")
df <- data.frame(a,b)
I would like to add a third column (c) which contains the order of entries, grouped by columns a and b that looks something like this:
a b c
1 up l 1
2 up r 1
3 up l 2
4 up r 2
5 down l 1
6 down l 2
7 down r 1
8 down r 2
I have tried solutions using dplyr that have not worked:
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = row_number()) # This counts the order based on `b`, ignoring `a`
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = seq_len(n())) # This counts the order based on `b`, ignoring `a`
I would prefer to keep using dplyr and pipes if possible, but other suggestions are welcome
You need to combine a and b in the same group_by statement.
order <- df %>%
group_by(a, b) %>%
mutate(c = row_number())
order
# Source: local data frame [8 x 3]
# Groups: a, b [4]
#
# a b c
# <fctr> <fctr> <int>
# 1 up l 1
# 2 up r 1
# 3 up l 2
# 4 up r 2
# 5 down l 1
# 6 down l 2
# 7 down r 1
# 8 down r 2

How to sum a variable based on factor?

Here is an example of my data:
Type <- c('A','A','A','A','B','B','C','D')
Name <- c('DK', 'MO', 'OM', 'LSO', 'GOP', 'ADG','BFC','TMD')
Value <- c(3,2,5,3,6,5,7,6)
Dat <- data.frame(Type, Name,Value)
Dat
Type Name Value
1 A DK 3
2 A MO 2
3 A OM 5
4 A LSO 3
5 B GOP 6
6 B ADG 5
7 C BFC 7
8 D TMD 6
What I'm trying to get is the sum of the value when Type=A. In this case, it is 13. I found some similar examples by applying dplyr, but I don't need the type nor the name. Please help and thank you!
Using dplyr you would use group_by to group each type or if you only want type A you could filter where Type == A. Then in both cases you would summarize by the sum of the value. I've shown both examples below.
library(dplyr)
Type <- c('A','A','A','A','B','B','C','D')
Name <- c('DK', 'MO', 'OM', 'LSO', 'GOP', 'ADG','BFC','TMD')
Value <- c(3,2,5,3,6,5,7,6)
Dat <- data.frame(Type, Name,Value)
Dat
res1 <- Dat %>%
group_by(Type) %>%
summarize(sum(Value))
res1
# Source: local data frame [4 x 2]
#
# Type sum(Value)
# (fctr) (dbl)
#1 A 13
#2 B 11
#3 C 7
#4 D 6
res2 <- Dat %>%
filter(Type == "A") %>%
summarize(sum(Value))
res2
# sum(Value)
#1 13

Dplyr filtering based on two variables

I want to use dplyr to determine which observations in a dataframe meet the following condition:
Within each Group, the combined total of Var2 for observations where Var1 == good is greater than the combined total of observations whereVar1 == bad
Here's the toy dataframe:
library(dplyr)
set.seed(seed = 10)
df <- data.frame("Id" = 1:12,
"Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)),
"Var1" = sample(rep(c("good","bad"),times = 1000),size = 12),
"Var2" = sample(rep(1:10, times = 1000),size = 12))
print(df)
Id Group Var1 Var2
1 1 A good 6
2 2 A bad 9
3 3 A good 10
4 4 A good 7
5 5 B bad 9
6 6 B bad 1
7 7 B bad 6
8 8 B good 6
9 9 C good 1
10 10 C bad 8
11 11 C good 4
12 12 C bad 2
So far I've determined that I should be using some combination of group_by(),summarise(), and filter() but I can't seem to wrap my head around a good way to do it. Here's what I've come up with so far:
keepers <- df %>%
group_by(Group, Var1) %>%
summarise(Total = sum(Var2)) %>%
print()
Source: local data frame [6 x 3]
Groups: Group [?]
Group Var1 Total
(chr) (chr) (int)
1 A bad 9
2 A good 23
3 B bad 16
4 B good 6
5 C bad 10
6 C good 5
What next steps should I take? Ultimately the analysis should return "A", because it's the only Group where Total is greater for the good observations than for the bad observations.
How about using spread than filter:
> library(tidyr)
> df %>% group_by(Group, Var1) %>%
+ summarise(Total = sum(Var2)) %>%
+ spread(Var1,Total) %>%
+ filter(good>bad)
Source: local data frame [1 x 3]
Group bad good
1 A 9 23
A similar option with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Group', 'Var1', get the sum of 'Var2', reshape from 'long' to 'wide' and filter the rows where the 'good' is greater than 'bad'.
library(data.table)
dcast(setDT(df)[, sum(Var2) , by = .(Group, Var1)],
Group~Var1, value.var='V1')[good>bad]
# Group bad good
#1: A 9 23

Proper idiom for adding zero count rows in tidyr/dplyr

Suppose I have some count data that looks like this:
library(tidyr)
library(dplyr)
X.raw <- data.frame(
x = as.factor(c("A", "A", "A", "B", "B", "B")),
y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
z = 1:6
)
X.raw
# x y z
# 1 A i 1
# 2 A ii 2
# 3 A ii 3
# 4 B i 4
# 5 B i 5
# 6 B i 6
I'd like to tidy and summarise like this:
X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:
X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 B i 15
# 3 A ii 5
# 4 B ii 0
But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?
Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.
Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:
X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0
This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).
The complete function from tidyr is made for just this situation.
From the docs:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data.
You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).
X.raw %>%
complete(x, y, fill = list(z = 0)) %>%
group_by(x,y) %>%
summarise(count = sum(z))
Source: local data frame [4 x 3]
Groups: x [?]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.
# Complete after ungrouping
X.tidy %>%
ungroup %>%
complete(x, y, fill = list(count = 0))
# Complete within grouping
X.tidy %>%
complete(y, fill = list(count = 0))
The result is the same for each option:
Source: local data frame [4 x 3]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can use tidyr's expand to make all combinations of levels of factors, and then left_join:
X.tidy %>% expand(x, y) %>% left_join(X.tidy)
# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii NA
Then you may keep values as NAs or replace them with 0 or any other value.
That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.
plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by #momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:
library(plyr)
X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)
X.tidy
x y count
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You could explicitly make all possible combinations and then joining it with the tidy summary:
x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
left_join(x.tidy, by=("x", "y")) %>%
mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's
You can also use the data.table package and its Cross Join CJ() function for that.
require(data.table)
X = data.table(X.raw)[
CJ(y = y,
x = x,
unique = TRUE),
on = .(x, y)
][ , .(z = sum(z)), .(x, y) ][ order(x, y) ]
X
# filling the NAs with 0s
setnafill(X, fill = 0, cols = 'z')
X
# x y z
# 1: A i 1
# 2: A ii 5
# 3: B i 15
# 4: B ii 0
Though it's not initially asked for, I'm adding a data.table solution here for the sake of completeness and to also link to the related data.table question.

Resources