Proper idiom for adding zero count rows in tidyr/dplyr

Proper idiom for adding zero count rows in tidyr/dplyr - r

Suppose I have some count data that looks like this:
library(tidyr)
library(dplyr)
X.raw <- data.frame(
x = as.factor(c("A", "A", "A", "B", "B", "B")),
y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
z = 1:6
)
X.raw
# x y z
# 1 A i 1
# 2 A ii 2
# 3 A ii 3
# 4 B i 4
# 5 B i 5
# 6 B i 6
I'd like to tidy and summarise like this:
X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:
X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 B i 15
# 3 A ii 5
# 4 B ii 0
But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?
Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.

Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:
X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups: x [2]
# x y count
# <fct> <fct> <int>
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii 0
This will keep groups made of all the levels of factor columns so if you have character columns you might want to convert them (thanks to Pake for the note).

The complete function from tidyr is made for just this situation.
From the docs:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data.
You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).
X.raw %>%
complete(x, y, fill = list(z = 0)) %>%
group_by(x,y) %>%
summarise(count = sum(z))
Source: local data frame [4 x 3]
Groups: x [?]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0
You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.
# Complete after ungrouping
X.tidy %>%
ungroup %>%
complete(x, y, fill = list(count = 0))
# Complete within grouping
X.tidy %>%
complete(y, fill = list(count = 0))
The result is the same for each option:
Source: local data frame [4 x 3]
x y count
<fctr> <fctr> <dbl>
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0

You can use tidyr's expand to make all combinations of levels of factors, and then left_join:
X.tidy %>% expand(x, y) %>% left_join(X.tidy)
# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
# 4 B ii NA
Then you may keep values as NAs or replace them with 0 or any other value.
That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.

plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by #momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:
library(plyr)
X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)
X.tidy
x y count
1 A i 1
2 A ii 5
3 B i 15
4 B ii 0

You could explicitly make all possible combinations and then joining it with the tidy summary:
x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
left_join(x.tidy, by=("x", "y")) %>%
mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's

You can also use the data.table package and its Cross Join CJ() function for that.
require(data.table)
X = data.table(X.raw)[
CJ(y = y,
x = x,
unique = TRUE),
on = .(x, y)
][ , .(z = sum(z)), .(x, y) ][ order(x, y) ]
X
# filling the NAs with 0s
setnafill(X, fill = 0, cols = 'z')
X
# x y z
# 1: A i 1
# 2: A ii 5
# 3: B i 15
# 4: B ii 0
Though it's not initially asked for, I'm adding a data.table solution here for the sake of completeness and to also link to the related data.table question.

Related

Merging data with partial match

I have two large data frames, and want to merge them based on one of the column. However, some of the cells only have partial match. Please see the example below:
df1 = data.frame(SampleID = c(1:6), Gene = c("ARF5;ARG1","AP3B1","CLDN5","XPO1;STX7","ABCC4","FLOT1"))
df2 = data.frame(Operation = c("Y"), Gene = c("ARG1","CLDN5;STK10","XPO1","PDE5A","ARF5","IPO7","VAPB","ABCC4"))
#-----------------
SampleID Gene
1 ARF5;ARG1
2 AP3B1
3 CLDN5
4 XPO1;STX7
5 ABCC4
6 FLOT1
#-----------------
Operation Gene
Y ARG1
Y CLDN5;STK10
Y XPO1
Y PDE5A
Y ARF5
Y IPO7
Y VAPB
Y ABCC4
Expected Output
#-----------------
SampleID Gene Operation
1 ARF5;ARG1 Y
2 AP3B1 -
3 CLDN5 Y
4 XPO1;STX7 Y
5 ABCC4 Y
6 FLOT1 -
You can see that df1$Gene and df2$Gene have partially matched, and I want to add Operation information into df1 whenever there is a match. In the example, the df1 row 1 and row 4 have partially match to the df2 row 1 and row 2. For those has no matches, it can be NA, or whatever. I have thousands of rows for my data frame, so I cannot adjust them one by one.

Using dplyr and fuzzyjoin:
library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
group_by(SampleID) %>%
summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
The first step converts df2$Gene[2] from CLDN5;STK10 to \\b(CLDN5|STK10)\\b, a pattern that allows a match on any of its ;-delimited values (inferred from your expected output).
Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]). For example, the above might be rewritten as:
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
rename(Gene = Gene.x) %>%
group_by(across(SampleID:Gene)) %>%
summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
(Renaming from Gene.x to Gene is not necessary but looked nice :-)
This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn use of :-ranges) or not difficult to add individually.

R program questions

I am trying to get some unique combinations of two variables.
For each value of x, I would like to have this unique y value, and drop those have several y values. But several x values could share same y value.
For example,
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6)),
and I would like to get the output like:
b=data.frame(x=c(2,4,5),y=c(3,3,6))
I have tried unique(), but it does not help this situation.
Thank you!

First we use unique to omit repeated rows with the same x and y values (keeping only one copy of each). Any repeated x values that are left have different y values, so we want to get rid of them. We use the standard way to remove all copies of any duplicated values as in this R-FAQ.
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
b = unique(a)
b = b[!duplicated(b$x) & !duplicated(b$x, fromLast = TRUE), ]
b
# x y
# 3 2 3
# 4 4 3
# 5 5 6
Fans of dplyr would probably do it like this, producing the same result.
library(dplyr)
a %>%
group_by(x) %>%
filter(n_distinct(y) == 1) %>%
distinct

Using dplyr:
library(dplyr)
a <- data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
a %>%
distinct() %>%
add_count(x) %>% # adds an implicit group_by(x)
filter(n == 1) %>%
select(-n)
#> # A tibble: 3 x 2
#> # Groups: x [3]
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 4 3
#> 3 5 6
Created on 2018-11-14 by the reprex package (v0.2.1)

Multiple values in one cell

I have data looking somewhat similar to this:
number type results
1 5 x, y, z
2 6 a
3 8 x
1 5 x, y
Basically, I have data in Excel that has commas in a couple of individual cells and I need to count each value that is separated by a comma, after a certain requirement is met by subsetting.
Question: How do I go about receiving the sum of 5 when subsetting the data with number == 1 and type == 5, in R?

If we need the total count, then another option is str_count after subsetting
library(stringr)
with(df, sum(str_count(results[number==1 & type==5], "[a-z]"), na.rm = TRUE))
#[1] 5
Or with gregexpr from base R
with(df, sum(lengths(gregexpr("[a-z]", results[number==1 & type==5])), na.rm = TRUE))
#[1] 5
If there are no matching pattern for an element, use
with(df, sum(unlist(lapply(gregexpr("[a-z]",
results[number==1 & type==5]), `>`, 0)), na.rm = TRUE))

Here is an option using dplyr and tidyr. filter function can filter the rows based on conditions. separate_rows can separate the comma. group_by is to group the data. tally can count the numbers.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
group_by(results) %>%
tally()
# # A tibble: 3 x 2
# results n
# <chr> <int>
# 1 x 2
# 2 y 2
# 3 z 1
Or you can use count(results) only as the following code shows.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
count(results)
DATA
dt <- read.table(text = "number type results
1 5 'x, y, z'
2 6 a
3 8 x
1 5 'x, y'",
header = TRUE, stringsAsFactors = FALSE)

Here is a method using base R. You split results on the commas and get the length of each list, then add these up grouping by number.
aggregate(sapply(strsplit(df$results, ","), length), list(df$number), sum)
Group.1 x
1 1 5
2 2 1
3 3 1
Your data:
df = read.table(text="number type results
1 5 'x, y, z'
2 6 'a'
3 8 'x'
1 5 'x, y'",
header=TRUE, stringsAsFactors=FALSE)

Create a mapping table of duplicated id / keys

I do have a statistical routine that does not like row exact duplicates (without ID) as resulting into null distances.
So I first detect duplicates which I remove, apply my routines and merge back records left aside.
For simplicity, consider I use rownames as ID/key.
I have found following way to achieve my result in base R:
data <- data.frame(x=c(1,1,1,2,2,3),y=c(1,1,1,4,4,3))
# check duplicates and get their ID -- cf. https://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows
dup1 <- duplicated(data)
dupID <- rownames(data)[dup1 | duplicated(data[nrow(data):1, ])[nrow(data):1]]
# keep only those records that do have duplicates to preveng running folowing steps on all rows
datadup <- data[dupID,]
# "hash" row
rowhash <- apply(datadup, 1, paste, collapse="_")
idmaps <- split(rownames(datadup),rowhash)
idmaptable <- do.call("rbind",lapply(idmaps,function(vec)data.frame(mappedid=vec[1],otherids=vec[-1],stringsAsFactors = FALSE)))
Which gives me what I want, ie deduplicated data (easy) and mapping table.
> (data <- data[!dup1,])
x y
1 1 1
4 2 4
6 3 3
> idmaptable
mappedid otherids
1_1.1 1 2
1_1.2 1 3
2_4 4 5
I wonder whether there is a simpler or more effective method (data.table / dplyr accepted). Any alternative to propose?

With data.table...
library(data.table)
setDT(data)
# tag groups of dupes
data[, g := .GRP, by=x:y]
# do whatever analysis
f = function(DT) Reduce(`+`, DT)
resDT = unique(data, by="g")[, res := f(.SD), .SDcols = x:y][]
# "update join" the results back to the main table if needed
data[resDT, on=.(g), res := i.res ]
The OP skipped a central part of the example (usage of the deduped data), so I just made up f.

A solution using tidyverse. I usually don't store information in the row names, so I created ID and ID2 to store information. But of course, you can change that based on your needs.
library(tidyverse)
idmaptable <- data %>%
rowid_to_column() %>%
group_by(x, y) %>%
filter(n() > 1) %>%
unite(ID, x, y) %>%
mutate(ID2 = 1:n()) %>%
group_by(ID) %>%
mutate(ID_type = ifelse(row_number() == 1, "mappedid", "otherids")) %>%
spread(ID_type, rowid) %>%
fill(mappedid) %>%
drop_na(otherids) %>%
mutate(ID2 = 1:n())
idmaptable
# A tibble: 3 x 4
# Groups: ID [2]
ID ID2 mappedid otherids
<chr> <int> <int> <int>
1 1_1 1 1 2
2 1_1 2 1 3
3 2_4 1 4 5

Some improvements to your base R solution,
df <- data[duplicated(data)|duplicated(data, fromLast = TRUE),]
do.call(rbind, lapply(split(rownames(df),
do.call(paste, c(df, sep = '_'))), function(i)
data.frame(mapped = i[1],
others = i[-1],
stringsAsFactors = FALSE)))
Which gives,
mapped others
1_1.1 1 2
1_1.2 1 3
2_4 4 5
And of course,
unique(data)
x y
1 1 1
4 2 4
6 3 3

R: aggregate by all factor levels (present and not present)

I can aggregate a data.frame trivially with dplyr with the following:
z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))
library(dplyr)
z %>%
group_by(b) %>%
summarise(out = n())
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.
For example, let's say the typical dataset should have 5 groups.
z$b <- factor(z$b, levels = letters[1:5])
But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.
Desired output:
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
5 e 0

One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":
library(dplyr)
library(tidyr)
z %>%
mutate(b = factor(b, letters[1:5])) %>%
group_by(b) %>%
summarise(out = n()) %>%
complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
#
# b out
# (fctr) (dbl)
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
# 5 e 0

A workaround is to join with a table containing all levels:
z <- full_join(z, data.frame(b=levels(z$b))
This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.

You could use xtabs:
xtabs(a ~ b, z)
This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:
table(z$a)