combine lists nested in a tibble - r

I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!

tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.

Related

How to obtain maximum counts by group

Using tidyverse, I would like to obtain the maximum count of events (e.g., dates) by group. Here is a minimum reproducible example:
Data frame:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5),
event = c(12, 6, 1, 7, 13, 9, 4, 8, 2, 5, 11, 3, 10, 14))
The following code produces the desired output, but seems overly complicated:
df %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
select(count) %>%
slice_max(count, n = 1, with_ties = FALSE)
Is there a simpler/better way? The following works, but top_n has been superseded by slice_max and it is recommended that the latter be used instead.
df %>%
count(id) %>%
distinct(n) %>% # to remove tied values
top_n(1)
Any suggestions?
If you want something with fewer steps, you could try base R table() to get the counts in a vector and then take the max(). By default it returns the max value only once even if it appears a few times in the vector.
max(table(df$id))
[1] 4
Or if you want it in tidyverse style
df$id %>%
table() %>%
max()
If you want the maximum number of events by group (where id is the grouping variable), then:
df %>%
group_by(id) %>%
summarise(max_n_events = max(event))
If instead you basically do not consider the specific values in the event column and only look at the id column, the solution proposed by #Josh above can also be written as follows:
df %>% group_by(id) %>% count() %>% ungroup() %>% summarise(max(n))

Sum of a list in R data frame

I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)

Looking for a dplyr function to apply a filter conditionally

I have a data frame of various hematology values and their collection times. Those values should only be collected at specific times, but occasionally an extra one is added. I want to remove any instances where a value was collected outside the scheduled time.
To illustrate the issue, here's some code to create a very simplified version of the data frame I'm working with (plus some example schedules):
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
hgb_sampling <- c(1, 4)
bili_sampling <- c(1, 5)
ldh_sampling <- c(0, 4)
So, I need an way to conditionally apply a filter based on the value in the Parameter column. The solution needs to fit into a dyplr pipeline and yield something like this:
filtered <- tibble("Parameter" = c(rep("hemoglobin", 2), rep("bilirubin", 2), rep("LDH", 2)),
"Collection" = c(1, 4, 1, 5, 0, 4))
I've tried a couple things (they all amount to something like the below) but the use of "Parameter" trips things up:
df <- example %>%
{if (Parameter == "hgb") filter(., Collection %in% hgb_sampling)}
Any suggestions?
You could create a reference tibble, join it with example and keep only selected rows.
library(dplyr)
ref_df <- tibble::tibble(Parameter = c("hgb","bili", "LDH"),
value = list(c(1, 4), c(1, 5), c(0, 4)))
example %>%
inner_join(ref_df, by = 'Parameter') %>%
group_by(Parameter) %>%
filter(Collection %in% unique(unlist(value))) %>%
select(Parameter, Collection)
# Parameter Collection
# <chr> <dbl>
#1 hgb 1
#2 hgb 4
#3 bili 1
#4 bili 5
#5 LDH 0
#6 LDH 4
Put your valid times in a list with names matching the names in Collection, then group by the values in Collection and filter by the values of each list element in sample_list:
sample_list <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
example %>%
group_by(Parameter) %>%
filter(Collection %in% sample_list[[first(Parameter)]])
Output:
# A tibble: 6 x 2
Parameter Collection
<chr> <dbl>
1 hemoglobin 1
2 hemoglobin 4
3 bilirubin 1
4 bilirubin 5
5 LDH 0
6 LDH 4
Try purrr::imap_dfr:
library(tidyverse)
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
l <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
imap_dfr(l, ~example %>%
filter(Parameter == .y & Collection %in% .x))
# # A tibble: 6 x 2
# Parameter Collection
# <chr> <dbl>
# 1 hgb 1
# 2 hgb 4
# 3 bili 1
# 4 bili 5
# 5 LDH 0
# 6 LDH 4
Simple method that is very easy to modify, add, remove, debug, ...
library(dplyr)
example %>%
filter(Parameter=="hgb" & Collection %in% c(1, 4) |
Parameter=="bili" & Collection %in% c(1, 5) |
Parameter=="LDH" & Collection %in% c(0, 4) )
Or if you want the values to be within a range:
example %>%
filter(Parameter=="hgb" & between(Collection, 1, 4) |
Parameter=="bili" & between(Collection, 1, 5) |
Parameter=="LDH" & between(Collection, 0, 4))
One option involving dplyr, stringr and tibble could be:
enframe(mget(ls(pattern = "sampling"))) %>%
mutate(name = str_extract(name, "[^_]+")) %>%
right_join(example %>%
mutate(Parameter = tolower(Parameter)), by = c("name" = "Parameter")) %>%
filter(Collection %in% unlist(value)) %>%
select(-value)
name Collection
<chr> <dbl>
1 hgb 1
2 hgb 4
3 bili 1
4 bili 5
5 ldh 0
6 ldh 4
If stored in a separate df as shown by #Ronak Shah, then you can do:
example %>%
filter(Collection %in% unlist(ref_df$value[match(Parameter, ref_df$Parameter)]))
additional solution
library(tidyverse)
library(purrr)
fltr <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0,4)) %>%
enframe(name = "Parameter")
example %>%
group_by(Parameter) %>%
nest() %>%
left_join(fltr) %>%
mutate(out = map2(.x = data, .y = value, .f = ~ filter(.x, Collection %in% .y))) %>%
unnest(out) %>%
select(Parameter, Collection)

Top_n return both max and min value - R

is it possible for the top_n() command to return both max and min value at the same time?
Using the example from the reference page https://dplyr.tidyverse.org/reference/top_n.html
I tried the following
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(c(1,-1)) ## returns an error
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(1) %>% top_n(-1) ## returns only max value
Thanks
Not really involving top_n(), but you can try:
df %>%
arrange(x) %>%
slice(c(1, n()))
x
1 1
2 10
Or:
df %>%
slice(which(x == max(x) | x == min(x))) %>%
distinct()
Or (provided by #Gregor):
df %>%
slice(c(which.min(x), which.max(x)))
Or using filter():
df %>%
filter(x %in% range(x) & !duplicated(x))
Idea similar to #Jakub's answer with purrr::map_dfr
library(tidyverse) # dplyr and purrrr for map_dfr
df %>%
map_dfr(c(1, -1), top_n, wt = x, x = .)
# x
# 1 10
# 2 1
# 3 1
# 4 1
Here is an option with top_n where we pass a logical vector based that returns TRUE for min/max using range and then get the distinct rows as there are ties for range i.e duplicate elements are present
library(dplyr)
df %>%
top_n(x %in% range(x), 1) %>%
distinct
# x
#1 10
#2 1
I like #tmfmnk's answer. If you want to use top_n function, you can do this:
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
bind_rows(
df %>% top_n(1),
df %>% top_n(-1)
)
# this solution addresses the specification in comments
df %>%
group_by(y) %>%
summarise(min = min(x),
max = max(x),
average = mean(x))

use dplyr with missing data

Once again a continuation of my previous 2 questions, but a slightly different problem. Another wrinkle in the data I have been working with:
date <- c("2016-03-24","2016-03-24","2016-03-24","2016-03-24","2016-03-24",
"2016-03-24","2016-03-24","2016-03-24","2016-03-24")
location <- c(1,1,2,2,3,3,4,"out","out")
sensor <- c(1,16,1,16,1,16,1,1,16)
Temp <- c(35,34,92,42,21,47,42,63,12)
df <- data.frame(date,location,sensor,Temp)
Some of my data have missing values. They are not indicated by NA. They are just not in the data period.
I want to subtract location "out" from location "4" ignoring the other locations and I want to do it by date and sensor. I have successfully done this with data locations that have all the data with the following code
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
summarize(Diff = Temp[location=="4"] - Temp[location=="out"],
location = first(location)) %>%
select(1, 2, 4, 3)
However for data with a missing date I get the following error Error: expecting a single value. I think this is because dplyr does not know what to do when it reaches a missing data point.
Doing some research, it seems like do is the way to go, but it returns a data frame without any of the values subtracted from one another.
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
do(Diff = Temp[location=="4"] - Temp[location=="out"],
location = first(location)) %>%
select(1, 2, 4, 3)
Is there a way to override dplyr and tell it to return NA if it can't find one of the entries to subtract?
library(tidyverse)
date <- c("2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24",
"2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24")
location <- c(1, 1, 2, 2, 3, 3, 4, "out", "out")
sensor <- c(1, 16, 1, 16, 1, 16, 1, 1, 16)
Temp <- c(35, 34, 92, 42, 21, 47, 42, 63, 12)
df <- data_frame(date, location, sensor, Temp)
# edge case helper
`%||0%` <- function (x, y) { if (is.null(x) | length(x) == 0) y else x }
df %>%
filter(location %in% c(4, 'out')) %>%
mutate(location=factor(location, levels=c("4", "out"))) %>% # make location a factor
arrange(sensor, location) %>% # order it so we can use diff()
group_by(date, sensor) %>%
summarize(Diff = diff(Temp) %||0% NA, location = first(location)) %>% # deal with the edge case
select(1, 2, 4, 3)
## Source: local data frame [2 x 4]
## Groups: date [1]
##
## date sensor location Diff
## <chr> <dbl> <fctr> <dbl>
## 1 2016-03-24 1 4 21
## 2 2016-03-24 16 out NA
If we want to return NA, the possible option is
library(dplyr)
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
arrange(sensor, location) %>%
summarise(Diff = if(n()==1) NA else diff(Temp), location = first(location)) %>%
select(1, 2, 4, 3)
# date sensor location Diff
# <fctr> <dbl> <fctr> <dbl>
#1 2016-03-24 1 4 21
#2 2016-03-24 16 out NA
and an equivalent option in data.table is
library(data.table)
setDT(df)[location %in% c(4, 'out')][
order(sensor, location), .(Diff = if(.N==1) NA_real_ else diff(Temp),
location = location[1]), .(date, sensor)][, c(1, 2, 4, 3), with = FALSE]
# date sensor location Diff
#1: 2016-03-24 1 4 21
#2: 2016-03-24 16 out NA

Resources