use dplyr with missing data - r

Once again a continuation of my previous 2 questions, but a slightly different problem. Another wrinkle in the data I have been working with:
date <- c("2016-03-24","2016-03-24","2016-03-24","2016-03-24","2016-03-24",
"2016-03-24","2016-03-24","2016-03-24","2016-03-24")
location <- c(1,1,2,2,3,3,4,"out","out")
sensor <- c(1,16,1,16,1,16,1,1,16)
Temp <- c(35,34,92,42,21,47,42,63,12)
df <- data.frame(date,location,sensor,Temp)
Some of my data have missing values. They are not indicated by NA. They are just not in the data period.
I want to subtract location "out" from location "4" ignoring the other locations and I want to do it by date and sensor. I have successfully done this with data locations that have all the data with the following code
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
summarize(Diff = Temp[location=="4"] - Temp[location=="out"],
location = first(location)) %>%
select(1, 2, 4, 3)
However for data with a missing date I get the following error Error: expecting a single value. I think this is because dplyr does not know what to do when it reaches a missing data point.
Doing some research, it seems like do is the way to go, but it returns a data frame without any of the values subtracted from one another.
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
do(Diff = Temp[location=="4"] - Temp[location=="out"],
location = first(location)) %>%
select(1, 2, 4, 3)
Is there a way to override dplyr and tell it to return NA if it can't find one of the entries to subtract?

library(tidyverse)
date <- c("2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24",
"2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24")
location <- c(1, 1, 2, 2, 3, 3, 4, "out", "out")
sensor <- c(1, 16, 1, 16, 1, 16, 1, 1, 16)
Temp <- c(35, 34, 92, 42, 21, 47, 42, 63, 12)
df <- data_frame(date, location, sensor, Temp)
# edge case helper
`%||0%` <- function (x, y) { if (is.null(x) | length(x) == 0) y else x }
df %>%
filter(location %in% c(4, 'out')) %>%
mutate(location=factor(location, levels=c("4", "out"))) %>% # make location a factor
arrange(sensor, location) %>% # order it so we can use diff()
group_by(date, sensor) %>%
summarize(Diff = diff(Temp) %||0% NA, location = first(location)) %>% # deal with the edge case
select(1, 2, 4, 3)
## Source: local data frame [2 x 4]
## Groups: date [1]
##
## date sensor location Diff
## <chr> <dbl> <fctr> <dbl>
## 1 2016-03-24 1 4 21
## 2 2016-03-24 16 out NA

If we want to return NA, the possible option is
library(dplyr)
df %>%
filter(location %in% c(4, 'out')) %>%
group_by(date, sensor) %>%
arrange(sensor, location) %>%
summarise(Diff = if(n()==1) NA else diff(Temp), location = first(location)) %>%
select(1, 2, 4, 3)
# date sensor location Diff
# <fctr> <dbl> <fctr> <dbl>
#1 2016-03-24 1 4 21
#2 2016-03-24 16 out NA
and an equivalent option in data.table is
library(data.table)
setDT(df)[location %in% c(4, 'out')][
order(sensor, location), .(Diff = if(.N==1) NA_real_ else diff(Temp),
location = location[1]), .(date, sensor)][, c(1, 2, 4, 3), with = FALSE]
# date sensor location Diff
#1: 2016-03-24 1 4 21
#2: 2016-03-24 16 out NA

Related

identify which group contains sequence of non-zero values

I am trying to identify which groups in a column contain a specific sequence length of non-zero numbers. In the basic example below, where the goal is the find the groups with the a sequence length of 5, only group b would be the correct.
set.seed(123)
df <- data.frame(
id = seq(1:40),
grp = sort(rep(letters[1:4], 10)),
x = c(
c(0, sample(1:10, 3), rep(0, 6)),
c(0, 0, sample(1:10, 5), rep(0, 3)),
c(rep(0, 6), sample(1:10, 4)),
c(0, 0, sample(1:10, 3), 0, sample(1:10, 2), 0, 0))
)
One limited solution is using cumsum below, to find count the non-zero values but does not work when there are breaks in the sequence, such as the specific length being 5 and group d being incorrectly included.
library(dplyr)
df %>%
group_by(grp) %>%
mutate(cc = cumsum(x != 0)) %>% filter(cc == 5) %>% distinct(grp)
Desired output for the example of a sequence length of 5, would identify only group b, not d.
You may use rle to find a consecutive non-zero numbers for each group.
library(dplyr)
find_groups <- function(x, n) {
tmp <- rle(x != 0)
any(tmp$lengths[tmp$values] >= n)
}
#apply the function for each group
df %>%
group_by(grp) %>%
dplyr::filter(find_groups(x, 5)) %>%
ungroup %>%
distinct(grp)
# grp
# <chr>
#1 b
in data.table:
library(data.table)
setDT(df)[,.N==5,.(grp,rleid(!x))][(V1), .(grp)]
grp
1: b
Split the sequence breaks into different groups by adding cumsum(x == 0) to the group_by statement. Then filter for the groups that contain 5 non-zero rows.
library(dplyr)
df %>%
group_by(grp, cumsum(x == 0)) %>%
filter(sum(x != 0) == 5) %>%
ungroup() %>%
distinct(grp)
#> # A tibble: 1 × 1
#> grp
#> <chr>
#> 1 b

combine lists nested in a tibble

I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!
tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.

Nested loops in R that create new variable names and lags

I am new to R, but experienced in Stata. To learn R, I am tracking Covid-19 infections. That requires creating seven-day trailing averages, and I do so with the following loop.
for (mylag in c(1:7)) {
data <- data %>% group_by(state) %>% mutate(!!paste0("deathIncrease", "_", mylag) := lag(deathIncrease, mylag)) %>% ungroup()
}
This works, but then I want to run the same code, not just for deaths, but also for cases. So I tried the following.
var_list <- c("deathIncrease", "positiveIncrease")
for (var in var_list) {
for (mylag in c(1:7)) {
var <- enquo(var)
varname <- enquo( paste0(quo_name(var), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!var, mylag)) %>% ungroup()
}
}
But that leads to the error arg must be a symbol. Any help would be much appreciated. In Stata, loops are simpler. Is there no package that gets R to automatically fill in the looping variables everywhere, like so: {{ var }}?
Edit: here is a minimal working example. The first way to create lags works, but only for var1. The second nested loop does not.
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
for (mylag in c(1:2)) {
df <- df %>% group_by(group_var) %>% mutate(!!paste0("var1", "_lag", mylag) := lag(var1, mylag)) %>% ungroup()
}
## Another loop
var_list <- c("var1", "var2")
for (myvar in var_list) {
for (mylag in c(1:2)) {
myvar <- enquo(myvar)
varname <- enquo( paste0(quo_name(myvar), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!myvar, mylag)) %>% ungroup()
}
}
You can use the function get(), like lag(get(myvar), mylag), to point the specific column the string myvar is referred to:
for(mylag in 1:7){
for(myvar in c('deathIncrease', 'positiveIncrease')){
data <- data %>%
group_by(state) %>%
mutate(
!!paste0(myvar, '_', mylag) := lag(get(myvar), mylag)
) %>%
ungroup()
}
}
My first solution contained a function, that did not respect grouped data. I wanted to look at that anyways, so i spend a bit of time to respect grouped data as well.
This is my solution now, it works as expected on grouped data, but it feels a bit hacky tbh.
add_lag <- function(.data, column, days) {
group <- unlist(groups(.data))
if(is.null(group)){
new <- mapply(function(x, y) {
lag(x, y)
}, x = .data[column], y = sort(rep(days, length(column))))
if(is.null(dim(new))){
new <- t(new)
}
new <- as.data.frame(new, stringsAsFactors = F)
names(new) <- paste0(column, "_", sort(rep(days, length(column))))
new <- as_tibble(new) %>%
select(sort(names(new)))
mutate(.data, !!!new)
} else {
tmp <- .data %>%
nest()
tmp$data <- lapply(tmp$data, function(x,y){
x %>%
add_lag(column,y)
}, y = days)
tmp %>% unnest(c(data))
}
}
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
df %>%
group_by(group_var) %>%
add_lag("var1", 1:2)
# A tibble: 6 x 5
# Groups: group_var [2]
group_var var1 var2 var1_1 var1_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 NA NA
2 A 2 11 1 NA
3 A 3 12 2 1
4 B 1 10 NA NA
5 B 2 11 1 NA
6 B 3 12 2 1

Looking for a dplyr function to apply a filter conditionally

I have a data frame of various hematology values and their collection times. Those values should only be collected at specific times, but occasionally an extra one is added. I want to remove any instances where a value was collected outside the scheduled time.
To illustrate the issue, here's some code to create a very simplified version of the data frame I'm working with (plus some example schedules):
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
hgb_sampling <- c(1, 4)
bili_sampling <- c(1, 5)
ldh_sampling <- c(0, 4)
So, I need an way to conditionally apply a filter based on the value in the Parameter column. The solution needs to fit into a dyplr pipeline and yield something like this:
filtered <- tibble("Parameter" = c(rep("hemoglobin", 2), rep("bilirubin", 2), rep("LDH", 2)),
"Collection" = c(1, 4, 1, 5, 0, 4))
I've tried a couple things (they all amount to something like the below) but the use of "Parameter" trips things up:
df <- example %>%
{if (Parameter == "hgb") filter(., Collection %in% hgb_sampling)}
Any suggestions?
You could create a reference tibble, join it with example and keep only selected rows.
library(dplyr)
ref_df <- tibble::tibble(Parameter = c("hgb","bili", "LDH"),
value = list(c(1, 4), c(1, 5), c(0, 4)))
example %>%
inner_join(ref_df, by = 'Parameter') %>%
group_by(Parameter) %>%
filter(Collection %in% unique(unlist(value))) %>%
select(Parameter, Collection)
# Parameter Collection
# <chr> <dbl>
#1 hgb 1
#2 hgb 4
#3 bili 1
#4 bili 5
#5 LDH 0
#6 LDH 4
Put your valid times in a list with names matching the names in Collection, then group by the values in Collection and filter by the values of each list element in sample_list:
sample_list <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
example %>%
group_by(Parameter) %>%
filter(Collection %in% sample_list[[first(Parameter)]])
Output:
# A tibble: 6 x 2
Parameter Collection
<chr> <dbl>
1 hemoglobin 1
2 hemoglobin 4
3 bilirubin 1
4 bilirubin 5
5 LDH 0
6 LDH 4
Try purrr::imap_dfr:
library(tidyverse)
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
l <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
imap_dfr(l, ~example %>%
filter(Parameter == .y & Collection %in% .x))
# # A tibble: 6 x 2
# Parameter Collection
# <chr> <dbl>
# 1 hgb 1
# 2 hgb 4
# 3 bili 1
# 4 bili 5
# 5 LDH 0
# 6 LDH 4
Simple method that is very easy to modify, add, remove, debug, ...
library(dplyr)
example %>%
filter(Parameter=="hgb" & Collection %in% c(1, 4) |
Parameter=="bili" & Collection %in% c(1, 5) |
Parameter=="LDH" & Collection %in% c(0, 4) )
Or if you want the values to be within a range:
example %>%
filter(Parameter=="hgb" & between(Collection, 1, 4) |
Parameter=="bili" & between(Collection, 1, 5) |
Parameter=="LDH" & between(Collection, 0, 4))
One option involving dplyr, stringr and tibble could be:
enframe(mget(ls(pattern = "sampling"))) %>%
mutate(name = str_extract(name, "[^_]+")) %>%
right_join(example %>%
mutate(Parameter = tolower(Parameter)), by = c("name" = "Parameter")) %>%
filter(Collection %in% unlist(value)) %>%
select(-value)
name Collection
<chr> <dbl>
1 hgb 1
2 hgb 4
3 bili 1
4 bili 5
5 ldh 0
6 ldh 4
If stored in a separate df as shown by #Ronak Shah, then you can do:
example %>%
filter(Collection %in% unlist(ref_df$value[match(Parameter, ref_df$Parameter)]))
additional solution
library(tidyverse)
library(purrr)
fltr <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0,4)) %>%
enframe(name = "Parameter")
example %>%
group_by(Parameter) %>%
nest() %>%
left_join(fltr) %>%
mutate(out = map2(.x = data, .y = value, .f = ~ filter(.x, Collection %in% .y))) %>%
unnest(out) %>%
select(Parameter, Collection)

Top_n return both max and min value - R

is it possible for the top_n() command to return both max and min value at the same time?
Using the example from the reference page https://dplyr.tidyverse.org/reference/top_n.html
I tried the following
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(c(1,-1)) ## returns an error
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(1) %>% top_n(-1) ## returns only max value
Thanks
Not really involving top_n(), but you can try:
df %>%
arrange(x) %>%
slice(c(1, n()))
x
1 1
2 10
Or:
df %>%
slice(which(x == max(x) | x == min(x))) %>%
distinct()
Or (provided by #Gregor):
df %>%
slice(c(which.min(x), which.max(x)))
Or using filter():
df %>%
filter(x %in% range(x) & !duplicated(x))
Idea similar to #Jakub's answer with purrr::map_dfr
library(tidyverse) # dplyr and purrrr for map_dfr
df %>%
map_dfr(c(1, -1), top_n, wt = x, x = .)
# x
# 1 10
# 2 1
# 3 1
# 4 1
Here is an option with top_n where we pass a logical vector based that returns TRUE for min/max using range and then get the distinct rows as there are ties for range i.e duplicate elements are present
library(dplyr)
df %>%
top_n(x %in% range(x), 1) %>%
distinct
# x
#1 10
#2 1
I like #tmfmnk's answer. If you want to use top_n function, you can do this:
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
bind_rows(
df %>% top_n(1),
df %>% top_n(-1)
)
# this solution addresses the specification in comments
df %>%
group_by(y) %>%
summarise(min = min(x),
max = max(x),
average = mean(x))

Resources