Performing logical tests against multiple rows in R - r

I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.
book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1
books chapt.start chapt.end
1 Dune 300 310
2 LOTR 8 19
3 LOTR 94 110
4 OriginOfSpecies 150 158
5 OldManSea 600 630
My second dataframe contains a list of book titles and a single page.
title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2
title page
1 LOTR 4
2 LOTR 12
3 LOTR 30
4 OriginOfSpecies 200
5 OldManSea 620
6 OldManSea 650
What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end
The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE
Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!

This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page).
sqldf
library(sqldf)
sqldf(
"select df1.*, df2.page
from df1
left join df2 on df1.book=df2.title
and df2.page between df1.[chapt.start] and df1.[chapt.end]")
# book chapt.start chapt.end page
# 1 Dune 300 310 NA
# 2 LOTR 8 19 12
# 3 LOTR 94 110 NA
# 4 OriginOfSpecies 150 158 NA
# 5 OldManSea 600 630 620
fuzzyjoin
(Edited out, see #IanCampbell's answer.)
data.table
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
# title page p2 p2.1
# <char> <num> <num> <num>
# 1: Dune NA 300 310
# 2: LOTR 12 8 19
# 3: LOTR NA 94 110
# 4: OriginOfSpecies NA 150 158
# 5: OldManSea 620 600 630
The reason I add p2 as a copy of page is that data.table on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.

You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin package:
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`))
books chapt.start chapt.end title page
1 Dune 300 310 <NA> NA
2 LOTR 8 19 LOTR 12
3 LOTR 94 110 <NA> NA
4 OriginOfSpecies 150 158 <NA> NA
5 OldManSea 600 630 OldManSea 620
From here it's easy to get to the desired output:
library(dplyr)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`)) %>%
mutate(result = !is.na(page)) %>%
select(-c(title,page))
books chapt.start chapt.end result
1 Dune 300 310 FALSE
2 LOTR 8 19 TRUE
3 LOTR 94 110 FALSE
4 OriginOfSpecies 150 158 FALSE
5 OldManSea 600 630 TRUE

using dplyr only i.e. without purrr or fuzzyjoin
df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
group_by(id, title) %>%
summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))),
F,
as.logical(sum(chapt.start <= page & page <= chapt.end))))
# A tibble: 5 x 3
# Groups: id [5]
id title desired
<int> <chr> <lgl>
1 1 Dune FALSE
2 2 LOTR TRUE
3 3 LOTR FALSE
4 4 OriginOfSpecies FALSE
5 5 OldManSea TRUE

Another approach using purrr without joining data
Create a logical check variable for df1
library(dplyr)
library(purrr)
# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
book_record <- tibble(...)
any_record <- page_df %>%
filter(title == book_record[["book"]],
page >= book_record[["chapt.start"]],
page <= book_record[["chapt.end"]])
nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#> book chapt.start chapt.end check
#> 1 Dune 300 310 FALSE
#> 2 LOTR 8 19 TRUE
#> 3 LOTR 94 110 FALSE
#> 4 OriginOfSpecies 150 158 FALSE
#> 5 OldManSea 600 630 TRUE
Same logics just did it for df2
# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
page_record <- tibble(...)
any_record <- book_chapters_df %>%
filter(book == page_record[["title"]],
chapt.start <= page_record[["page"]],
chapt.end >= page_record[["page"]])
nrow(any_record) > 0
}
# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)
df2
#> title page check
#> 1 LOTR 4 FALSE
#> 2 LOTR 12 TRUE
#> 3 LOTR 30 FALSE
#> 4 OriginOfSpecies 200 FALSE
#> 5 OldManSea 620 TRUE
#> 6 OldManSea 650 FALSE
Created on 2021-04-12 by the reprex package (v1.0.0)

Related

Rowwise logical operation with tidy selection of imput variables - dplyr

I'm looking for a more eloquent way to write R code for a kind of case that I've encountered more than once. Here is an example of the data and some code that accomplishes the result I want:
library(tidyverse)
df <- tibble(id = 1:5, primary_county = 101:105, secondary_county = 201:205)
specific_counties <- c(101, 103, 202, 205)
df |>
mutate(target_area =
primary_county %in% specific_counties | secondary_county %in% specific_counties)
The result is:
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
I want to know if there is a way to get the same result using code that would be more succinct and eloquent if I were dealing with more columns of the "..._county" variety. Specifically, in my code above, the expression %in% specific_counties must be repeated with an | for each extra column I want to handle. Is there a way to not have to repeat this kind of phrase multiple times?
These logical rowwise operations are superbly well handled by dplyr::if_any() or dplyr::if_all():
library(dplyr)
df %>%
mutate(target_area = if_any(ends_with('county'), ~. %in% specific_counties))
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
We can also use:
purrr::reduce with |,
rowSums with as.logical
purrr::pmap_lgl with any(c(...) %in% x)
library(purrr)
library(dplyr)
df %>%
mutate(target_area = reduce(across(ends_with('county'), ~.x %in% specific_counties),
`|`))
## OR ##
df %>%
mutate(target_area = rowSums(across(ends_with('county'), ~.x %in% specific_counties)) %>%
as.logical)
## OR ##
df %>%
mutate(target_area = pmap_lgl(across(ends_with('county')),
~any(c(...) %in% specific_counties)))
For reference, this other answer of mine shows similar usages for if_any, and reduce(|) in a filter() operation:
R - Remove rows from dataframe that contain only zeros in numeric columns, base R and pipe-friendly methods?
Additional related questions/answers:
Logical function across multiple columns using "any" function
How to create a new column based on if any of a subset of columns are NA with the dplyr
This allows a little over what you have, not sure how "eloquent" I'd call it:
df %>%
mutate(
target_area = rowSums(
sapply(select(cur_data(), matches("_county")),
`%in%`, specific_counties)) > 0
)
# # A tibble: 5 x 4
# id primary_county secondary_county target_area
# <int> <int> <int> <lgl>
# 1 1 101 201 TRUE
# 2 2 102 202 TRUE
# 3 3 103 203 TRUE
# 4 4 104 204 FALSE
# 5 5 105 205 TRUE
Or you can list the columns explicitly, replacing the select(.., matches(..)) with list(primary_county, secondary_county).
Add as many columns to the list(..) as you want.

Join overlapping ranges from two data frames in r

Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...
I have a main data set main_df with a start and end time (in seconds). I would like to see if the time range in main_df falls within a list of ranges in lookup_df, and if so, grab the value from lookup_df. Additionally, if the main_df falls within two different lookup ranges, duplicate the row so each value is represented.***
main_df <- tibble(start = c(30,124,161),
end = c(80,152,185))
lookup_df <- tibble(start = c(34,73,126,141,174,221),
end = c(69,123,136,157,189,267),
value = c('a','b','b','b','b','a'))
# Do something here to get the following:
> final_df
# A tibble: 4 x 4
start end value notes
<dbl> <dbl> <chr> <chr>
1 30 80 a ""
2 30 80 b "Duplicate because it falls within a and b"
3 124 152 b "Falls within two lookups but both are b"
4 161 185 b ""
***Edit: Looking at the way I've structured the problem...
#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>%
add_rows(through_some_means)
Rather than having to add a new row I could flip how I'm joining them...
semi_join(lookup_df, main_df, by(some_range_join_function))
You could do some logical comparisons and then a case handling what shall happen if all are 'b', 'a' and 'b', etc. In this way you easily could add more cases, e.g. both are 'a', one is 'a', more are 'b' which you didn't declare in OP. The approach yields NULL if there are no matches which gets omitted during rbind.
f <- \(x, y) {
w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
if (length(w) > 0) {
d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
if (length(w) >= 2) {
if (all(d$value == 'b')) {
d <- d[!duplicated(d$value), ]
d$notes[1] <- 'both b'
}
else {
d$notes[nrow(d)] <- 'a & b'
}
}
d
}
}
apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
# start end value notes
# 1 30 80 a
# 2 30 80 b a & b
# 3 124 152 b both b
# 4 161 185 b
Data:
main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152,
185)), row.names = c(NA, -4L), class = "data.frame")
lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69,
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b",
"a")), row.names = c(NA, -6L), class = "data.frame")
Another option is fuzzyjoin::interval_join:
library(fuzzyjoin)
library(dplyr)
interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>%
group_by(value, start.x, end.x) %>%
slice(1) %>%
select(start = start.x, end = end.x, value)
# A tibble: 4 × 3
# Groups: value, start, end [4]
start end value
<dbl> <dbl> <chr>
1 30 80 a
2 30 80 b
3 124 152 b
4 161 185 b
You can use foverlaps from data.table for this.
library(data.table)
setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed
setkey(main_df, start, end) # set the keys of 'y'
foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup
# start end i.start i.end value
# 1: 30 80 34 69 a
# 2: 30 80 73 123 b
# 3: 124 152 126 136 b
# 4: 124 152 141 157 b
# 5: 161 185 174 189 b
Or to get the cleaned results as end result (OP's final_df)
unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])
start end value
1: 30 80 a
2: 30 80 b
3: 124 152 b
4: 161 185 b
A possible solution, based on powerjoin:
library(tidyverse)
library(powerjoin)
power_left_join(
main_df, lookup_df,
by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
(.x$start >= .y$start & .x$start <= .y$end) |
(.x$start <= .y$start & .x$end >= .y$start),
keep = "left") %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
Or using tidyr::crossing:
library(tidyverse)
crossing(main_df, lookup_df,
.name_repair = ~ c("start", "end", "start2", "end2", "value")) %>%
filter((start <= start2 & end >= end2) |
(start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>%
select(-start2, -end2) %>%
distinct()
#> # A tibble: 4 x 3
#> start end value
#> <dbl> <dbl> <chr>
#> 1 30 80 a
#> 2 30 80 b
#> 3 124 152 b
#> 4 161 185 b
You can use the fuzzyjoin package to join based on intervals with the fuzzyjoin::interval_*_join() functions.
I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.
library(tidyverse)
library(fuzzyjoin)
fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#> start.x end.x value start.y end.y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 34 69 a 30 80
#> 2 73 123 b 30 80
#> 3 126 136 b 124 152
#> 4 141 157 b 124 152
#> 5 174 189 b 161 185
As you can see, the fuzzy_inner_join() preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases in main_df that match multiple cases in lookup_df. Thus, we do some cleanup of the joined table:
interval_inner_join(lookup_df, main_df,
by = c("start", "end"),
type = "any") |>
select(-ends_with(".x")) |> # remove lookup interval cols
distinct() |> # remove duplicate
rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#> value start end
#> <chr> <dbl> <dbl>
#> 1 a 30 80
#> 2 b 30 80
#> 3 b 124 152
#> 4 b 161 185
Finally, a clarification of terminology: In your question you state you want to join based on the interval from main_df falling within the interval from lookup_df. This is possible by using type = "within" in interval_*_join(). But based on the examples you provide, it appears you want to join based on any overlap. This can be done with type = "any", but it is the default, so you don't need to specify it.

Sampling different x and different sample size in R

Say I have a table like this:
Students
Equipment #
A
101
A
102
A
103
B
104
B
105
B
106
B
107
B
108
C
109
C
110
C
111
C
112
I want to grab equipment # samples from each student in the data frame with varying sample sizes.
For example, I want 1 equipment # from student "A", 2 from student "B", and 3 from student "C". How can I achieve this in R?
This is the code that I have now, but I'm only getting 1 equipment # printed from each student.
students <- unique(df$`Students`)
sample_size <- c(1,2,3)
for (i in students){
s <- sample(df[df$`Students` == i,]$`Equipment #`, size = sample_size, replace = FALSE)
print(s)
}
You can create a dataframe which has information students and the rows to be sampled. Join the data and use sample_n to sample those rows.
library(dplyr)
sample_data <- data.frame(Students = c('A', 'B', 'C'), nr = 1:3)
df %>%
left_join(sample_data, by = 'Students') %>%
group_by(Students) %>%
sample_n(first(nr)) %>%
ungroup() %>%
select(-nr) -> s
s
# Students Equipment
# <chr> <int>
#1 A 102
#2 B 108
#3 B 105
#4 C 110
#5 C 112
#6 C 111
You're close. You need to index the sample_size vector with the loop, otherwise it will just take the first item in the vector for each iteration.
library(dplyr)
# set up data
df <- data.frame(Students = c(rep("A", 3),
rep("B", 5),
rep("C", 4)),
Equipment_num = 101:112)
# create vector of students
students <- df %>%
pull(Students) %>%
unique()
# sample and print
for (i in seq_along(students)) {
p <- df %>%
filter(Students == students[i]) %>%
slice_sample(n = i)
print(p)
}
#> Students Equipment_num
#> 1 A 102
#> Students Equipment_num
#> 1 B 107
#> 2 B 105
#> Students Equipment_num
#> 1 C 109
#> 2 C 110
#> 3 C 112
Created on 2021-08-06 by the reprex package (v2.0.0)
Actually this is a much more elegant and generalizable way to tackle this problem.

Sort by dates and value R

I have this dataframe:
a <- c(1,2,3,4,5)
b <- c(100, 300, NA, 430, 270)
c <- c('2018-02-01', '2020-02-17', '2019-10-24', '2019-10-24', '2020-02-17')
df <- data.frame(a,b,c)
names(df) <- c('id', "value", "Date")
All I want is to sort the value of Dateand value in the descending order. However I find that Date can be order by using rev but with NA the order is not correct :
df[rev(order(df$Date, df$value)),] ##
The result that I want should be like this:
id value Date
2 2 300 2020-02-17
5 5 270 2020-02-17
4 4 430 2019-10-24
3 3 NA 2019-10-24
1 1 100 2018-02-01
Any suggest to deal with NA in this dataframe without dropping it.
you can try dplyr functions, since arrange() sorts NA values to last by default:
df %>%
arrange(desc(Date),desc(value))
gives:
id value Date
1 2 300 2020-02-17
2 5 270 2020-02-17
3 4 430 2019-10-24
4 3 NA 2019-10-24
5 1 100 2018-02-01
Using na.last = FALSE should do the trick:
df[rev(order(df$Date, df$value, na.last = FALSE)),]
If you want to use data.table package, you can do it like this:
library(data.table)
setorderv(df, cols = c("Date","value"), order = -1, na.last=TRUE)

Single row per id to multiple row per id

I'd like to expand observations from single row-per-id to multiple rows-per-id based on a given time interval:
> dput(df)
structure(list(id = c(123, 456, 789), gender = c(0, 1, 1), yr.start = c(2005,
2010, 2000), yr.last = c(2007, 2012, 2000)), .Names = c("id",
"gender", "yr.start", "yr.last"), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L))
> df
# A tibble: 3 x 4
id gender yr.start yr.last
<dbl> <dbl> <dbl> <dbl>
1 123 0 2005 2007
2 456 1 2010 2012
3 789 1 2000 2000
I want to get id expanded into one row per year:
> dput(df_out)
structure(list(id = c(123, 123, 123, 456, 456, 456, 789), gender = c(0,
0, 0, 1, 1, 1, 1), yr = c(2005, 2006, 2007, 2010, 2011, 2012,
2000)), .Names = c("id", "gender", "yr"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L))
> df_out
# A tibble: 7 x 3
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
I know how to melt/reshape, but I'm not sure how I can expand the years.
Thanks.
Here is a base R method.
# expand years to a list
yearList <- mapply(":", df$yr.start, df$yr.last)
Now, use this list to calculate the number of rows to repeat for each ID (the second argument of rep) and then append it as a vector (transformed from list with unlist) using cbind.
# get data.frame
cbind(df[rep(seq_along(df$id), lengths(yearList)), c("id", "gender")], yr=unlist(yearList))
id gender yr
1 123 0 2005
1.1 123 0 2006
1.2 123 0 2007
2 456 1 2010
2.1 456 1 2011
2.2 456 1 2012
3 789 1 2000
You could gather into long format and then fill in the missing rows via complete using tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(group, yr, starts_with("yr") ) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
You can use select to get rid of the extra column.
df %>%
gather(group, yr, starts_with("yr") ) %>%
select(-group) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
# A tibble: 8 x 3
# Groups: id, gender [3]
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
8 789 1 2000
Here is a tidyverse solution
library(tidyverse)
df %>%
group_by(id, gender) %>%
nest() %>%
mutate(data = map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
unnest() %>%
rename(year = data)
# A tibble: 7 x 3
id gender year
<dbl> <dbl> <int>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
As the OP mentions that his production data set has more than 1 M rows and he is benchmarking the different solutions, it might be worthwhile to try a data.table version:
library(data.table) # CRAN version 1.10.4 used
data.table(DF)[, .(yr = yr.start:yr.last), by = .(id, gender)]
which returns
id gender yr
1: 123 0 2005
2: 123 0 2006
3: 123 0 2007
4: 456 1 2010
5: 456 1 2011
6: 456 1 2012
7: 789 1 2000
If there are more non-varying columns than just gender it might be more efficient to do a join rather than including all those columns in the grouping parameter by =:
data.table(DF)[DF[, .(yr = yr.start:yr.last), by = id], on = "id"]
id gender yr.start yr.last yr
1: 123 0 2005 2007 2005
2: 123 0 2005 2007 2006
3: 123 0 2005 2007 2007
4: 456 1 2010 2012 2010
5: 456 1 2010 2012 2011
6: 456 1 2010 2012 2012
7: 789 1 2000 2000 2000
Note that both approaches assume that id is unique in the input data.
Benchmarking
The OP has noted that he is surprised that above data.table solution is five times slower than lmo's base R solution, apparently with OP's production data set of more than 1 M rows.
Also, the question has attracted 5 different answers plus additional suggestions. So, it's worthwhile to compare the solution in terms of processing speed.
Data
As the production data set isn't available, and problem size among other factors like the strcuture of the data is important for benchmarking, sample data sets are created.
# parameters
n_rows <- 1E2
yr_range <- 10L
start_yr <- seq(2000L, length.out = 10L, by = 1L)
# create sample data set
set.seed(123L)
library(data.table)
DT <- data.table(id = seq_len(n_rows),
gender = sample(0:1, n_rows, replace = TRUE),
yr.start = sample(start_yr, n_rows, replace = TRUE))
DT[, yr.last := yr.start + sample(0:yr_range, n_rows, replace = TRUE)]
DF <- as.data.frame(DT)
str(DT)
Classes ‘data.table’ and 'data.frame': 100 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : int 0 1 0 1 1 0 1 1 1 0 ...
$ yr.start: int 2005 2003 2004 2009 2004 2008 2009 2006 2004 2001 ...
$ yr.last : int 2007 2013 2010 2014 2008 2017 2013 2009 2005 2002 ...
- attr(*, ".internal.selfref")=<externalptr>
For the first run, 100 rows are created, the start year can vary between 2000 and 2009, and the span of years an indivdual id can cover is between 0 and 10 years. Thus, the result set should be expected to have approximately 100 * (10 + 1) / 2 rows.
Also, only one additional column gender is included although the OP has told that the producion data may have 2 to 10 non-varying columns.
Code
library(magrittr)
bm <- microbenchmark::microbenchmark(
lmo = {
yearList <- mapply(":", DF$yr.start, DF$yr.last)
res_lmo <- cbind(DF[rep(seq_along(DF$id), lengths(yearList)), c("id", "gender")],
yr=unlist(yearList))
},
hao = {
res_hao <- DF %>%
dplyr::group_by(id, gender) %>%
tidyr::nest() %>%
dplyr::mutate(data = purrr::map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
tidyr::unnest() %>%
dplyr::rename(yr = data)
},
aosmith = {
res_aosmith <- DF %>%
tidyr::gather(group, yr, dplyr::starts_with("yr") ) %>%
dplyr::select(-group) %>%
dplyr::group_by(id, gender) %>%
tidyr::complete(yr = tidyr::full_seq(yr, period = 1) )
},
jason = {
res_jason <- DF %>%
dplyr::group_by(id, gender) %>%
dplyr::do(data.frame(yr=.$yr.start:.$yr.last))
},
uwe1 = {
res_uwe1 <- DT[, .(yr = yr.start:yr.last), by = .(id, gender)]
},
uwe2 = {
res_uwe2 <- DT[DT[, .(yr = yr.start:yr.last), by = id], on = "id"
][, c("yr.start", "yr.last") := NULL]
},
frank1 = {
res_frank1 <- DT[rep(1:.N, yr.last - yr.start + 1L),
.(id, gender, yr = DT[, unlist(mapply(":", yr.start, yr.last))])]
},
frank2 = {
res_frank2 <- DT[, {
m = mapply(":", yr.start, yr.last); c(.SD[rep(.I, lengths(m))], .(yr = unlist(m)))},
.SDcols=id:gender]
},
times = 3L
)
Note that references to tidyverse functions are explicit in order to avoid name conflicts due to a cluttered name space.
First run
Unit: microseconds
expr min lq mean median uq max neval
lmo 655.860 692.6740 968.749 729.488 1125.193 1520.899 3
hao 40610.776 41484.1220 41950.184 42357.468 42619.887 42882.307 3
aosmith 319715.984 336006.9255 371176.437 352297.867 396906.664 441515.461 3
jason 77525.784 78197.8795 78697.798 78869.975 79283.804 79697.634 3
uwe1 834.079 870.1375 894.869 906.196 925.264 944.332 3
uwe2 1796.910 1810.8810 1880.482 1824.852 1922.268 2019.684 3
frank1 981.712 1057.4170 1086.680 1133.122 1139.164 1145.205 3
frank2 994.172 1003.6115 1081.016 1013.051 1124.438 1235.825 3
For the given problem size of 100 rows, the timings clearly indicate that the dplyr/ tidyr solutions are magnitudes slower than base R or data.table solutions.
The results are essentially consistent:
all.equal(as.data.table(res_lmo), res_uwe1)
all.equal(res_hao, res_uwe1)
all.equal(res_jason, res_uwe1)
all.equal(res_uwe2, res_uwe1)
all.equal(res_frank1, res_uwe1)
all.equal(res_frank2, res_uwe1)
return TRUE except all.equal(res_aosmith, res_uwe1) which returns
[1] "Incompatible type for column yr: x numeric, y integer"
Second run
Due to the long execution times, the tidyverse solutions are skipped when benchmarking larger problem sizes.
With the modified parameters
n_rows <- 1E4
yr_range <- 100L
the result set is expected to consist of about 500'000 rows.
Unit: milliseconds
expr min lq mean median uq max neval
lmo 425.026101 447.716671 455.85324 470.40724 471.26681 472.12637 3
uwe1 9.555455 9.796163 10.05562 10.03687 10.30571 10.57455 3
uwe2 18.711805 18.992726 19.40454 19.27365 19.75091 20.22817 3
frank1 22.639031 23.129131 23.58424 23.61923 24.05685 24.49447 3
frank2 13.989016 14.124945 14.47987 14.26088 14.72530 15.18973 3
For the given problem size and structure the data.table solutions are the fastest while the base R approach is a magnitude slower. The most concise solution uwe1 is also the fastest, here.
Note that the results depend on the structure of the data, in particular the parameters n_rows and yr_range and the number of non-varying columns. If there are more of those columns than just gender the timings might look differently.
The benchmark results are in contradiction to the OP's observation on execution speed which needs to be further investigated.
Another way using do in dplyr, but it's slower than the base R method.
df %>%
group_by(id, gender) %>%
do(data.frame(yr=.$yr.start:.$yr.last))
# # A tibble: 7 x 3
# # Groups: id, gender [3]
# id gender yr
# <dbl> <dbl> <int>
# 1 123 0 2005
# 2 123 0 2006
# 3 123 0 2007
# 4 456 1 2010
# 5 456 1 2011
# 6 456 1 2012
# 7 789 1 2000

Resources