I have a dataframe that looks like this
df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))
Month ID Region Qty Sales Leads
April 11 East 120 1000 10
April 11 West 110 1100 12
May 12 North 110 900 9
May 10 East 110 1000 8
June 11 North 100 1000 6
June 11 East 90 800 5
June 11 West 70 650 4
I want a dataframe that looks like this
Month ID Qty Sales Leads Region
April 11 230 2100 22 East
May 12 110 900 9 North
May 10 110 1000 8 East
June 11 260 2450 15 North
I am using a the following code
result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1)
result$Region <- NULL
I have over 2 million such rows and it is taking forever to calculate the aggregate.
I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.
However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.
summarize makes more sense to me than mutate and slice. This should save you some time.
library(dplyr)
result <- df %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups: Month [3]
# Month ID Qty Sales Leads Region
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 April 11 230 2100 22 East
# 2 June 11 260 2450 15 North
# 3 May 10 110 1000 8 East
# 4 May 12 110 900 9 North
Here is a data.table solution.
library(data.table)
setDT(df)
cols <- c("Qty", "Sales", "Leads")
df[, c(lapply(.SD, sum, na.rm = TRUE),
Region = first(Region)), .SDcols = cols,
by = .(Month, ID)][]
# Month ID Qty Sales Leads Region
# 1: April 11 230 2100 22 East
# 2: May 12 110 900 9 North
# 3: May 10 110 1000 8 East
# 4: June 11 260 2450 15 North
We can apply generic speed-up strategies:
Do less
Choose an appropriate back-end
Use appropriate data structures
dplyr provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.
solution 1
We could rewrite the code slightly to be more efficient by using the collapse package, which provides a C++ interface to dplyr functions. It prepends dplyr functions with f, with one exception fsubset which is similar to dplyr::filter (or base R subset).
library(collapse)
df |>
fgroup_by(Month, ID) |>
fsummarise(Qty = fsum(Qty),
Sales = fsum(Sales),
Leads = fsum(Leads),
Region = fsubset(Region, 1L),
keep.group_vars = T) |>
as_tibble() # optional
#> # A tibble: 4 x 6
#> Month ID Qty Sales Leads Region
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 April 11 230 2100 22 East
#> 2 June 11 260 2450 15 North
#> 3 May 10 110 1000 8 East
#> 4 May 12 110 900 9 North
Where |> (Requires R version > 3.5) is a slightly faster pipe than %>%. Its result is ungrouped.
solution 2
data.table is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr code to use data.table is using the dtplyr package, which ships with tidyverse. We can convert it by adding two lines of code.
library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region)) %>%
as_tibble() # or data.table()
Note that this results is an ungrouped data.frame at the end.
Benchmarks
Approaches are put in wrapper functions. dplyr here is www's approach. All approaches outputting is a tibble.
bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
expression median mem_alloc n_itr
<bch:expr> <dbl> <bch:byt> <int>
1 collapse 0.316 0B 200
2 dplyr 5.42 8.73KB 195
3 dtplyr 6.67 120.21KB 196
We can see that collapse is more memory efficient, and significantly faster compared to dplyr. dtplyr approach is included here, as its time complexity is different than that of dplyr and its convenience to rewrite.
Per #www's request, an inclusion of pure data.table approach, wrapper functions rewritten for brevity. Input/ Output is a data.frame for collapse and a data.table for data.table respectively.
data.table = \(x){setDT(x); cols = c("Qty", "Sales", "Leads");x[, c(lapply(.SD, sum, na.rm = T), Region = first(Region)), .SDcols = cols, by = .(Month, ID)][]}
# retainig the `|>` pipes for readability, impact is ~4us.
collapse = \(x) x|>fgroup_by(Month, ID)|>fsummarise(Qty = fsum(Qty),Sales = fsum(Sales),Leads = fsum(Leads),Region = fsubset(Region, 1L),keep.group_vars = T)
dt <- as.data.table(df)
bench::mark(collapse(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(data.table(dt), iterations = 10e3)[c(1,3,5,7)]
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 collapse(df) 150us 0B 9988
2 data.table(dt) 796us 146KB 9939
The difference between collapse and pure data.table, for such a small dataset, is negligible. The reason for speed increase is likely the use of fsum instead of base R sum.
Related
I have a dataframe like the one below...
df <- data.frame(row.names = c(1,2,3,4,5,6,7,8), Week = c(1,1,2,2,52,52,53,53), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,100,98,22,34), Count_2002 = c(3, 78, 22, 5, 78, 6, 88, 97))
I am now trying to manipulate this dataset such that only weeks 52 and 53 get summed together for each state in the list, across all of the Count columns. Similar to this example.. GROUP BY for specific rows
The new dataset should have these rows summed together to create the new Week 52 row for each state, like this example below...
df2 <- data.frame(row.names = c(1,2,3,4,5,6), Week = c(1,1,2,2,52,52), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,122,132), Count_2002 = c(3, 78, 22, 5, 166, 103))
Is there an easy solution for this in R?
Change your 53s to 52s and do a sum by group:
library(dplyr)
df %>%
mutate(Week = case_when(Week == 53 ~ 52, TRUE ~ Week)) %>%
group_by(State, Week) %>%
summarize(across(everything(), sum))
# # A tibble: 6 x 4
# # Groups: State [2]
# State Week Count_2001 Count_2002
# <chr> <dbl> <dbl> <dbl>
# 1 Florida 1 25 3
# 2 Florida 2 83 22
# 3 Florida 52 122 166
# 4 Georgia 1 16 78
# 5 Georgia 2 45 5
# 6 Georgia 52 132 103
Using aggregate.
s <- 52:53
tp <- transform(aggregate(cbind(Count_2001, Count_2002) ~ State, df[df$Week %in% s, ], sum),
Week=52)
df <- merge(df[!df$Week %in% s, ], tp, all=T)
df
# Week State Count_2001 Count_2002
# 1 1 Florida 25 3
# 2 1 Georgia 16 78
# 3 2 Florida 83 22
# 4 2 Georgia 45 5
# 5 52 Florida 122 166
# 6 52 Georgia 132 103
A simple alternative to using anything state specific would just be to create a new column with weeks at the level of aggregation that works!
I'd get this by doing: (using the tidyverse library)
df <- df %>%
mutate(week1 = if_else(week %in% c(52,53),52,week)
and then you can summate as
dfsumm <- df %>%
group_by(state, week1)%>%
summarise()
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA
I am looking for a way to check wether two columns in a data frame contain the same elements for one or more rows, then eliminate the row containing more NAs.
Lets assume we have a data frame as such:
x <- data.frame("Year" = c(2017,2017,2017,2018,2018),
"Country" = c("Sweden", "Sweden", "Norway", "Denmark", "Finland"),
"Sales" = c(15, 15, 18, 13, 12),
"Campaigns" = c(3, NA, 4, 1, 1),
"Employees" = c(15, 15, 12, 8, 9),
"Satisfaction" = c(0.8, NA, 0.9, 0.95, 0.87),
"Expenses" = c(NA, NA, 9000, 7500, 4300))
Note that the entry for Sweden in the year 2017 is there twice, but the first row has one entry with NA while the other one contains NAs in three places. Now I would like to check wether two rows contain the same "Year" and "Country", then proceed to eliminate the row containing the higher amount of NAs, in this case the second row. I did some research but I could not seem to find a solution for this particular case.
Thank you very much in advance.
Using dplyr:
library(dplyr)
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
# A tibble: 4 x 7
# Groups: Year, Country [4]
Year Country Sales Campaigns Employees Satisfaction Expenses
<dbl> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 Norway 18 4 12 0.90 9000
2 2017 Sweden 15 3 15 0.80 NA
3 2018 Denmark 13 1 8 0.95 7500
4 2018 Finland 12 1 9 0.87 4300
We can use a data.table approach
library(data.table)
ind <- setDT(x)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x[-ind]
# Year Country Sales Campaigns Employees Satisfaction Expenses
#1: 2017 Sweden 15 3 15 0.80 NA
#2: 2017 Norway 18 4 12 0.90 9000
#3: 2018 Denmark 13 1 8 0.95 7500
#4: 2018 Finland 12 1 9 0.87 4300
Base R solution:
x$nas <- rowSums(sapply(x, is.na))
do.call(rbind,
by(x, x[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
# Year Country Sales Campaigns Employees Satisfaction Expenses nas
# 4 2018 Denmark 13 1 8 0.95 7500 0
# 5 2018 Finland 12 1 9 0.87 4300 0
# 3 2017 Norway 18 4 12 0.90 9000 0
# 1 2017 Sweden 15 3 15 0.80 NA 1
Not too surprisingly, the data.table implementation is the fast, though I"m a little surprised by how much faster it was than base R. Being a small dataset could affect this. (In the benchmarking, I had to create a copy of the original, since data.table modifies the data in-place, so x is no longer a data.frame.)
microbenchmark(
data.table = {
x0 <- copy(x)
ind <- setDT(x0)[, {
i1 <- Reduce(`+`, lapply(.SD, is.na))
.I[i1 > 0 & (i1 == max(i1))]
}, .(Year, Country)]$V1
x0[-ind]
},
dplyr = {
x %>%
mutate(n_na = rowSums(is.na(.))) %>% ## calculate NAs for each row
group_by(Year, Country) %>% ## for each year/country
arrange(n_na) %>% ## sort by number of NAs
slice(1) %>% ## take the first row
select(-n_na) ## remove the NA counter column
},
base = {
x0 <- x
x0$nas <- rowSums(sapply(x0, is.na))
do.call(rbind,
by(x0, x0[c("Year","Country")],
function(df) head(df[order(df$nas),,drop=FALSE], n=1)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.223477 1.441005 1.973714 1.582861 1.919090 12.837569 100
# dplyr 2.675239 2.901882 4.465172 3.079295 3.806453 42.261540 100
# base 2.039615 2.209187 2.737758 2.298714 2.570760 8.586946 100
I need to count of future visits by specific customer in the next 7 days. I solved this with purrr:map2 but I'm experiencing very slow performance. I think I must be missing something basic about how to use purrr. How do I speed this up? Thanks
This toy example takes 2.3 secs with 100 rows, but 3.3 minutes with 1000 rows on my machine. My actual data has 400K rows!
library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T),
date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))
df <- df %>%
rowwise() %>%
mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )
Here's an option that uses purrr::reduce to sum the list of vectors returned by data.table::shift (a vectorized version of lead/lag). pmap_int with sum would do the same as reduce with + if you like, but it a little slower. You could similarly do map(1:7, ~lead(n, .x, default = 0L)) instead of data.table::shift, but it's more code and slower.
library(tidyverse)
set.seed(123)
rows <- 1000
df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE),
date = sample(seq(as.Date('2017/01/01'),
as.Date('2017/01/31'),
by = "day"),
rows, replace = TRUE))
df2 <- df %>%
count(cust_num, date) %>%
group_by(cust_num) %>%
# add dates with no occurrences; none in sample data, but quite possible in real
complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>%
mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>%
right_join(df)
df2
#> # A tibble: 1,000 x 4
#> # Groups: cust_num [?]
#> cust_num date n visits_next_7
#> <fctr> <date> <int> <int>
#> 1 123 2017-01-09 10 78
#> 2 128 2017-01-19 12 70
#> 3 124 2017-01-05 15 73
#> 4 128 2017-01-27 14 37
#> 5 128 2017-01-27 14 37
#> 6 123 2017-01-15 19 74
#> 7 124 2017-01-24 12 59
#> 8 128 2017-01-10 10 78
#> 9 124 2017-01-03 19 77
#> 10 124 2017-01-14 8 84
#> # ... with 990 more rows
This may not be the most efficient algorithm, as depending on the spacing of your data, complete could potentially expand your data dramatically.
Further, with data this size, you may find data.table is more practical unless you want to put your data in a database and access it with dplyr.
A solution using the zoo package. The idea is to group the data by cust_num and date and count the row number first, and then use the lead function to shift the count number by 1 and use rollapply to calculate the sum of the next six days (not include the beginning date). Finally, use left_join to merge the results back to the original data frame. This should be much faster than your original approach. df3 is the final output.
library(dplyr)
library(zoo)
df2 <- df %>%
count(cust_num, date) %>%
ungroup() %>%
mutate(n2 = lead(n)) %>%
mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)) %>%
select(cust_num, date, visits.next.7.days)
df3 <- df %>% left_join(df2, by = c("cust_num", "date"))
head(df3)
# cust_num date visits.next.7.days
# 1 123 2017-01-09 70
# 2 128 2017-01-19 54
# 3 124 2017-01-05 58
# 4 128 2017-01-27 37
# 5 128 2017-01-27 37
# 6 123 2017-01-15 68
I'd like to expand observations from single row-per-id to multiple rows-per-id based on a given time interval:
> dput(df)
structure(list(id = c(123, 456, 789), gender = c(0, 1, 1), yr.start = c(2005,
2010, 2000), yr.last = c(2007, 2012, 2000)), .Names = c("id",
"gender", "yr.start", "yr.last"), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L))
> df
# A tibble: 3 x 4
id gender yr.start yr.last
<dbl> <dbl> <dbl> <dbl>
1 123 0 2005 2007
2 456 1 2010 2012
3 789 1 2000 2000
I want to get id expanded into one row per year:
> dput(df_out)
structure(list(id = c(123, 123, 123, 456, 456, 456, 789), gender = c(0,
0, 0, 1, 1, 1, 1), yr = c(2005, 2006, 2007, 2010, 2011, 2012,
2000)), .Names = c("id", "gender", "yr"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -7L))
> df_out
# A tibble: 7 x 3
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
I know how to melt/reshape, but I'm not sure how I can expand the years.
Thanks.
Here is a base R method.
# expand years to a list
yearList <- mapply(":", df$yr.start, df$yr.last)
Now, use this list to calculate the number of rows to repeat for each ID (the second argument of rep) and then append it as a vector (transformed from list with unlist) using cbind.
# get data.frame
cbind(df[rep(seq_along(df$id), lengths(yearList)), c("id", "gender")], yr=unlist(yearList))
id gender yr
1 123 0 2005
1.1 123 0 2006
1.2 123 0 2007
2 456 1 2010
2.1 456 1 2011
2.2 456 1 2012
3 789 1 2000
You could gather into long format and then fill in the missing rows via complete using tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(group, yr, starts_with("yr") ) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
You can use select to get rid of the extra column.
df %>%
gather(group, yr, starts_with("yr") ) %>%
select(-group) %>%
group_by(id, gender) %>%
complete(yr = full_seq(yr, period = 1) )
# A tibble: 8 x 3
# Groups: id, gender [3]
id gender yr
<dbl> <dbl> <dbl>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
8 789 1 2000
Here is a tidyverse solution
library(tidyverse)
df %>%
group_by(id, gender) %>%
nest() %>%
mutate(data = map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
unnest() %>%
rename(year = data)
# A tibble: 7 x 3
id gender year
<dbl> <dbl> <int>
1 123 0 2005
2 123 0 2006
3 123 0 2007
4 456 1 2010
5 456 1 2011
6 456 1 2012
7 789 1 2000
As the OP mentions that his production data set has more than 1 M rows and he is benchmarking the different solutions, it might be worthwhile to try a data.table version:
library(data.table) # CRAN version 1.10.4 used
data.table(DF)[, .(yr = yr.start:yr.last), by = .(id, gender)]
which returns
id gender yr
1: 123 0 2005
2: 123 0 2006
3: 123 0 2007
4: 456 1 2010
5: 456 1 2011
6: 456 1 2012
7: 789 1 2000
If there are more non-varying columns than just gender it might be more efficient to do a join rather than including all those columns in the grouping parameter by =:
data.table(DF)[DF[, .(yr = yr.start:yr.last), by = id], on = "id"]
id gender yr.start yr.last yr
1: 123 0 2005 2007 2005
2: 123 0 2005 2007 2006
3: 123 0 2005 2007 2007
4: 456 1 2010 2012 2010
5: 456 1 2010 2012 2011
6: 456 1 2010 2012 2012
7: 789 1 2000 2000 2000
Note that both approaches assume that id is unique in the input data.
Benchmarking
The OP has noted that he is surprised that above data.table solution is five times slower than lmo's base R solution, apparently with OP's production data set of more than 1 M rows.
Also, the question has attracted 5 different answers plus additional suggestions. So, it's worthwhile to compare the solution in terms of processing speed.
Data
As the production data set isn't available, and problem size among other factors like the strcuture of the data is important for benchmarking, sample data sets are created.
# parameters
n_rows <- 1E2
yr_range <- 10L
start_yr <- seq(2000L, length.out = 10L, by = 1L)
# create sample data set
set.seed(123L)
library(data.table)
DT <- data.table(id = seq_len(n_rows),
gender = sample(0:1, n_rows, replace = TRUE),
yr.start = sample(start_yr, n_rows, replace = TRUE))
DT[, yr.last := yr.start + sample(0:yr_range, n_rows, replace = TRUE)]
DF <- as.data.frame(DT)
str(DT)
Classes ‘data.table’ and 'data.frame': 100 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : int 0 1 0 1 1 0 1 1 1 0 ...
$ yr.start: int 2005 2003 2004 2009 2004 2008 2009 2006 2004 2001 ...
$ yr.last : int 2007 2013 2010 2014 2008 2017 2013 2009 2005 2002 ...
- attr(*, ".internal.selfref")=<externalptr>
For the first run, 100 rows are created, the start year can vary between 2000 and 2009, and the span of years an indivdual id can cover is between 0 and 10 years. Thus, the result set should be expected to have approximately 100 * (10 + 1) / 2 rows.
Also, only one additional column gender is included although the OP has told that the producion data may have 2 to 10 non-varying columns.
Code
library(magrittr)
bm <- microbenchmark::microbenchmark(
lmo = {
yearList <- mapply(":", DF$yr.start, DF$yr.last)
res_lmo <- cbind(DF[rep(seq_along(DF$id), lengths(yearList)), c("id", "gender")],
yr=unlist(yearList))
},
hao = {
res_hao <- DF %>%
dplyr::group_by(id, gender) %>%
tidyr::nest() %>%
dplyr::mutate(data = purrr::map(data, ~ seq(.x$yr.start, .x$yr.last))) %>%
tidyr::unnest() %>%
dplyr::rename(yr = data)
},
aosmith = {
res_aosmith <- DF %>%
tidyr::gather(group, yr, dplyr::starts_with("yr") ) %>%
dplyr::select(-group) %>%
dplyr::group_by(id, gender) %>%
tidyr::complete(yr = tidyr::full_seq(yr, period = 1) )
},
jason = {
res_jason <- DF %>%
dplyr::group_by(id, gender) %>%
dplyr::do(data.frame(yr=.$yr.start:.$yr.last))
},
uwe1 = {
res_uwe1 <- DT[, .(yr = yr.start:yr.last), by = .(id, gender)]
},
uwe2 = {
res_uwe2 <- DT[DT[, .(yr = yr.start:yr.last), by = id], on = "id"
][, c("yr.start", "yr.last") := NULL]
},
frank1 = {
res_frank1 <- DT[rep(1:.N, yr.last - yr.start + 1L),
.(id, gender, yr = DT[, unlist(mapply(":", yr.start, yr.last))])]
},
frank2 = {
res_frank2 <- DT[, {
m = mapply(":", yr.start, yr.last); c(.SD[rep(.I, lengths(m))], .(yr = unlist(m)))},
.SDcols=id:gender]
},
times = 3L
)
Note that references to tidyverse functions are explicit in order to avoid name conflicts due to a cluttered name space.
First run
Unit: microseconds
expr min lq mean median uq max neval
lmo 655.860 692.6740 968.749 729.488 1125.193 1520.899 3
hao 40610.776 41484.1220 41950.184 42357.468 42619.887 42882.307 3
aosmith 319715.984 336006.9255 371176.437 352297.867 396906.664 441515.461 3
jason 77525.784 78197.8795 78697.798 78869.975 79283.804 79697.634 3
uwe1 834.079 870.1375 894.869 906.196 925.264 944.332 3
uwe2 1796.910 1810.8810 1880.482 1824.852 1922.268 2019.684 3
frank1 981.712 1057.4170 1086.680 1133.122 1139.164 1145.205 3
frank2 994.172 1003.6115 1081.016 1013.051 1124.438 1235.825 3
For the given problem size of 100 rows, the timings clearly indicate that the dplyr/ tidyr solutions are magnitudes slower than base R or data.table solutions.
The results are essentially consistent:
all.equal(as.data.table(res_lmo), res_uwe1)
all.equal(res_hao, res_uwe1)
all.equal(res_jason, res_uwe1)
all.equal(res_uwe2, res_uwe1)
all.equal(res_frank1, res_uwe1)
all.equal(res_frank2, res_uwe1)
return TRUE except all.equal(res_aosmith, res_uwe1) which returns
[1] "Incompatible type for column yr: x numeric, y integer"
Second run
Due to the long execution times, the tidyverse solutions are skipped when benchmarking larger problem sizes.
With the modified parameters
n_rows <- 1E4
yr_range <- 100L
the result set is expected to consist of about 500'000 rows.
Unit: milliseconds
expr min lq mean median uq max neval
lmo 425.026101 447.716671 455.85324 470.40724 471.26681 472.12637 3
uwe1 9.555455 9.796163 10.05562 10.03687 10.30571 10.57455 3
uwe2 18.711805 18.992726 19.40454 19.27365 19.75091 20.22817 3
frank1 22.639031 23.129131 23.58424 23.61923 24.05685 24.49447 3
frank2 13.989016 14.124945 14.47987 14.26088 14.72530 15.18973 3
For the given problem size and structure the data.table solutions are the fastest while the base R approach is a magnitude slower. The most concise solution uwe1 is also the fastest, here.
Note that the results depend on the structure of the data, in particular the parameters n_rows and yr_range and the number of non-varying columns. If there are more of those columns than just gender the timings might look differently.
The benchmark results are in contradiction to the OP's observation on execution speed which needs to be further investigated.
Another way using do in dplyr, but it's slower than the base R method.
df %>%
group_by(id, gender) %>%
do(data.frame(yr=.$yr.start:.$yr.last))
# # A tibble: 7 x 3
# # Groups: id, gender [3]
# id gender yr
# <dbl> <dbl> <int>
# 1 123 0 2005
# 2 123 0 2006
# 3 123 0 2007
# 4 456 1 2010
# 5 456 1 2011
# 6 456 1 2012
# 7 789 1 2000