I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do:
data <- tibble(
"id" = c(1,1,2,2),
"dates" = c("2020-01-01", "2020-01-03", "2020-01-01", "2020-01-03"),
"values" = c(3,4,7,8))
# A tibble: 4 x 3
id dates values
<dbl> <chr> <dbl>
1 1 2020-01-01 3
2 1 2020-01-03 4
3 2 2020-01-01 7
4 2 2020-01-03 8
data %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
# A tibble: 6 x 3
# Groups: id [2]
id dates values
<dbl> <date> <dbl>
1 1 2020-01-01 3
2 1 2020-01-02 NA
3 1 2020-01-03 4
4 2 2020-01-01 7
5 2 2020-01-02 NA
6 2 2020-01-03 8
However the complete function does not exist in sparklyr.
data_spark %>%
mutate(dates = as_date(dates)) %>%
group_by(id) %>%
complete(dates = seq.Date(min(dates), max(dates), by="day"))
Error in UseMethod("complete_") :
no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
Is there a way to set a UDF or to achieve a similar result?
Thank you
Under the hood tidyr::complete just performs a full join followed by optional NA fill. You can replicate its effects by using sdf_copy_to to create a new sdf that is just a single column seq.Date between your start and end date, and then perform a full_join between that and your dataset.
Here's a method that does all of the work in Spark.
library(sparklyr)
sc <- spark_connect(master = "local")
data <- tibble(
id = c(1, 1, 2, 2),
dates = c("2020-01-02", "2020-01-04", "2020-01-01", "2020-01-03"),
values = c(1, 2, 3, 4)
)
data_spark <- copy_to(sc, data)
We need to generate all combinations of dates and id. To do this, we need to know the total number of days and the first date.
days_info <-
data_spark %>%
summarise(
first_date = min(dates),
total_days = datediff(max(dates), min(dates))
) %>%
collect()
days_info
#> # A tibble: 1 x 2
#> first_date total_days
#> <chr> <int>
#> 1 2020-01-01 3
sdf_seq can be used to generate a sequence in Spark. This can be used to get the combinations of dates and id.
dates_id_combinations <-
sdf_seq(
sc,
from = 0,
to = days_info$total_days,
repartition = 1
) %>%
transmute(
dates = date_add(local(days_info$first_date), id),
join_by = TRUE
) %>%
full_join(data_spark %>% distinct(id) %>% mutate(join_by = TRUE)) %>%
select(dates, id)
dates_id_combinations
#> # Source: spark<?> [?? x 2]
#> dates id
#> <date> <dbl>
#> 1 2020-01-01 1
#> 2 2020-01-01 2
#> 3 2020-01-02 1
#> 4 2020-01-02 2
#> 5 2020-01-03 1
#> 6 2020-01-03 2
#> 7 2020-01-04 1
#> 8 2020-01-04 2
full_join the original data frame and the combination data frame. Then filter based on the min/max date for each group.
data_spark %>%
group_by(id) %>%
mutate(first_date = min(dates), last_date = max(dates)) %>%
full_join(dates_id_combinations) %>%
filter(dates >= min(first_date), dates <= max(last_date)) %>%
arrange(id, dates) %>%
select(id, dates)
#> # Source: spark<?> [?? x 2]
#> # Groups: id
#> # Ordered by: id, dates
#> id dates
#> <dbl> <chr>
#> 1 1 2020-01-02
#> 2 1 2020-01-03
#> 3 1 2020-01-04
#> 4 2 2020-01-01
#> 5 2 2020-01-02
#> 6 2 2020-01-03
Related
My task is to count the length of periods from given start/end date that were extracted from the large dataset.
Here is sample data.
library(tidyverse)
data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-04", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", "2022-03-04", "2022-03-07", "2022-03-22", "2022-03-23")))
data
# A tibble: 5 × 3
ID start end
<dbl> <date> <date>
1 1 2022-03-03 2022-03-03
2 1 2022-03-03 2022-03-04
3 1 2022-03-04 2022-03-07
4 2 2022-03-20 2022-03-22
5 2 2022-03-22 2022-03-23
I've figured out this with the method introduced here.
data2 <- data %>%
rowwise() %>%
do(tibble(ID = .$ID,
Date = seq(.$start, .$end, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(ID)
data2
# A tibble: 2 × 2
ID n
<dbl> <int>
1 1 5
2 2 4
However, occasionally, all the observations in the extracted start/end columns are NA.
Then the method above stops at the function seq() because no data is there.
like
na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(NA),
end = ymd(NA))
na_data
A tibble: 5 × 3
ID start end
<dbl> <date> <date>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 2 NA NA
5 2 NA NA
na_data %>%
rowwise() %>%
do(tibble(ID = .$ID,
Date = seq(.$start, .$end, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(ID)
*Error in seq.int(0, to0 - from, by) : 'to' must be a finite number*
It is difficult for me to check if all the data in selected columns are NA beforehand, because I have a lot of this kind of process to run simultaneously with the data from the same dataset.
To run the process, I usually select entire scripts in Rstudio with [ctrl + A] and then start. But the error message interrupts in the middle of my tasks.
Does Anyone have a solution to achieve this process with a whole NA data, or to avoid interruption by the error message and proceed to the next code?
Thank you.
This solution (1) creates lubridate Intervals for each row; (2) merges them by group using a modification of #AllanCameron's int_merge() function to handle NAs; and (3) sums days per Interval within each group.
To fully test it, I made two additional example datasets -- one including discontinuous date intervals, and one where only some values are NA.
library(lubridate)
library(dplyr)
int_merge <- function(x, na.rm = FALSE) {
if (na.rm) {
if(all(is.na(x))) return(interval(NA, NA))
if(any(is.na(x))) x <- x[!is.na(x)]
} else {
if(any(is.na(x))) return(interval(NA, NA))
}
if(length(x) == 1) return(x)
x <- x[order(int_start(x))]
y <- x[1]
for(i in 2:length(x)){
if(int_overlaps(y[length(y)], x[i]))
y[length(y)] <- interval(start = min(int_start(c(y[length(y)], x[i]))),
end = max(int_end(c(y[length(y)], x[i]))))
else
y <- c(y, x[i])
}
return(y)
}
data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 4
discontinuous_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 8
#> 2 2 4
na_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 NA
#> 2 2 NA
partial_na_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 NA
#> 2 2 4
partial_na_data %>% # with `na.rm = TRUE`
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval, na.rm = TRUE), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 7
#> 2 2 4
Created on 2022-11-25 with reprex v2.0.2
Additional example data:
discontinuous_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", "2022-03-04", "2022-03-15", "2022-03-22", "2022-03-23")))
partial_na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", NA, "2022-03-15", "2022-03-22", "2022-03-23")))
I'm not sure why this has been so difficult, but i've exhausted my R knowledge. I'm trying to return the date from a column if it falls between two dates into a new column in R. it must be done through sql-friendly verbs (i.e. dplyr).
sample <- data.frame(
id = c(1, 1, 2, 3, 4),
paint = c('zwbc',
'zbbb',
'zwbs',
'aass',
'zwbc')
date = c('2020-03-01',
'2020-04-01',
'2019-01-01',
'2019-12-31',
'2020-05-01',))
I've tried the following:
sam2 <- sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0)) %>%
mutate(paint_date = if_else(flag == 1 + (date > '2020-1-1' & date < '2020-1-1'), date, NULL)) %>%
ungroup()
Does this solve your problem?
library(tidyverse)
sample <- data.frame(
id = c(1, 1, 2, 3, 4),
paint = c('zwbc',
'zbbb',
'zwbs',
'aass',
'zwbc'),
date = c('2020-03-01',
'2020-04-01',
'2019-01-01',
'2019-12-31',
'2020-05-01'))
sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0))
#> # A tibble: 5 × 4
#> # Groups: id [4]
#> id paint date flag
#> <dbl> <chr> <chr> <dbl>
#> 1 1 zwbc 2020-03-01 1
#> 2 1 zbbb 2020-04-01 0
#> 3 2 zwbs 2019-01-01 0
#> 4 3 aass 2019-12-31 0
#> 5 4 zwbc 2020-05-01 1
# To exclude the bottom row (date > specified date cutoff)
sample %>%
group_by(id) %>%
mutate(flag = if_else(paint == 'zwbc', 1, 0)) %>%
mutate(paint_date = if_else(flag == 1 & date > '2020-01-01' & date < '2020-04-01', date, NULL)) %>%
ungroup()
#> # A tibble: 5 × 5
#> id paint date flag paint_date
#> <dbl> <chr> <chr> <dbl> <chr>
#> 1 1 zwbc 2020-03-01 1 2020-03-01
#> 2 1 zbbb 2020-04-01 0 <NA>
#> 3 2 zwbs 2019-01-01 0 <NA>
#> 4 3 aass 2019-12-31 0 <NA>
#> 5 4 zwbc 2020-05-01 1 <NA>
Created on 2022-10-12 by the reprex package (v2.0.1)
I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)
Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)
I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])
I have a data frame with columns labeled sales1, sales2, price1, price2 and I want to calculate revenues by multiplying sales1 * price1 and so-on across each number in an iterative fashion.
data <- data_frame(
"sales1" = c(1, 2, 3),
"sales2" = c(2, 3, 4),
"price1" = c(3, 2, 2),
"price2" = c(3, 3, 5))
data
# A tibble: 3 x 4
# sales1 sales2 price1 price2
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 3
#2 2 3 2 3
#3 3 4 2 5
Why doesn't the following code work?
data %>%
mutate (
for (i in seq_along(1:2)) {
paste0("revenue",i) = paste0("sales",i) * paste0("price",i)
}
)
Assuming your columns are already ordered (sales1, sales2, price1, price2). We can split the dataframe in two parts and then multiply them
data[grep("sales", names(data))] * data[grep("price", names(data))]
# sales1 sales2
#1 3 6
#2 4 9
#3 6 20
If the columns are not already sorted according to their names, we can sort them by using order and then use above command.
data <- data[order(names(data))]
This answer is not brief. For that, #RonakShah's existing answer is the one to look at!
My response is intended to address a broader concern regarding the difficulty of trying to do this in the tidyverse. My understanding is this is difficult because the data is not currently in a "tidy" format. Instead, you can create a tidy data frame like so:
library(tidyverse)
tidy_df <- data %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
extract(key, c("variable", "id"), "([a-z]+)([0-9]+)") %>%
spread(variable, value)
Which then makes the final calculation straightforward
tidy_df %>% mutate(revenue = sales * price)
#> # A tibble: 6 x 5
#> rowname id price sales revenue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 3 1 3
#> 2 1 2 3 2 6
#> 3 2 1 2 2 4
#> 4 2 2 3 3 9
#> 5 3 1 2 3 6
#> 6 3 2 5 4 20
If you need to get the data back into the original format you can although this feels clunky to me (I'm sure this can be improved in someway).
tidy_df %>% mutate(revenue = sales * price) %>%
gather(key, value, -c(rowname, id)) %>%
unite(key, key, id, sep = "") %>%
spread(key, value) %>%
select(starts_with("price"),
starts_with("sales"),
starts_with("revenue"))
#> # A tibble: 3 x 6
#> price1 price2 sales1 sales2 revenue1 revenue2
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 1 2 3 6
#> 2 2 3 2 3 4 9
#> 3 2 5 3 4 6 20