I am trying to join two data.frames based on a condition. Consider the following situation where I have df_a and df_b.
library(tidyverse)
# Dummy data A
df_a <- tibble(
id = c("a", "b", "c", "a"),
text = c("hi","why", "bye","cry"),
created_date = c(as.Date("2020-01-01"), as.Date("2020-02-02"), as.Date("2020-03-01"), as.Date("2020-04-04"))
)
# Dummy data B
df_b <- tibble(
id = c("a", "b", "c", "a"),
group = c("GROUP A","GROUP B","GROUP C", "GROUP C"),
start_date = c(as.Date("2020-01-01"), as.Date("2020-01-01"), as.Date("2020-01-01"), as.Date("2020-02-04"))
)
> df_a
# A tibble: 4 x 3
id text created_date
<chr> <chr> <date>
1 a hi 2020-01-01
2 b why 2020-02-02
3 c bye 2020-03-01
4 a cry 2020-04-04
> df_b
# A tibble: 4 x 3
id group start_date
<chr> <chr> <date>
1 a GROUP A 2020-01-01
2 b GROUP B 2020-01-01
3 c GROUP C 2020-01-01
4 a GROUP C 2020-02-04
# Current solution
d_current_sol <- df_a %>%
left_join(
df_b %>%
distinct(id, .keep_all = T), by = "id"
)
> d_current_sol
# A tibble: 4 x 5
id text created_date group start_date
<chr> <chr> <date> <chr> <date>
1 a hi 2020-01-01 GROUP A 2020-01-01
2 b why 2020-02-02 GROUP B 2020-01-01
3 c bye 2020-03-01 GROUP C 2020-01-01
4 a cry 2020-04-04 GROUP A 2020-01-01
# Desired solution
> d_desired
# A tibble: 4 x 5
id text created_date start_date group
<chr> <chr> <date> <date> <chr>
1 a hi 2020-01-01 2020-01-01 GROUP A
2 b why 2020-02-02 2020-01-01 GROUP B
3 c bye 2020-03-01 2020-01-01 GROUP C
4 a cry 2020-04-04 2020-02-04 GROUP C
As you can see in df_b the start date for id = a is first 2020-01-01 and then 2020-02-04.
What I would like is for any row in df_a after 2020-02-04, it's associated group to be "GROUP C". This is illustrated in the final data frame d_desired
However, with a traditional left_join where we match the row to a single row item, we only get the earliest entry (d_current_sol). This would be easy enough to do using map() and using a custom function with mutate(), however, SQL doesn't support that when translating dplyr queries.
Does anyone know a way of doing this through the tidyverse paradigm? As I will be using dbplyr to run this on a PostgreSQL database.
this?
df_a %>%
left_join(df_b, by = "id" ) %>%
filter( created_date >= start_date) %>%
group_by(id, created_date) %>%
top_n(1,start_date)
# A tibble: 4 x 5
# Groups: id, created_date [4]
id text created_date group start_date
<chr> <chr> <date> <chr> <date>
1 a hi 2020-01-01 GROUP A 2020-01-01
2 b why 2020-02-02 GROUP B 2020-01-01
3 c bye 2020-03-01 GROUP C 2020-01-01
4 a cry 2020-04-04 GROUP C 2020-02-04
Related
I would like to track min and max occurrences in two columns. This should be done in rolling fashion from beginning of the data, so we can track how many times overall IDs are present at each date. Also it doesn't matter in which column ID is present.
Result should be as follows. Row 1, nor B or C has occurred, so min_appearance is 0 but max_appearance is 1 as A and D was present. Row 5 A and D have been present 3 times at this point but B and C only 2. I'm not concerned which ID is present, but only on counts what is min and max. Also real data is more complicated, so pairs are not static, but A could face C and so on.
# A tibble: 8 x 5
date id1 id2 min_appearances max_appearances
<date> <chr> <chr> <dbl> <dbl>
1 2020-01-01 A D 0 1
2 2020-01-02 B C 1 1
3 2020-01-03 C B 1 2
4 2020-01-04 D A 2 2
5 2020-01-05 A D 2 3
6 2020-01-06 B C 3 3
7 2020-01-07 C B 3 4
8 2020-01-08 D A 4 4
DATA:
library(dplyr)
date <- seq(as.Date("2020/1/1"), by = "day", length.out = 8)
id1 <- rep(c("A", "B", "C", "D"), 2)
id2 <- rep(c("D", "C", "B", "A"), 2)
dt <- tibble(date = date,
id1 = id1,
id2 = id2)
Here's a way to do it using functions from the tidyverse. First, pivot_longer to handle more easily the data. Then compute the cumulative count of value for every unique ids. Compute the min and max for each row over the "count" columns. Finally, take the last min and max values for each pairs, and pivot back to wide.
library(tidyverse)
dt %>%
pivot_longer(cols = -date, values_to = "id") %>%
mutate(map_dfc(unique(id), ~ tibble("count_{.x}" := cumsum(id == .x)))) %>%
mutate(min_appearances = do.call(pmin, select(., starts_with("count"))),
max_appearances = do.call(pmax, select(., starts_with("count")))) %>%
group_by(date) %>%
mutate(across(min_appearances:max_appearances, last),
n = row_number()) %>%
pivot_wider(c(date, min_appearances, max_appearances), names_from = n, values_from = id, names_prefix = "id") %>%
relocate(order(colnames(.)))
date id1 id2 max_appearances min_appearances
<date> <chr> <chr> <int> <int>
1 2020-01-01 A D 1 0
2 2020-01-02 B C 1 1
3 2020-01-03 C B 2 1
4 2020-01-04 D A 2 2
5 2020-01-05 A D 3 2
6 2020-01-06 B C 3 3
7 2020-01-07 C B 4 3
8 2020-01-08 D A 4 4
I'm trying to mark all dates, which fall within several ranges in a different table.
The events table among other variables contains start_date and end_date of events:
events <- tibble(
name = c("Event A", "Event B"),
start_date = as.Date(c("2021-10-17", "2021-02-19")),
end_date = as.Date(c("2021-10-19", "2021-02-10"))
)
The date_info table contains date, statistic and value information in the long format for all days of the year:
date_info <- tibble(
date = as.Date(c("2021-10-16", "2021-10-16", "2021-10-17", "2021-10-17")),
statistic = c("var1", "var2", "var1", "var2"),
value = c(10, 54, 23, 34)
)
I need to make a new column in date_info to mark dates which fall within any date range of events.
I've tried the approach below, but it works only if there is one event in events
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date < events$start_date | date > events$end_date, FALSE, TRUE))
I thought about creating a date_range vector in events such that code below can be used to mark the dates:
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date %in% events$date_range, TRUE, FALSE))
However I'm not sure that this is the best approach. Additionally I'm not sure how to get such date range as seq() works on a single start/end date pair rather than a vector.
This can be done as a range-based or non-equi join. Unfortunately, dplyr alone cannot do it, but one of the following should work fine.
The code below assigns the particular events$name to each row, not just an "in range" indicator. It's not hard to simplify that with in_range = !is.na(name) or similar.
fuzzyjoin
# library(fuzzyjoin)
date_info %>%
fuzzyjoin::fuzzy_left_join(events,
by = c(date = "start_date", date = "end_date"),
match_fun = list(`>=`, `<=`))
# # A tibble: 4 x 6
# date statistic value name start_date end_date
# <date> <chr> <dbl> <chr> <date> <date>
# 1 2021-10-16 var1 10 NA NA NA
# 2 2021-10-16 var2 54 NA NA NA
# 3 2021-10-17 var1 23 Event A 2021-10-17 2021-10-19
# 4 2021-10-17 var2 34 Event A 2021-10-17 2021-10-19
sqldf
# library(sqldf)
sqldf::sqldf("
select t1.*, t2.name
from date_info t1
left join events t2 on t1.date between t2.start_date and t2.end_date")
# date statistic value name
# 1 2021-10-16 var1 10 <NA>
# 2 2021-10-16 var2 54 <NA>
# 3 2021-10-17 var1 23 Event A
# 4 2021-10-17 var2 34 Event A
data.table
library(data.table)
date_info_DT <- as.data.table(date_info)
events_DT <- as.data.table(events)
date_info_DT[events_DT, name := i.name,
on = .(date >= start_date, date <= end_date)][]
# date statistic value name
# <Date> <char> <num> <char>
# 1: 2021-10-16 var1 10 <NA>
# 2: 2021-10-16 var2 54 <NA>
# 3: 2021-10-17 var1 23 Event A
# 4: 2021-10-17 var2 34 Event A
(There's also data.table::foverlaps, which requires the second data.table to be keyed.)
Another option, a bit simpler (not requiring class-changes):
date_info %>%
mutate(in_range = data.table::inrange(date, events$start_date, events$end_date))
# # A tibble: 4 x 4
# date statistic value in_range
# <date> <chr> <dbl> <lgl>
# 1 2021-10-16 var1 10 FALSE
# 2 2021-10-16 var2 54 FALSE
# 3 2021-10-17 var1 23 TRUE
# 4 2021-10-17 var2 34 TRUE
Here's a solution using map from the purrr package that should work. It could be more concise but I made it very explicit so it's not overwhelming if you're not familiar with the syntax.
date_info |>
mutate(
in_range_n = map_dbl(date, .f = function(date){
filter(events, start_date <= date, end_date >= date) |>
nrow()
}),
in_range = in_range_n > 0
) |>
select(-in_range_n)
Output:
# A tibble: 4 x 4
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE
Let me know if I misunderstood the problem!
Using base r
date_info$in_range <- sapply(date_info$date, function(date){
any(date >= events$start_date & date <= events$end_date)
})
gives
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE
Is there a straightforward way in dplyr to expand a dataframe by replicating a row based on a specific vector?
For example, I have following dataframe:
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2019:2021)
where I would like to replicate the row where cat1=="A" for the years 2019-2021. The values of some columns (value1, cat) should be taken from the original year 2019, some other columns (value2) filled with NAs.
The final output should look like:
Year cat value1 value2
2019 A 1 10
2020 A 1 NA
2021 A 1 NA
2019 B 2 20
I tried with bind_rows()...however, the result is not fully what I wanted (I only get the "A"-Part not the "B"-Part), and I am not sure if this is really the most intuitive/dplyr way to go, or if another approach (or even specific function) would be more reasonable:
df%>%
filter(cat1=="A",Year==2019)%>%
bind_rows(
data.frame(
Year=setdiff(selected_years,.$Year),
cat1=.$cat1,
value1=.$value1
)
)
)
Edit:
I also tried using expand and right_join, but I then my desired column values are not repeated:
df %>%
dplyr::right_join(df %>%
filter(cat1=="A",Year==2019)%>%
expand(Year=c(2019:2021)))
Maybe an approach involving case_when?
The part where you only want to keep specfic values and others not, makes this tricky. It is easy to expand the vector on all values using rowwise and unnest together with the condition in if_else. In the last step we just reset the values to NA which we don't want to replicate. If you have more than one value that you want to set NA, we can use across.
library(tidyverse)
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2019:2021)
df %>%
rowwise %>%
mutate(Year = if_else(cat1 == "A", list(selected_years), list(Year))) %>%
unnest(Year) %>%
mutate(value2 = if_else(Year != 2019, NA_real_, value2))
#> # A tibble: 4 x 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2020 A X 1 NA
#> 3 2021 A X 1 NA
#> 4 2019 B Y 2 20
Created on 2021-12-08 by the reprex package (v2.0.1)
Or we could create a df2 and full_join it with df:
library(dplyr)
df2 <- tibble(Year = selected_years,
cat1 = "A",
cat2 = "X",
value1 = 1)
df %>%
full_join(df2, by = c("Year", "cat1", "cat2", "value1"))
#> # A tibble: 4 x 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2019 B Y 2 20
#> 3 2020 A X 1 NA
#> 4 2021 A X 1 NA
Created on 2021-12-08 by the reprex package (v2.0.1)
library(tidyverse)
tibble(selected_years) %>%
mutate(cat1 = "A") %>%
full_join(df, by = "cat1") %>%
mutate(selected_years = ifelse(is.na(selected_years), Year, selected_years)) %>%
group_by(cat1) %>%
mutate(value2 = ifelse(row_number() != 1, NA, value2)) %>%
ungroup() %>%
select(Year = selected_years, cat = cat1, value1, value2)
Year cat value1 value2
<dbl> <chr> <dbl> <dbl>
1 2019 A 1 10
2 2020 A 1 NA
3 2021 A 1 NA
4 2019 B 2 20
A solution based on dplyr::bind_rows:
library(tidyverse)
df <- tibble(Year=c(2019),
cat1=c("A","B"),
cat2=c("X","Y"),
value1=c(1,2),
value2=c(10,20))
selected_years <- c(2020:2021)
df %>%
bind_rows(data.frame(
Year=selected_years, filter(., cat1 == "A") %>% select(-Year, -value2))) %>%
arrange(cat1)
#> # A tibble: 4 × 5
#> Year cat1 cat2 value1 value2
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2019 A X 1 10
#> 2 2020 A X 1 NA
#> 3 2021 A X 1 NA
#> 4 2019 B Y 2 20
How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))
I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)