Retrieve only selected columns data in r with date criteria - r

I have two tables( orders, prices) and I would like to retrieve the Monthly_code and daily_code from orders table to prices table considering date criteria. Both tables doesn't have a unique primary key.
**Orders table data**
orders <- data.table(ID = c(1,1,1,2,2,3), Monthly_code = c('xx','xx','vv','uu','mm','gg'),
daily_code = c('xx-1','xx-1','vv-1','uu-1','mm-1','gg-1'),
Time_in = c('12/1/2020','12/16/2020','12/28/2020', '6/1/2020', '4/5/2020', '6/9/2020'),
Time_out = c('12/6/2020', '12/27/2020', '12/31/2020','6/13/2020','4/12/2020','6/23/2020')
**Prices table data**
prices <- data.table(ID = c(1,1,1,1,2,2,2,3), record_date = c('12/2/2020','12/3/2020','12/4/2020',
'12/5/2020', '6/6/2020', '6/7/2020', '6/8/2020' , '6/20/2020'), Price = c(20,22,21,22,13,15,22,30))
**Expected results data**
price_2 <- data.table(ID = c(1,1,1,1,2,2,2,3), record_date = c('12/2/2020','12/3/2020','12/4/2020',
'12/5/2020', '6/6/2020', '6/7/2020', '6/8/2020' , '6/20/2020'),
Price = c(20,22,21,22,13,15,22,30), Monthly_code = c('xx','xx','xx','xx', 'uu','uu', 'uu','gg'),
daily_code = c('xx-1', 'xx-1', 'xx-1','xx-1', 'uu-1', 'uu-1','uu-1','gg-1'))

You can use fuzzyjoin to join two dataframes in range.
library(dplyr)
library(lubridate)
library(fuzzyjoin)
orders %>%
mutate(across(starts_with('Time'), mdy)) %>%
fuzzy_right_join(prices %>% mutate(record_date = mdy(record_date)),
by = c('ID', 'Time_in' = 'record_date', 'Time_out' = 'record_date'),
match_fun = c(`==`, `<=`, `>=`)) -> result
result
# ID.x Monthly_code daily_code Time_in Time_out ID.y record_date Price
#1 1 xx xx-1 2020-12-01 2020-12-06 1 2020-12-02 20
#2 1 xx xx-1 2020-12-01 2020-12-06 1 2020-12-03 22
#3 1 xx xx-1 2020-12-01 2020-12-06 1 2020-12-04 21
#4 1 xx xx-1 2020-12-01 2020-12-06 1 2020-12-05 22
#5 2 uu uu-1 2020-06-01 2020-06-13 2 2020-06-06 13
#6 2 uu uu-1 2020-06-01 2020-06-13 2 2020-06-07 15
#7 2 uu uu-1 2020-06-01 2020-06-13 2 2020-06-08 22
#8 3 gg gg-1 2020-06-09 2020-06-23 3 2020-06-20 30

Related

How to use repeated labels when grouping data using cut?

I need to create a new column to categorize my experiment. The timeseries data is divided in 10 minutes group using cut. I want to add a column that loops between 3 labels say A, B, C and the goes back to A.
test<- tibble(date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec' ),
cat = cut(date_time, breaks = '10 min'))
I want to get something like this
date_time cat
<dttm> <fct>
1 2020-02-02 00:00:00 A
2 2020-02-02 00:05:30 A
3 2020-02-02 00:10:00 B
4 2020-02-02 00:20:30 C
5 2020-02-02 00:30:00 A
6 2020-02-02 00:31:30 A
I have used the labels option in cut before with a known number of factors but not like this. Any help is welcome
You could use cut(labels = F) to create numeric labels for your intervals, then use these as an index into the built-in LETTERS vector (or a vector of custom labels). The modulo operator and a little arithmetic will make it cycle through A, B, and C:
library(tidyverse)
library(lubridate)
test<- tibble(date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec' ),
cat_num = cut(date_time, breaks = '10 min', labels = F),
cat = LETTERS[((cat_num - 1) %% 3) + 1]
)
date_time cat_num cat
<dttm> <int> <chr>
1 2020-02-02 00:00:00 1 A
2 2020-02-02 00:00:30 1 A
3 2020-02-02 00:01:00 1 A
4 2020-02-02 00:01:30 1 A
5 2020-02-02 00:02:00 1 A
6 2020-02-02 00:02:30 1 A
7 2020-02-02 00:03:00 1 A
8 2020-02-02 00:03:30 1 A
9 2020-02-02 00:04:00 1 A
10 2020-02-02 00:04:30 1 A
...48 more rows...
59 2020-02-02 00:29:00 3 C
60 2020-02-02 00:29:30 3 C
61 2020-02-02 00:30:00 4 A
62 2020-02-02 00:30:30 4 A
...59 more rows...
Using my santoku package:
library(santoku)
library(lubridate)
test <- tibble(
date_time = seq.POSIXt(ymd_hms('2020-02-02 00:00:00'),
ymd_hms('2020-02-02 01:00:00'),
by= '30 sec'
),
cut_time = chop_width(date_time,
minutes(10),
labels = lbl_seq("A")
)
)
table(test$cut_time)
A B C D E F G
20 20 20 20 20 20 1

Expand dataset by count column in Dplyr

I have a dataset as follows:
library(tidyverse)
df <- data.frame(
report_date = c("2020-03-14", "2020-03-14", "2020-03-19", "2020-03-20"),
start_date = c("2020-03-06", "2020-03-10", "2020-03-11", "2020-03-11"),
count = c(1, 2, 1, 3)
)
Looking like:
report_date start_date count
1 2020-03-14 2020-03-06 1
2 2020-03-14 2020-03-10 2
3 2020-03-19 2020-03-11 1
4 2020-03-20 2020-03-11 3
I want to perform a transformation using the value count - aka - repeating each row n times as in count for starting row.
I think it's clear if I show the desired result as follows:
df_final <- data.frame(
report_date = c("2020-03-14", "2020-03-14", "2020-03-14", "2020-03-19",
"2020-03-20", "2020-03-20", "2020-03-20"),
start_date = c("2020-03-06", "2020-03-10", "2020-03-10", "2020-03-11",
"2020-03-11", "2020-03-11", "2020-03-11"),
count = c(1, 1, 1, 1, 1, 1, 1)
)
report_date start_date count
1 2020-03-14 2020-03-06 1
2 2020-03-14 2020-03-10 1
3 2020-03-14 2020-03-10 1
4 2020-03-19 2020-03-11 1
5 2020-03-20 2020-03-11 1
6 2020-03-20 2020-03-11 1
7 2020-03-20 2020-03-11 1
Thanks!
We may use uncount to replicate and then create the 'count'
library(dplyr)
library(tidyr)
df %>%
uncount(count) %>%
mutate(count = 1)
-output
report_date start_date count
1 2020-03-14 2020-03-06 1
2 2020-03-14 2020-03-10 1
3 2020-03-14 2020-03-10 1
4 2020-03-19 2020-03-11 1
5 2020-03-20 2020-03-11 1
6 2020-03-20 2020-03-11 1
7 2020-03-20 2020-03-11 1

Extracting a date from a column and adding the year if missing in R

I am trying to extract dates from text and create a new column in a dataset. Dates are entered in different formats in column A1 (either mm-dd-yy or mm-dd). I need to find a way to identify the date in column A1 and then add the year if it is missing. Thus far, I have been able to extract the date regardless of the format; however, when I use as.Date on the new column A2, the date with mm-dd format becomes <NA>. I am aware that there might not be a direct solution for this situation, but a workaround (generalizable to a larger data set) would be great. The year would go from September 2019 to August 2020. Additionally, I am not sure why the format I use within the as.Date function is unable to control how the date gets displayed. This latter issue is not that important, but I am surprised by the behavior of the R function. A solution in tidyverse would be much appreciated.
library(tidyverse)
library(stringr)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+"))
# A1 A2
#1 review 11/18 11/18
#2 begins 12/4/19 12/4/19
#3 3/5/20 3/5/20
#4 <NA> <NA>
#5 deadline 09/5/19 09/5/19
#6 9/3 9/3
db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+")) %>%
mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
# A1 A2
# 1 review 11/18 <NA>
# 2 begins 12/4/19 2019-12-04
# 3 3/5/20 2020-03-05
# 4 <NA> <NA>
# 5 deadline 09/5/19 2019-09-05
# 6 9/3 <NA>
Perhaps:
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
#year from september to august 2019
(db <-
db %>%
mutate(A2 = str_extract(A1, '[\\d\\d/]+'),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) > 8, paste0(A2, '/19'), A2),
A2 = if_else(str_count(A2, '/') == 1 & as.numeric(str_extract(A2, '\\d+')) <= 8, paste0(A2, '/20'), A2),
A2 = as.Date(A2, "%m/%d/%y")) )
#> A1 A2
#> 1 review 11/18 2019-11-18
#> 2 begins 12/4/19 2019-12-04
#> 3 3/5/20 2020-03-05
#> 4 <NA> <NA>
#> 5 deadline 09/5/19 2019-09-05
#> 6 9/3 2019-09-03
Created on 2021-11-21 by the reprex package (v2.0.1)
Well, this is neither a beautiful, concise or tidyverse solution but it does work and should be flexible in its modularity.
library(tidyverse)
db <- data.frame(A1 = c("review 11/18", "begins 12/4/19", "3/5/20", NA, "deadline 09/5/19", "9/3"))
db <- db %>% mutate(A2 = str_extract(A1, "[0-9/0-9]+"), A2 = str_extract(A1, "[0-9/0-9]+"))
test1 <- unlist(lapply(str_split(db$A2, "/", n = 3), function(x) length(x)))
test2 <- lapply(str_split(db$A2, "/", n = 3), function(x) as.numeric(x))
if(test1 == 2){
if(test2[[1]] >= 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/19"), no = db$A2)
}
if(test2[[1]] < 9){
db$A2 <- ifelse(test = between(nchar(db$A2), 3, 5) & !is.na(db$A2), yes = paste0(db$A2, "/20"), no = db$A2)
}
}
db <- db %>% mutate(A2 = A2 %>% as.Date(., "%m/%d/%y"))
db
A1 A2
1 review 11/18 2019-11-18
2 begins 12/4/19 2019-12-04
3 3/5/20 2020-03-05
4 <NA> <NA>
5 deadline 09/5/19 2019-09-05
6 9/3 2019-09-03
I like the rematch2 package for many regex scenarios.
The first pattern tries to match the full m/d/y values. The second patterns tried to match the partial m/d values (furthermore, it separates the month from the day, so it can determine if it should be 2019 or 2020).
Once those pieces are isolated, the rest is just a sequence of small steps.
db |>
rematch2::bind_re_match(from = A1, "^.*?(?<mdy>\\d{1,2}/\\d{1,2}/\\d{2})$") |>
rematch2::bind_re_match(from = A1, "^.*?(?<md_m>\\d{1,2})/(?<md_d>\\d{1,2})$") |>
dplyr::mutate(
md_m = as.integer(md_m),
md_y = dplyr::if_else(9L <= md_m, "19", "20"), # It's 2019 if the month is Sept or later
md = sprintf("%i/%s/%s", md_m, md_d, md_y), # Assemble components
md = as.Date(md , "%m/%d/%y"), # Convert data type
mdy = as.Date(mdy, "%m/%d/%y"), # Convert data type
date = dplyr::coalesce(mdy, md), # Prefer the mdy if it's not missing
)
Output:
A1 mdy md_m md_d md_y md date
1 review 11/18 <NA> 11 18 19 2019-11-18 2019-11-18
2 begins 12/4/19 2019-12-04 4 19 20 2020-04-19 2019-12-04
3 3/5/20 2020-03-05 5 20 20 2020-05-20 2020-03-05
4 <NA> <NA> NA <NA> <NA> <NA> <NA>
5 deadline 09/5/19 2019-09-05 5 19 20 2020-05-19 2019-09-05
6 9/3 <NA> 9 3 19 2019-09-03 2019-09-03

Create variable for day of the experiment

I have a large data set that spanned a month in time with the data stamped in a column called txn_date like the below. (this is a toy reproduction of it)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
Ideally I would like to get a column in my data frame for each date in the data which I think could be done by first getting a single column that is 1 for the first date that appears and then so on.
So something like this
dat1 <- read.table(text = "var1 txn_date day
5 2020-10-25 1
1 2020-10-25 1
3 2020-10-26 2
4 2020-10-27 3
1 2020-10-27 3
3 2020-10-31 7
3 2020-11-01 8
8 2020-11-12 9 ", header = TRUE
I'm not quite sure how to get this. The txn_date column is as.Date in my actual data frame. I think if I could get the single day column like is listed above (then convert it to a factor) then I could always one hot encode the actual levels of that column if I need to. Ultimately I need to use the day of the experiment as a regressor in a regression I'm going to run.
Something along the lines of y ~ x + day_1 + day_2 +...+ error
Would this be suitable?
library(tidyverse)
dat1 <- read.table(text = "var1 txn_date
5 2020-10-25
1 2020-10-25
3 2020-10-26
4 2020-10-27
1 2020-10-27
3 2020-10-31
3 2020-11-01
8 2020-11-02 ", header = TRUE)
dat1$txn_date <- as.Date(dat1$txn_date)
dat1 %>%
mutate(days = txn_date - txn_date[1] + 1)
# var1 txn_date days
#1 5 2020-10-25 1 days
#2 1 2020-10-25 1 days
#3 3 2020-10-26 2 days
#4 4 2020-10-27 3 days
#5 1 2020-10-27 3 days
#6 3 2020-10-31 7 days
#7 3 2020-11-01 8 days
#8 8 2020-11-02 9 days
We create a sequence of dates based on the min and max of 'txn_date' and match
dates <- seq(min(as.Date(dat1$txn_date)),
max(as.Date(dat1$txn_date)), by = '1 day')
dat1$day <- with(dat1, match(as.Date(txn_date), dates))
dat1$day
#[1] 1 1 2 3 3 7 8 9
Or may use factor route
with(dat1, as.integer(factor(txn_date, levels = as.character(dates))))
#[1] 1 1 2 3 3 7 8 9

How many days from the list were in given period [R]

I’d like to count using R, how many days of given list:
2020-10-01
2020-10-03
2020-10-07
2020-10-08
2020-10-09
2020-10-10
2020-10-14
2020-10-17
2020-10-21
2020-10-22
2020-10-27
2020-10-29
2020-10-30
Were in given period from start to end:
id start end
1 2020-10-05 2020-10-30
2 2020-10-06 2020-10-29
3 2020-10-10 2020-10-12
And the result should be for example:
id number of days
1 5
2 18
3 12
Here you can find a tidyverse approch with lubridate and dplyr.
library(lubridate)
library(dplyr)
df %>%
count(id, start, end,
wt = days %within% interval(start, end),
name = "number_of_days")
#> id start end number_of_days
#> 1 1 2020-10-05 2020-10-30 11
#> 2 2 2020-10-06 2020-10-29 10
#> 3 3 2020-10-10 2020-10-12 1
For each row, count the number of days within the interval of start and end (extremes included).
(If you don't want to see start and end just remove them from the first line of count)
Where:
days <- c("2020-10-01",
"2020-10-03",
"2020-10-07",
"2020-10-08",
"2020-10-09",
"2020-10-10",
"2020-10-14",
"2020-10-17",
"2020-10-21",
"2020-10-22",
"2020-10-27",
"2020-10-29",
"2020-10-30")
df <- read.table(text = " id start end
1 2020-10-05 2020-10-30
2 2020-10-06 2020-10-29
3 2020-10-10 2020-10-12", header = TRUE)
days <- as.Date(days)
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
Assuming all the dates are of date class you can use mapply :
df2$num_days <- mapply(function(x, y) sum(df1$dates >= x & df1$dates <= y), df2$start, df2$end)

Resources