Convert character column to date column in lubridate [duplicate] - r

This question already has answers here:
Convert date-time string to class Date
(4 answers)
Closed 3 years ago.
I have a dataframe looking like:
# A tibble: 10 x 4
id incoming_date expiry_date end_date
<dbl> <chr> <chr> <chr>
1 1 11.17.18 10.1.19 03.1.19
2 11 03.4.19 NA 03.20.19
3 2 03.17.19 02.1.20 05.7.19
4 2 05.7.19 NA 06.15.19
5 4 06.11.19 05.1.21 06.22.19
6 1 06.12.19 04.1.21 NA
7 2 06.12.19 04.1.21 NA
8 13 11.16.18 06.1.19 02.20.19
9 7 02.19.19 12.1.21 3.23.19
10 1 03.19.19 01.1.21 09.10.19
reproducible:
library(tidyverse)
library(lubridate)
df <- as.tibble(structure(list(id = c(1, 11, 2, 2, 4, 1, 2, 13, 7, 1), incoming_date = c("11.17.18",
"03.4.19", "03.17.19", "05.7.19", "06.11.19", "06.12.19", "06.12.19",
"11.16.18", "02.19.19", "03.19.19"), expiry_date = c("10.1.19",
NA, "02.1.20", NA, "05.1.21", "04.1.21", "04.1.21", "06.1.19",
"12.1.21", "01.1.21"), end_date = c("03.1.19", "03.20.19", "05.7.19",
"06.15.19", "06.22.19", NA, NA, "02.20.19", "3.23.19", "09.10.19"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
)))
I want to convert the columns from character to date format, as follows:
id incoming_date expiry_date end_date
<dbl> <date> <date> <date>
1 1 2018-11-17 2019-10-01 2019-03-01
2 11 2019-03-04 NA 2019-03-20
3 2 2019-03-17 2020-02-01 2019-05-07
4 2 2019-05-07 NA 2019-06-15
5 4 2019-06-11 2021-05-01 2019-06-22
6 1 2019-06-12 2021-04-01 NA
7 2 2019-06-12 2021-04-01 NA
8 13 2018-11-16 2019-06-01 2019-02-20
9 7 2019-02-19 2021-12-01 2019-03-23
10 1 2019-03-19 2021-01-01 2019-09-10
What I tried to do - starting from the incoming_date column - is:
df %>%
mutate(incoming_date_parsed = strptime(as.POSIXct(incoming_date), "%m.%d.%Y"))
throwing the error:
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
In logical order, what I would do is:
Convert the column from character to datetime
Parse the date following in the format YYYY-MM-dd

You're really close!
There's no reason to try to convert to posxict first and then date in your example.
And you would use a lower case 'y' for YY format years.
df <- df %>%
mutate(incoming_date_parsed = as.Date(incoming_date, "%m.%d.%y"))
> head(df$incoming_date_parsed)
[1] "2018-11-17" "2019-03-04" "2019-03-17" "2019-05-07" "2019-06-11" "2019-06-12"

Related

Joining two data frames on the closest date in R

I have two datasets that I would like to join based on date. One is a survey dataset, and the other is a list of prices at various dates. The dates don't match exactly, so I would like to join on the nearest date in the survey dataset (the price data is weekly).
Here's a brief snippet of what the survey dataset looks like (there are many other variables, but here's the two most relevant):
ID
actual.date
20120377
2012-09-26
2020455822
2020-11-23
20126758
2012-10-26
20124241
2012-10-25
2020426572
2020-11-28
And here's the price dataset (also much larger, but you get the idea):
date
price.var1
price.var2
2017-10-30
2.74733926399869
2.73994826674735
2015-03-16
2.77028200438506
2.74079930272231
2010-10-18
3.4265947805337
3.41591263539176
2012-10-29
4.10095806545397
4.14717556976502
2012-01-09
3.87888859352037
3.93074237884497
What I would like to do is join the price dataset to the survey dataset, joining on the nearest date.
I've tried a number of different things, none of which have worked to my satisfaction.
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
You were almost there.
In the DT[i,on] syntax, i should be survey to join on all its rows
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
Convert the dates to numeric and find the closest date from the survey for price with Closest() from DescTools, and take that value.
Example datasets
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
survey
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
price
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6

Find if a date overlaps between multiple pairs of vectorised dates

I'm trying to find whether a date exists between multiple pairs of dates which are wide in my dataset - the length I've given here is just an example, the eventual number may be larger or smaller. Not sure if this is the most sensible option but working longwise didn't seem to work, this is also a very common way to work with overlapping dates and date pairs in SPSS, where you can have multiple variables numerised as the dates are here and it works through each numbered 'set' to give you a response.
Here is an example dataset:
person key_date 1_end_date 2_end_date 3_end_date 4_end_date 1_start_date 2_start_date 3_start_date 4_start_date
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 <NA> 2019-05-23 2019-09-30 2016-07-22 <NA>
2 2 2019-06-07 2019-05-16 2019-06-07 <NA> <NA> 2019-05-16 <NA> <NA> <NA>
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09
test <- structure(list(person = 1:3, key_date = structure(c(18169, 18054,18330), class = "Date"), `1_end_date` = structure(c(18039, 18032,16954), class = "Date"), `2_end_date` = structure(c(18169, 18054,18117), class = "Date"), `3_end_date` = structure(c(17004, NA,18409), class = "Date"), `4_end_date` = structure(c(NA, NA, 18304), class = "Date"), `1_start_date` = structure(c(18039, 18032,16954), class = "Date"), `2_start_date` = structure(c(18169,NA, 18117), class = "Date"), `3_start_date` = structure(c(17004,NA, 18409), class = "Date"), `4_start_date` = structure(c(NA,NA, 18330), class = "Date")), row.names = c(NA, 3L), class = "data.frame")
The expected output would be just a binary flag to indicate that the key_date exists between any pair of start_date and end_date. In the example given, that would mean person 1 and 3. Any ideas how to do this? Is this really inefficient?
tidyverse approach
library(tidyverse)
result <- test %>% mutate(across(ends_with("end_date"), ~
key_date <= . & key_date >= get(str_replace(cur_column(), "end", "start")),
.names = '{.col}_flag')) %>%
rowwise() %>%
mutate(Flag1 = sum(c_across(ends_with("flag")), na.rm = T)) %>%
ungroup() %>%
select(-ends_with("flag"))
> result$Flag1
[1] 1 0 0
Complete output will look like
> result
# A tibble: 3 x 11
person key_date `1_end_date` `2_end_date` `3_end_date` `4_end_date` `1_start_date` `2_start_date` `3_start_date` `4_start_date` Flag1
<int> <date> <date> <date> <date> <date> <date> <date> <date> <date> <dbl>
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 NA 2019-05-23 2019-09-30 2016-07-22 NA 1
2 2 2019-06-07 2019-05-16 2019-06-07 NA NA 2019-05-16 NA NA NA 0
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09 0

Set up data in order to use Prophet() in R

I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.

Subtracting dates based on conditions using dplyr in r

Below is an example of a table I am working with.
df = data.frame(Test_ID = c('a1','a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a2'),
Event_ID = c('Failure_x', 'Failure_x', 'Failure_y', 'Failure_y', 'Failure_x',
'Failure_x', 'Failure_y', 'Failure_x', 'Failure_y', 'Failure_y',
'Failure_x','Failure_x', 'Failure_y'),
Fail_Date = c('2018-10-10 17:52:20', '2018-10-11 17:02:16', '2018-10-14 12:52:20',
'2018-11-11 16:18:34', '2018-11-12 17:03:06', '2018-11-25 10:50:10',
'2018-12-01 10:28:50', '2018-09-12 19:02:08', '2018-09-20 11:32:25',
'2018-10-13 14:43:30', '2018-10-15 14:22:28', '2018-10-30 21:55:45',
'2018-11-17 11:53:35'))
I want to subtract the failure dates (by Test_ID) only where Failure_y occurs after Failure_x. The Fail_Date for Event_ID Failure_y will be subtracted from the Fail_Date for Event_ID Failure_x. Within a group I can have multiple Failure_y's. The second Failure_y will be subtracted from the Failure_x occurring after the first instance of Failure_y.
I have tried to use dplyr to create a column TIME_BETWEEN_FAILURES.
library(lubridate)
df$Fail_Date = as.POSIXct(as.character(as.factor(df$Fail_Date)),format="%Y-%m-%d %H:%M:%S")
df = df %>% group_by(Test_ID) %>%
mutate(TIME_BETWEEN_FAILURES = ifelse(Event_ID == "Failure_y" & lag(Event_ID) == "Failure_x",
difftime(Fail_Date, first(Fail_Date),units = "hours"),''))`
I was able to create the Time_BETWEEN_FAILURES correctly only for the first instance using first() in dplyr. That's where I am currently stuck. Any help on this matter will be appreciated.
This is result from the code snippet above.
Output required for analysis.
This is ideal response needed for my analysis.
Thanks.
Cheers.
df %>%
group_by(gr = rev(cumsum(rev(Event_ID)=="Failure_y")), Test_ID) %>%
mutate(time_between_failures = ifelse(n() > 1 & Event_ID=="Failure_y", difftime(Fail_Date[n()], Fail_Date[1L], units = "hours"), NA))
# A tibble: 13 x 5
# Groups: gr, Test_ID [6]
Test_ID Event_ID Fail_Date gr time_between_failures
<fct> <fct> <dttm> <int> <dbl>
1 a1 Failure_x 2018-10-10 17:52:20 6 NA
2 a1 Failure_x 2018-10-11 17:02:16 6 NA
3 a1 Failure_y 2018-10-14 12:52:20 6 91
4 a1 Failure_y 2018-11-11 16:18:34 5 NA
5 a1 Failure_x 2018-11-12 17:03:06 4 NA
6 a1 Failure_x 2018-11-25 10:50:10 4 NA
7 a1 Failure_y 2018-12-01 10:28:50 4 449.
8 a2 Failure_x 2018-09-12 19:02:08 3 NA
9 a2 Failure_y 2018-09-20 11:32:25 3 185.
10 a2 Failure_y 2018-10-13 14:43:30 2 NA
11 a2 Failure_x 2018-10-15 14:22:28 1 NA
12 a2 Failure_x 2018-10-30 21:55:45 1 NA
13 a2 Failure_y 2018-11-17 11:53:35 1 790.

Creating intervals from time series data

I have a data frame of users and access times. Access times can be duplicated.
I am trying to create a list of users grouped and named by a given time interval, e.g. year.
timestamp user
1 2013-03-06 01:00:00 1
2 2014-07-06 21:00:00 1
3 2014-07-31 23:00:00 2
4 2014-08-09 17:00:00 2
5 2014-08-14 20:00:00 2
6 2014-08-14 22:00:00 3
7 2014-08-16 15:00:00 3
8 2014-08-19 02:00:00 1
9 2014-12-28 18:00:00 1
10 2015-01-17 17:00:00 1
11 2015-01-22 22:00:00 2
12 2015-01-22 22:00:00 3
13 2015-03-23 15:00:00 4
14 2015-04-05 18:00:00 1
15 2015-04-06 01:00:00 2
My code example already creates a list of users grouped by year.
My problem is that I need to modify the table in this approach, which becomes a problem with my tables of a million entries.
test <- structure(list(timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00", "2014-08-14 20:00:00",
"2014-08-14 22:00:00", "2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00", "2015-01-22 22:00:00",
"2015-01-22 22:00:00", "2015-03-23 15:00:00", "2015-04-05 18:00:00",
"2015-04-06 01:00:00"), user = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 1L, 2L)), .Names = c("timestamp", "user"
), class = "data.frame", row.names = c(NA, -15L))
require(lubridate)
#Creating "POSIXct" object from string timestamp
timestamp <- lapply(test$timestamp,
function(x)parse_date_time(x, "y-m-d H:M:S"))
test$timestamp <- do.call(c,timestamp)
print(class(test$timestamp))
#Adding column for year
test <- cbind(test,sapply(timestamp, function(x)year(x)))
colnames(test)[3]<- "year"
#Creating list of year time intervals for users
intervals <- names(table(test$year))
users <- lapply(intervals, function(x)test[test$year %in% x,"user"])
names(users) <- intervals
without timestamps
treat the timestamp as a character. Only works if for every timestap, the first 4 digits represent the year.
library(dplyr)
test %>%
group_by( user, substr(timestamp,1,4 ) ) %>%
summarise( )
# user `substr(timestamp, 1, 4)`
# <int> <chr>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
dplyr + lubridate
will extract the year from the timestamp
library( dplyr )
library( lubridate )
test %>%
mutate( timestamp = as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" ) ) %>%
group_by( user, lubridate::year( timestamp ) ) %>%
summarise( )
# # Groups: user [?]
# user `year(timestamp)`
# <int> <dbl>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
table
a frequency table is also quickly made
table( test$user, substr( test$timestamp, 1, 4 ) )
# 2013 2014 2015
# 1 1 3 2
# 2 0 3 2
# 3 0 2 1
# 4 0 0 1
there are any more alternatives... pick one
edit
if speed is an issue, ty data.table
dcast(
setDT( test )[, timestamp := as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" )][, .N, by = list( user, data.table::year(timestamp) )],
user ~ data.table,
value.var = "N")
# user 2013 2014 2015
# 1: 1 1 3 2
# 2: 2 NA 3 2
# 3: 3 NA 2 1
# 4: 4 NA NA 1
Another option using the lightning fast data.table package:
library(data.table)
setDT(test) # make `test` a data.frame 'by reference' (no copy is made at all)
test[, j=.(users=list(unique(user))),
by=.(year=substr(test$timestamp,1,4))]
year users
1: 2013 1
2: 2014 1,2,3
3: 2015 1,2,3,4
Again assuming your test$timestamp column is a character vector - otherwise substitute lubridate::year() as needed.
Update:
Simple change to show grouping instead by month (just as it was mentioned in a comment):
test[, j=.(users=list(unique(user))),
by=.(ym=substr(test$timestamp,1,7))]
ym users
1: 2013-03 1
2: 2014-07 1,2
3: 2014-08 2,3,1
4: 2014-12 1
5: 2015-01 1,2,3
6: 2015-03 4
7: 2015-04 1,2
Or group by day, to help demonstrate how to subset with chaining:
test[, j=.(users=list(unique(user))),
by=.(ymd=substr(test$timestamp,1,11))][ymd>='2014-08-01' & ymd<= '2014-08-21']
ymd users
1: 2014-08-09 2
2: 2014-08-14 2,3
3: 2014-08-16 3
4: 2014-08-19 1
Note for filtering/subsetting, if you are only interested in a subset of dates for a "one off" calculation (and not otherwise saving the whole aggregated set to be stored for other purposes) it will likely be more efficient to do the subset in i of DT[i, j, by] for the "one off" calculation.
You could also use base (stats) function aggregate() as follows:
aggregate( x = test$user,
by = list(year=substr(test$timestamp,1,4)),
FUN = unique )
Result:
year x
1 2013 1
2 2014 1, 2, 3
3 2015 1, 2, 3, 4
Above working on assumption that your timestamp column is initially just a character vector exactly as included in your structured example data. In which case you may directly substr out the year with substr(test$timestamp,1,4) avoiding the need to first convert to dates.
However, if you have the timestamp column already as a date, simply substitute the lubridate::year() function you demonstrated in your attempted solution.

Resources