I have a dataframe with a few columns that contain time/ date information. I'm familiar with using lubridate to parse date-time (ie mm/dd/yyyy hh:mm:ss), but this dataframe has date time in reverse order (ie hh:mm:ss mm/dd/yyyy). How do I get this to read as a date/time? The column is currently reading as a character which is useless to me. Below is an example of what my dataset looks like. I can't make the "time_date" column read as a date -time.
df <- tribble(~activity, ~time_date,
"run", "15:06:17 03/08/2016",
"skip", "09:01:00 03/08/2016")
You should first convert it to a date time with right format and after that you can use strftime with the desired format like this:
datetimes <- as.POSIXct(df$time_date, format = "%H:%M:%S %m/%d/%Y")
df$time_date <- strftime(datetimes, format = "%m/%d/%Y %H:%M:%S")
df
#> # A tibble: 2 × 2
#> activity time_date
#> <chr> <chr>
#> 1 run 03/08/2016 15:06:17
#> 2 skip 03/08/2016 09:01:00
Created on 2023-01-04 with reprex v2.0.2
With dplyr and lubridate on character class data.
library(dplyr)
library(lubridate)
df %>%
rowwise() %>%
mutate(dd = strsplit(time_date, " "),
date_time = mdy_hms(paste(unlist(dd)[2], unlist(dd)[1])),
dd = NULL) %>%
ungroup()
# A tibble: 2 × 3
activity time_date date_time
<chr> <chr> <dttm>
1 run 15:06:17 03/08/2016 2016-03-08 15:06:17
2 skip 09:01:00 03/08/2016 2016-03-08 09:01:00
Alternatively using str_extract
df %>%
mutate(date_time = mdy_hms(paste(str_extract(time_date, " \\d+/.+"),
str_extract(time_date, "\\d+:.+ "))))
# A tibble: 2 × 3
activity time_date date_time
<chr> <chr> <dttm>
1 run 15:06:17 03/08/2016 2016-03-08 15:06:17
2 skip 09:01:00 03/08/2016 2016-03-08 09:01:00
You can use lubridate::parse_date_time() and specify the order as "HMS mdy":
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = parse_date_time(time_date, "HMS mdy"))
# A tibble: 2 × 3
activity time_date date_time
<chr> <chr> <dttm>
1 run 15:06:17 03/08/2016 2016-03-08 15:06:17
2 skip 09:01:00 03/08/2016 2016-03-08 09:01:00
Related
I am working with a large list of dataframes that use inconsistent date formats. I would like to conditionally mutate across the list so that any dataframe that contains a string will use one date format, and those that do not contain the string use another format. In other words, I want to distinguish between dataframes launched in year 2019 (which use mdy) and those launched in all others years (which use dmy).
The following code will conditionally mutate rows within a dataframe, but I am unsure how to conditionally mutate across the entire column.
dataframes %>% map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)))
Thank you!
edit
Data and code example. There are dataframes that contain a mixture of years.
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map(~.x %>%
mutate(date_time = if_else(str_detect(date_time, "/19 "),
mdy_hms(date_time), dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)))
[[1]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2019-07-06 13:00:00 1 2019-07-06 7 187
2 2020-06-07 13:00:00 2 2020-06-07 6 159
[[2]]
# A tibble: 2 × 5
date_time num date month doy
<dttm> <int> <date> <dbl> <dbl>
1 2020-07-06 13:00:00 1 2020-07-06 7 188
2 2021-07-06 13:00:00 2 2021-07-06 7 187
If you are trying to determine the format of the date column for the whole data.frame based on the presence of any date from 2019, then a small tweak of your code should work.
Instead of evaluating each record for the presence of /19 , you set the condition of the if_else() to be any(str_detect(...)) which returns TRUE if any of the values are TRUE. However the result of any() is always of length 1 so you then need to rep() the result to match the length of the whole data.frame using dplyr::n().
library(tidyverse)
library(lubridate)
dataframes <- list(
tibble(date_time = c("07/06/19 01:00:00 PM", "07/06/20 01:00:00 PM"), num = 1:2), # July 6th
tibble(date_time = c("06/07/20 01:00:00 PM", "06/07/21 01:00:00 PM"), num = 1:2) # July 6th
)
dataframes %>%
map( ~ .x %>%
mutate(
date_time = if_else(str_detect(date_time, "/19 ") %>%
any() %>%
rep(n()),
mdy_hms(date_time),
dmy_hms(date_time)),
date = date(date_time),
month = month(date_time),
doy = yday(date_time)
))
#> [[1]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2019-07-06 13:00:00 1 2019-07-06 7 187
#> 2 2020-07-06 13:00:00 2 2020-07-06 7 188
#>
#> [[2]]
#> # A tibble: 2 × 5
#> date_time num date month doy
#> <dttm> <int> <date> <dbl> <dbl>
#> 1 2020-07-06 13:00:00 1 2020-07-06 7 188
#> 2 2021-07-06 13:00:00 2 2021-07-06 7 187
Created on 2022-07-20 by the reprex package (v2.0.1)
I'm doing something quite simple. Given a dataframe of start dates and end dates for specific periods I want to expand/create a full sequence for each period binned by week (with the factor for each row), then output this in a single large dataframe.
For instance:
library(tidyverse)
library(lubridate)
# Dataset
start_dates = ymd_hms(c("2019-05-08 00:00:00",
"2020-01-17 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00"), tz = "UTC")
end_dates = ymd_hms(c( "2019-10-24 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00",
"2022-01-19 00:00:00"), tz = "UTC")
df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)
It was suggested to me to use do(), which currently works fine but I hate it when things are superseded. I also have a way of doing it using map2. But reading the file (https://dplyr.tidyverse.org/reference/do.html) suggests you can use nest_by(), across() and summarise() to do the same job as do(), how would I go about getting same result? I've tried a lot of things but I just can't seem to get it.
# do() way to do it
df1 %>%
group_by(studying) %>%
do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))
# transmute() way to do it
df1 %>%
transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying)
%>% unnest(cols = c(weeks))
As the documentation of ?do suggests, we can now use summarise and replace the . with across():
library(tidyverse)
library(lubridate)
df1 %>%
group_by(studying) %>%
summarise(week = seq(across()$start_dates,
across()$end_dates,
by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups: studying [7]
#> studying week
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-19 by the reprex package (v0.3.0)
You can also use tidyr::complete:
df1 %>%
group_by(studying) %>%
complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>%
select(-end_dates, weeks = start_dates)
# A tibble: 134 x 2
# Groups: studying [7]
studying weeks
<chr> <dttm>
1 period1 2019-05-08 00:00:00
2 period1 2019-05-15 00:00:00
3 period1 2019-05-22 00:00:00
4 period1 2019-05-29 00:00:00
5 period1 2019-06-05 00:00:00
6 period1 2019-06-12 00:00:00
7 period1 2019-06-19 00:00:00
8 period1 2019-06-26 00:00:00
9 period1 2019-07-03 00:00:00
10 period1 2019-07-10 00:00:00
# ... with 124 more rows
Although marked Experimental the help file for group_modify does say that
‘group_modify()’ is an evolution of ‘do()’
and, in fact, the code for the example in the question using group_modify is nearly the same as with do.
# with group_modify
df2 <- df1 %>%
group_by(studying) %>%
group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
# with do
df0 <- df1 %>%
group_by(studying) %>%
do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
identical(df2, df0)
## [1] TRUE
Not sure if this exactly what you are looking for, but here is my attempt with rowwise and unnest
df1 %>%
rowwise() %>%
mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>%
select(studying, week) %>%
unnest(cols = c(week))
Another approach:
library(tidyverse)
df1 %>%
group_by(studying) %>%
summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups: studying [7]
#> studying weeks
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-20 by the reprex package (v2.0.1)
I have a date format as 01-June-2020. I would like to convert it to a time series data in R. I tried as.Date but it returns NAs.
Here is the data:
dput(head(TData))
structure(list(Date = c("31-May-20", "01-Jun-20", "02-Jun-20",
"03-Jun-20", "04-Jun-20", "07-Jun-20"), Price = c(7213.03, 7288.81,
7285.23, 7222.41, 7207.78, 7267.86), Open = c(7050.66, 7213.03,
7288.81, 7285.23, 7222.41, 7207.78), High = c(7338.96, 7288.81,
7321.36, 7311.85, 7207.78, 7277.7), Low = c(7149.71, 7202.14,
7277.63, 7202.39, 7129.25, 7233.67), Vol. = c("307.44M", "349.59M",
"343.52M", "286.85M", "234.18M", "225.87M"), `Change %` = c("2.30%",
"1.05%", "-0.05%", "-0.86%", "-0.20%", "0.83%")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
We have to specify the format. By default, the format is YYYY-MM-DD i.e. %Y-%m-%d. Here, it is %d 2 digit day, followed by abbreviated month in characters- %b and 2 digit year - %y
TData$Date <- as.Date(TData$Date, '%d-%b-%y')
If we want to create a time series data, may be use xts
library(lubridate)
library(xts)
library(dplyr)
TData %>%
mutate(Date = dmy(Date)) %>%
select(Date, where(is.numeric)) %>%
{xts(.[-1], order.by = .$Date)}
Price Open High Low
2020-05-31 7213.03 7050.66 7338.96 7149.71
2020-06-01 7288.81 7213.03 7288.81 7202.14
2020-06-02 7285.23 7288.81 7321.36 7277.63
2020-06-03 7222.41 7285.23 7311.85 7202.39
2020-06-04 7207.78 7222.41 7207.78 7129.25
2020-06-07 7267.86 7207.78 7277.70 7233.67
or may use tsibble
library(tsibble)
TData %>%
mutate(Date = dmy(Date)) %>%
select(Date, where(is.numeric)) %>%
as_tsibble(index = Date)
-output
# A tsibble: 6 x 5 [1D]
Date Price Open High Low
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-05-31 7213. 7051. 7339. 7150.
2 2020-06-01 7289. 7213. 7289. 7202.
3 2020-06-02 7285. 7289. 7321. 7278.
4 2020-06-03 7222. 7285. 7312. 7202.
5 2020-06-04 7208. 7222. 7208. 7129.
6 2020-06-07 7268. 7208. 7278. 7234.
We can also use lubridate package functions. Since months are stored as abbreviated month names we use %b instead of %m here:
library(lubridate)
df %>%
mutate(Date = as_date(Date, format = "%d-%b-%Y"))
# A tibble: 6 x 7
Date Price Open High Low Vol. `Change %`
<date> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 2020-05-31 7213. 7051. 7339. 7150. 307.44M 2.30%
2 2020-06-01 7289. 7213. 7289. 7202. 349.59M 1.05%
3 2020-06-02 7285. 7289. 7321. 7278. 343.52M -0.05%
4 2020-06-03 7222. 7285. 7312. 7202. 286.85M -0.86%
5 2020-06-04 7208. 7222. 7208. 7129. 234.18M -0.20%
6 2020-06-07 7268. 7208. 7278. 7234. 225.87M 0.83%
I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.
I've triangulated information from other SO answers for the below code, but getting stuck with an error message. Searched SO for similar errors and resolutions but haven't been able to figure it out, so help is appreciated.
For every group ("id"), I want to get the difference between the start times for consecutive rows.
Reproducible data:
require(dplyr)
df <-data.frame(id=as.numeric(c("1","1","1","2","2","2")),
start= c("1/31/17 10:00","1/31/17 10:02","1/31/17 10:45",
"2/10/17 12:00", "2/10/17 12:20","2/11/17 09:40"))
time <- strptime(df$start, format = "%m/%d/%y %H:%M")
df %>%
group_by(id)%>%
mutate(diff = time - lag(time),
diff_mins = as.numeric(diff, units = 'mins'))
Gets me error:
Error in mutate_impl(.data, dots) :
Column diff must be length 3 (the group size) or one, not 6
In addition: Warning message:
In unclass(time1) - unclass(time2) :
longer object length is not a multiple of shorter object length
Do you mean something like this?
There is no need for lag here, a simple diff on the grouped times is sufficient.
df %>%
mutate(start = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = c(0, diff(start)))
## A tibble: 6 x 3
## Groups: id [2]
# id start diff
# <dbl> <dttm> <dbl>
#1 1. 2017-01-31 10:00:00 0.
#2 1. 2017-01-31 10:02:00 2.
#3 1. 2017-01-31 10:45:00 43.
#4 2. 2017-02-10 12:00:00 0.
#5 2. 2017-02-10 12:20:00 20.
#6 2. 2017-02-11 09:40:00 1280.
You can use lag and difftime (per Hadley):
df %>%
mutate(time = as.POSIXct(start, format = "%m/%d/%y %H:%M")) %>%
group_by(id) %>%
mutate(diff = difftime(time, lag(time)))
# A tibble: 6 x 4
# Groups: id [2]
id start time diff
<dbl> <fct> <dttm> <time>
1 1. 1/31/17 10:00 2017-01-31 10:00:00 <NA>
2 1. 1/31/17 10:02 2017-01-31 10:02:00 2
3 1. 1/31/17 10:45 2017-01-31 10:45:00 43
4 2. 2/10/17 12:00 2017-02-10 12:00:00 <NA>
5 2. 2/10/17 12:20 2017-02-10 12:20:00 20
6 2. 2/11/17 09:40 2017-02-11 09:40:00 1280