I have a following dataframe in R
ID Date1 Date2
1 21-03-16 8:36 22-03-16 12:36
1 23-03-16 9:36 24-03-16 01:36
1 22-03-16 10:36 25-03-16 11:46
1 23-03-16 11:36 28-03-16 10:16
My desired dataframe is
ID Date1 Date1_time Date2 Date2_time
1 2016-03-21 08:36:00 2016-03-22 12:36:00
1 2016-03-23 09:36:00 2016-03-24 01:36:00
1 2016-03-22 10:36:00 2016-03-25 11:46:00
1 2016-03-23 11:36:00 2016-03-28 10:16:00
I can do this individually using strptime like following
df$Date1 <- strptime(df$Date1, format='%d-%m-%y %H:%M')
df$Date1_time <- strftime(df$Date1 ,format="%H:%M:%S")
df$Date1 <- strptime(df$Date1, format='%Y-%m-%d')
But,I have many date columns to convert like above. How can I write function in R which will do this.
You can do this with dplyr::mutate_at to operate on multiple columns. See select helpers for more info on efficiently specifying which columns to operate on.
Then you can use lubridate and hms for date and time functions.
library(dplyr)
library(lubridate)
library(hms)
df <- readr::read_csv(
'ID,Date1,Date2
1,"21-03-16 8:36","22-03-16 12:36"
1,"23-03-16 9:36","24-03-16 01:36"
1,"22-03-16 10:36","25-03-16 11:46"
1,"23-03-16 11:36","28-03-16 10:16"'
)
df
#> # A tibble: 4 x 3
#> ID Date1 Date2
#> <int> <chr> <chr>
#> 1 1 21-03-16 8:36 22-03-16 12:36
#> 2 1 23-03-16 9:36 24-03-16 01:36
#> 3 1 22-03-16 10:36 25-03-16 11:46
#> 4 1 23-03-16 11:36 28-03-16 10:16
df %>%
mutate_at(vars(Date1, Date2), dmy_hm) %>%
mutate_at(vars(Date1, Date2), funs("date" = date(.), "time" = as.hms(.))) %>%
select(-Date1, -Date2)
#> # A tibble: 4 x 5
#> ID Date1_date Date2_date Date1_time Date2_time
#> <int> <date> <date> <time> <time>
#> 1 1 2016-03-21 2016-03-22 08:36:00 12:36:00
#> 2 1 2016-03-23 2016-03-24 09:36:00 01:36:00
#> 3 1 2016-03-22 2016-03-25 10:36:00 11:46:00
#> 4 1 2016-03-23 2016-03-28 11:36:00 10:16:00
Using dplyr for manipulation:
convertTime <- function(x)as.POSIXct(x, format='%d-%m-%y %H:%M')
df %>%
mutate_at(vars(Date1, Date2), convertTime) %>%
group_by(ID) %>%
mutate_all(funs("date"=as.Date(.), "time"=format(., "%H:%M:%S")))
# Source: local data frame [4 x 7]
# Groups: ID [1]
#
# ID Date1 Date2 Date1_date Date2_date Date1_time Date2_time
# <int> <dttm> <dttm> <date> <date> <chr> <chr>
# 1 1 2016-03-22 12:36:00 2016-03-22 12:36:00 2016-03-22 2016-03-22 12:36:00 12:36:00
# 2 1 2016-03-24 01:36:00 2016-03-24 01:36:00 2016-03-23 2016-03-23 01:36:00 01:36:00
# 3 1 2016-03-25 11:46:00 2016-03-25 11:46:00 2016-03-25 2016-03-25 11:46:00 11:46:00
# 4 1 2016-03-28 10:16:00 2016-03-28 10:16:00 2016-03-28 2016-03-28 10:16:00 10:16:00
I have the same problem, you can try this may be help using strsplit
x <- df$Date1
y = t(as.data.frame(strsplit(as.character(x),' ')))
row.names(y) = NULL
# store splitted data into new columns
df$date <- y[,1] # date column
df$time <- y[,2] # time column
Related
I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days
I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4
library(dplyr)
library(lubridate)
I have some data like this
f <- tribble(~a, ~date,
"BVH", 201801,
"HBYU", 202012,
"CYC", 202112,
"AC", 202109)
And I need to transform it to the last day of month, I do this
f %>% mutate(date = ym(date))
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-01
2 HBYU 2020-12-01
3 CYC 2021-12-01
4 AC 2021-09-01
What I would like is, this
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-31
2 HBYU 2020-12-31
3 CYC 2021-12-31
4 AC 2021-09-30
lubridate's rollforward does exactly what you want:
> f %>% mutate(date = rollforward(ym(date)))
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-31
2 HBYU 2020-12-31
3 CYC 2021-12-31
4 AC 2021-09-30
Here are two other approaches -
library(dplyr)
library(lubridate)
f %>%
mutate(date = ym(date),
date1 = date + months(1) - 1,
date2 = ceiling_date(date, 'month') - 1)
# a date date1 date2
# <chr> <date> <date> <date>
#1 BVH 2018-01-01 2018-01-31 2018-01-31
#2 HBYU 2020-12-01 2020-12-31 2020-12-31
#3 CYC 2021-12-01 2021-12-31 2021-12-31
#4 AC 2021-09-01 2021-09-30 2021-09-30
I have data which looks like
library(dplyr)
library(lubridate)
Date_Construct= c("10/03/2018 00:00", "10/03/2018 00:00","01/01/2016 00:00","21/03/2015 01:25", "21/03/2015 01:25", "17/04/2016 00:00","17/04/2016 00:00", "20/02/2012 00:00","20/02/2020 00:00")
Date_first_use = c("02/08/2018 00:00","02/08/2018 00:00", "01/04/2016 00:00","NA", "NA", "NA", "NA","13/08/2012 00:00","20/04/2020 00:00")
Date_fail = c("02/08/2019 00:00","02/08/2019 00:00", "21/06/2018 06:42","NA" , "NA" , "17/04/2016 00:00", "17/04/2016 00:00","13/08/2014 07:45","NA")
P_ID = c("0001", "0001" ,"0001" ,"0001", "0001","34000","34000","34000", "00425")
Comp_date= c("16/05/2019 00:00", "10/04/2018 12:55","25/06/2017 00:00","22/04/2015 00:00","08/05/2015 00:00" ,"04/05/2017 00:00" ,"15/07/2016 00:00","01/03/2014 00:00", "20/03/2020 00:00")
Type = c("a","a","b","c","c","b","b","a","c")
Date_Construct= dmy_hm(Date_Construct)
dfq= data.frame(`P_ID`, `Type`, `Date_Construct`, `Date_first_use`,`Date_fail`, `Comp_date`)%>%
arrange(P_ID, desc(Date_Construct))%>%
group_by( P_ID, Date_Construct, Type)%>%
mutate(A_ID= cur_group_id())%>%
select(P_ID,A_ID,Type, Date_Construct, Date_first_use, Date_fail, Comp_date)%>%
mutate(across(contains("Date", ignore.case = TRUE), dmy_hm))
View(dfq)
It is a data frame of different items (A_ID) of type a/b/c, created for different clients (P_ID), with date of construction, date of first use and date of failure. Each P_ID may have multiple A_ID, and each A_ID may have multiple Comp_date.
I need to supply a date for where Date_fail is NA, which is the Date_construct of the next constructed A_ID for the same P_ID.
i.e. Date_fail for P_ID 0001, A_ID 1 should be 2016-01-01 00:00:00.
For A_ID which there are no subsequent A_ID (as is the case for P_ID 00425, A_ID 4), the Date_fail should remain NA .
So result should look like:
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
I tried this, which I thought worked, but it is just given me the Date_Construct of the next row in the group, which isn't correct as some A_ID have multiple entries:
arrange(P_ID, Date_Construct)%>%
group_by(P_ID,) %>%
mutate(Date_fail2 = sort(Date_Construct, decreasing = FALSE)[row_number(Date_Construct) + 1])%>%
mutate(Date_fail = if_else( is.na(Date_fail), paste(Date_fail2), paste(Date_fail)))
I'm ideally looking for a dplyr solution as I find them easier to understand and reproduce.
One solution is to nest all the variables that can be different for the same A_ID. (In this case only Comp_date)
library(tidyr)
nested = dfq %>%
ungroup() %>%
arrange(P_ID, A_ID) %>%
nest(extra = Comp_date)
This results in a tibble with one row for each A_ID, where the different Comp_dates are comfortably nested in their own tibbles:
> nested
# A tibble: 6 x 7
# Groups: P_ID, Type, Date_Construct [6]
P_ID A_ID Type Date_Construct Date_first_use Date_fail extra
<fct> <int> <fct> <dttm> <dttm> <dttm> <list>
1 0001 1 c 2015-03-21 01:25:00 NA NA <tibble [2 × 1]>
2 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 <tibble [1 × 1]>
3 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 <tibble [2 × 1]>
4 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA <tibble [1 × 1]>
5 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 <tibble [1 × 1]>
6 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 <tibble [2 × 1]>
You can now modify this using normal dplyr methods. Your own approach would probably work as well here, but it can be done much more cleanly using coalesce and lead. Don't forget to unnest at the end to get your original structure back:
result = nested %>%
group_by(P_ID) %>%
mutate(Date_fail = coalesce(Date_fail, lead(Date_Construct))) %>%
unnest(extra)
Result:
> result
# A tibble: 9 x 7
# Groups: P_ID [3]
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
<fct> <int> <fct> <dttm> <dttm> <dttm> <dttm>
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 00:00:00 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 00:00:00 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 00:00:00 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 00:00:00 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
I have a date-time column with non-consecutive date-times (all on the hour), like this:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
# Output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-13 01:00:00
#3 2018-02-01 11:00:00
I'd like to expand the rows of column dt so that every hour in between the very minimum and maximum date-times is present, looking like:
# Desired output:
# dt
#1 2018-01-01 12:00:00
#2 2018-01-01 13:00:00
#3 2018-01-01 14:00:00
#4 .
#5 .
And so on. tidyverse-based solutions are most preferred.
#DavidArenburg's comment is the way to go for a vector. However, if you want to expand dt inside a data frame with other columns that you would like to keep, you might be interested in tidyr::complete combined with tidyr::full_seq:
dat <- data.frame(dt = as.POSIXct(c("2018-01-01 12:00:00",
"2018-01-13 01:00:00",
"2018-02-01 11:00:00")))
dat$a <- letters[1:3]
dat
#> dt a
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-13 01:00:00 b
#> 3 2018-02-01 11:00:00 c
library(tidyr)
res <- complete(dat, dt = full_seq(dt, 60 ** 2))
print(res, n = 5)
#> # A tibble: 744 x 2
#> dt a
#> <dttm> <chr>
#> 1 2018-01-01 12:00:00 a
#> 2 2018-01-01 13:00:00 <NA>
#> 3 2018-01-01 14:00:00 <NA>
#> 4 2018-01-01 15:00:00 <NA>
#> 5 2018-01-01 16:00:00 <NA>
#> # ... with 739 more rows
Created on 2018-03-12 by the reprex package (v0.2.0).