Fill down data frame in R with date and numeric values - r

I have a data frame that looks this way
df <- tibble(date = c('6/8/2021 18:58',
'6/8/2021 19:00',
'6/8/2021 19:05',
'6/8/2021 19:07'),
values = c(1,0,1,0)
)
date
values
6/8/2021 18:58
1
6/8/2021 19:00
0
6/8/2021 19:05
1
6/8/2021 19:07
0
That I need to be transformed to look like this
date
values
6/8/2021 18:58
1
6/9/2021 18:59
1
6/10/2021 19:00
0
6/11/2021 19:01
0
6/12/2021 19:02
0
6/13/2021 19:03
0
6/14/2021 19:04
0
6/15/2021 19:05
1
6/16/2021 19:06
1
6/17/2021 19:07
0
Thanks in advance for the help.

Use complete to return a sequence of date from the min to max by 1 minute after converting the 'date' column to POSIXct
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
mutate(date = mdy_hm(date)) %>%
complete(date = seq(first(date), last(date), by = '1 min'),
fill = list(values = 0)) %>%
mutate(date = date + days(row_number() - 1), values = +(values|lag(values)))
-output
# A tibble: 10 x 2
date values
<dttm> <int>
1 2021-06-08 18:58:00 1
2 2021-06-09 18:59:00 1
3 2021-06-10 19:00:00 0
4 2021-06-11 19:01:00 0
5 2021-06-12 19:02:00 0
6 2021-06-13 19:03:00 0
7 2021-06-14 19:04:00 0
8 2021-06-15 19:05:00 1
9 2021-06-16 19:06:00 1
10 2021-06-17 19:07:00 0

Maybe you can try the base R code like below
df <- transform(
df,
date = strptime(date, "%m/%d/%Y %H:%M")
)
dfout <- transform(
data.frame(Date = with(df, seq(date[1], date[length(date)], by = "1 min"))),
Values = with(df, values[findInterval(Date, df$date)])
)
which gives
Date Values
1 2021-06-08 18:58:00 1
2 2021-06-08 18:59:00 1
3 2021-06-08 19:00:00 0
4 2021-06-08 19:01:00 0
5 2021-06-08 19:02:00 0
6 2021-06-08 19:03:00 0
7 2021-06-08 19:04:00 0
8 2021-06-08 19:05:00 1
9 2021-06-08 19:06:00 1
10 2021-06-08 19:07:00 0

Related

How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
Id
date
1
08/03/2020 08:17
1
08/03/2020 12:01
1
08/04/2020 15:08
1
08/04/2020 22:16
2
07/03/2020 08:10
2
07/03/2020 12:03
2
07/04/2020 15:07
2
07/05/2020 00:16
3
08/22/2020 09:17
3
08/23/2020 11:04
3
08/24/2020 00:01
4
10/03/2020 08:37
4
10/03/2020 11:13
4
10/04/2020 15:20
4
10/04/2020 23:05
This is what I want:
Id
date
day
1
08/03/2020 08:17
1
1
08/03/2020 12:01
1
1
08/04/2020 15:08
2
1
08/04/2020 22:16
2
2
07/03/2020 08:10
1
2
07/03/2020 12:03
1
2
07/04/2020 15:07
2
2
07/05/2020 00:16
2
3
08/22/2020 09:17
1
3
08/23/2020 11:04
2
3
08/24/2020 00:01
2
4
10/03/2020 08:37
1
4
10/03/2020 11:13
1
4
10/04/2020 15:20
2
4
10/04/2020 23:05
2
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
output
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Output
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Filter data based on subgroups R

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!
Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00
We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

Determine the number of process running each day and average days of commencing those projects, in R

I have a large dataset of processes (their IDs), start-dates and corresponding end dates.
What I want is divided in two parts. Firstly, how many processes are running each day. Secondly the running processes' mean days of running/commencement.
Sample data set is like
> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020",
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020",
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA,
-5L))
df
> df
Process Start End
1 P001 01-01-2020 10-01-2020
2 P002 02-01-2020 09-01-2020
3 P003 03-01-2020 04-01-2020
4 P004 08-01-2020 17-01-2020
5 P005 13-01-2020 19-01-2020
For first part I have proceeded like this
library(tidyverse)
df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
arrange(dates, event) %>%
mutate(processes = ifelse(event == 'Start', 1, -1),
processes = cumsum(processes)) %>%
select(-Process, -event) %>%
complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
fill(processes)
# A tibble: 20 x 2
dates processes
<date> <dbl>
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 2
6 2020-01-06 2
7 2020-01-07 2
8 2020-01-08 3
9 2020-01-09 3
10 2020-01-10 2
11 2020-01-11 1
12 2020-01-12 1
13 2020-01-13 2
14 2020-01-14 2
15 2020-01-15 2
16 2020-01-16 2
17 2020-01-17 2
18 2020-01-18 1
19 2020-01-19 1
20 2020-01-20 0
For second part the desired output is like column mean days in the following screenshot with explanation-
tidyverse approach will be preferred, please.
Here is one approach :
library(tidyverse)
df %>%
#Convert to date
mutate(across(c(Start, End), lubridate::dmy),
#Create a sequence of dates from start to end
Dates = map2(Start, End, seq, by = 'day')) %>%
#Get data in long format
unnest(Dates) %>%
#Remove columns
select(-Start, -End) %>%
#For each process
group_by(Process) %>%
#Count number of days spent on it
mutate(days_spent = row_number() - 1) %>%
#For each date
group_by(Dates) %>%
#Count number of process running and average days
summarise(process = n(),
mean_days = mean(days_spent))
This returns :
# Dates process mean_days
# <date> <int> <dbl>
# 1 2020-01-01 1 0
# 2 2020-01-02 2 0.5
# 3 2020-01-03 3 1
# 4 2020-01-04 3 2
# 5 2020-01-05 2 3.5
# 6 2020-01-06 2 4.5
# 7 2020-01-07 2 5.5
# 8 2020-01-08 3 4.33
# 9 2020-01-09 3 5.33
#10 2020-01-10 2 5.5
#11 2020-01-11 1 3
#12 2020-01-12 1 4
#13 2020-01-13 2 2.5
#14 2020-01-14 2 3.5
#15 2020-01-15 2 4.5
#16 2020-01-16 2 5.5
#17 2020-01-17 2 6.5
#18 2020-01-18 1 5
#19 2020-01-19 1 6

Time-interval overlap match by group

Suppose I have the following DF:
id flag time
1 1 2017-01-01 UTC--2017-01-07 UTC
1 0 2018-01-01 UTC--2019-01-01 UTC
1 0 2017-01-03 UTC--2017-01-09 UTC
2 1 2017-01-01 UTC--2017-01-15 UTC
2 1 2018-07-01 UTC--2018-09-01 UTC
2 1 2018-10-12 UTC--2018-10-20 UTC
2 0 2017-01-12 UTC--2017-01-16 UTC
2 0 2017-03-01 UTC--2017-03-15 UTC
2 0 2017-12-01 UTC--2017-12-31 UTC
2 0 2018-08-15 UTC--2018-09-19 UTC
2 0 2018-10-01 UTC--2018-10-21 UTC
Created with the following code:
df <- data.frame(id=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
flag=c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0),
time=c(interval(ymd(20170101), ymd(20170107)),
interval(ymd(20180101), ymd(20190101)),
interval(ymd(20170103), ymd(20170109)),
# Casos
interval(ymd(20170101), ymd(20170115)),
interval(ymd(20180701), ymd(20180901)),
interval(ymd(20181012), ymd(20181020)),
# Controles
interval(ymd(20170112), ymd(20170116)),
interval(ymd(20170301), ymd(20170315)),
interval(ymd(20171201), ymd(20171231)),
interval(ymd(20180815), ymd(20180919)),
interval(ymd(20181001), ymd(20181021))))
And I want to obtain this result
id flag time value
1 1 2017-01-01 UTC--2017-01-07 UTC NA
1 0 2018-01-01 UTC--2019-01-01 UTC 0
1 0 2017-01-03 UTC--2017-01-09 UTC 1
2 1 2017-01-01 UTC--2017-01-15 UTC NA
2 1 2018-07-01 UTC--2018-09-01 UTC NA
2 1 2018-10-12 UTC--2018-10-20 UTC NA
2 0 2017-01-12 UTC--2017-01-16 UTC 1
2 0 2017-03-01 UTC--2017-03-15 UTC 0
2 0 2017-12-01 UTC--2017-12-31 UTC 0
2 0 2018-08-15 UTC--2018-09-19 UTC 1
2 0 2018-10-01 UTC--2018-10-21 UTC 1
This is, I want to compare the time intervals of flag = 0 to all possible flag = 1, within each group, to see if there is at least one time overlap between flag 0 and flag 1
For these purpose I have tried with lubridate int_overlaps function
I have tried the following code but does not work:
result <- df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0 & int_overlaps(time, any(time[flag == 1])), 1, 0))
I have found a very similar approach:
R: Determine if each date interval overlaps with all other date intervals in a dataframe
You can use map_int from purrr to see if any intervals overlap within each id:
library(tidyverse)
library(lubridate)
df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0, map_int(time, ~ any(int_overlaps(.x, time[flag == 1]))), NA))
Output
# A tibble: 11 x 4
# Groups: id [2]
id flag time value
<dbl> <dbl> <Interval> <int>
1 1 1 2017-01-01 UTC--2017-01-07 UTC NA
2 1 0 2018-01-01 UTC--2019-01-01 UTC 0
3 1 0 2017-01-03 UTC--2017-01-09 UTC 1
4 2 1 2017-01-01 UTC--2017-01-15 UTC NA
5 2 1 2018-07-01 UTC--2018-09-01 UTC NA
6 2 1 2018-10-12 UTC--2018-10-20 UTC NA
7 2 0 2017-01-12 UTC--2017-01-16 UTC 1
8 2 0 2017-03-01 UTC--2017-03-15 UTC 0
9 2 0 2017-12-01 UTC--2017-12-31 UTC 0
10 2 0 2018-08-15 UTC--2018-09-19 UTC 1
11 2 0 2018-10-01 UTC--2018-10-21 UTC 1
I add another answer extracted from here:
R: Determine if each date interval overlaps with all other date intervals in a dataframe
result <- df %>% group_by(id) %>%
mutate(value = map(seq_along(time), function(x){
y = setdiff(seq_along(time[flag == 1]), x)
return(any(int_overlaps(time[x], time[y])))
}))

How to insert missing time (by minutes) in data frame? And how to assign corresponding y values for that missing time as NA?

I have a data that looks like this:
SN TimeStamp MOTOR
1 1/27/20 18:00 0
2 1/27/20 18:02 1
3 1/27/20 18:04 0
4 1/27/20 18:05 1
5 1/27/20 18:08 0
How can make it look like this?
SN TimeStamp MOTOR
1 1/27/20 18:00 0
2 1/27/20 18:01 NA
3 1/27/20 18:02 1
4 1/27/20 18:03 NA
5 1/27/20 18:04 0
6 1/27/20 18:05 1
7 1/27/20 18:06 NA
8 1/27/20 18:07 NA
9 1/27/20 18:08 0
Basically, my question is how can I insert missing timestamp and assign the corresponding MOTOR values as NA?
I would be thankful if anyone could help. I just started learning R and this is giving me headache since morning.
Thanks.
You can create a dataframe with a timestamp being the sequence from the min and max values of your original dataframe and then, make a left join (here using dplyr and lubridate):
library(lubridate)
library(dplyr)
df_or$TimeStamp = mdy_hm(df_or$TimeStamp) # Convert TimeStamp into appropriate date format
DF <- data.frame(TimeStamp = seq(min(df_or$TimeStamp),max(df_or$TimeStamp), by = "min"))
DF %>% left_join(., df_or, by = "TimeStamp")
TimeStamp MOTOR
1 2020-01-27 18:00:00 0
2 2020-01-27 18:01:00 NA
3 2020-01-27 18:02:00 1
4 2020-01-27 18:03:00 NA
5 2020-01-27 18:04:00 0
6 2020-01-27 18:05:00 1
7 2020-01-27 18:06:00 NA
8 2020-01-27 18:07:00 NA
9 2020-01-27 18:08:00 0
Data
df_or <- data.frame(TimeStamp = c("1/27/20 18:00","1/27/20 18:02","1/27/20 18:04","1/27/20 18:05", "1/27/20 18:08"),
MOTOR = c(0,1,0,1,0))
Another possible approach is to use library(padr):
library(padr)
library(dplyr)
df %>%
pad(interval = 'min') %>%
mutate(SN = row_number())
Output
SN TimeStamp Motor
1 1 2020-01-27 18:00:00 0
2 2 2020-01-27 18:01:00 NA
3 3 2020-01-27 18:02:00 1
4 4 2020-01-27 18:03:00 NA
5 5 2020-01-27 18:04:00 0
6 6 2020-01-27 18:05:00 1
7 7 2020-01-27 18:06:00 NA
8 8 2020-01-27 18:07:00 NA
9 9 2020-01-27 18:08:00 0
Data
df <- data.frame(
SN = 1:5,
TimeStamp = as.POSIXct(c("1/27/20 18:00", "1/27/20 18:02", "1/27/20 18:04", "1/27/20 18:05", "1/27/20 18:08"), format = "%m/%d/%y %H:%M"),
Motor = c(0,1,0,1,0)
)

Resources