I have a dataset that looks like this:
id land datetime
pb1 0 2004-04-05 01:44:00
pb1 1 2004-04-05 02:00:00
pb1 1 2004-04-05 16:00:00
pb2 1 2004-04-05 18:01:00
pb2 1 2004-04-05 20:00:00
library(data.table)
DT = data.table(
id = c("pb1", "pb1", "pb1", "pb2", "pb2"),
land = c(0L, 1L, 1L, 1L, 1L),
datetime = sprintf("2004-04-05 %02d:%02d:00",
c(1, 2, 16, 18, 20),
c(44, 0, 0, 1, 0))
)
I would like to make a column that cumulatively adds time (in days) but ONLY if there is a '1' in the land column. I also would like the count to reset when the id changes.
I have tried a variety of methods using data.table, rleid, and even a nested for loop with no success. I have gotten errors using code like this:
DT[, total :=land*diff(as.numeric(datetime)), .(id, rleid(land))]
I have tried variations of the solution here: Calculating cumulative time in R
I'm not sure the best way to calculate the time interval (no success with difftime or lubridate).
I want the end result to look like this:
id land datetime cumtime.land
pb1 0 2004-04-05 01:44:00 0
pb1 1 2004-04-05 02:00:00 0
pb1 1 2004-04-06 16:00:00 1.58333
pb2 1 2004-04-05 18:00:00 0
pb2 1 2004-04-05 20:00:00 0.08333
I could not replicate #Japp's comment, but you can easily do this with dplyr.
Depending on what your exact expected output is, you could stop before the summarize call:
library(dplyr)
df=read.table(text=
"id land datetime
pb1 0 '2004-04-05 01:44:00'
pb1 1 '2004-04-05 02:00:00'
pb1 1 '2004-04-06 16:00:00'
pb1 1 '2004-04-07 16:00:00'
pb2 1 '2004-04-05 18:00:00'
pb2 1 '2004-04-05 20:00:00'", header=T) %>%
mutate(datetime=as.POSIXct(datetime,format='%Y-%m-%d %H:%M:%S'))
x = df %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0,
0,
difftime(datetime, lag(datetime), units="days"))) %>%
mutate(cumtime.land=time.land + ifelse(is.na(lag(time.land)), 0, lag(time.land)))
id land datetime time.land cumtime.land
<fct> <int> <dttm> <dbl> <dbl>
1 pb1 0 2004-04-05 01:44:00 0 0
2 pb1 1 2004-04-05 02:00:00 0 0
3 pb1 1 2004-04-06 16:00:00 1.58 1.58
4 pb1 1 2004-04-07 16:00:00 1 2.58
5 pb2 1 2004-04-05 18:00:00 0 0
6 pb2 1 2004-04-05 20:00:00 0.0833 0.0833
The key is to use the dplyr::lag() function which takes the "line just above" in the table (which implies that you have to arrange() it beforehand).
By wrapping this inside the ifelse, I'm checking that land and previous land were not 0 (and that we are not in the first line of the id, or lag(anything) will be missing).
I then just reuse the lag() function to get the cumtime.land variable.
I believe you're after:
DT[land == 1, cumtime.land =
cumsum(c(0, diff(as.numeric(datetime))))/86400, by = id]
as.numeric(datetime) converts it to seconds so we use 86400 to convert to days.
Somewhat more "official" in the sense of leveraging time/date classes directly is to use difftime and shift:
DT[land == 1, by = id,
cumtime.land :=
cumsum(as.double(difftime(
datetime, shift(datetime, fill = datetime[1L]), units = 'days'
)))]
I switched the order of the by argument simply to help with formatting.
We use datetime[1L] to fill so that the initial difference is 0; we need as.double because cumsum errors as it's not confident how to deal with difftime objects as input.
See also:
Calculate cumsum() while ignoring NA values
https://stackoverflow.com/a/40227629/3576984
Related
I have a dataset where every row correspond to a participant. It has a categorical variable called "Injury.Cause"
Injury.Date.Time Injury.Cause
3608 2019-05-22 00:00:00 Motor Vehicle
3915 2019-03-25 10:00:00 Accidental
3916 2019-03-25 16:00:00 Burn
3917 2019-03-25 10:00:00 Accidental
3920 2019-03-25 00:00:00 Fall
3928 2019-03-27 00:00:00 Fall
3929 2019-03-26 21:50:00 Motor Vehicle
3930 2019-03-27 17:00:00 Fall
3931 2019-03-26 00:00:00 Motor Vehicleter
I want to run line plot with multiple lines (each line represent a cause of injury over time) and the y-axis shows the total number of occurrence(frequency)/month for each cause of injury
I assume the first step is I have to make my data ordered as follow
Date Motor Vehicle Accidental Burn Fall
2021-03-22 3 2 1 2
2021-03-23 1 1 0 3
this example is shown in days but I believe I can control the time frame when making the plot. I will be exploring the changes across periods of 3 months intervals on the x-axis
Thank you in advance
Rami
We can get the data in required structure using table after extracting the date from timestamp.
table(as.Date(df$Injury.Date.Time), df$Injury.Cause)
Or in tidyverse -
library(tidyverse)
df %>%
count(Date = as.Date(Injury.Date.Time), Injury.Cause) %>%
pivot_wider(names_from = Injury.Cause, values_from = n, values_fill = 0)
# Date Accidental Burn Fall `Motor Vehicle` `Motor Vehicleter`
# <date> <int> <int> <int> <int> <int>
#1 2019-03-25 2 1 1 0 0
#2 2019-03-26 0 0 0 1 1
#3 2019-03-27 0 0 2 0 0
#4 2019-05-22 0 0 0 1 0
However, if you want to plot the data you should have data in long format and not in wide format.
df %>%
count(Date = as.Date(Injury.Date.Time), Injury.Cause) %>%
ggplot() + aes(Date, n, color = Injury.Cause) + geom_line()
This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4
I have an R data frame like as shown below
test_df <- data.frame("subbject_id" = c(1,2,3,4,5),
"date_1" = c("01/01/2003","12/31/2007","12/30/2008","01/02/2007","01/01/2007"))
I would like to get the no of days to prev year and next year.
I was trying something like the below
library(lubridate)
test_df$current_yr = year(mdy(test_df$date_1))
prev_yr = test_df$current_yr - 1 #(subtract 1 to get the prev year)
next_yr = test_df$current_yr + 1 #(add 1 to get the prev year)
days_to_prev_yr = days_in_year(current_yr) # this doesn't work
In python, I know we have something called day of the year and offsets.YearEnd(0) etc which I knew based on this post. But can help me with how to do this using R?
I expect my output to be like as shown below
You can use ceiling_date and floor_date from lubridate to get first and last days of the year and then subtract it with date_1 to get days_to_previous_year and days_to_next_year.
library(dplyr)
library(lubridate)
test_df %>%
mutate(date_1 = mdy(date_1),
previous_year = floor_date(date_1, 'year'),
next_year = ceiling_date(date_1, 'year') - 1,
days_to_previous_year = as.integer(date_1 - previous_year),
days_to_next_year = as.integer(next_year - date_1)) %>%
select(-previous_year, -next_year)
# subbject_id date_1 days_to_previous_year days_to_next_year
#1 1 2003-01-01 0 364
#2 2 2007-12-31 364 0
#3 3 2008-12-30 364 1
#4 4 2007-01-02 1 363
#5 5 2007-01-01 0 364
One dplyr and lubridate option could be:
test_df %>%
mutate(date_1 = mdy(date_1),
days_to_prev_year = date_1 - mdy(paste0("01-01-", year(date_1))),
days_to_next_year = mdy(paste0("12-31-", year(date_1))) - date_1)
subbject_id date_1 days_to_prev_year days_to_next_year
1 1 2003-01-01 0 days 364 days
2 2 2007-12-31 364 days 0 days
3 3 2008-12-30 364 days 1 days
4 4 2007-01-02 1 days 363 days
5 5 2007-01-01 0 days 364 days
I have some trips per each person and each household. the first 2 rows of data is like this
Household person trip are.time depends.time duration
1 1 0 02:20:00 08:20:00 NA
1 1 1 08:50:00 17:00:00 30
This means that the person start trip at 8:20 and get to destination at 8:50 . that's why duration is 30. (is trip duration)
Now I want to put start time of each trip in the same row as that trip.
like this
Household person trip start.time. are.time depends.time duration
1 1 0 NA. 02:20:00 08:20:00 NA
1 1 1 08:20:00 08:50:00 17:00:00 30
Notice that for zero trip of each person we do not have start time so I put NA.
A solution using dplyr with base R functions for time manipulation.
First create the data
df <- data.frame(
Household = c(1, 1),
person = c(1, 1),
trip = c(0, 1),
are.time = c("02:20:00", "08:50:00"),
depends.time = c("08:20:00", "17:00:00"),
duration = c(NA, 30)
)
Then subtract 30 minutes where duration is not NA. I multiply duration*60 here because as.numeric(strptime()) is converting the string to a value in the units of seconds and we want to add duration*60 seconds to the value.
library(dplyr)
df %>%
mutate(start.time = case_when(
!is.na(duration) ~ as.character(as.POSIXct(as.numeric(strptime(are.time, "%H:%M:%S")) - duration*60, origin="1970-01-01"), format="%H:%M:%S"),
TRUE ~ NA_character_
))
Output:
Household person trip are.time depends.time duration start.time
1 1 1 0 02:20:00 08:20:00 NA <NA>
2 1 1 1 08:50:00 17:00:00 30 08:20:00
I am new to R and stuck in getting Get start time and end time of Flag=1. The condition is, 1 should repeat (consecutively) at least 4 times then only its start and end time needs to be captured. My input looks like-
Time Stamp Flag
00:00:00 1
00:00:10 1
00:00:20 1
00:00:30 1
00:00:40 0
00:00:50 0
00:01:00 0
00:01:10 0
00:01:20 0
00:01:30 1
00:01:40 1
00:01:50 1
00:02:00 0
00:02:10 1
00:02:20 1
00:02:30 1
00:02:40 1
00:02:50 1
00:03:00 1
00:03:10 1
00:03:20 1
and my output should look like -
Start Time End Time Duration
00:00:00 00:00:30 00:00:30
00:02:10 00:03:20 00:01:10
I have achieved it using for loop and if else, but as the data is huge its taking lots of time. So, I need to optimize it.
Is there any inbuilt function in R that i can modify and use?
First create the data using:
st <- data.frame(Time = format(seq(from=as.POSIXct("2012-1-1 00:00:00", tz="UTC"),
to=as.POSIXct("2012-1-1 00:03:20", tz="UTC"),
by="10 secs") , "%H:%M:%S" ),
Flag = c(1,1,1,1,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1))
st$Time <- as.POSIXct(st$Time, format = "%H:%M:%S")
Using dplyr I would approach this as:
st %>%
mutate(gr = cumsum(lag(Flag, default = Flag[1]) != Flag)) %>%
filter(Flag == 1) %>%
group_by(gr) %>%
filter(length(gr) >= 4) %>%
summarise(start.time = first(Time),
last.time = last(Time)) %>%
mutate(Duration = last.time - start.time)
The result is
# A tibble: 2 × 4
gr start.time last.time Duration
<int> <dttm> <dttm> <time>
1 0 2017-06-08 00:00:00 2017-06-08 00:00:30 30 secs
2 4 2017-06-08 00:02:10 2017-06-08 00:03:20 70 secs