I am currently facing a dataset of taxi trips by a driver in NYC. I got the driver ID as well as the pickup date and time and dropoff date and time for every trip. Now I want to calculate the waiting time between the dropoff time of the last trip and the pickup time of the new trip. Therefore I have to calculate the time difference between two columns with one Lag (because dropoff time refers to the last trip and pickup time to the next trip (next column)) grouped by driver ID (to make sure I am not calculating the time difference between trips of two different drivers).
A possible data set looks like this:
hack_license = c("303F79923DA5DA7A10DF15E2D91CDCF7","697ABFCDF7E7C77A01183C857132F2A4","697ABFCDF7E7C77A01183C857132F2A4","697ABFCDF7E7C77A01183C857132F2A4","ABE23CA71E2DE84972281BA1C70B6EBB","ABE23CA71E2DE84972281BA1C70B6EBB","BA83D7C383EAA4F9D78A1A8B83CB3E92","BA83D7C383EAA4F9D78A1A8B83CB3E92","D476A1872F1F6594BD638C274483ED06","D476A1872F1F6594BD638C274483ED06")
pickup_datetime = c("2013-12-31 23:01:07","2013-12-31 23:04:00","2013-12-31 23:31:00","2013-12-31 23:40:00","2013-12-31 23:16:39","2013-12-31 23:24:05","2013-12-31 23:09:10","2013-12-31 23:26:26","2013-12-31 23:13:00","2013-12-31 23:22:00")
dropoff_datetime = c("2013-12-31 23:20:33","2013-12-31 23:28:00","2013-12-31 23:33:00","2013-12-31 23:48:00","2013-12-31 23:22:29","2013-12-31 23:28:37","23:21:24","2013-12-31 23:36:54","2013-12-31 23:20:00","2013-12-31 23:27:00")
data <- data.frame(hack_license,pickup_datetime,dropoff_datetime)
I tried to use dplyr and lubridate like this, but it doesn't work.
data %>%
group_by(data$hack_license) %>%
group_by(hack_license) %>%
mutate(waiting_time_in_secs = difftime(pickup_datetime,
lag(dropoff_datetime), units = 'secs'))
Maybe some of you can help me out here. Would be great!
You can create a datetime column for both pickup and dropoff and for each hack_license calculate the difference in time between the current pickup time and previous drop off time.
data <- data %>%
mutate(pickup_datetime = ymd_hms(pickup_datetime),
dropoff_datetime = ymd_hms(dropoff_datetime)) %>%
group_by(hack_license) %>%
mutate(waiting_time_in_secs = as.numeric(difftime(pickup_datetime,
lag(dropoff_datetime), units = 'secs')))
# hack_license pickup_datetime dropoff_datetime waiting_time_in_secs
# <chr> <dttm> <dttm> <dbl>
# 1 303F79923DA5DA7A10DF15E2D91CDCF7 2013-12-31 23:01:07 2013-12-31 23:20:33 NA
# 2 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:04:00 2013-12-31 23:28:00 NA
# 3 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:31:00 2013-12-31 23:33:00 180
# 4 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:40:00 2013-12-31 23:48:00 420
# 5 ABE23CA71E2DE84972281BA1C70B6EBB 2013-12-31 23:16:39 2013-12-31 23:22:29 NA
# 6 ABE23CA71E2DE84972281BA1C70B6EBB 2013-12-31 23:24:05 2013-12-31 23:28:37 96
# 7 BA83D7C383EAA4F9D78A1A8B83CB3E92 2013-12-31 23:09:10 2013-12-31 23:21:24 NA
# 8 BA83D7C383EAA4F9D78A1A8B83CB3E92 2013-12-31 23:26:26 2013-12-31 23:36:54 302
# 9 D476A1872F1F6594BD638C274483ED06 2013-12-31 23:13:00 2013-12-31 23:20:00 NA
#10 D476A1872F1F6594BD638C274483ED06 2013-12-31 23:22:00 2013-12-31 23:27:00 120
I'm currently building some charts of covid-related data....my script goes out and downloads most recent data and goes from there. I wind up with dataframes that look like
Date state positiveIncrease totalTestResultsIncrease
1 2020-05-19 NM 158 4367
2 2020-05-18 NM 81 4669
3 2020-05-17 NM 195 4126
4 2020-05-16 NM 159 4857
5 2020-05-15 NM 139 4590
6 2020-05-14 NM 152 4722
I've been aggregating to weekly data using the tq_transmute function from tidyquant.
NMweeklyPos <- NMdata %>% tq_transmute(select = positiveIncrease, mutate_fun = apply.weekly, FUN=sum)
This works, but it aggregates on week of the year, with weeks starting on Sunday.
Date positiveIncrease
<dttm> <int>
1 2020-03-08 00:00:00 0
2 2020-03-15 00:00:00 13
3 2020-03-22 00:00:00 44
4 2020-03-29 00:00:00 180
5 2020-04-05 00:00:00 306
6 2020-04-12 00:00:00 631
So for instance if I ran it today (which happens to be a Wednesday) my last entry is a partial week with Monday, Tuesday, Wednesday.
Date positiveIncrease
<dttm> <int>
1 2020-04-19 00:00:00 624
2 2020-04-26 00:00:00 862
3 2020-05-03 00:00:00 1072
4 2020-05-10 00:00:00 1046
5 2020-05-17 00:00:00 1079
6 2020-05-19 00:00:00 239
For purposes of my chart this winds up being a small value, and so I have been discarding the partial weeks at the end, but that means I'm throwing out the most recent data.
I would prefer the throw out a partial week from the start of the dataset and have the aggregation automatically use weeks that end on whatever day the script is being run. So if I ran it today (Wednesday) it would aggregate on weeks ending Wednesday so that I had the most current data included...I could drop the partial week from the beginning of the data. But tomorrow it would choose weeks ending Thursday, etc. And I don't want to have to hardcode the week end day and change it each time.
How can I go about achieving that?
Using lubridate, the below code will find what day of the week it is and define that day as the floor for each week.
Hope this helps!
end = as.Date("2020-04-14")
data = data.frame(
date = seq.Date(as.Date("2020-01-01"), end, by = "day"),
val = 1
# get the day of the week
weekday = wday(end)
# using the floor_date function we can use todays date to determine what day of the week will be our floor
mutate(week = floor_date(date, "week", week_start = weekday))%>%
summarise(total = sum(val))
I have a nice dateset that includes user logins to my website, but no user logout. The user can access again the next day and then another line is registered. I'm looking to calculate how much time each user spent on the site.
The working assumption is that the site is quite interactive, and one can assume that the time between the first and last action is the time the user was on the site. The problem starts with a final action definition. For example, a user can be on the same page all night too, but this is unlikely (without taking any action). It is very clear to me how to calculate it in reality, but haven't been able to find a proper code.
df <- read_csv(url("https://srv-file9.gofile.io/download/eBInE9/sheet1.csv"))
df %>% group_by(`User ID`)
##Now I am wondering how to cluster the itemsn by time and to calculate lags..
Any thoughts?
I am assuming that the csv file you read has a POSIXct column which contains the login time. If not, you should make sure this exists.
Here's some code which generates time differences using the lag function by ID. The first time difference for each group will be NA. I generate some random time data first as you have not provided sample data (as you should have).
random_times <- seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/02/01'), by="1 mins")
login <- tibble(login = sort(sample(random_times[hour(random_times) > "00:00" & hour(random_times) < "23:59"], 1000)),
ID = sample(LETTERS[1:6], 1000, replace = TRUE))
login <- login %>%
group_by(ID) %>%
mutate(login_delayed = lag(login, 1)) %>%
mutate(login_time = login - login_delayed)
With this output:
> login
# A tibble: 1,000 x 4
# Groups: ID [6]
login ID login_delayed login_time
<dttm> <chr> <dttm> <drtn>
1 2020-01-01 01:03:00 A NA NA mins
2 2020-01-01 01:11:00 A 2020-01-01 01:03:00 8 mins
3 2020-01-01 01:46:00 E NA NA mins
4 2020-01-01 02:33:00 E 2020-01-01 01:46:00 47 mins
5 2020-01-01 02:47:00 A 2020-01-01 01:11:00 96 mins
6 2020-01-01 10:43:00 F NA NA mins
7 2020-01-01 11:44:00 A 2020-01-01 02:47:00 537 mins
8 2020-01-01 11:57:00 A 2020-01-01 11:44:00 13 mins
9 2020-01-01 12:02:00 F 2020-01-01 10:43:00 79 mins
10 2020-01-01 12:57:00 D NA NA mins
# ... with 990 more rows
I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?
I want to convert the seconds in "TOTALSEC" to a 24-hr time of day (ie. 14:32:40). However, it needs to be based on the starting time in "Time". So for example, I have "12:00:00" as the start time under Time and "3630" under TOTALSEC". I want to get the converted time to read 13:00:30. For context the Time is start time of an audio recorder, and TOTALSEC is # elapsed seconds that audio clip occurs since the start time. So in the above example, I started the recorded at 12:00:00, and the audio clip occurred 3630 seconds after it started recording.
> head(stacksubset)
# A tibble: 6 x 2
<dbl> <chr> <chr>
1 10613.67 15:53:50 2017-05-30
2 56404.35 17:29:44 2017-05-29
3 20480.54 16:16:12 2017-06-13
4 60613.47 15:53:50 2017-05-30
5 80034.30 16:16:12 2017-06-02
6 50710.37 16:16:12 2017-05-27
I was thinking it might have to be a loop function of some sort (there's 2000 additional lines after the above 6).
data <- data.frame(TOTALSEC = c(10613.67, 56404.35, 20480.54, 60613.47, 80034.30, 50710.37),
Time = c("15:53:50", "17:29:44", "16:16:12", "15:53:50", "16:16:12", "16:16:12"))
So because date will actually matter here each of the times is pasted to a date (Today's date) and then converted to datetime in the timezone of your system.
Then add together.
data %>% mutate(Time = as_datetime(paste(Date, Time), Sys.timezone(location = TRUE)),
end_time = TOTALSEC + Time)
TOTALSEC Time end_time
1 10613.67 2017-11-23 15:53:50 2017-11-23 18:50:43
2 56404.35 2017-11-23 17:29:44 2017-11-24 09:09:48
3 20480.54 2017-11-23 16:16:12 2017-11-23 21:57:32
4 60613.47 2017-11-23 15:53:50 2017-11-24 08:44:03
5 80034.30 2017-11-23 16:16:12 2017-11-24 14:30:06
6 50710.37 2017-11-23 16:16:12 2017-11-24 06:21:22
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00