I have a large dataframe with one column with time and a second column with speed measurements (km/h). Here is an short example of the database:
df <- data.frame(time = as.POSIXct(c("2019-04-01 13:55:18", "2019-04-01 14:03:18",
"2019-04-01 14:14:18", "2019-04-01 14:26:55",
"2019-04-01 14:46:55", "2019-04-01 15:01:55")),
speed = c(4.5, 6, 3.2, 5, 4, 2))
Is there any way to do a new dataframe, which calculates the distance driven every 20 minutes, from 2019-04-01 14:00:00 to 2019-04-01 15:00:00? assuming that the speed changes are linear. I was trying to find solutions with integrals, but was not sure if it is the correct way to do it. Thanks for the help!
Here is a solution using a combination of zoo::na.approx and dplyr functions.
library(zoo)
library(dplyr)
seq = data.frame(time = seq(min(df$time),max(df$time), by = 'secs'))
df <- merge(seq,df,all.x=T)
df$speed <- na.approx(df$speed)
df %>%
filter(time >= "2019-04-01 14:00:00" & time < "2019-04-01 15:00:00") %>%
mutate(km = speed/3600) %>%
group_by(group = cut(time, breaks = "20 min")) %>%
summarise(distance = sum(km))
Which gives:
# A tibble: 3 x 2
group distance
<fct> <dbl>
1 2019-04-01 14:00:00 1.50
2 2019-04-01 14:20:00 1.54
3 2019-04-01 14:40:00 1.16
Explanation:
The first step is to create a sequence of time frames to compute the speed between two times points (seq). The sequence is then merged with the data frame and NAs are filled using na.approx.
Then, using dplyr verbs, the data frame is filtered, and the 20 minutes sequences are created using cut. The final distance is the sum of every 1-sec distance in the 20 minutes time frame.
Related
I have two data frame ,one is the in time of employees and the other is the out time of employees.The data in both the data frames have timestamps for about 4000 employees in the last one year(excludes weekend/public holiday dates).Each data frame has 4000 rows and 250 columns.I would like to find the number of hours spent by an employee each day at work basically my approach would be to find the difference in time between the two data frames using difftime() function.i used the below code and expected a resulting data frame containing 4000 rows and 250 columns with difference in time,however the data was returned in one single column.How should I deal with this problem so that I can get the difference in time between two data frames in the data frame format with 4000 rows and 250 columns?
hours_spent <- as.data.frame(as.matrix(difftime(as.matrix(out_time_data_hrs),as.matrix(in_time_data_hrs),unit='hour')))
Input data looks like below ,
In_time data frame
Out_time data frame
Expected output
Here's a small and simple example based on the data you posted and a possible solution:
# example data in_times
df1 = data.frame(`2018-08-01` = c("2018-08-01 10:30:00", "2018-08-01 10:25:00"),
`2018-08-02` = c("2018-08-02 10:20:00", "2018-08-02 10:45:00"))
# example data out_times
df2 = data.frame(`2018-08-01` = c("2018-08-01 17:33:00", "2018-08-01 18:06:00"),
`2018-08-02` = c("2018-08-02 17:11:00", "2018-08-02 17:45:00"))
library(tidyverse)
# reshape datasets
df1_resh = df1 %>%
mutate(empl_id = row_number()) %>% # add an employee id (using the row number)
gather(day, in_time, -empl_id) # reshape dataset
df2_resh = df2 %>%
mutate(empl_id = row_number()) %>%
gather(day, out_time, -empl_id)
# join datasets and calculate hours spent
left_join(df1_resh, df2_resh, by=c("empl_id","day")) %>%
mutate(hours_spent = difftime(out_time, in_time))
# empl_id day in_time out_time hours_spent
# 1 1 X2018.08.01 2018-08-01 10:30:00 2018-08-01 17:33:00 7.050000 hours
# 2 2 X2018.08.01 2018-08-01 10:25:00 2018-08-01 18:06:00 7.683333 hours
# 3 1 X2018.08.02 2018-08-02 10:20:00 2018-08-02 17:11:00 6.850000 hours
# 4 2 X2018.08.02 2018-08-02 10:45:00 2018-08-02 17:45:00 7.000000 hours
You can use this as the final piece of code if you want to reshape back to your initial format:
left_join(df1_resh, df2_resh, by=c("empl_id","day")) %>%
mutate(hours_spent = difftime(out_time, in_time)) %>%
select(empl_id, day, hours_spent) %>%
spread(day, hours_spent)
# empl_id X2018.08.01 X2018.08.02
# 1 1 7.050000 hours 6.85 hours
# 2 2 7.683333 hours 7.00 hours
my requirement is satisfied by just doing the below, pretty straight forward
employee_hrs_df <- out_time_data - in_time_data
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a data frame with start and stop times for an experiment and I want to calculate the duration of each experiment (one line per experiment). Data frame:
start_t stop_t
7:35 7:48
23:50 00:15
11:22 12:06
I created a function to convert the time to POSIX format and calculate the duration, testing if start and stop crosses midnight:
TimeDiff <- function(t1,t2) {
if (as.numeric(as.POSIXct(paste("2016-01-01", t1))) > as.numeric(as.POSIXct(paste("2016-01-01", t2)))) {
t1n <- as.numeric(as.POSIXct(paste("2016-01-01", t1)))
t2n <- as.numeric(as.POSIXct(paste("2016-01-02", t2)))
}
if (as.numeric(as.POSIXct(paste("2016-01-01", t1))) < as.numeric(as.POSIXct(paste("2016-01-01", t2)))) {
t1n <- as.numeric(as.POSIXct(paste("2016-01-01", t1)))
t2n <- as.numeric(as.POSIXct(paste("2016-01-01", t2)))
}
#calculate time-difference in seconds
t2n - t1n
}
Then I wanted to apply this function to my data frame using either the 'mutate' function in 'dplyr' or an 'apply' function, e.g.:
mutate(df, dur = TimeDiff(start_t, stop_t))
But the result is that the 'dur' table is filled with just the same value. I ended up using a clunky for-loop to apply my function to the dataframe, but would want a more elegant solution. Help wanted!
Day can be incremented when the time stamp passes midnight. I am not sure if that is necessary to just to test if start and stop crosses midnight. Hope this helps!
df = data.frame(start_t = c("7:35", "23:50","11:22"), stop_t=c("7:48", "00:15", "12:06"), stringsAsFactors = F)
myfun = function(tvec1, tvec2, units_args="secs") {
tvec1_t = as.POSIXct(paste("2016-01-01", tvec1))
tvec2_t = as.POSIXct(paste("2016-01-01", tvec2))
time_diff = difftime(tvec2_t, tvec1_t, units = units_args)
return( time_diff )
}
# append new columns (base R)
df$time_diff = myfun(df$start_t, df$stop_t)
df$cross = ifelse(df$time_diff < 0, 1, 0)
output:
start_t stop_t time_diff cross
1 7:35 7:48 780 secs 0
2 23:50 00:15 -84900 secs 1
3 11:22 12:06 2640 secs 0
Since you don't have dates but only times, there is indeed the problem of experiments crossing midnight. Your function does not work, because it is not vectorized, i.e. it doesn't compute the difference for each element on its own.
The following works but is still not perfectly elegant:
If the start happened before the end, we simply subtract to get the duration.
If we cross midnight (the heuristic for this is not very stable), we calculate the difference until midnight and add the duration on the next day.
library(tidyverse)
diff_time <- function(start, end) {
case_when(start < end ~ end - start,
start > end ~ parse_time("23:59") - start + end + parse_time("0:01")
)
}
df %>%
mutate_all(parse_time) %>%
mutate(duration = diff_time(start_t, stop_t))
#> start_t stop_t duration
#> 1 07:35:00 07:48:00 780 secs
#> 2 23:50:00 00:15:00 1500 secs
#> 3 11:22:00 12:06:00 2640 secs
If you had dates, you could simply do:
df %>%
mutate(duration = stop_t - start_t)
Data
df <- read.table(text = "start_t stop_t
7:35 7:48
23:50 00:15
11:22 12:06", header = T)
The simplest way I can think of involves lubridate:
library(lubridate)
library(dplyr)
#make a fake df
df <- data.frame(start = c('7:35', '23:50', '11:22'), stop = c('7:48', '00:15', '12:06'), stringsAsFactors = FALSE)
#convert to lubridate minutes/seconds format, then subtract
df %>%
mutate(start = ms(start), stop = ms(stop)) %>%
mutate(dur= stop - start)
Output:
start stop dur
1 7M 35S 7M 48S 13S
2 23M 50S 15S -23M -35S
3 11M 22S 12M 6S 1M -16S
The problem with your circumstance is that the second line will confuse lubridate - it will show 23 hours and some minutes because it will assume all of these times are on the same day. You should probably add the day:
library(lubridate)
library(dplyr)
#make a fake df
df <- data.frame(start = c('2017/10/08 7:35', '2017/10/08 23:50', '2017/10/08 11:22'), stop = c('2017/10/08 7:48', '2017/10/09 00:15', '2017/10/08 12:06'), stringsAsFactors = FALSE)
#convert to lubridate minutes/seconds format, then subtract
df %>%
mutate(start = ymd_hm(start), stop = ymd_hm(stop)) %>%
mutate(dur= stop - start)
Output:
start stop dur
1 2017-10-08 07:35:00 2017-10-08 07:48:00 13 mins
2 2017-10-08 23:50:00 2017-10-09 00:15:00 25 mins
3 2017-10-08 11:22:00 2017-10-08 12:06:00 44 mins
Intro:
I would like to aggregate some 5-minute data into 10-minute data. Specifically, I only want to aggregate on the 10-minute marks (00:10:00, 00:20:00, 00:30:00, etc.).
The code below almost achieves this, but the breaks are on the 5 minute mark instead of the 10 minute mark (00:05:00, 00:15:00, 00:25:00). I think dplyr is using the first row in the dataframe when determining the cutpoints.
Are there any ways to achieve "nice" 10-min breaks using cut {base} and group_by() {dplyr}? I would be okay with just removing the first row of data, but I really need the solution to manage many different files, each of which with unique starting points.
Thanks in advance!
Example Code:
date <- c("2017-06-14 14:35:00", "2017-06-14 14:40:00", "2017-06-14 14:45:00", "2017-06-14 14:50:00")
co <- as.numeric(c(5.17,10.07,13.88,13.78))
no <- as.numeric(c(34.98,32.45,31.34,29.09))
no2 <- as.numeric(c(0.00,0.00,0.00,0.00))
o3 <- as.numeric(c(5.17,10.07,13.88,13.78))
data <- data.frame(date, co, no , no2, o3)
data$date <- strptime(data$date, format = "%Y-%m-%d %H:%M")
data$date <- as.POSIXct(data$date)
head(data)
data_10min <- data %>%
group_by(date = cut(date, breaks = "10 min")) %>%
summarize(co = mean(co), no = mean(no), no2 = mean(no2), o3 = mean(o3))
head(data_10min)
Desired Output:
2017-06-14 14:40:00
2017-06-14 14:50:00
Just adding 300 seconds to date column during group_by gets the desired result.
library(magrittr)
library(dplyr)
df_10min <- df %>%
group_by(date = cut(as.POSIXct(x) + 300, breaks = "10 min")) %>%
summarize_each(funs(mean))
df_10min
The result:
# # A tibble: 2 × 5
# date co no no2 o3
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 2017-06-14 14:40:00 7.62 33.715 0 7.62
# 2 2017-06-14 14:50:00 13.83 30.215 0 13.83
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE