I have a data frame with start and stop times for an experiment and I want to calculate the duration of each experiment (one line per experiment). Data frame:
start_t stop_t
7:35 7:48
23:50 00:15
11:22 12:06
I created a function to convert the time to POSIX format and calculate the duration, testing if start and stop crosses midnight:
TimeDiff <- function(t1,t2) {
if (as.numeric(as.POSIXct(paste("2016-01-01", t1))) > as.numeric(as.POSIXct(paste("2016-01-01", t2)))) {
t1n <- as.numeric(as.POSIXct(paste("2016-01-01", t1)))
t2n <- as.numeric(as.POSIXct(paste("2016-01-02", t2)))
}
if (as.numeric(as.POSIXct(paste("2016-01-01", t1))) < as.numeric(as.POSIXct(paste("2016-01-01", t2)))) {
t1n <- as.numeric(as.POSIXct(paste("2016-01-01", t1)))
t2n <- as.numeric(as.POSIXct(paste("2016-01-01", t2)))
}
#calculate time-difference in seconds
t2n - t1n
}
Then I wanted to apply this function to my data frame using either the 'mutate' function in 'dplyr' or an 'apply' function, e.g.:
mutate(df, dur = TimeDiff(start_t, stop_t))
But the result is that the 'dur' table is filled with just the same value. I ended up using a clunky for-loop to apply my function to the dataframe, but would want a more elegant solution. Help wanted!
Day can be incremented when the time stamp passes midnight. I am not sure if that is necessary to just to test if start and stop crosses midnight. Hope this helps!
df = data.frame(start_t = c("7:35", "23:50","11:22"), stop_t=c("7:48", "00:15", "12:06"), stringsAsFactors = F)
myfun = function(tvec1, tvec2, units_args="secs") {
tvec1_t = as.POSIXct(paste("2016-01-01", tvec1))
tvec2_t = as.POSIXct(paste("2016-01-01", tvec2))
time_diff = difftime(tvec2_t, tvec1_t, units = units_args)
return( time_diff )
}
# append new columns (base R)
df$time_diff = myfun(df$start_t, df$stop_t)
df$cross = ifelse(df$time_diff < 0, 1, 0)
output:
start_t stop_t time_diff cross
1 7:35 7:48 780 secs 0
2 23:50 00:15 -84900 secs 1
3 11:22 12:06 2640 secs 0
Since you don't have dates but only times, there is indeed the problem of experiments crossing midnight. Your function does not work, because it is not vectorized, i.e. it doesn't compute the difference for each element on its own.
The following works but is still not perfectly elegant:
If the start happened before the end, we simply subtract to get the duration.
If we cross midnight (the heuristic for this is not very stable), we calculate the difference until midnight and add the duration on the next day.
library(tidyverse)
diff_time <- function(start, end) {
case_when(start < end ~ end - start,
start > end ~ parse_time("23:59") - start + end + parse_time("0:01")
)
}
df %>%
mutate_all(parse_time) %>%
mutate(duration = diff_time(start_t, stop_t))
#> start_t stop_t duration
#> 1 07:35:00 07:48:00 780 secs
#> 2 23:50:00 00:15:00 1500 secs
#> 3 11:22:00 12:06:00 2640 secs
If you had dates, you could simply do:
df %>%
mutate(duration = stop_t - start_t)
Data
df <- read.table(text = "start_t stop_t
7:35 7:48
23:50 00:15
11:22 12:06", header = T)
The simplest way I can think of involves lubridate:
library(lubridate)
library(dplyr)
#make a fake df
df <- data.frame(start = c('7:35', '23:50', '11:22'), stop = c('7:48', '00:15', '12:06'), stringsAsFactors = FALSE)
#convert to lubridate minutes/seconds format, then subtract
df %>%
mutate(start = ms(start), stop = ms(stop)) %>%
mutate(dur= stop - start)
Output:
start stop dur
1 7M 35S 7M 48S 13S
2 23M 50S 15S -23M -35S
3 11M 22S 12M 6S 1M -16S
The problem with your circumstance is that the second line will confuse lubridate - it will show 23 hours and some minutes because it will assume all of these times are on the same day. You should probably add the day:
library(lubridate)
library(dplyr)
#make a fake df
df <- data.frame(start = c('2017/10/08 7:35', '2017/10/08 23:50', '2017/10/08 11:22'), stop = c('2017/10/08 7:48', '2017/10/09 00:15', '2017/10/08 12:06'), stringsAsFactors = FALSE)
#convert to lubridate minutes/seconds format, then subtract
df %>%
mutate(start = ymd_hm(start), stop = ymd_hm(stop)) %>%
mutate(dur= stop - start)
Output:
start stop dur
1 2017-10-08 07:35:00 2017-10-08 07:48:00 13 mins
2 2017-10-08 23:50:00 2017-10-09 00:15:00 25 mins
3 2017-10-08 11:22:00 2017-10-08 12:06:00 44 mins
Related
I have a large dataframe with one column with time and a second column with speed measurements (km/h). Here is an short example of the database:
df <- data.frame(time = as.POSIXct(c("2019-04-01 13:55:18", "2019-04-01 14:03:18",
"2019-04-01 14:14:18", "2019-04-01 14:26:55",
"2019-04-01 14:46:55", "2019-04-01 15:01:55")),
speed = c(4.5, 6, 3.2, 5, 4, 2))
Is there any way to do a new dataframe, which calculates the distance driven every 20 minutes, from 2019-04-01 14:00:00 to 2019-04-01 15:00:00? assuming that the speed changes are linear. I was trying to find solutions with integrals, but was not sure if it is the correct way to do it. Thanks for the help!
Here is a solution using a combination of zoo::na.approx and dplyr functions.
library(zoo)
library(dplyr)
seq = data.frame(time = seq(min(df$time),max(df$time), by = 'secs'))
df <- merge(seq,df,all.x=T)
df$speed <- na.approx(df$speed)
df %>%
filter(time >= "2019-04-01 14:00:00" & time < "2019-04-01 15:00:00") %>%
mutate(km = speed/3600) %>%
group_by(group = cut(time, breaks = "20 min")) %>%
summarise(distance = sum(km))
Which gives:
# A tibble: 3 x 2
group distance
<fct> <dbl>
1 2019-04-01 14:00:00 1.50
2 2019-04-01 14:20:00 1.54
3 2019-04-01 14:40:00 1.16
Explanation:
The first step is to create a sequence of time frames to compute the speed between two times points (seq). The sequence is then merged with the data frame and NAs are filled using na.approx.
Then, using dplyr verbs, the data frame is filtered, and the 20 minutes sequences are created using cut. The final distance is the sum of every 1-sec distance in the 20 minutes time frame.
I would like to find the mid time between two times in column SLQ300 (sleep time) and SLQ310 (wake up time) in my data frame of 6000 participants.
Added: I have already the duration in column SLD012. So if we could add half of the duration to the sleep time, it would be great.
Eg. in the first row it should be the midpoint between 23:00 and 07:00, which is 03:00.
And in the 9th row, it should be between 01:00 and 06:00, which is 03:30.
Thank you in advance!
Data frame
Try this:
time2num <- function(x) {
vapply(strsplit(x, ':'), function(y) sum(as.numeric(y) * c(60, 1)),
numeric(1), USE.NAMES=FALSE)
}
# sample data
dat <- data.frame(id=1:2, tm1=c("23:00","01:00"), tm2=c("07:00","06:00"))
# the code
dat[,c("tm1n","tm2n")] <- lapply(dat[,c("tm1","tm2")], time2num)
dat
# id tm1 tm2 tm1n tm2n
# 1 1 23:00 07:00 1380 420
# 2 2 01:00 06:00 60 360
with(dat, ifelse(tm1n > tm2n, 24*60, 0) + tm2n - tm1n)
# [1] 480 300 ### minutes
Or you can use modulus:
with(dat, tm2n - tm1n) %% (12*60)
(though I haven't tested it in all sorts of combinations).
And the mid time:
format(as.POSIXct(paste(Sys.Date(), dat$tm1)) +
60*with(dat, ifelse(tm1n > tm2n, 24*60, 0) + tm2n - tm1n)/2,
format="%H:%M")
# [1] "03:00" "03:30"
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a .csv file with the following data:
duration time | starting time | finish time
1,#1996-06-18 23:25:00#,#1996-06-18 23:26:00#
23,#1996-06-18 23:28:00#,#1996-06-18 23:51:00#
1,#1996-06-18 23:59:00#,#1996-06-19#
1,#1996-06-18 23:24:00#,#1996-06-18 23:25:00#
8,#1996-06-18 23:51:00#,#1996-06-18 23:59:00#
3,#1996-06-19#,#1996-06-19 00:03:00#
12,#1996-06-19 00:12:00#,#1996-06-19 00:24:00#
3,#1996-06-18 23:03:00#,#1996-06-18 23:06:00#
The bold lines have incomplete elements. My question is how can i complete the those elements(start time and finish time) using the duration and the other element, i.e sum the duration and the start time to obtain the finish time, in R (assuming that this data is in a data frame).
Basically, how do i add 00:00:00 to these rows (in a R data frame)?
You can use the lubridate package to convert the duration column to time objects. Using the dplyr package you can manipulate your data.frame as follows:
library(dplyr)
library(magrittr)
library(lubridate)
df <- data.frame(duration = c(1, 23, 1, 1, 8, 3),
start_time = c("1996-06-18 23:25:00",
"1996-06-18 23:28:00",
"1996-06-18 23:59:00",
"1996-06-18 23:24:00",
"1996-06-18 23:51:00",
"1996-06-19"),
end_time = c("1996-06-18 23:26:00",
"1996-06-18 23:51:00",
"1996-06-19",
"1996-06-18 23:25:00",
"1996-06-18 23:59:00",
"1996-06-19 00:03:00"))
df1 <- df %>%
mutate(start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"), duration = minutes(duration),
end_time = as.POSIXct(end_time, format = "%Y-%m-%d %H:%M:%S")) %>%
mutate(start_time = ifelse(is.na(start_time),
(end_time - duration),
start_time),
end_time = ifelse(is.na(end_time),
(start_time + duration),
end_time))
df1 %<>%
mutate(start_time = as.POSIXct(start_time, origin = "1970-01-01"),
end_time = as.POSIXct(end_time, origin = "1970-01-01"))
and then your output would look like this:
> df1
duration start_time end_time
1 1M 0S 1996-06-18 23:25:00 1996-06-18 23:26:00
2 23M 0S 1996-06-18 23:28:00 1996-06-18 23:51:00
3 1M 0S 1996-06-18 23:59:00 1996-06-19 00:00:00
4 1M 0S 1996-06-18 23:24:00 1996-06-18 23:25:00
5 8M 0S 1996-06-18 23:51:00 1996-06-18 23:59:00
6 3M 0S 1996-06-19 00:00:00 1996-06-19 00:03:00
I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time