I have been looking around but I still couldn't find a way to subset my dataframe by time, here is the sample data:
Duration End Date Start Date
228 2013-01-03 09:10:00 2013-01-03 09:06:00
1675 2013-01-04 17:34:00 2013-01-04 17:06:00
393 2013-01-04 17:54:00 2013-01-04 17:48:00
426 2013-01-04 11:10:00 2013-01-04 11:03:00
827 2013-01-01 16:13:00 2013-01-01 15:59:00
780 2013-01-01 16:13:00 2013-01-01 16:00:00
The End Date and Start Date are in POSIXct format, and here is what I have tried if I only what times between 8:00 to 9:30.
tm1 <- as.POSIXct("08:00", format = "%H:%M")
tm2 <- as.POSIXct("09:30", format = "%H:%M")
df.time <- with(df, df[format('Start Date', '%H:%M')>= tm1 & format('End Date', '%H:%M')< tm2, ])
but this returns an error. I have also tried this, but it didn't work as well.
df.time <- subset(df, format('Start Date', '%H:%M') >= '8:00' & format('End Date', '%H:%M') < '9:30'))
if anybody tell me what am I doing wrong? Thanks
Assuming that the start and end dates are always the same and only the times differ and you want those rows for which the time starts at or after 8:00 and ends before 9:30, convert the date/time values to characters strings of the form HH:MM and compare:
subset(DF, format(`Start Date`, "%H:%M") >= "08:00" &
format(`End Date`, "%H:%M") < "09:30")
giving:
Duration End Date Start Date
1 228 2013-01-03 09:10:00 2013-01-03 09:06:00
Note: We used the following for DF. (Next time please use dput to provide your data in reproducible form.)
DF <- structure(list(Duration = c(228L, 1675L, 393L, 426L, 827L, 780L
), `End Date` = structure(c(1357222200, 1357338840, 1357340040,
1357315800, 1357074780, 1357074780), class = c("POSIXct", "POSIXt"
), tzone = ""), `Start Date` = structure(c(1357221960, 1357337160,
1357339680, 1357315380, 1357073940, 1357074000), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Duration", "End Date", "Start Date"
), row.names = c(NA, -6L), class = "data.frame")
Related
I have a dataset, df1, I would like to convert all the values from the 24 hour clock to UTC.
Date Name
1/2/2020 16:46 A
1/2/2020 16:51 B
I Would like
Date Name
1/2/2020 4:46:47 PM A
1/2/2020 4:51:44 PM B
I have tried:
df$Date<- format(df$Date, "%m/%d/%Y %I:%M:%S %p")
dput:
structure(list(Date = structure(1:2, .Label = c("1/2/2020 16:46",
"1/2/2020 16:51"), class = "factor"), Name = structure(1:2, .Label = c("A",
"B"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
You can first convert the data to POSIXct format and then use format to get data in the required format.
df$Date <- format(as.POSIXct(df$Date, format = "%m/%d/%Y %H:%M"),
"%m/%d/%Y %I:%M:%S %p")
#Can also use mdy_hm from lubridate
#df$Date <- format(lubridate::mdy_hm(df$Date), "%m/%d/%Y %I:%M:%S %p")
df
# Date Name
#1 01/02/2020 04:46:00 PM A
#2 01/02/2020 04:51:00 PM B
Assuming you want to actually convert a string in one format to a string in another format rather than having it as a (more useful) actual date/time, you can use a little arithmetic and string chopping along with mapply:
splits <- strsplit(as.character(df$Date), " |:")
Hours <- as.numeric(sapply(splits, `[`, 2))
AMPM <- c(" AM", " PM")[Hours %/% 12 + 1]
Hours <- Hours %% 13 + Hours %/% 13
df$Date <- mapply(function(x, y, z) paste0(x[1], " ", y, ":", x[3], z), splits, Hours, AMPM)
df
#> Date Name
#> 1 1/2/2020 4:46 PM A
#> 2 1/2/2020 4:51 PM B
Created on 2020-02-26 by the reprex package (v0.3.0)
Assuming the same assumptions as the previous answer by Allan, here is another way of converting from 24 hour to 12 hour.
library(tidyverse)
library(lubridate)
df <- tibble(
date = c(ymd_hms("2020/01/02 16:46:00", "2020/01/02 16:51:00", tz = "UTC")),
name = c("A", "B")
)
df %>%
mutate(date_hour = hour(date),
am_pm = if_else(date_hour > 12, "PM", "AM"),
date_hour = if_else(date_hour > 12, date_hour - 12, date_hour - 0),
newdatetime = paste0(date(date), " ", date_hour , ":", minute(date), " ", am_pm)) %>%
select(-c(date_hour, am_pm))
df
# A tibble: 2 x 3
date name newdatetime
<dttm> <chr> <chr>
1 2020-01-02 16:46:00 A 2020-01-02 4:46 PM
2 2020-01-02 16:51:00 B 2020-01-02 4:51 PM
Hope this helps!
I have two columns
starttime endtime
2019-11-05 18:02:04 2019-11-05 00:02:04
2019-08-02 20:18:00 2019-0802 01:10:00
2019-12-07 17:28:00 2019-12-07 18:00:00
I am trying to find the difference in time between starttime and endtime
mutate(col = difftime(endtime,starttime,units = "hours")
but i am getting negative hours which makes no sense, and i need it to be endtime - startime because it would mess up things for the dataframe that I have I beleive that 0.533 is right I got
col
-18
-19
0.533
We can increment the endtime by 1 day if startttime > endtime and then use difftime
library(dplyr)
df %>%
mutate(endtime = if_else(starttime > endtime, endtime + 86400, endtime),
col = difftime(endtime,starttime,units = "hours"))
# starttime endtime col
#1 2019-11-05 18:02:04 2019-11-06 00:02:04 6.0000000 hours
#2 2019-08-02 20:18:00 2019-08-03 01:10:00 4.8666667 hours
#3 2019-12-07 17:28:00 2019-12-07 18:00:00 0.5333333 hours
data
df <- structure(list(starttime = structure(c(1572976924, 1564777080,
1575739680), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
endtime = structure(c(1572912124, 1564708200, 1575741600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -3L), class = "data.frame")
I have 3 data frames, df1 = a time interval, df2 = list of IDs, df3 = list of IDs with associated date.
df1 <- structure(list(season = structure(c(2L, 1L), .Label = c("summer",
"winter"), class = "factor"), mindate = structure(c(1420088400,
1433131200), class = c("POSIXct", "POSIXt")), maxdate = structure(c(1433131140,
1448945940), class = c("POSIXct", "POSIXt")), diff = structure(c(150.957638888889,
183.040972222222), units = "days", class = "difftime")), .Names = c("season",
"mindate", "maxdate", "diff"), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(ID = c(23796, 23796, 23796)), .Names = "ID", row.names = c(NA,
-3L), class = "data.frame")
df3 <- structure(list(ID = c("23796", "123456", "12134"), time = structure(c(1420909920,
1444504500, 1444504500), class = c("POSIXct", "POSIXt"), tzone = "US/Eastern")), .Names = c("ID",
"time"), row.names = c(NA, -3L), class = "data.frame")
The code should compare if df2$ID == df3$ID. If true, and if df3$time >= df1$mindate and df3$time <= df1$maxdate, then df1$maxdate - df3$time, else df1$maxdate - df1$mindate. I tried using the ifelse function. This works when i manually specify specific cells, but this is not what i want as I have many more (uneven rows) for each of the dfs.
df1$result <- ifelse(df2[1,1] == df3[1,1] & df3[1,2] >= df1$mindate & df3[1,2] <= df1$maxdate,
difftime(df1$maxdate,df3[1,2],units="days"),
difftime(df1$maxdate,df1$mindate,units="days")
EDIT: The desired output is (when removing last row of df2):
season mindate maxdate diff result
1 winter 2015-01-01 2015-05-31 23:59:00 150.9576 days 141.9576
2 summer 2015-06-01 2015-11-30 23:59:00 183.0410 days 183.0410
Any ideas? I don't see how I could merge dfs to make them of the same length. Note that df2 can be of any row length and not affect the code. Issues arise when df1 and df3 differ in # of rows.
The > and < are vectorized:
transform(df1,result=ifelse(df3$ID%in%df2$ID & df3$time>mindate & df3$time <maxdate, difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
You can also use the between function from data.table library
library(data.table)
transform(df1,result=ifelse(df3$ID%in%df2$ID&df3$time%between%df1[2:3],
difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
I am experimenting with bike station data and have a for loop that extracts bikes that started at different stations than where they stopped, then rearranges stoptime and startime to show the movement of the bike by the operator (from where it stopped, to where it started), and the difftime or difference in time between when it started and last ended.
Sample data
starttime stoptime start.station.id end.station.id bikeid
1 2017-01-16 13:08:18 2017-01-16 13:28:13 3156 466 1
2 2017-01-10 19:10:31 2017-01-10 19:16:02 422 3090 1
3 2017-01-04 08:47:42 2017-01-04 08:57:10 507 442 1
4 2017-01-12 18:08:33 2017-01-12 18:36:09 546 3151 2
5 2017-01-21 09:52:13 2017-01-21 10:21:07 3243 212 2
6 2017-01-26 05:46:18 2017-01-26 05:49:13 470 168 2
My code
raw_data = test
unique_id = unique(raw_data$bikeid)
output1 <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)
for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
onebike$starttime <- strptime(onebike$starttime, "%Y-%m-%d %H:%M:%S", tz = "EST")
onebike <- onebike[order(onebike$starttime, decreasing = FALSE),]
if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
print(onebike)
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%Y-%m-%d %H:%M:%S", tz = "EST"),
strptime(onebike[i-1,"stoptime"], "%Y-%m-%d %H:%M:%S", tz = "EST")
,units = "secs"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output1[nrow(output1) + 1,] = new_row
}
}
}
}
Output
bikeid end.station.id start.station.id diff.time stoptime starttime
1 1 442 422 555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2 1 3090 3156 496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3 2 3151 3243 746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4 2 212 470 415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5 3 3112 351 1587161 2017-01-12 08:58:42 2017-01-30 17:51:23
However, on a large dataset this for loop takes a very very long time. Is there a way to dplyr or data.table to speed up this loop or rearrange the data in a way that avoids looping? Would appreciate any kind of explanation or suggestions
Sample data (in dput)
structure(list(starttime = structure(c(1484572098, 1484075431,
1483519662, 1484244513, 1484992333, 1485409578, 1484210616, 1483727948,
1485798683), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
stoptime = structure(c(1484573293, 1484075762, 1483520230,
1484246169, 1484994067, 1485409753, 1484211522, 1483729024,
1485799997), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
start.station.id = c(3156L, 422L, 507L, 546L, 3243L, 470L,
439L, 309L, 351L), end.station.id = c(466L, 3090L, 442L,
3151L, 212L, 168L, 3112L, 439L, 433L), bikeid = c(1, 1, 1,
2, 2, 2, 3, 3, 3)), .Names = c("starttime", "stoptime", "start.station.id",
"end.station.id", "bikeid"), row.names = c(NA, -9L), class = "data.frame")
One approach would be the following. I called your data foo. You perhaps want to start sorting your data by bikeid and starttime. Then, for each bikeid, you want to create new columns (i.e., next.start.station.id and next.start.time) using lead(). You also want to find the time difference using difftime(). After that you want to remove rows that have a same id for end.station.id and next.start.station.id. Finally, you arrange columns as you wish.
library(dplyr)
foo %>%
arrange(bikeid, starttime) %>% # if necessary, arrange(bikeid, starttime, stoptime)
group_by(bikeid) %>%
mutate(next.start.station.id = lead(start.station.id),
next.start.time = lead(starttime),
diff.time = difftime(next.start.time, stoptime, units = "secs")) %>%
filter(end.station.id != next.start.station.id) %>%
select(bikeid, end.station.id, next.start.station.id, diff.time, stoptime, next.start.time)
bikeid end.station.id next.start.station.id diff.time stoptime next.start.time
<dbl> <int> <int> <time> <dttm> <dttm>
1 1.00 442 422 555201 2017-01-04 08:57:10 2017-01-10 19:10:31
2 1.00 3090 3156 496336 2017-01-10 19:16:02 2017-01-16 13:08:18
3 2.00 3151 3243 746164 2017-01-12 18:36:09 2017-01-21 09:52:13
4 2.00 212 470 415511 2017-01-21 10:21:07 2017-01-26 05:46:18
5 3.00 3112 351 1587161 2017-01-12 08:58:42 2017-01-30 17:51:23
I have a dataframe consisting of:
two columns with start and end timestamps (POSIXct class) of various
projects
another timestamp (POSIXct class) column showing events which
occured within the start and end timeframe
a project id column
Projects have multiple events naturally.
Projid Event BEGIN_DT END_DT
1 04/12/2013 09:00:00 04/12/2013 08:12:00 04/14/2013 20:14:00
1 04/13/2013 15:16:24 04/12/2013 08:12:00 04/14/2013 20:14:00
2 06/06/2012 18:00:00 06/06/2012 13:54:32 08/06/2012 23:59:43
2 06/07/2012 22:54:32 06/06/2012 13:54:32 08/06/2012 23:59:43
I would like to add a field showing for each event the 60 min time bucket it belongs to (as in first hour or second hour or n-th hour of the project etc...). How could this be done?
How about the following using a floor on the difftime in hours:
# Your sample data
df <- structure(list(
Projid = c(1L, 1L, 2L, 2L),
Event = structure(c(1365721200, 1365830184, 1338969600, 1339073672), class = c("POSIXct", "POSIXt"), tzone = ""),
BEGIN_DT = structure(c(1365718320, 1365718320, 1338954872, 1338954872), class = c("POSIXct", "POSIXt"), tzone = ""),
END_DT = structure(c(1365934440, 1365934440, 1344261583, 1344261583), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("Projid", "Event", "BEGIN_DT", "END_DT"), row.names = c(NA, -4L), class = "data.frame");
# Add hour bin
df$hourBin <- floor(difftime(df$Event, df$BEGIN_DT, unit = "hours")) + 1;
df;
#Projid Event BEGIN_DT END_DT hourBin
#1 1 2013-04-12 09:00:00 2013-04-12 08:12:00 2013-04-14 20:14:00 1 hours
#2 1 2013-04-13 15:16:24 2013-04-12 08:12:00 2013-04-14 20:14:00 32 hours
#3 2 2012-06-06 18:00:00 2012-06-06 13:54:32 2012-08-06 23:59:43 5 hours
#4 2 2012-06-07 22:54:32 2012-06-06 13:54:32 2012-08-06 23:59:43 34 hours