I have a set of traffic data that has date and time columns, however, I'm having issues to properly subset the data according to the specific times. Is there a way to properly subset data based on date and time ranges? Using a filter or subset does not seem to work for me.
For e.g. I would like to extract data from 17/08/2019 to 19/08/2019 and for the following time periods: 06:00 to 07:00, 08:30 to 10:00, 12:00 to 13:00, 17:30 to 19:00, 19:00 to 20:00 and 20:00 to 22:00. I appreciate everyone's advice, please!
Vehicle.No. Date Time Payment.Amount
SXX0001A 17/08/2019 00:01 1.25
SXX0002A 17/08/2019 00:21 5
SXX0003A 17/08/2019 00:31 0
SXX0004A 17/08/2019 02:01 3
SXX0005A 17/08/2019 03:01 2
SXX0006A 17/08/2019 18:01 1.25
.
.
.
SXX0007A 18/08/2019 00:01 1.25
SXX0008A 18/08/2019 02:01 1.25
SXX0009A 18/08/2019 19:01 1.25
SXX0010A 18/08/2019 20:01 1.25
.
.
.
SXX0006A 20/08/2019 02:01 1.25
SXX0006A 20/08/2019 03:01 3.25
SXX0006A 20/08/2019 01:01 5.25
SXX0006A 20/08/2019 12:01 0
SXX0006A 20/08/2019 14:01 1.25
.
.
.
The first thing is to make sure that your Date and Time variables are in date and time formats respectively. It is impossible to tell, from what you are providing, whether this is the case or whether those variables are characters or factors.
Let's assume that they are characters:
df <- read.table(
text =
"Vehicle.No. Date Time Payment.Amount
SXX0001A 17/08/2019 00:01 1.25
SXX0002A 17/08/2019 00:21 5
SXX0003A 17/08/2019 00:31 0
SXX0004A 17/08/2019 02:01 3
SXX0005A 17/08/2019 03:01 2
SXX0006A 17/08/2019 18:01 1.25
SXX0007A 18/08/2019 00:01 1.25
SXX0008A 18/08/2019 02:01 1.25
SXX0009A 18/08/2019 19:01 1.25
SXX0010A 18/08/2019 20:01 1.25
SXX0006A 20/08/2019 02:01 1.25
SXX0006A 20/08/2019 03:01 3.25
SXX0006A 20/08/2019 01:01 5.25
SXX0006A 20/08/2019 12:01 0
SXX0006A 20/08/2019 14:01 1.25",
stringsAsFactors = F,
header = T
)
str(df$Date)
chr [1:15] "17/08/2019" "17/08/2019" "17/08/2019" "17/08/2019" ...
str(df$Time)
chr [1:15] "00:01" "00:21" "00:31" "02:01" "03:01" "18:01" "00:01" "02:01" ...
Let's create 2 new variables (date and datetime) in date and datetime formats. I am creating a datetime variable rather than a time one because this will come in handy later. The package readr has great functions to parse vectors.
library(dplyr)
library(readr)
df <-
df %>%
mutate(
date = parse_date(Date, "%d/%m/%Y"),
datetime = parse_datetime(paste(Date, Time), "%d/%m/%Y %H:%M")
)
str(df$date)
Date[1:15], format: "2019-08-17" "2019-08-17" "2019-08-17" ...
str(df$datetime)
POSIXct[1:15], format: "2019-08-17 00:01:00" "2019-08-17 00:21:00" ...
It is not clear to me how you want your output (do you want to filter the data that fit in any of the times you list? or do you want to filter for each date and time period separately?). Let's assume that you want all of the data that fit in any of the date and time periods you list.
Since we need to filter for the same time periods for several days, we will use purrr to avoid code repetition:
create a list of filtered data frames (each element corresponding to one of the days of interest)
create a function that will filter data for all the time periods of interest for a certain day. This function uses the package lubridate.
apply the function to each element of the list and output a data frame thanks to purrr:map_df() and remove the variables time and datetime we had created (though maybe you should keep them and get rid of your Date and Time variables instead).
library(purrr)
library(lubridate)
ls <- list(
filter(df, date == "2019-08-17"),
filter(df, date == "2019-08-18"),
filter(df, date == "2019-08-19")
)
select_times <- function(df) {
df %>%
filter(
datetime %within% interval(paste(unique(df$date), "06:00:00"),
paste(unique(df$date), "07:00:00")) |
datetime %within% interval(paste(unique(df$date), "08:30:00"),
paste(unique(df$date), "10:00:00")) |
datetime %within% interval(paste(unique(df$date), "12:00:00"),
paste(unique(df$date), "13:00:00")) |
datetime %within% interval(paste(unique(df$date), "17:30:00"),
paste(unique(df$date), "22:00:00"))
)
}
map_df(ls, select_times) %>%
select(- date, - datetime)
Output:
Vehicle.No. Date Time Payment.Amount
1 SXX0006A 17/08/2019 18:01 1.25
2 SXX0009A 18/08/2019 19:01 1.25
3 SXX0010A 18/08/2019 20:01 1.25
This is the subset of your data for the time periods of interest during the days of interest.
For alternative solutions, you might want to look at the package xts. This post could be useful.
Related
I have date and time as separate columns, which i combined into a single column using library(lubridate)
Now i want to create a new column that would calculate the elapsed time between two consecutive rows for each unique ID
I tried diff, however the error i am getting is that the new column has +1 rows compared to original data set
s1$DT<-with(s1, mdy(Date.of.Collection) + hm(MILITARY.TIME))#this worked - #needs the library lubridate
s1$ElapsedTime<-difff(s1$DT)
units(s1$ElapsedTime)<-"hours"
Subject.ID time DT Time elapsed
1 Dose 8/1/2018 8:15 0
1 time point1 8/1/2018 9:56 0.070138889
1 time point2 8/2/2018 9:56 1.070138889
2 Dose 9/4/2018 10:50 0
2 time point1 9/11/2018 11:00 7.006944444
3 Dose 10/1/2018 10:20 0
3 time point1 10/2/2018 14:22 1.168055556
3 time point2 10/3/2018 12:15 2.079861111
From your comment, you don't need a "diff"; in conventional R-speak, a "diff" would be T1-T0, T2-T1, T3-T2, ..., Tn - Tn-1.
For you, one of these will work to give you T1,2,...,n - T0.
Base R
do.call(
rbind,
by(patients, patients$Subject.ID, function(x) {
x$elapsed <- x$realDT - x$realDT[1]
units(x$elapsed) <- "hours"
x
})
)
# Subject.ID time1 DT Time elapsed realDT
# 1.1 1 Dose 8/1/2018 8:15 0.000000 hours 2018-08-01 08:15:00
# 1.2 1 time_point1 8/1/2018 9:56 1.683333 hours 2018-08-01 09:56:00
# 1.3 1 time_point2 8/2/2018 9:56 25.683333 hours 2018-08-02 09:56:00
# 2.4 2 Dose 9/4/2018 10:50 0.000000 hours 2018-09-04 10:50:00
# 2.5 2 time_point1 9/11/2018 11:00 168.166667 hours 2018-09-11 11:00:00
# 3.6 3 Dose 10/1/2018 10:20 0.000000 hours 2018-10-01 10:20:00
# 3.7 3 time_point1 10/2/2018 14:22 28.033333 hours 2018-10-02 14:22:00
# 3.8 3 time_point2 10/3/2018 12:15 49.916667 hours 2018-10-03 12:15:00
dplyr
library(dplyr)
patients %>%
group_by(Subject.ID) %>%
mutate(elapsed = `units<-`(realDT - realDT[1], "hours")) %>%
ungroup()
data.table
library(data.table)
patDT <- copy(patients)
setDT(patDT)
patDT[, elapsed := `units<-`(realDT - realDT[1], "hours"), by = "Subject.ID"]
Notes:
The "hours" in the $elapsed column is just an artifact of dealing with a time-difference thing, it should not affect most operations. To get rid of it, make sure you're in the right units ("hours", "secs", ..., see ?units) and use as.numeric.
The only reasons I used as.POSIXct as above are that I'm not a lubridate user, and the data as provided is not in a time format. You shouldn't need it if your Time is a proper time format, in which case you'd use that field instead of my hacky realDT.
On similar lines, if you do calculate realDT and use it, you really don't need both realDT and the pair of DT and Time.
The data I used:
patients <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Subject.ID time1 DT Time elapsed
1 Dose 8/1/2018 8:15 0
1 time_point1 8/1/2018 9:56 0.070138889
1 time_point2 8/2/2018 9:56 1.070138889
2 Dose 9/4/2018 10:50 0
2 time_point1 9/11/2018 11:00 7.006944444
3 Dose 10/1/2018 10:20 0
3 time_point1 10/2/2018 14:22 1.168055556
3 time_point2 10/3/2018 12:15 2.079861111")
# this is necessary for me because DT/Time here are not POSIXt (they're just strings)
patients$realDT <- as.POSIXct(paste(patients$DT, patients$Time), format = "%m/%d/%Y %H:%M")
I have a dataset where one column has a date and time values. Every date has multiple entries. The first row for every date has a date value inthe form 29MAY2018_00:00:00.000000 while the rest of the row for the same date has time values i.e. 20:00 - 21:00. The problem is that I want to sum the values in another column for each day.
The sample data has the following format
Date A
29MAY2018_00:00:00.000000
20:00 - 21:00 0.009
21:00 - 22:00 0.003
22:00 - 23:00 0.0003
23:00 - 00:00 0
30MAY2018_00:00:00.000000
00:00 - 01:00 -0.0016
01:00 - 02:00 -0.0012
02:00 - 03:00 -0.0002
03:00 - 04:00 -0.0023
04:00 - 05:00 0
05:00 - 06:00 -0.0005
20:00 - 21:00 -0.0042
21:00 - 22:00 -0.0035
22:00 - 23:00 -0.0026
23:00 - 00:00 -0.001
I have created a new column
data$C[data$A ==0 ] <- 0
data$C[data$A < 0 ] <- -1
data$C[data$A > 0 ] <- 1
I need to sum the column `C' for every date.
The output should be
A B
29-MAY-2019 4
30-MAY-2019 -9
31-MAY-2019 3
An option would be to create a grouping column based on the occurrence of full datetime format in the 'Date', summarise the first 'Date', convert it to Date format (with anydate from anytime) and get the sum of sign of 'A'
library(tidyverse)
library(anytime)
data %>%
group_by(grp = cumsum(str_detect(Date, "[A-Z]"))) %>%
summarise(Date = anydate(first(Date)),
B = sum(sign(A), na.rm = TRUE))
I have a data from field instruments where values for 7 different parameters are measured and recorded every 15 minutes. The data set extends for many years. Sometimes the instruments fail or are taken off-line for preventive maintenance giving incomplete days in the record. In post-processing the data, I would like to remove those incomplete days (or, stated alternatively, retain only the complete days).
An abbreviated example of what the data might look like:
Date Temp
2012-02-01 00:01:00 18.5
2012-02-01 00:16:00 18.4
2012-02-01 00:31:00 18.6
.
.
.
2012-02-01 23:31:00 19.0
2012-02-01 23:46:00 18.9
2012-02-02 00:01:00 19.0
2012-02-02 00:16:00 19.0
2012-02-03 00:01:00 17.0
2012-02-03 00:16:00 17.1
2012-02-03 00:31:00 17.0
.
.
.
2012-02-03 23:31:00 18.0
2012-02-03 23:46:00 18.2
So 2012-02-01 and 2012-02-03 are complete days and I'd like to remove 2012-02-02 as it is an incomplete day.
Convert dates to days
Count the number of observations per day
Retain only those days with the maximum number of observations
The code
library(dplyr)
library(lubridate)
dataset %>%
mutate(Day = floor_date(Date, unit = "day")) %>%
group_by(Day) %>%
mutate(nObservation = n()) %>%
filter(nObservation == max(nObservation)
Date.rle = rle(df$Date)
Date.good = Date.rle$val[Date.rle$len==96]
df = df[df$Date %in% Date.good,]
Here is one base R method that should work:
# create a day variable
df$day <- as.Date(df$Date, format="%Y-%m-%d")
# calculate the number of observations per day
df$obsCnt <- ave(df$Temp, df$day, FUN=length)
# subset data: more than 90 observations
dfNew <- df[df$obsCnt > 96,]
I put the threshold at 96 observations a day, but it is easily adjusted.
I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]
Date DE VE
12/1/2016 93.387 0.095
11/1/2016 77.968 0.095
10/1/2016 65.184 0.095
9/1/2016 63.984 0.095
8/1/2016 67.657 0.095
%m/%d/%Y
DE and VE are daily averages. How to convert from daily average to monthly total in R based on the actual days in that month? Total for 12/2016 =93.387*31. Need to calculate the monthly total for all 10*12 months from 2006-01 to 2016-12.
To find the number of days in a month you can use the days_in_month function in the lubridate package.
The argument takes a datetime object so you have to convert your Date column to a known date/datetime-based class (i.e. "POSIXct, POSIXlt, Date, chron, yearmon, yearqtr, zoo, zooreg, timeDate, xts, its, ti, jul, timeSeries, and fts objects").
Then you can just mutate your df with the multiplicated daily averages.
library(lubridate)
library(dplyr)
myDf <- read.table(text = "Date DE VE
12/1/2016 93.387 0.095
11/1/2016 77.968 0.095
10/1/2016 65.184 0.095
9/1/2016 63.984 0.095
8/1/2016 67.657 0.095", header = TRUE)
mutate(myDf, Date = as.Date(Date, format = "%m/%d/%Y"),
monthlyTotalDE = DE * days_in_month(Date),
monthlyTotalVE = VE * days_in_month(Date))
# Date DE VE monthlyTotalDE monthlyTotalVE
# 1 2016-12-01 93.387 0.095 2894.997 2.945
# 2 2016-11-01 77.968 0.095 2339.040 2.850
# 3 2016-10-01 65.184 0.095 2020.704 2.945
# 4 2016-09-01 63.984 0.095 1919.520 2.850
# 5 2016-08-01 67.657 0.095 2097.367 2.945
EDIT
In mutate if you use a new column name, it will append this column to the data frame.
If you want to avoid to add columns, you have to keep the columns names that already exist, it will overwrite these columns e.g.
mutate(myDf, Date = as.Date(Date, format = "%m/%d/%Y"),
DE = DE * days_in_month(Date),
VE = VE * days_in_month(Date))
# Date DE VE
# 1 2016-12-01 2894.997 2.945
# 2 2016-11-01 2339.040 2.850
# 3 2016-10-01 2020.704 2.945
# 4 2016-09-01 1919.520 2.850
# 5 2016-08-01 2097.367 2.945
If you have a lot of columns to compute, I suggest you to use mutate_each, it's very powerfull and will save you the pain to do it manualy with mutate or the loss of performance by doing a traditional loop.
Use vars to include/exclude variables in mutate.
You can exclude variables manualy using the variable name prececed by a minus :
vars = -Date or use a vector to exclude several variables vars = c(Date, DE).
Or you can also use special specification functions as in dplyr::select, see ?dplyr::select for more informations.
Warning : If you use vars to include variables, don't explicit the named argument vars = in your function if you want to keep the column names.
one_of(c("DE", "VE")), DE:VE... To drop variables, use - before the function : -contains("Date")
myDf %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
mutate_each(funs(. * days_in_month(Date)),
vars = -Date)
# Date DE VE
# 1 2016-12-01 2894.997 2.945
# 2 2016-11-01 2339.040 2.850
# 3 2016-10-01 2020.704 2.945
# 4 2016-09-01 1919.520 2.850
# 5 2016-08-01 2097.367 2.945