I have a df like that (head):
date Value
1: 2016-12-31 169361280
2: 2017-01-01 169383153
3: 2017-01-02 169494585
4: 2017-01-03 167106852
5: 2017-01-04 166750164
6: 2017-01-05 164086438
I would like to calculate a ratio, for that reason I need the max of every period. The max it´s normally the last day of the month but sometime It could be some days after and before (28,29,30,31,01,02).
In order to calculate it properly I would like to assign to my reference date (the last day of the month) the max value of this group of days to be sure that the ratio reflects what it supossed to.
This could be a reproducible example:
Start<-as.Date("2016-12-31")
End<-Sys.Date()
window<-data.table(seq(Start,End,by='1 day'))
dt<-cbind(window,rep(rnorm(nrow(window))))
colnames(dt)<-c("date","value")
# Create a Dateseq
DateSeq <- function(st, en, freq) {
st <- as.Date(as.yearmon(st))
en <- as.Date(as.yearmon(en))
as.Date(as.yearmon(seq(st, en, by = paste(as.character(12/freq),
"months"))), frac = 1)
}
# df to be fulfilled with the group max.
Value.Max.Month<-data.frame(DateSeq(Start,End,12))
colnames(Value.Max.Month)<-c("date")
date
1 2016-12-31
2 2017-01-31
3 2017-02-28
4 2017-03-31
5 2017-04-30
6 2017-05-31
7 2017-06-30
8 2017-07-31
9 2017-08-31
10 2017-09-30
11 2017-10-31
12 2017-11-30
13 2017-12-31
14 2018-01-31
15 2018-02-28
16 2018-03-31
You could use data.table:
library(lubridate)
library(zoo)
Start <- as.Date("2016-12-31")
End <- Sys.Date()
window <- data.table(seq(Start,End,by='1 day'))
dt <- cbind(window,rep(rnorm(nrow(window))))
colnames(dt) <- c("date","value")
dt <- data.table(dt)
dt[,period := as.Date(as.yearmon(date)) %m+% months(1) - 1,][, maximum:=max(value), by=period][, unique(maximum), by=period]
In the first expression we create a new column called period. Then we group by this new column and look for the maximum in value. In the last expression we just output these unique rows.
Notice that to get the last day of each period we add one month using lubridate and then substract 1 day.
The output is:
period V1
1: 2016-12-31 -0.7832116
2: 2017-01-31 2.1988660
3: 2017-02-28 1.6644812
4: 2017-03-31 1.2464980
5: 2017-04-30 2.8268820
6: 2017-05-31 1.7963104
7: 2017-06-30 1.3612476
8: 2017-07-31 1.7325457
9: 2017-08-31 2.7503439
10: 2017-09-30 2.4369036
11: 2017-10-31 2.4544802
12: 2017-11-30 3.1477730
13: 2017-12-31 2.8461506
14: 2018-01-31 1.8862944
15: 2018-02-28 1.8946470
16: 2018-03-31 0.7864341
A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).
Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.
There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.
I tried these lines from this question (even though they fill all without differentiation) but not getting there.
How can it be done in R?
date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data
library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"),
tz = "GMT"),
length.out = 24, by = "1 hour")), staff]
setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[,
.(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
.(staff, date_time)]
table(res$staff, as.Date(res$date_time))
# 2018-03-02 2018-03-03 2018-03-04
# David 3 24 2
# Jack 24 1 1
# Kate 3 1 24
head(res)
# staff date_time reading
#1: Jack 2018-03-02 00:00:00 7.5
#2: Jack 2018-03-02 01:00:00 8.3
#3: Kate 2018-03-02 02:00:00 7.0
#4: Jack 2018-03-02 03:00:00 6.9
#5: Jack 2018-03-02 05:00:00 7.1
#6: Jack 2018-03-02 06:00:00 8.1
tail(res)
# staff date_time reading
#1: Kate 2018-03-04 04:00:00 0
#2: Kate 2018-03-04 09:00:00 0
#3: Kate 2018-03-04 13:00:00 0
#4: Kate 2018-03-04 18:00:00 0
#5: Kate 2018-03-04 21:00:00 0
#6: Kate 2018-03-04 23:00:00 0
Try this code:
Identify each daily hour and all staff members
date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")
Uniform date format in df
df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")
Merge information
out<-merge(comb,df,all.x=T)
Your output:
head(out)
staff date_time reading
1 Jack 2018-03-02 00:00:00 7.5
2 Jack 2018-03-02 01:00:00 8.3
3 Jack 2018-03-02 02:00:00 NA
4 Jack 2018-03-02 03:00:00 6.9
5 Jack 2018-03-02 04:00:00 NA
6 Jack 2018-03-02 05:00:00 7.1
I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455
I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.