Simple manipulation on large data frame in r - r

I have a relatively large data frame. It contains roughly 40 million rows and 12 columns, please see part of it below. Specifically, it is a 3-hr averaged ozone data for counties in the US. Each row represents a certain county and a certain day (from 19800101 to 20161231 for 3108 counties). Note that this data file has a size of 7.05 GB.
index date state.fips county.fp X07.30 X10.30 X13.30 X16.30 X19.30 X21.30 X01.30 X04.30
1 01001 1980-01-01 01 001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02 01 001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03 01 001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04 01 001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05 01 001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06 01 001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945
The question is times in the columns below are in UTC, and I need to convert to US local time. There are five time zones for the US, namely Eastern time zone, Central time ozone, Mountain time zone, and Pacific time zone. Yes I only covered the contiguous US. How should I start this manipulation?
Also please pay attention that the original data file is large (7.05 GB). We may encounter no enough memory errors. I am working on a laptop with 16 GB RAM.
Below I post my code for doing this. However I don't how to add the dplyr:case_when to adjust time zones.
names(ozone) <- gsub("^X","", names(ozone)) # get rid of X in columns names
ozone <- pivot_longer(ozone, cols = c('01.30','04.30','07.30',
'10.30','13.30','16.30','19.30','21.30'),
names_to = 'time', values_to = 'ozone_val')
ozone$date <- ymd(ozone$date) # convert to date format
ozone$date = as.POSIXct(paste(ozone$date, ozone$time),
format = "%Y-%m-%d %H.%M",
tz = 'UTC')
ozone$date <- with_tz(ozone$date, "America/New_York") # how to apply case_when here
ozone$time <- substr(ozone$date, 12, 19)
ozone$year.day <- substr(ozone$date, 1, 10)
ozone <- subset(ozone, select = -date)
ozone_1 <- pivot_wider(ozone, id_cols = c('index','state.fips','county.fp','year.day'),
names_from = 'time', values_from = 'ozone_val')
Any ideas?

This should get you started but you'll need to post a more complete reproducible example and/or some more info about what exactly you are looking for. But, you should be able to use this general framework if you do not run out of memory (e.g., you may be able to use something like dplyr::case_when() to create the timezone based on the state; or subsetting after making the column POSIXct). Hope this gets you started!
Also, I am happy to explain anything that is unclear!
library(data.table)
setDT(data)
names(data) <- gsub("^X", "", names(data))
dt <- melt(data, id.vars = c("index", "date", "state.fips", "county.fp"),
variable.name = "time", value.name = "ozone_val")
dt[, date := as.POSIXct(paste(as.character(date), time),
format = "%Y-%m-%d %H.%M",
tz = "America/New_York")]
print(dt, nrows = 10)
index date state.fips county.fp time ozone_val
1: 1001 1980-01-01 07:30:00 1 1 07.30 29.98488
2: 1001 1980-01-02 07:30:00 1 1 07.30 29.03014
3: 1001 1980-01-03 07:30:00 1 1 07.30 30.69475
4: 1001 1980-01-04 07:30:00 1 1 07.30 30.20852
5: 1001 1980-01-05 07:30:00 1 1 07.30 35.80190
---
44: 1001 1980-01-02 04:30:00 1 1 04.30 31.14960
45: 1001 1980-01-03 04:30:00 1 1 04.30 30.58366
46: 1001 1980-01-04 04:30:00 1 1 04.30 35.78104
47: 1001 1980-01-05 04:30:00 1 1 04.30 38.09185
48: 1001 1980-01-06 04:30:00 1 1 04.30 35.05945
Data:
data <- read.table(header = T, text = "index date state.fips county.fp X07.30 X10.30 X13.30 X16.30 X19.30 X21.30 X01.30 X04.30
1 01001 1980-01-01 01 001 29.98488 29.47778 29.12294 29.98976 31.69830 31.56405 30.48744 29.62118
2 01001 1980-01-02 01 001 29.03014 28.75464 28.58736 30.26555 32.39263 32.43746 31.70940 31.14960
3 01001 1980-01-03 01 001 30.69475 30.19832 29.68841 30.28920 31.61882 31.43047 31.01369 30.58366
4 01001 1980-01-04 01 001 30.20852 29.69874 29.47550 30.55639 32.62610 34.47959 35.54881 35.78104
5 01001 1980-01-05 01 001 35.80190 35.69129 35.89026 38.51287 39.82833 39.49016 38.73464 38.09185
6 01001 1980-01-06 01 001 37.32787 36.55899 35.96070 36.62670 37.03226 36.71239 35.86387 35.05945")

Related

How to insert missing dates/times using R based on criteria?

A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).
Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.
There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.
I tried these lines from this question (even though they fill all without differentiation) but not getting there.
How can it be done in R?
date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data
library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"),
tz = "GMT"),
length.out = 24, by = "1 hour")), staff]
setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[,
.(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
.(staff, date_time)]
table(res$staff, as.Date(res$date_time))
# 2018-03-02 2018-03-03 2018-03-04
# David 3 24 2
# Jack 24 1 1
# Kate 3 1 24
head(res)
# staff date_time reading
#1: Jack 2018-03-02 00:00:00 7.5
#2: Jack 2018-03-02 01:00:00 8.3
#3: Kate 2018-03-02 02:00:00 7.0
#4: Jack 2018-03-02 03:00:00 6.9
#5: Jack 2018-03-02 05:00:00 7.1
#6: Jack 2018-03-02 06:00:00 8.1
tail(res)
# staff date_time reading
#1: Kate 2018-03-04 04:00:00 0
#2: Kate 2018-03-04 09:00:00 0
#3: Kate 2018-03-04 13:00:00 0
#4: Kate 2018-03-04 18:00:00 0
#5: Kate 2018-03-04 21:00:00 0
#6: Kate 2018-03-04 23:00:00 0
Try this code:
Identify each daily hour and all staff members
date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")
Uniform date format in df
df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")
Merge information
out<-merge(comb,df,all.x=T)
Your output:
head(out)
staff date_time reading
1 Jack 2018-03-02 00:00:00 7.5
2 Jack 2018-03-02 01:00:00 8.3
3 Jack 2018-03-02 02:00:00 NA
4 Jack 2018-03-02 03:00:00 6.9
5 Jack 2018-03-02 04:00:00 NA
6 Jack 2018-03-02 05:00:00 7.1

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

Vectorising iterative operation across rows

I've seen a lot of questions on here about vectorising for loops, but couldn't find any that involve vectorising a for loop to populate a cell based on the value of a cell in a row below (apologies if I'm just being blind though...).
I have a dataframe with 1.6 million rows of salaries and the date each person started earning that salary. Each person can have multiple salaries, and so multiple rows, each with a different date that it was updated.
Code for a dummy dataset is as follows:
df1 <- data.frame("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(df1$id,df1$start_date),]
I want to create a column with an end date for each salary, which is calculated as the day before the subsequent salary entry. If there is no subsequent salary entry, then it's set as today's date. This is my code, including a for loop, to do that:
df1$end_date <- Sys.Date()
for (i in 1:(nrow(df1)-1)){
if(df1[i,1]== df1[i+1,1]){
df1[i,4] <- df1[i+1,3]-1
}
print(i)
}
However, I know that for loops are not the most efficient way, but how would I go about vectorising this?
Using the dplyr package, you could do:
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(end_date=lead(start_date-1,default=Sys.Date()))
Which returns:
id salary start_date end_date
<dbl> <dbl> <date> <date>
1 1 15456 2016-02-14 2016-03-02
2 1 16594 2016-03-03 2017-05-22
3 2 17364 2016-01-17 2016-11-28
4 2 34564 2016-11-29 2017-05-22
5 3 33656 2016-08-17 2016-11-25
6 3 34525 2016-11-26 2017-05-22
7 4 23464 2016-01-20 2017-05-05
8 4 23467 2017-05-06 2017-05-22
9 5 27454 2016-02-29 2016-12-15
10 5 16794 2016-12-16 2017-05-22
11 6 42743 2016-03-14 2017-01-29
12 6 40663 2017-01-30 2017-05-22
You can use library(data.table):
setDT(df1)[, end_date := shift(start_date, type = "lead", fill = Sys.Date()), id][]
With data.table and shift, you can use below:
df1 <- data.table("id" = c(1,1,2,2,3,3,4,4,5,5,6,6),
"salary" = c(15456,16594,
17364,34564,
34525,33656,
23464,23467,
16794,27454,
40663,42743),
"start_date" = sample(seq(as.Date('2016/01/01'),as.Date(Sys.Date()), by="day"), 12))
df1 <- df1[order(id,start_date),]
df1[, EndDate := shift(start_date, type="lead"), id]
df1[is.na(EndDate), EndDate := Sys.Date()]
If I understand your question, the following base R code will work.
df1$end <- ave(df1$start_date, df1$id, FUN=function(x) c(tail(x, -1) - 1, Sys.Date()))
ave is used to perform the group level operation. The function performed takes the second through final date and subtracts 1. This is concatenated with the final date.
This returns
df1
id salary start_date end
1 1 15456 2016-03-20 2016-12-06
2 1 16594 2016-12-07 2017-05-22
3 2 17364 2016-10-17 2016-07-27
4 2 34564 2016-07-28 2017-05-22
5 3 34525 2016-05-26 2016-05-01
6 3 33656 2016-05-02 2017-05-22
7 4 23464 2017-04-17 2016-01-19
8 4 23467 2016-01-20 2017-05-22
9 5 16794 2016-09-12 2016-05-06
10 5 27454 2016-05-07 2017-05-22
11 6 40663 2016-10-03 2016-03-28
12 6 42743 2016-03-29 2017-05-22

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

Resources