Get aggregate sum of data by day and hour - r

The below is an example of the data I have.
date time size filename day.of.week
1 2015-01-16 5:36:12 1577 01162015053400.xml Friday
2 2015-01-16 5:38:09 2900 01162015053600.xml Friday
3 2015-01-16 5:40:09 3130 01162015053800.xml Friday
What I would like to do is sum up the size of the files for each hour.
I would like a resulting data table that looks like:
date hour size
2015-01-16 5 7607
2015-01-16 6 10000
So forth and so on.
But I can't quite seem to get the output I need.
I've tried ddply and aggregate, but I'm summing up the entire day, I'm not sure how to break it down by the hour in the time column.
And I've got multiple days worth of data. So it's not only for that one day. It's from that day, almost every day until yesterday.
Thanks!

The following should do the trick, assuming your example data are stored in a data frame called "test":
library(lubridate) # for hms and hour functions
test$time <- hms(test$time)
test$hour <- factor(hour(test$time))
library(dplyr)
test %>%
select(-time) %>% # dplyr doesn't like this column for some reason
group_by(date, hour) %>%
summarise(size=sum(size))

You can use data.table
library(data.table)
# Define a time stamp column.
dt[, timestamp=as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))]
# Aggregate by hours
dt[, size = .N, by = as.POSIXct(round(timestamp, "hour"))]
Benefit is that data.table is blazing fast!

Use a compound group_by(day,hour)
That will do it.

If you convert your date and time columns into a single POSIX date when (similar to a previous answer, i.e. df$when <- as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))), you could use:
aggregate(df[c("size")], FUN=sum, by=list(d=as.POSIXct(trunc(df$when, "hour"))))

Related

How to convert time to standard format and calculate time difference

newdf=data.frame(date=as.Date(c("2021-01-04","2021-01-05","2021-01-06","2021-01-07")),
time=c("10:32:29","11:25","12:18:42","09:58"))
This is my data frame. I want to calculate time difference between two consecutive days in hours. Could you please suggest a method to calculate? Note, some time values do not contain seconds. So, first we have to convert it to standard form. Could you please give me a method to solve all these problems. This is completely R programming.
Paste date and time together in one column, use parse_date_time to change the time value in standard format (Posixct) and use difftime to calculate difference between consecutive time in hours.
library(dplyr)
library(tidyr)
library(lubridate)
newdf %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = parse_date_time(datetime, c('Ymd HMS', 'Ymd HM')),
difference_in_hours = round(as.numeric(difftime(datetime,
lag(datetime), 'hours')), 2))
# datetime difference_in_hours
#1 2021-01-04 10:32:29 NA
#2 2021-01-05 11:25:00 24.88
#3 2021-01-06 12:18:42 24.90
#4 2021-01-07 09:58:00 21.66

Filter date time POSIXct data

I am trying to filter a large dataset down to records that occur on the hour. The data looks like this:
I want to filter the Date_Time field to be only the records that are on the hour i.e. "yyyy-mm-dd XX:00:00" or within 10 min of the hour. So, for example, this dataset would reduce down to row 1 and 5. Does anyone have a suggestion?
You can extract the minute value from datetime and select the rows which is within 10 minutes.
result <- subset(df, as.integer(format(UTC_datetime, '%M')) <= 10)
Or with dplyr and lubridate -
library(dplyr)
library(lubridate)
result <- df %>% filter(minute(UTC_datetime) <= 10)
Using data.table
library(data.table)
setDT(df)[minute(UTC_datetime)<=10]

R: create index for xts time object from calendar week , e.g. 201501 ... 201553

I know how to get the week from an index, but don't know the other way around: how to create an index if I have the calendar weeks (in this case, from an SAP system with 0CALWEEK as 201501, 201502 ... 201552, 201553.
Found this:
How to Parse Year + Week Number in R?
but the day is needed and it's not clear how to set it, especially at the end of the year (Year - week - day: YEAR-53-01 does not always exist, since the first day of week 53 might be Monday, then 01 (Sunday) is not in that week.
I could try to get in the source system the first day of the corresponding week (through SQL) but thought R might do it easier...
Do you have any suggestions?
(Which first day of the week would be not important , since I will create all objects the same way and then merge/cbind them, then continue the analysis. If zoo is easier, I'll go with it)
Thanks!
The problem is that all indices end in 2015-07-29:
data <- 1:4
weeks <- c('201501','201502','201552','201553')
weeks_2 <- as.Date(weeks,format='%Y%w')
xts(data, order.by = weeks_2)
[,1]
2015-07-29 1
2015-07-29 2
2015-07-29 3
2015-07-29 4
test <- xts(data, order.by = weeks_2)
index(test)
[1] "2015-07-29" "2015-07-29" "2015-07-29" "2015-07-29"
You can use as.Date() function, I think is the easiest way:
weeks <- c('201501','201502','201552','201553')
as.Date(paste0(weeks,'1'),format='%Y%W%w') # paste a dummy day
## [1] "2015-01-05" "2015-01-12" "2015-12-28" NA
Where:
%W: Week 00-53 with Monday as first day of the week
or
%U: Week 01-53 with Sunday as first day of the week
%w: Weekday 0-6 Sunday is 0
For this year, week number 53 doesn't exist. And If you want to start with 2015-01-01, just set the right week day:
weeks <- c('201500','201501','201502','201551','201552')
as.Date(paste0(weeks,'4'),format='%Y%W%w')
## [1] "2015-01-01" "2015-01-08" "2015-01-15" "2015-12-24" "2015-12-31"
You may try with substr() and lubridate
library(lubridate)
# a number from your list: 201502
# set the year
x <- ymd("2015-01-1")
# retrieve second week
week(x) <- 2
x
[1] "2015-01-08"
you can use the result for your Index or rownames().
zoo and xts are great for time series once you have set the names,
be sure to remove any column with dates from your data frame

Using scale_x_date in ggplot2 with different columns

Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?
The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Resources