R Zoo - aggregating many records with same time entry - r

I consistently need to take transaction data and aggregate it by Day, Week, Month, Quarter, Year - essentially, it's time-series data. I started to apply zoo/xts to my data in hopes I could aggregate the data faster, but I either don't fully understand the packages' purpose or I'm trying to apply it incorrectly.
In general, I would like to calculate the number of orders and the number of products ordered by category, by time period (day, week, month, etc).
#Create the data
clients <- 1:10
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(numProducts = 1:10,
category = sample(categories, 1000, replace = TRUE),
clientID = sample(clients, 1000, replace = TRUE),
OrderDate = sample(dates, 1000, replace = TRUE))
I could do this with plyr and reshape, but I think this is a round-about way to do so.
#Aggregate by date and category
products.day <- ddply(products, .(OrderDate, category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Aggregate by Month and category
products.month <- ddply(products, .(Month = months(OrderDate), Category = category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Make a wide-version of the data frame
products.month.wide <- cast(products.month, Month~Category, sum)
I tried to apply zoo to the data like so:
products.TS <- aggregate(products$numProducts, yearmon, mean)
It returned this error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
'by' must be a list
I've read the zoo vignettes and documentation, but every example that I've found only shows 1 record/row/entry per time entry.
Do I have to pre-aggregate the data I want to time-series on? I was hoping that I could simply group by the fields I want, then have the months or quarters get added to the data frame incrementally to the X-axis.
Is there a better approach to aggregating this or a more appropriate package?

products$numProducts is a vector, not a zoo object. You'd need to create a zoo object before you can use method dispatch to call aggregate.zoo.
pz <- with(products, zoo(numProducts, OrderDate))
products.TS <- aggregate(pz, as.yearmon, mean)

Related

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.
To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.
The results are different in case of duplicates, for instance Event_id==1002:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.

average of few maximum values in r

I find monthly maximum value from daily data of many companies by this r code
DF$date <- as.Date(DF$date,format="%Y-%m-%d")
Output <- aggregate(DF[,-1],
by=list(Month=format(DF$date,"%y-%m")),
FUN=max)
However, I could not figure out how I can change this code to find out the average of two maximum values in a month or the average of three maximum values in a month. As a new learner of r, it would be very helpful for me if you can help me in this regard.
We can sort the values in descending (or ascending), then take the head of 'n' number of values and get the mean by using an anonymous function call in FUN
aggregate(DF[,-1], by=list(Year_Month=format(DF$date,"%y-%m")),
FUN = function(x) mean(head(sort(x, decreasing = TRUE), 2)))
Or this can be done with tidyverse
library(dplyr)
DF %>%
group_by(Year_Month = format(date, "%y-%m")) %>%
summarise_at(vars(-date), funs(mean(head(sort(., decreasing = TRUE), 2))))

Moving averages

I have daily data for over 100 years that looks like
01.01.1856 12
02.01.1956 9
03.01.1956 -12
04.01.1956 7
etc.
I wish to calculate the 30 year running average for this huge data. I tried converting the data into a time series but cant still figure out how to go about it. I will prefer a simple method that has to do with working with a data.frame.
I guess the preparation is the difficulty considering some leapyears.
So I try to show some way for preparing, before using the already mentioned function runmean of package require(caTools).
First we create example data (which is not necessary for you, but for the understanding).
Second I divide the data frame into a list of data frames, one for each year and taking the mean values for each year. These two steps could be done at once, but I think the separated way is easier to understand and to adapt.
#example data
Days <- seq(as.Date("1958-01-01"), as.Date("2015-12-31"), by="days")
Values <- runif(length(Days))
DF <- data.frame(Days = Days, Values = Values)
#start of script
Years <- format(DF$Days, "%Y")
UniqueYears <- unique(format(DF$Days, "%Y"))
#Create subset of years
#look for every unique year which element of days is in this year.
YearlySubset <- lapply(UniqueYears, function(x){
DF[which(Years == x), ]
})
YearlyMeanValues <- sapply(YearlySubset, function(x){
mean(x$Values)
})
Now the running mean is applied:
#install.packages("caTools")
require(caTools)
RM <- data.frame(Years = UniqueYears, RunningMean30y = runmean(YearlyMeanValues, 30))
Just if I didn't got you right at first and you want some running mean for every day within about 30 years, of course you could simply do:
RM <- cbind(DF, runmean(DF$Values, 365 * 30))
And considering your problems creating a timeseries:
DF[ , 1] <- as.Date(DF[ , 1], format = "%Y.%m.%d")
I would also suggest exploring RcppRoll in combination with dplyr which provides a fairly convenient solution to calculate rolling averages, sums, etc.
Code
# Libs
library(RcppRoll) # 'roll'-ing functions for R vectors and matrices.
library(dplyr) # data grammar (convenience)
library(zoo) # time series (convenience)
library(magrittr) # compound assignment pipe-operator (convenience)
# Data
data("UKgas")
## Convert to data frame to make example better
UKgas <- data.frame(Y = as.matrix(UKgas), date = time(UKgas))
# Calculations
UKgas %<>%
# To make example more illustrative I converted the data to a quarterly format
mutate(date = as.yearqtr(date)) %>%
arrange(date) %>%
# The window size can be changed to reflect any period
mutate(roll_mean = roll_mean(Y, n = 4, align = "right", fill = NA))
Notes
As the data provided in the example was fairly modest I used quarterly UK gas consumption data available via the data function in the utils package.

Group R table entries by month

I have as CSV value with two columns, a unix timestamp and a version string. What I finally want to achieve is to group the data by months and plot the data, so that the single months are entries on the x axis, and a line is plotted for each unique version string, where the y axis values should represent the number of hits on that month.
Here is a small example CSV:
timestamp,version
1434974143,1.0.0
1435734004,1.1.0
1435734304,1.0.0
1435735386,1.2.0
I'm new to R, so I encountered several problems. First I successfully read the csv with
mydata <- read.csv("data.csv")
and figured out an ungly function that convers a single timestamp into an R date:
as_time <- function(val){
return(head(as.POSIXct(as.numeric(as.character(val)),origin="1970-01-01",tz="GMT")))
}
But non of the several apply functions seemed to work on the table column.
So how do I create a data structure, that groups the version hits by month, and can be plotted later?
It's easier than you think!
You are essentially looking for the hist function.
#Let's make some mock data
# Set the random seed for reproducibility
set.seed(12345)
my.data <- data.frame(timestamp = runif(1000, 1420000000, 1460000000),
version = sample(1:5, 1000, replace = T))
my.data$timestamp <- as.POSIXct(my.data$timestamp, origin = "1970-01-01")
# Histogram of the data, irrespective of version
hist(my.data$timestamp, "month")
# If you want to see the version then split the data first...
my.data.split <- split(my.data, my.data$version)
# Then apply hist
counts <- sapply(my.data.split, function (x)
{
h <- hist(x$timestamp, br = "month", plot = FALSE)
h$counts
})
# Transform into a matrix and plot
counts <- do.call("rbind", counts)
barplot(counts, beside = T)
You can use the as.yearmon function from the zoo package to get year/month formats:
library(zoo)
dat$yearmon <- as.yearmon(as.POSIXct(dat$timestamp, origin = "1970-01-01", tz = "GMT"))
Then it depends what you want to do with your data. For example, number of version hits per month (thanks to #Frank for fixing):
dat %>% group_by(yearmon, version) %>%
summarise(hits = n())

Recovering tapply results into the original data-frame in R

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]

Resources