Given a list of events with timestamps, how do I use R to calculate the average rate at which such event is happening? - r

I have a CSV file with a list of posts from an online discussion forum. I have the timestamp for each post in this format: YYYY-MM-DD hh:mm:ss.
I want to calculate how often a new post is submitted, as in "X posts per second". I think what I need is just the mean, median and sd for the rate of posting (posts per second). I just loaded the CSV:
d <- read.csv("posts.csv")
colnames(d) <- c("post.id", "timestamp")

The average number of posts per second is just 1/interval from last posting, so make a vector of diff(times) and then take mean(1/as.numeric(diff(times))).
> posts <- data.frame(ids = paste(letters[sample(1:26, 100, replace=TRUE)],
sample(1:100) ), time=Sys.time() +cumsum(abs(rnorm(100))*100) )
> mean( 1/as.numeric(diff(posts$time)) )
[1] 0.03545346
Edit: I thought that by using cumsum I would get the time series ordered, but that was not the case, so it's amended to take abs(rnorm(100) ).

Something like:
tt <- table(cut(as.POSIXlt(d$timestamp),"1 sec"))
c(mean(tt),median(tt),sd(tt))
You didn't provide a reproducible example so I'm not 100% sure this works, but something like that ... also don't know how well it will scale to giant data sets.
More detail (with example):
set.seed(1001)
n <- 1e5
nt <- 1e5
z <- seq(as.POSIXct("2010-09-01"),length=nt,by="1 sec")
length(z)
z2 <- sample(z,size=n,replace=TRUE)
tt <- table(cut(z2,"1 sec"))
c(mean(tt),median(tt),sd(tt))
This tiny example suggests that the cut() command might be slow.
Play with the 'nt' (number of seconds in the time interval from beginning to end) and 'n' (number of samples) parameters to get a sense of how long your problem will take.

i dont know your programming language, but if you could convert the timestamp to milliseconds, just subtract the lowest from the highest timestamp, then divide by the number of posts (rows in the posts.csv) then divide by 1000 (milliseconds) and your left with posts per second. Or if you can get the timestamp in seconds, it is the same except don't divide by 1000.

Related

R: Turn timestamps into (as short as possible) integers

Edit 1: I think a possible solution would be to count the number of 15-minute intervals elapsed since a starting date. If anyone has thoughts on this, please come forward. Thanks
As the title says, I am looking for a way to turn timestamps into as small as possible integers.
Explanation of the situation:
I am working with "panelAR". I have T>N panel-data containing different timestamps that look like this (300,000 rows in total):
df$timestamp[1]
[1] "2013-08-01 00:15:00 UTC"
class(df$timestamp)
[1] "POSIXct" "POSIXt"
I am using panelAR and thus need the timestamp as an integer. I can't simply use "as.integer" because I would hit the max length for integers resulting in only NA's. This was my first try to work around this problem:
df$timestamp <- as.numeric(gsub("[: -]", "" , df$timestamp, perl=TRUE))
Subtract the numbers starting at te 3rd position (Because "20" is irrelevant) and stop before the 2nd last position (Because they all end at 00 seconds)
(I need shorter integers in order to not hit the max level of integers in R)
df$timestamp <- substr(df$timestamp, 3, nchar(df$timestamp)-2)
#Save as integer
df$timestamp <- as.integer(df$timestamp)
#Result
df$timestamp[1]
1308010015
This allows panelAR to work with it, but the numbers seem to be way too large. When I try to run a regression with it, i get the following error message:
"cannot allocate vector of size 1052.2 GB"
I am looking for a way to turn these timestamps into (as small as possible) integers in order to work with panelAR.
Any help is greatly appreciated.
this big number that you get corresponds to the number of seconds elapsed since 1970-01-01 00:00:00. Do your time stamps have regular intervals? If it is, let's say, every 15 minutes you could divide all integers by 900, and it might help.
Another option is to pick your earliest date and subtract it from the others
#generate some dates:
a <- as.POSIXct("2013-01-01 00:00:00 UTC")
b <- as.POSIXct("2013-08-01 00:15:00 UTC")
series <- seq(a,b, by = 'min')
#calculate the difference (result are integers/seconds)
integer <- as.numeric(series - min(series))
If you still get memory problems, I might combine both.
I managed to solve the main question. Since this still results in a memory error, I think it stems from the number of observations and the way panelAR computes things. I will open a separate question for that matter.
I used
df$timestampnew <- as.integer(difftime(df$timestamp, "2013-01-01 00:00:00", units = "min")/15)
to get integers that count the number of 15-min intervals elapsed since a certain date.

Efficient time-weighted averages

given a dataset containing short intervals, and a value representing an average measure of something over each interval, I would like to average those values up to the calendar year, separately for each individual ("id").
The issue is that these intervals are not aligned with calendar year, so time-weighting of those values is necessary in order to get the best estimate of the annual average from the shorter interval averages.
Note that the intervals are inclusive for the start date and exclusive for the end date.
Example data
start_date and end_date are intervals that are unique non-overlapping within levels of id:
set.seed(30)
library(lubridate)
library(data.table)
x <- CJ(id=1:5, start_date=seq(from=as.Date("2005-01-12"),by=14,length=100))
#add noise so intervals don't all start on 2005-01-12
x[,start_date:=start_date + rbinom(1,size=20,prob=.15)*15L,by=id]
#all intervals are two weeks:
x[,end_date:=start_date+14]
x[,value:=rnorm(nrow(x))]
#for each id, calculate the mean value over each calendar year.
years <- c(year(min(x$start_date)), year(max(x$start_date)))
Additional constraints:
works for intervals that aren't exactly two weeks long
works even if the intervals aren't all the same length (as long as
they're non-overlapping)
works even if the earliest start_date isn't the same for each
participant
averages for calendar years that are don't have enough periods for
that id to complete the year should be NA
Potential solution that is too slow for my purposes.
complete_date_seq <- seq(as.Date(ymd(paste0(years[1],"-01-01"))), as.Date(ymd(paste0(years[2],"-12-12"))),by=1)
m <- matrix(NA,nrow=length(unique(x$id)),ncol=length(complete_date_seq))
rownames(m) <- unique(x$id)
colnames(m) <- as.character(complete_date_seq)
for(i in 1:nrow(m)){
temp <- x[id==rownames(m)[i]]
for(j in 1:nrow(temp)){
m[i, as.Date(complete_date_seq) %within% temp[j,interval(start_date,end_date-1)]] <- temp[j,value]
}
}
out <- CJ(id=unique(x$id),year=years[1]:years[2])
intervalfromyear <- function(y) interval(as.Date(ymd(paste0(y,"-01-01"))), as.Date(ymd(paste0(y,"-12-31"))))
out[, annual_avg:=mean(m[rownames(m)==.BY$id,complete_date_seq %within% intervalfromyear(.BY$year)]) ,by=c("id","year")]
I'm guessing there's some package for doing time-weighting that I'm not aware of. Is this true? Ideally there's a native data.table solution that's fast.
This is basically the same approach I posed in the question but a lot more efficient since it creates a long data.table rather than a matrix. I spent some time looking for a different solution (that doesn't actually involve actually creating a cell for each date and instead uses a weighted average product formula) using foverlaps but it was way more work, less easily extensible, and more error-prone.
#switch from exclusive to inclusive end_date
x[, actual_end_date:=as.Date(as.numeric(end_date)-1,origin="1970-01-01")]
z <- x[, list(date=seq(start_date,actual_end_date,by=1),value),by=c("id","start_date")]
complete_date_seq <- seq(from=as.Date(paste0(years[1],"-01-01")),
to=as.Date(paste0(years[2],"-12-31")),by=1)
missing_dates <- z[,list(date=as.Date(setdiff( complete_date_seq,date ),origin="1970-01-01"),value=NA),by=id]
result <- rbind(z,missing_dates,fill=TRUE)[order(id,date)]
result[, year:=substr(date,1,4)]
result[, mean(value),by=c("id","year")]

Plotting truncated times from zoo time series

Let's say I have a data frame with lots of values under these headers:
df <- data.frame(c("Tid", "Value"))
#Tid.format = %Y-%m-%d %H:%M
Then I turn that data frame over to zoo, because I want to handle it as a time series:
library("zoo")
df <- zoo(df$Value, df$Tid)
Now I want to produce a smooth scatter plot over which time of day each measurement was taken (i.e. discard date information and only keep time) which supposedly should be done something like this: https://stat.ethz.ch/pipermail/r-help/2009-March/191302.html
But it seems the time() function doesn't produce any time at all; instead it just produces a number sequence. Whatever I do from that link, I can't get a scatter plot of values over an average day. The data.frame code that actually does work (without using zoo time series) looks like this (i.e. extracting the hour from the time and converting it to numeric):
smoothScatter(data.frame(as.numeric(format(df$Tid,"%H")),df$Value)
Another thing I want to do is produce a density plot of how many measurements I have per hour. I have plotted on hours using a regular data.frame with no problems, so the data I have is fine. But when I try to do it using zoo then I either get errors or I get the wrong results when trying what I have found through Google.
I did manage to get something plotted through this line:
plot(density(as.numeric(trunc(time(df),"01:00:00"))))
But it is not correct. It seems again that it is just producing a sequence from 1 to 217, where I wanted it to be truncating any date information and just keep the time rounded off to hours.
I am able to plot this:
plot(density(df))
Which produces a density plot of the Values. But I want a density plot over how many values were recorded per hour of the day.
So, if someone could please help me sort this out, that would be great. In short, what I want to do is:
1) smoothScatter(x-axis: time of day (0-24), y-axis: value)
2) plot(density(x-axis: time of day (0-24)))
EDIT:
library("zoo")
df <- data.frame(Tid=strptime(c("2011-01-14 12:00:00","2011-01-31 07:00:00","2011-02-05 09:36:00","2011-02-27 10:19:00"),"%Y-%m-%d %H:%M"),Values=c(50,52,51,52))
df <- zoo(df$Values,df$Tid)
summary(df)
df.hr <- aggregate(df, trunc(df, "hours"), mean)
summary(df.hr)
png("temp.png")
plot(df.hr)
dev.off()
This code is some actual values that I have. I would have expected the plot of "df.hr" to be an hourly average, but instead I get some weird new index that is not time at all...
There are three problems with the aggregate statement in the question:
We wish to truncate the times not df.
trunc.POSIXt unfortunately returns a POSIXlt result so it needs to be converted back to POSIXct
It seems you did not intend to truncate to the hour in the first place but wanted to extract the hours.
To address the first two points the aggregate statement needs to be changed to:
tt <- as.POSIXct(trunc(time(df), "hours"))
aggregate(df, tt, mean)
but to address the last point it needs to be changed entirely to
tt <- as.POSIXlt(time(df))$hour
aggregate(df, tt, mean)

Timeline Event Concentration

Given a series of events, is there an algorithm for determining if a certain number of events occur in a certain period of time? For example, given list of user logins, are there any thirty day periods that contain more than 10 logins?
I can come up with a few brute force ways to do this, just wondering if there is an algorithm or name for this kind of problem that I havent turned up with the usual google searching.
In general it is called binning. It is basically aggregating one variable (e.g. events) over an index (e.g. time) using count as a summary function.
Since you didn't provide data I'll just show a simple example:
# Start with a dataframe of dates and number of events
data <- data.frame(date=paste('2013', rep(1:12, each=20), rep(1:20, times=12), sep='-'),
logins=rpois(12*20, 5))
# Make sure to store dates as class Date, it can be useful for other purposes
data$date <- as.Date(data$date)
# Now bin it. This is just a dirty trick, exactly how you do it depends on what you want.
# Lets just sum the number of events for each month
data$month <- sub('-', '', substr(data$date, 6, 7))
aggregate(logins~month, data=data, sum, na.rm=TRUE)
Is that what you wanted?

Obtaining or subsetting the first 5 minutes of each day of data from an xts

I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]

Resources