Given a series of events, is there an algorithm for determining if a certain number of events occur in a certain period of time? For example, given list of user logins, are there any thirty day periods that contain more than 10 logins?
I can come up with a few brute force ways to do this, just wondering if there is an algorithm or name for this kind of problem that I havent turned up with the usual google searching.
In general it is called binning. It is basically aggregating one variable (e.g. events) over an index (e.g. time) using count as a summary function.
Since you didn't provide data I'll just show a simple example:
# Start with a dataframe of dates and number of events
data <- data.frame(date=paste('2013', rep(1:12, each=20), rep(1:20, times=12), sep='-'),
logins=rpois(12*20, 5))
# Make sure to store dates as class Date, it can be useful for other purposes
data$date <- as.Date(data$date)
# Now bin it. This is just a dirty trick, exactly how you do it depends on what you want.
# Lets just sum the number of events for each month
data$month <- sub('-', '', substr(data$date, 6, 7))
aggregate(logins~month, data=data, sum, na.rm=TRUE)
Is that what you wanted?
Related
I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.
I'm using R to analyze 365 days of data collected on over 40,000 events. The events occur at various times of the day. I wish to aggregate the events and calculate means at various intervals such as 2, 8, 12 hour or daily. I've seen CUT and AGGREGATE used but it does not appear to provide the intervals as required.
Any suggestions would be greatly appreciated.
To use the CUT function one must first define the break points. In order to do that, use the seq function.
mydateseq<-seq(as.POSIXct("2016-01-01"), by="2 hour", length.out = 20)
There are options to set the start/stop points or the number of elements. In this example the breaks are set every 2 hours but this is adjustable. See ?seq.POSIXt for more help. Be sure to set the start/stop to completely capture the date range of interest.
Once the date sequence is defined this can be passed to cut function to aggregate or use the group_by function in the dplyr package.
I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)
I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]
I have a CSV file with a list of posts from an online discussion forum. I have the timestamp for each post in this format: YYYY-MM-DD hh:mm:ss.
I want to calculate how often a new post is submitted, as in "X posts per second". I think what I need is just the mean, median and sd for the rate of posting (posts per second). I just loaded the CSV:
d <- read.csv("posts.csv")
colnames(d) <- c("post.id", "timestamp")
The average number of posts per second is just 1/interval from last posting, so make a vector of diff(times) and then take mean(1/as.numeric(diff(times))).
> posts <- data.frame(ids = paste(letters[sample(1:26, 100, replace=TRUE)],
sample(1:100) ), time=Sys.time() +cumsum(abs(rnorm(100))*100) )
> mean( 1/as.numeric(diff(posts$time)) )
[1] 0.03545346
Edit: I thought that by using cumsum I would get the time series ordered, but that was not the case, so it's amended to take abs(rnorm(100) ).
Something like:
tt <- table(cut(as.POSIXlt(d$timestamp),"1 sec"))
c(mean(tt),median(tt),sd(tt))
You didn't provide a reproducible example so I'm not 100% sure this works, but something like that ... also don't know how well it will scale to giant data sets.
More detail (with example):
set.seed(1001)
n <- 1e5
nt <- 1e5
z <- seq(as.POSIXct("2010-09-01"),length=nt,by="1 sec")
length(z)
z2 <- sample(z,size=n,replace=TRUE)
tt <- table(cut(z2,"1 sec"))
c(mean(tt),median(tt),sd(tt))
This tiny example suggests that the cut() command might be slow.
Play with the 'nt' (number of seconds in the time interval from beginning to end) and 'n' (number of samples) parameters to get a sense of how long your problem will take.
i dont know your programming language, but if you could convert the timestamp to milliseconds, just subtract the lowest from the highest timestamp, then divide by the number of posts (rows in the posts.csv) then divide by 1000 (milliseconds) and your left with posts per second. Or if you can get the timestamp in seconds, it is the same except don't divide by 1000.