I'm using R to analyze 365 days of data collected on over 40,000 events. The events occur at various times of the day. I wish to aggregate the events and calculate means at various intervals such as 2, 8, 12 hour or daily. I've seen CUT and AGGREGATE used but it does not appear to provide the intervals as required.
Any suggestions would be greatly appreciated.
To use the CUT function one must first define the break points. In order to do that, use the seq function.
mydateseq<-seq(as.POSIXct("2016-01-01"), by="2 hour", length.out = 20)
There are options to set the start/stop points or the number of elements. In this example the breaks are set every 2 hours but this is adjustable. See ?seq.POSIXt for more help. Be sure to set the start/stop to completely capture the date range of interest.
Once the date sequence is defined this can be passed to cut function to aggregate or use the group_by function in the dplyr package.
Related
I have a dataset having solar power generation for 24 hours for many days, now I have to find the average of the power generated in accordance with the time, as for example, Have a glimpse of the datasetI have to find the average of the power generated at time 9:00:00 AM.
Start by stripping out the time from the date-time variable.
Assuming your data is called myData
library(lubridate)
myData$Hour <- hour(strptime(myData$Time, format = "%Y-%m-%d %H:%M:%S"))
Then use ddply from the plyr package, which allows us to apply a function to a subset of data.
myMeans <- ddply(myData[,c("Hour", "IT_solar_generation")], "Hour", numcolwise(mean))
The resulting frame will have one column called Time which will give you the hour, and another with the means at each hour.
NOW, on another side but important note, when you ask a question you should be providing information on the attempts you've made so far to answer the question. This isn't a help desk.
How does the ts() function use its frequency parameter? What is the effect of assigning wrong values as frequency?
I am trying to use 1.5 years of website usage data to build a time series model so that I can forecast the usage for coming periods. I am using data at daily level. What should be the frequency here - 7 or 365 or 365.25?
The frequency is "the" period at which seasonal cycles repeat. I use "the" in scare quotes since, of course, there are often multiple cycles in time series data. For instance, daily data often exhibit weekly patterns (a frequency of 7) and yearly patterns (a frequency of 365 or 365.25 - the difference often does not matter).
In your case, I would assume that weekly patterns dominate, so I would assign frequency=7. If your data exhibits additional patterns, e.g., holiday effects, you can use specialized methods accounting for multiple seasonalities, or work with dummy coding and a regression-based framework.
Here, the frequency parameter is not a frequency that you can observe in the data of your time series. Instead, you have to specify the frequency at which samples of the time series were taken. In your case, this is simply 1 day, or 1.
The value you give here will influence the results you get later when running analysis operations (examples are average requests per time unit or fourier transformation to get the (real) frequencies in the data). E.g. if you wanted to get all your results in the unit of hours instead of in days, you would pass 24 instead of 1 as frequency, because your data samples were taken in a frequency of 24 hours.
I have the logs of the amount of arrivals at a bank , every half an hour for one month.
I am trying to find different cluster groups according to the amount of "arrivals". I tried according to the day, and i tried according to the hour (not of a specific day). I would like to try according to the hour of a specific day.
An example:
Thursdays at 14:00 and Sundays at 15:00 are one cluster with an average of 10000 arrivals
Mondays at 13:00, Mondays at 10:00 and Tuesdays at 16:00 are one cluster with an average of 15000 arrivals.
all the rest are another cluster with an average of 2000 arrivals.
I have a csv file with the columns: Date, Day(1-7), Time, Arrivals
Until now I used this:
km <- kmeans(table, 3, 15)
plot(km)
(i tried 3 clusters) - this code clusters pairs .( a matrix of 3x3 with a plot of each 2 out of 3 columns)
Is there a way to do that?
k-means and similar algorithms will yield meaningless results on this kind of data.
The problem is you are using the wrong tool for the wrong problem on the wrong data.
Your data is: Date, Day(1-7), Time, Arrivals
K-means will try to minimize variance. But does variance make any sense on this data set? How do you know hich k makes most sense? Since Arrivals likely has the largest variance of these attributes, it will completely dominate your result.
The question you should first try to answer is: what is a good result? Then, consider ways of visualizing the results to verify that you are up to something. And when you've visualized the data, consider ways to manually mark the desired result on the visualization, this may well be good enough for you. Better than praying for k-means to yield a somewhat meaningful result; because on this kind of mixed type data, it usually does not work very well.
Given a series of events, is there an algorithm for determining if a certain number of events occur in a certain period of time? For example, given list of user logins, are there any thirty day periods that contain more than 10 logins?
I can come up with a few brute force ways to do this, just wondering if there is an algorithm or name for this kind of problem that I havent turned up with the usual google searching.
In general it is called binning. It is basically aggregating one variable (e.g. events) over an index (e.g. time) using count as a summary function.
Since you didn't provide data I'll just show a simple example:
# Start with a dataframe of dates and number of events
data <- data.frame(date=paste('2013', rep(1:12, each=20), rep(1:20, times=12), sep='-'),
logins=rpois(12*20, 5))
# Make sure to store dates as class Date, it can be useful for other purposes
data$date <- as.Date(data$date)
# Now bin it. This is just a dirty trick, exactly how you do it depends on what you want.
# Lets just sum the number of events for each month
data$month <- sub('-', '', substr(data$date, 6, 7))
aggregate(logins~month, data=data, sum, na.rm=TRUE)
Is that what you wanted?
I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]