Calculate sum of counts per min from data frame in R - r

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?

You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.

Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

Related

All data in one column

I have this soccer data all in one column.
Round 36 # Round of the league------------------------------------
29.07. 20:45 # Date and time of the match
Barcelona # Home Team
4 - 1 # FT result
Getafe # Away team
(2 - 0) # HT result
29.07. 20:45 # *date of the second match of the round*
Valencia
2 - 3
Laci
(1 - 2)
Round 35 # repeating pattern -------------------------------------------------
How can I move all the data in a certain round of the league in a new column? e.g. I want all observation from the Round 36 observation to the Round 35 observation in a single a column and so on.
I really do not have any idea how to solve this. I tried to transpose the data so that I could work better with observations as variables but still nothing. I am just a beginner in R and would appreciate any help.
thanks
Assuming your data is within a variable named lines (eg, lines[1] = Round 36 is the first entry, lines[2] = 29.07. 2045 is the next entry and so forth), we can spot the lines, split the vector into a list and then finally bind it into a data.frame (assuming they have equal length, if not you will have to do some manual work)
#Figure out where each round is.
rounds <- grepl('^Round', lines)
# Split it into seperate list. cumsum(rounds) will be an index for each group.
data <- split(lines, cumsum(rounds))
# Bind the data into a data.frame (assuming all have the same amount of data)
bound <- do.call(rbind, data)
Of course without a reproducible example it is hard to test the final result.
Note that if the soccer data does not have equal amount of data between rounds or if the data does not come in the same order, the resulting data.frame may not make immediate sense (if round 45 has 7 elements but round 46 has 4, round 46 will recycle element 1, 2 and 3 to fill out the missing values), but it might make it simpler to do some follow up data cleaning.

Explanation for aggregate and cbind function

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

Interval search on a data frame

I've a data frame of time series and values. The time series are seconds from the epoch. Here is how the top few elements look like in that data frame
val = seq(1,19)
ts = seq(1342980888,1342982000,by=60)
x = data.frame(ts = ts,val = val)
head(x)
ts val
1 1342980888 1
2 1342980948 2
3 1342981008 3
4 1342981068 4
5 1342981128 5
6 1342981188 6
I would like to some kind of an interval search function which takes in as input a time stamp say 1342980889 (+1 the the ts in the first row) and it should return 1,2 (the row number) as output. Basically, I want to find the two rows which have time stamps that bracket the input time stamp 1342980889. While this is relatively easy to do using "which", I suspect "which" does a vector scan and as the real data frame is quite large I want to do it using a binary search. Thanks much in advance
You should use the findInterval function. It will give you the index of the row where x$ts is immediately smaller than the value you are looking for (and you just have to add one to get the other index)
findInterval(1342980889, x$ts)
# [1] 1
Also note that the function is vectorized, i.e., the first argument can be a vector of values to look for:
findInterval(c(1342980889, 1342981483), x$ts)
# [1] 1 10

How to SUM multiple columns within a zoo object

This should be exceptionally simple. I have a zoo object with 500 times series (each one a different product) and 250 periods of sales. The zoo object is perfectly rectangular, all series contain an observation at each point in time. My index column is a very simple 1...250,
My difficulty is in trying to aggregate all of the time series to form a "Total Sales" series.
I've tried using aggregate, which seems focused on aggregating rows e.g. days into months. But I want to keep every time period, just aggregate the time series together. This is a simplified version of my zoo object shown below with only 5 series.
head(z.all)
1 2 3 4 5
1 1232.205 1558.056 993.9784 1527.066 359.6946
2 1262.194 1665.084 1092.0105 1834.313 484.5073
3 1301.034 1528.607 900.4158 1587.548 525.5191
4 1014.082 1352.090 1085.6376 1785.034 490.9164
5 1452.149 1623.015 1197.3709 1944.189 600.5150
6 1463.359 1205.948 1155.0340 1528.887 556.6371
When I try to aggregate using either of the following 2 commands I get exactly the same data as in my original zoo object!!
aggregate(z.all[,1:num.series], index(z.all), sum)
aggregate(z.all, index(z.all), sum)
However I am able to aggregate by doing this, though it's not realistic for 500 columns! I want to avoid using a loop if possible.
z.all[,1] + z.all[,2]
Apologies if this is not the right protocol, it's my first post in this site.
I hope i understood correctly what you want. But if it is a rowsum you are looking for:
rowSums(z.all)
directly from the base package. (?rowSums). This function adds all values along one row:
D<-cbind(rep(1,10),c(1:10))
colSums(D)
[1] 10 55
rowSums(D)
[1] 2 3 4 5 6 7 8 9 10 11
the opposite would be colSums() which would sum each column.

How to graph requests per second from web log file using R

I'm trying to graph request per second using our apache log files.
I've massaged the log down to a simple listing of the timestamps, one entry per request.
04:02:28
04:02:28
04:02:28
04:02:29
...
I can't quite figure out how to get R to recognize as time and aggregate to per second.
Thanks for any help
The lubridate package makes working with dates and time very easy.
Here is an example, using the hms() function of lubridate. hms converts a character string into a data frame with separate columns for hours, minutes and seconds. There are similar functions for myd (month-day-year), dmy (day-month-year), ms (minutes-seconds)... you get the point.
library(lubridate)
data <- c("04:02:28", "04:02:28", "04:02:28", "04:02:29")
times <- hms(data)
times$second
[1] 28 28 28 29
At this point, times is a straight-forward data frame, and you can isolate any column you wish:
str(times)
Classes 'period' and 'data.frame': 4 obs. of 6 variables:
$ year : num 0 0 0 0
$ month : num 0 0 0 0
$ day : num 0 0 0 0
$ hour : num 4 4 4 4
$ minute: num 2 2 2 2
$ second: num 28 28 28 29
It seems to me that since you already have time-stamps at one-second granularity, all you need to do is do a frequency-count of the time-stamps and plot the frequencies in the original time-order. Say timeStamps is your array of time-stamps, then you would do:
plot(c( table( timeStamps ) ) )
I'm assuming you want to plot the log-messages in each one-second interval over a certain period. Also I'm assuming that the HMS time-stamps are within one day. Note that the table function produces a frequency-count of its argument.
I'm not exactly sure, how to make this correctly, but this should be one possible way and maybe helps you.
Instead of strings, get the data as UNIX timestamps from the database that denote the number of seconds from 1970-01-01.
Use hist(data) to plot a histogram for example. Or you may use melt command from reshape2 package and use cast for creating a data frame, where one column is the timestamp and another column determines the number of transactions at that time.
Use as.POSIXlt(your.unix.timestamps, origin="1970-01-01", tz="GMT") to convert the timestamps to R understandable datetime structures.
Then add labels to the plot using the data from point 3 using format.
Here's an example:
# original data
data.timestamps = c(1297977452, 1297977452, 1297977453, 1297977454, 1297977454, 1297977454, 1297977455, 1297977455)
data.unique.timestamps = unique(data.timestamps)
# get the labels
data.labels = format(as.POSIXlt(data.unique.timestamps, origin="1970-01-01", tz="GMT"), "%H:%M:%S")
# plot the histogram without axes
hist(data.timestamps, axes=F)
# add axes manually
axis(2)
axis(1, at=unique(data.timestamps), labels=data.labels)
--
Hope this helps

Resources