How to SUM multiple columns within a zoo object - r

This should be exceptionally simple. I have a zoo object with 500 times series (each one a different product) and 250 periods of sales. The zoo object is perfectly rectangular, all series contain an observation at each point in time. My index column is a very simple 1...250,
My difficulty is in trying to aggregate all of the time series to form a "Total Sales" series.
I've tried using aggregate, which seems focused on aggregating rows e.g. days into months. But I want to keep every time period, just aggregate the time series together. This is a simplified version of my zoo object shown below with only 5 series.
head(z.all)
1 2 3 4 5
1 1232.205 1558.056 993.9784 1527.066 359.6946
2 1262.194 1665.084 1092.0105 1834.313 484.5073
3 1301.034 1528.607 900.4158 1587.548 525.5191
4 1014.082 1352.090 1085.6376 1785.034 490.9164
5 1452.149 1623.015 1197.3709 1944.189 600.5150
6 1463.359 1205.948 1155.0340 1528.887 556.6371
When I try to aggregate using either of the following 2 commands I get exactly the same data as in my original zoo object!!
aggregate(z.all[,1:num.series], index(z.all), sum)
aggregate(z.all, index(z.all), sum)
However I am able to aggregate by doing this, though it's not realistic for 500 columns! I want to avoid using a loop if possible.
z.all[,1] + z.all[,2]
Apologies if this is not the right protocol, it's my first post in this site.

I hope i understood correctly what you want. But if it is a rowsum you are looking for:
rowSums(z.all)
directly from the base package. (?rowSums). This function adds all values along one row:
D<-cbind(rep(1,10),c(1:10))
colSums(D)
[1] 10 55
rowSums(D)
[1] 2 3 4 5 6 7 8 9 10 11
the opposite would be colSums() which would sum each column.

Related

(R, dplyr) How to aggregate-window data where rows must be conditionally included?

I've googled around, but have not found anything similar to this, but I'm hoping what I'm trying to do has already been done by someone else before.
I have a set of data with timestamps.
I need a running cumulative count of transactions per second - calculated as a true rolling second window. Would be nice to just truncate / round off to nearest second but that wont be enough for my use case.
#Timestamp
Current TPS
00:00:00.1
1
................................................................................................
00:00:00.2
2
00:00:00.3
3
00:00:00.4
4
00:00:00.5
5
00:00:00.6
6
00:00:00.7
7
00:00:00.8
8
00:00:00.9
9
00:00:01.0
10
....................................10 TPS here............................................................
00:00:01.1
10
00:00:01.2
10
.................................... still 10 TPS here............................................................
00:00:01.4
9
............ only 9 here, because no event at 00:00:01.3
00:00:01.5
9
00:00:01.5
10
00:00:01.8
8
Initially, I was planning to calculate a time interval difference between rows, but that doesn't solve the question of how to determine which rows should be included or excluded in the aggregate window.
This morning, I thought about mutating a new column that is just the subsecond portion of time. Then, I use that new column as a substraction on the time column, and cumsum it inside a 2nd if_else mutate that does a look-back over last X number of rows?
Does that sound reasonable? Have I overlooked some other/better approach?
library(dplyr)
timestamps <- c("00:00:00.1", "00:00:00.2", "00:00:00.3", "00:00:00.4", "00:00:00.5", "00:00:00.6", "00:00:00.7", "00:00:00.8", "00:00:00.9", "00:00:01.0", "00:00:01.1", "00:00:01.2", "00:00:01.4", "00:00:01.5", "00:00:01.5", "00:00:01.8") %>%
lubridate::hms %>% # convert to a time period in hours minutes seconds
as.numeric # convert that to a number of seconds
slider::slide_index_dbl(timestamps,
timestamps,
~length(.x), # = how many timestamps are in the window
.before = .99) # Note: using 1 here gave me an incorrect result,
# presumably due to floating point arithmetic errors
# https://en.wikipedia.org/wiki/Floating-point_error_mitigation
[1] 1 2 3 4 5 6 7 8 9 10 10 10 9 10 10 8

Explanation for aggregate and cbind function

first I can't understand aggregate function and cbind I need explanation really simple words, second I have data
permno number mean std
1 10107 120 0.0117174000 0.06802718
2 11850 120 0.0024398083 0.04594591
3 12060 120 0.0005072167 0.08544500
4 12490 120 0.0063569167 0.05325215
5 14593 120 0.0200060583 0.08865493
6 19561 120 0.0154743500 0.07771348
7 25785 120 0.0184815583 0.16510082
8 27983 120 0.0025951333 0.09538822
9 55976 120 0.0092889000 0.04812975
10 59328 120 0.0098526167 0.07135423
I NEED TO process this by
data_processed2 <- aggregate(cbind(return)~permno, Data_summary, median)
I cant understand this command please explain me very simple THANK YOU!
cbind takes two or more tables (dataframes), puts them side by side, and then makes them into one big table. So for example, if you have one table with columns A, B and C, and another with column D and E, after you cbind them, you'll have one table with five columns: A, B, C, D and E. for the rows, cbind assumes all tables are in the same order.
As noted by Rui, in your example cbind doesn't do anything, because return is not a table, and even if it was, it's only one thing.
aggregate takes a table, divides it by some variable, and the calculates a statistic on a variable within each group. For example, if I have data for sales by month and day of month, I can aggregate by month, and calculate the average sales per day for each of the months.
The command you provided uses the following syntax:
aggregate(VARIABLES~GROUPING, DATA, FUNCTION)
Variables (cbind(return) - which doesn't make sense, really) is the list of all the variables for which your statistic will be calculated
Grouping (pernmo) is the variable by which you will break the data into groups (in the sample data you provided each row has a unique number for this variable, so that doesn't really make sense either).
Data is the dataframe you're using.
Function is median.
So this call will break Data_summery into groups that have the same pernmo, and calculate the median for each of the columns.
With the data you provided, you'll basically get the same table back, since you're grouping the data by groups of one row each... -- Actually, since your variables are an empty group, as far as I can tell, you'll get nothing back.

Can I assign a time value for an entire dataframe in R?

I would like to implement the following solution:
I have datasets that are generated every 5 minutes, those have several colums and rows, where the row are the variables and the metrics in the rows.
The issue is that I would like to load those entire datasets and assign them the time they have been loaded for all of the variables in the given time for the same exact metric value, ¿Does it make any sense?.
A
dding a column with the time as a metric for eacehone of the variables, but I would like to know if I could do it globally as a dataframe in R, and everytime I'm accessing to this dataframe, keep it in an inside variable as it would apply to each element of the dataframe (the time of the extraction).
Thank you.
EDIT: Imagine the following table for T1, I load it in one dataframe call it df1.
PlaneID FlightTime Passengers Cost
1 123 5 6
2 34 4 2
3 93 3 1
Now the very next day I recieve a new data report, on the time T2 and save it on a df2:
PlaneID FlightTime Passengers Cost
1 33 10 16
2 134 2 1
3 393 3 6
Now, what I'd like to do (besides being on different dataframes) would be for example to analyze the number of passangers for each plane during T1 and T2, and creating a time series out of it.
Hope it helps.

Calculate sum of counts per min from data frame in R

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?
You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.
Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources