Cumulative sum of 30 rows. SLOW code need improvement - r

Need help to speed up this code!
Goal is to create a dataframe where the TPS (transaction per second) of the first DF: TPS_Jan7_11h_13h_CheckIMEI will be accumulated from record 1 to 30, then reset to 0 and do that again.
This is what it looks like in graph form:
https://docs.google.com/spreadsheets/d/1-286za99C5gdHLDErR9B4ZazVrZFFINGaH3xzVMghFk/edit?usp=sharing
My dataset has more than 6millions rows...
I start creating a sequence where I need to reset to 0 my cumulative variable. Then I go through the full dataset and just add on top of the previous value.
I have been running this for a few hours on a quad code x64 8gig machine and still running... so... crazy slow!
Any ideas how to speed this up? Subsets or some magic with Tables?
Here's the code:
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
# Initialize Dataframe
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 <- data.frame(matrix(ncol = 3, nrow = nrow(Jan7_11h_13h_CheckIMEI)))
colnames(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30) <- c("CumulTPS","100%","130%")
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[2] = 1000*30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[3] = (1000*30)*1.3
CumulVal = 0
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[1] = TPS_Jan7_11h_13h_CheckIMEI$TPS[1]
for(i in 2:nrow(Jan7_11h_13h_CheckIMEI)) {
CumulVal = CumulVal + TPS_Jan7_11h_13h_CheckIMEI$TPS[i-1]
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[i] = CumulVal
# print(CumulVal)
if (i %in% TPS_Jan7_11h_13h_CheckIMEI_seq30) CumulVal = 0
}
The TPS DF is simply a list of TPS on the TPS column and timestamp on first column.
Goal is to recreate what I put in the spreadsheet example, but on millions of rows!
Thanks,
Simon

Use dplyr to group your data into groups of 30 records, then compute the cumulative sum for each value in each group.
Here's some code; note that it needs some refinement to include all values - take a look at the cut documentation for help.:
library(dplyr)
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
#use cut() to add a factor column to the data frame with a different level for each group of 30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$numgroup = cut(as.numeric(row.names(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30)), TPS_Jan7_11h_13h_CheckIMEI_seq30)
#aggregate by the new column and get the cumulative sum at each line, within each group
newdf = TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 %>% group_by(numgroup) %>% mutate(cumulsum = cumsum(TPS))

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Calculating the difference of elements in a vector with varying lag/lead

I have some lab data and I am looking to calculate the difference between sample measurements over a moving time frame/window e.g 2 minutes (as apposed to 0-2, 2-4, 4-6 minute, static windows)
The problem is that although the data is sampled every second there are some missed samples (e.g. 1,2,4,6,7) so I cannot use a fixed lag function especially for larger time windows.
Here is the most promising I have tried. I have tried to calculate the difference in the row positions that will then use that to determine the lag value.
library(tidyverse)
df <- data.frame(sample_group = c(rep("a", 25), rep("b", 25)),t_seconds = c(1:50), measurement = seq(1,100,2))
df <- df[-c(5,10,23,33,44),] #remove samples
t_window = 5
df_diff <- df %>%
group_by(sample_group) %>%
arrange(t_seconds) %>%
mutate(lag_row = min(which(t_seconds >= t_seconds + t_window))- min(which(t_seconds == t_seconds)), #attempt to identify the lag value for each element
Meas_diff = measurement - lag(measurement, lag_row))
In this example (lag_row) I am trying to call an element from a vector and the vector itself, which obviously does not work! to make it clearer, I have added '_v' to identify what I wanted as a vector and '_e' as an element of that vector min(which(t_seconds_v >= t_seconds_e + t_window))- min(which(t_seconds_v == t_seconds_e))
I have tried to stay away from using loops but I have failed to solve the problem.
I would appreciate if anyone has any better ideas?
Your first step should be inserting missing observations into your time series. Then you could fill the missing values using a Last-Observation-Carried-Backwards operation. This provides you with a complete regular time series.
Your desired output is very unclear, so the next step after that in the following example is just a guess. Adjust as needed.
#complete time series (using a data.table join):
library(data.table)
setDT(df)
df_fill <- df[, .SD[data.table(t_seconds = min(t_seconds):max(t_seconds)),
on = "t_seconds"],
by = sample_group]
df_fill[, filled := is.na(measurement)]
#last observation carried backwards
library(zoo)
df_fill[, measurement := na.locf(measurement, fromLast = TRUE), by = sample_group]
#differences
df_fill[, diff_value := shift(measurement, -t_window) - measurement, by = sample_group]

Random generation of numbers using R

I have some data that involves zebu (beef animals) that are labeled 1-40. I need to divide them into 4 groups of 10 each. I need to choose them randomly to remove any bias and I need to use R and Excel. Thank you please help.
There are ways of doing this that only require less code, but here's a verbose example that let's me explain what's happening.
Here's the dataset I'll be using since I don't know exactly how your data look.
beef <-
data.frame(number = 1:40, weight = round(rnorm(40, mean = 2000, sd = 500)))
Because your animals are numbered from 1 to 40, you can create a new dataframe that contains those numbers with a random group number (1 to 4) as the second column.
num_group <- (data.frame(
number = 1:40,
group =
sample(
x = 1:4,
size = 40,
replace = TRUE
)
))
Join the two dataframes together and you have your answer.
merge(beef, num_group)
To shuffle the data in excel follow this tip
Create new column in your data then apply RAND()
It will generate random number over that column and sort random numbers column you will get your data shuffled.
Later load data in to R and select 10 rows each time and assign class to them.

Summing rows grouped by another parameter in R

I am trying to calculate some rates for time on condition parameters, and have written the following, which successfully calculates the desired rates. But, I'm sure there must be a more succinct way to do this using the data.table methods. Any suggestions?
Background on what I'm trying to achieve with the code.
For each run number there are 10 record numbers. Each record number refers to a value bin (the full range of values for each parameter is split into 10 equal sized bins). The values are counts of time spent in each bin. I am trying to sum the counts for P1 over each run number (calling this opHours for the run number). I then want to divide each of the bin counts by the opHours to show the proportion of each run that is spent in each bin.
library(data.table)
#### Create dummy parameter values
P1 <- rnorm(2000,400, 50);
Date <- seq(from=as.Date("2010/1/1"), by = "day", length.out = length(P1));
RECORD_NUMBER <- rep(1:10, 200);
RUN_NUMBER <- rep(1:200, each=10, len = 2000);
#### Combine the dummy parameters into a dataframe
data <- data.frame(Date, RECORD_NUMBER, RUN_NUMBER, P1);
#### Calculating operating hours for each run
setDT(data);
running_hours_table <- data[ , .(opHours = sum(P1)), by = .(RUN_NUMBER)];
#### Set the join keys for the data and running_hours tables
setkey(data, RUN_NUMBER);
setkey(running_hours_table, RUN_NUMBER);
#### Combine tables row-wise
data <- data[running_hours_table];
data$P1.countRate <- (data$P1 / data$opHours)
Is it possible to generate the opHours column in the data table without first creating a separate table and then joining them back together?
data2[ , opHours := sum(P1), by = .(RUN_NUMBER)]
You should probably read some materials about data.table:
wiki Getting-started
or
data.table.cheat.sheet

Average columns in reverse, and return sum of those averages in R

I'm just starting to learn R for forecasting and analysis purposes, and I've decided to try and create a full package for the forecasting model I'm using (Additive Pickup). I work for a hotel, and one of the things I do on a regular basis is forecast our demand, so this will certainly make this part of my job faster and easier!
I've already created a few functions that will get me a data frame of my pickup numbers, and now I'm working on a function to average a user defined number of columns in that new data frame. I've included code to create some sample data, and the code I'm working on below.
Sample Data:
test = data.frame(replicate(10, sample(0:2, 32, rep = TRUE)))
Broken Code:
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - periods
row = (day-1)
new_frame = as.data.frame(matrix(nrow = 1, ncol = periods))
for(i in pStart:pEnd) {
new_frame[1,i] = mean(data[1:row , i])
}
return(sum(new_frame[1,1:i]))
}
The goal of this is to iterate backwards from the last column in the data to a user defined period. For example, setting "periods" to 1 should return the sum of the average of the last column only. Setting it to 2 would yield the sum of the averages of the last column and second to last column.
However, when I try to run a test of this I get an error that reads
Error in [<-.data.frame(tmp`, 1, i, value = 0.9) : new columns
would leave holes after existing columns
Any advice you guys could lend would be so appreciated. Also, let me know if I made absolutely zero sense, and apologies for the essay on this question... Note that this has to iterate backwards because of the way the input data is formatted.
I think this is what you want:
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - (periods-1)
row = (day-1)
new_frame <- as.data.frame(matrix(nrow = 1, ncol = periods))
for(i in pStart:pEnd) {
new_frame[1,1+abs(ncol(data)-i)] <- mean(data[1:row , i])
}
return(sum(new_frame[1,1:ncol(new_frame)]))
}
averagePickup(test,1,5)
[1] 7
I believe this does what you're looking for:
colMeans will return the average for each column
colMeans(test)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1.15625 1.00000 0.90625 1.03125 1.15625 1.09375 0.81250 0.93750 1.15625 0.84375
Now, instead of every column, you only want the last x columns. dim will give you the dimensions of your matrix/dataframe, and the second value is the number of columns.
dim(test)[2]
You can now subset your dataframe dynamically
test[, (dim(test)[2] - x):dim(test)[2]]
Finally, plug the subsetted dataframe into the colMeans function, and wrap a sum around it.
sum(colMeans(test[, (dim(test)[2] - x):dim(test)[2]]))

Resources