R- Speed up calculation related with subset of data.table - r

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?

I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

Related

How can I speed up this R code, in which I use stringdist?

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata. I've found the R package stringdist and I'm using the following code to calculate the distance between every single record:
output <- df$id
for(i in 1:(length(df$customerdata)-1) ){
for(j in (i+1):length(df$customerdata)){
if(abs(df$customerdataLEN[i]-df$customerdataLEN[j]) < 10){
if( stringdist(df$customerdata[i],df$customerdata[j])<10){
output[j] <- df$id[i]
}
}
}
}
df$newcustomerid <- output
So here, I first initialize a vector named output with customerid data. Then, I loop through all customers. I have a column called customerdatalength. To reduce calculation time, I first check if there is large (>10) difference in length between columns. If that is the case, I don't bother calculating the stringdist. Otherwise, if the distance between the two customers is < 10, I consider them to be the same customer, and I give that customer the same id.
I'm looking to speed up the process however. At 2000 rows, this loop takes 2 minutes. At 7400 rows, this loop takes 32 minutes. I'm looking to run this on around 1 000 000 rows. Does anyone have any idea on how to improve the speed of this loop?

Find row where multiple criteria are met, return a different column, for a full dataset

I have two datasets; 'data' and 'noiseaware'. noiseaware contains a RoomCode and a time stamp.
RoomCode last_trigger
GTX-513 2020-05-09 00:30:28
data contains a ton of things, including a reservation code, a check-in time stamp, a check-out time stamp, and a RoomCode. Ie
ReservationID RoomCode checkin_time checkOutDate
25307070gawgw GTX-513 2020-04-09 00:30:28 2020-05-09 00:30:28
My objective is that for each line in noiseaware, I want to find the corresponding reservation ID that matches the following combination:
Is after the checkInDate
Is before the checkOutDate
Has the same RoomCode
That in logic is as follows:
noiseaware$last_trigger <= data$checkOutDate & noiseaware$last_trigger >= data$checkInDate & data$RoomCode == noiseaware$RoomCode
However, I can't work out how to turn that logic - which returns a vector of true and false values - into something that returns the ReservationId. If it makes any difference, there should only be one matching ID for the above criteria.
Once I can do that, I'd then want to loop through and do the same for each line in noiseaware. I suppose I could do that with lapply?
Sounds like something dplyr can handle easily.
You will need to left_join table noiseaware to data by RoomCode.
And then filter out the samples you don't need.
Here's an example. Without a sample data, I have no way to test this. You may need to tweak these codes to accommodate the actual data. But the basic idea is there.
library("dplyr")
noiseaware %>%
left_join(data, by = "RoomCode") %>%
filter(last_trigger > checkin_time & last_trigger < checkOutDate)
An option using data.table:
library(data.table)
setDT(noiseaware)[, last_trigger :=
setDT(data)[.SD, on=.(RoomCode, checkInDate<=last_trigger, checkOutDate>=last_trigger),
mult="last", x.ReservationID]
]
mult="last" uses the last observation if there are multiple results for a row in noiseaware.

Speed up complex data.table operation (subset, sum, group)

I have a large data.table which I need to subset, sum and group the same way on several occurrences in my code. Therefore, I store the result to save time. The operation still takes rather long and I would like to know how to speed it up.
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]
The keys of inventory are period and articleID. The size varies depending on the parameters but is always greater than 3 GB. It has about 62,670,000 rows of 7 variables.
I comment on my thought so far:
1. Subset: period > p
This could be faster with vector scanning, but I would need to generate the sequence from p to max(p) for that, taking additional time. Plus, the data.table is already sorted by p. So I suppose, the gain in speed is not high.
2. Aggregate: sum(incoming)
No idea how to improve this.
3. Group: by = articleID
This grouping might be faster with another key setting of the table, but this would have a bad impact on my other code.
4. Access: [, V1]
This could be neglected and done during later operations, but I doubt a speed gain.
Do you have ideas for detailed profiling or improving this operation?
Minimum reproducible example
(decrease n to make it run on your machine, if necessary):
library(data.table)
p <- 100
n <- 10000
inventory <- CJ(period=seq(1,n,1), weight=c(0.1,1), volume=c(1,10), price=c(1,1000), E_demand=c(1000), VK=seq(from=0.2, to=0.8, by=0.2), s=c(seq(1,99,1), seq(from=100, to=1000, by=20)))
inventory[, articleID:=paste0("W",weight,"V",volume,"P",price,"E", round(E_demand,2), "VK", round(VK,3), "s",s)]
inventory[, incoming:=rgamma( rate=1,shape=0.3, dim(inventory)[1])]
setkey(inventory, period, articleID)
inco <- inventory[period > p, sum(incoming), by = articleID][,V1]

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

R Accumulate equity data - add time and price

I have some data formatted as below. I have done some analysis on this and would like to be able to plot the price development in the same graph as the analyzed data.
This requires me to have the same x-axes for the data.
So I would like to aggregate the "shares" column in say 150 increments, and add the "finalprice" and "time" to this.
The aggregation should include the latest time and price, so if the aggregation needs to occur over two or more rows of data then the last row should provide the price and time data.
My question is how to create a new vector with 150 shares per row.
The length of the vector will equal sum(shares)/150.
Is there an easy way to do this? Thanks in advance.
Edit:
I thought about expanding the observations using rep(finalprice, shares) and then getting each 150th value of the expanded vector.
Data sample:
"date","ord","shares","finalprice","time","stock"
20120702,E,2000,99.35,540.84753333,500
20120702,E,28000,99.35,540.84753333,500
20120702,E,50,99.5,542.03073333,500
20120702,E,13874,99.5,542.29411667,500
20120702,E,292,99.5,542.30191667,500
20120702,E,784,99.5,542.30193333,500
20120702,E,13300,99.35,543.04805,500
20120702,E,16658,99.35,543.04805,500
20120702,E,42,99.5,543.04805,500
20120702,E,400,99.4,546.17173333,500
20120702,E,100,99.4,547.07,500
20120702,E,2219,99.3,549.47988333,500
20120702,E,781,99.3,549.5238,500
20120702,E,50,99.3,553.4052,500
20120702,E,1500,99.35,559.86275,500
20120702,E,103,99.5,567.56726667,500
20120702,E,1105,99.7,573.93326667,500
20120702,E,4100,99.5,582.2657,500
20120702,E,900,99.5,582.2657,500
20120702,E,1024,99.45,582.43891667,500
20120702,E,8214,99.45,582.43891667,500
20120702,E,10762,99.45,582.43895,500
20120702,E,1250,99.6,586.86446667,500
20120702,E,5000,99.45,594.39061667,500
20120702,E,20000,99.45,594.39061667,500
20120702,E,15000,99.45,594.39061667,500
20120702,E,4000,99.45,601.34491667,500
20120702,E,8700,99.45,603.53608333,500
20120702,E,3290,99.6,609.23213333,500
I think I got it solved.
expand <- rep(finalprice, shares)
Increment <- expand[seq(from = 1, to = length(expand), by = 150)]

Resources