using BigQuery's APPROX_QUANTILES efficiently - quantile

Right now if I want to get the decile of some value I'd do
SELECT
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(10)] as p10,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(20)] as p20,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(30)] as p30,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(40)] as p40,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(50)] as p50,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(60)] as p60,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(70)] as p70,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(80)] as p80,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(90)] as p90,
APPROX_QUANTILES(value, 100)[SAFE_ORDINAL(100)] as p100
FROM table
I wanted to make sure this is not 10xing the work for big query, and if there'd be a more compact way to write this

If you run the query and then check the execution plan, you will see that BigQuery only computes the quantiles once, then extracts the various elements of the array in a second step. You don't need to worry about trying to deduplicate the APPROX_QUANTILES aggregation yourself.

Related

How to remove some values from a 4-dimensional matrix?

I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).

R- Speed up calculation related with subset of data.table

Need help on speed up for case below:
I am having roughly 8.5 Millions rows of orders history for 1.3M orders. I need to calculate the time it take between two steps of each order. I use calculation as below:
History[, time_to_next_status:=
get_time_to_next_step(id_sales_order_item_status_history,
id_sales_order_item, History_subset),
by='id_sales_order_item_status_history']
In the code above:
id_sales_order_item - id of a sales order item - there are multiple history record have the same id_sales_order_item
id_sales_order_item_status_history - id of a row
History_subset is a subset of History which contains only 3 columns [id_sales_order_item_status_history, id_sales_order_item, created_at] needed in the calculations.
created_at is the time the history was created
The function get_time_to_next_step is defined as below
get_time_to_next_step <- function(id_sales_order_item_status_history, filter_by,
dataSet){
dataSet <- dataSet %.% filter(id_sales_order_item == filter_by)
index <- match(currentId, dataSet$id_sales_order_item_status_history)
time_to_next_status <- dataSet[index + 1, created_at] - dataSet[index, created_at]
time_to_next_status
}
The issues is that it take 15mins to run arround 10k records of the History. So it would take up to ~9 days to complete the calculation. Is there anyway I can fasten this up without break the data in to multiple subset?
I will take a shot. Can't you try something like this..
History[ , Index := 1:.N, by= id_sales_order_item]
History[ , time_to_next_status := created_at[Index+1]-created_at[Index], by= id_sales_order_item]
I would think this would be pretty fast.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Ratio between two datasets with conditions in R

I have two data sets. One dataset is dye and it has two columns time and dye and another dataset is sed. Sed dataset has also time and sed. Then I want to find the new variable new from the ratio of the sed to dye. I want to find the ratio such that if either dye or sed is zero then new would be zero else calculate new as sed /dye.
The code I have worked so far is as follows:
dyei94j66 <- read.table("i94 j66 dye time series.Dat",header=T,sep="\t")
names(dyei94j66) <- c("Time","Dye")
head(dyei94j66)
sedi94j66 <- read.table("i94 j66 sediment time series.Dat", header=T,sep="\t")
names(sedi94j66) <- c("Time","Sed")
head(sedi94j66)
newi94j66 <- data.frame(sedi94j66$Sed / dyei94j66$Dye)
head(newi94j66)
# Merge two datasets
totaldata <- merge(dyei94j66,sedi94j66, all=TRUE)
I didn't know how to use the condition to calculate the ratio. May be I need to create a function I am not sure. Right now, when either dye or sed is zero then the value calculated by R is NaN. So, I want to change that NaN to zero while calculating the ratio.
I hope I explained what I am looking for. Thanks.
The data for i94 j66 sediment time series can be found on https://www.dropbox.com/s/1cbo8wul9nb8zeh/i94%20j66%20sediment%20time%20series.Dat
The data for i94 j66 dey time series can be found on https://www.dropbox.com/s/sqm8rip2xjh3tiu/i94%20j66%20dye%20time%20series.Dat

dealing with data table with redundant rows

The title is not precisely stated but I could not come up with other words which summarizes what I exactly going to ask.
I have a table of the following form:
value (0<v<1) # of events
0.5677 100000
0.5688 5000
0.1111 6000
... ...
0.5688 200000
0.1111 35000
Here are some of the things I like to do with this table: drawing the histogram, computing mean value, fitting the distribution, etc. So far, I could only figure out how to do this with vectors like
v=(0.5677,...,0.5688,...,0.1111,...)
but not with tables.
Since the number of possible values are huge by being almost continuous, I guess making a new table would not be that effective, so doing this without modifying the original table and making another table would be desirable very much. But if it has to be done so, it's okay. Thanks in advance.
Appendix: What I want to figure out is how to treat this table as a usual data vector:
If I had the following vector representing the exact same data as above:
v= (0.5677, ...,0.5677 , 0.5688, ... 0.5688, 0.1111,....,0.1111,....)
------------------ ------------------ ------------------
(100000 times) (5000+200000 times) (6000+35000) times
then we just need to apply the basic functions like plot, mean, or etc to get what I wanted. I hope this makes my question more clear.
Your data consist of a value and a count for that value so you are looking for functions that will use the count to weight the value. Type ?weighted.mean to get information on a function that will compute the mean for weighted (grouped) data. For density plots, you want to use the weights= argument in the density() function. For the histogram, you just need to use cut() to combine values into a small number of groups and then use aggregate() to sum the counts for all the values in the group. You will find a variety of weighted statistical measures in package Hmisc (wtd.mean, wtd.var, wtd.quantile, etc).

Resources