Join of two tables in prestoDB based on closest value - bigdata

I have two tables, one with timeseries values and another one with prices. For example
timeseries
purchase_date user_id item_id
1618231488 123 2313
1618244875 435 2314
1618266985 23 2313
1618268671 54 144
...
price
item_id price_date price
2313 1618231400 233.67
2313 1618232400 294.12
2313 1618254400 224.14
144 1618254400 212.34
...
Goal: evaluate the price of each item purchased by a user at the given purchase timestamp.
As you can see from sample data, a purchase can happen at any timestamp, while the price of an item is stored every hour inconsistently. So, for example, we can have an item price updated every hour for one day and for the next day, maybe just a couple of updates. This happens because system records price only if there's a change.
Using an SQL query (engine is PrestoDB) and keeping in mind that the timeseries table can be of millions of rows (while the prices one up to a few hundreds thousands), how can I get a table as follows
timeseries_with_price
date user item price
1618231488 123 2313 xxx.xx
1618244875 435 2314 xxx.xx
1618266985 23 2313 xxx.xx
1618268671 54 144 xxx.xx
...
Where xxx.xx are the prices at given timestamp. Of course, item price doesn't change linearly but change when a new record is put in database. So if I have (simplifying timestamps for readability)
price at date 100 equal to 14.02$
price at date 300 equal to 23.02$
These are the transactions' prices based on purchase date:
purchase_date: 100 -> price 14.02$
purchase_date: 148 -> price 14.02$
purchase_date: 299 -> price 14.02$
purchase_date: 300 -> price 23.02$
purchase_date: 348 -> price 23.02$

You could do join of both tables based on item_id and then select only the elements that are closest in time, for example:
select purchase_date,
user_id,
item_id,
min_by(price, abs(purchase_date-price_date)) as price
from timeseries join price using(item_id)
group by purchase_date, user_id, item_id
See the docs for the min_by function here: https://prestodb.io/docs/current/functions/aggregate.html#min_by
I'm purposefully placing the price table to the right of the join since it's the smallest of the two. Ideally though, when a user makes a purchase you would keep a pointer to a unique identifier in the price table.

Related

What is the R code for the following Data set

I have a data set which has products and their quantity sold. I want to write a R code which tells me the best selling product.
Products Quantity
Laminated 520
Laminated 150
Laminated 639
Laminated 702
SUPERSTAR 3
TAMAX 500
TAMAX 20
TAMAX 40
GreenDragon 40
GreenDragon 50
XPLODE 40
XPLODE 20
EXPERT 40
KHANJARBIOSL 40
Here just by looking the data set we can say laminated is the best product in terms of quantity sold. Can we write an R code for this.
Thanks
There could be multiple ways to do this. One way using tapply is to get sum of Quantity for each Product, get the name of the maximum value.
names(which.max(tapply(df$Quantity, df$Products, sum, na.rm = TRUE)))
#[1] "Laminated"
You can use data.table package. First do the sum, then sort it in descending order based on aggregated value. Then fetch first row.
tb = data.frame("Products" =c("Laminated", "Laminated", "Laminated", "Laminated", "SUPERSTAR", "TAMAX", "TAMAX", "TAMAX", "GreenDragon", "GreenDragon", "XPLODE", "XPLODE", "EXPERT", "KHANJARBIOSL"), "Quantity" = c(520,150,639,702,3,500,20,40,40,50,40,20,40,40))
library(data.table)
tb = data.table(tb)
tb[,sum(Quantity), by="Products"][order(-V1)][1]

Need the average of last 17 days and compare with previous 17 days

I was looking for a query which will provide me the count to records loaded for the last 17 days and compare the count with next 17 days.
There is a drop in the records loaded and i have to see the drop and also check the percentage drop if possible. I was checking the moving average window function but there is an option of number of rows and not days.
sel call_start_dt,count(*) from dpd.network_activity where call_start_dt > date-45 and dtype_cd='8,006'
group by call_start_dt
order by call_start_dt desc
This is my base query.
Required Output:
call_start_dt count(*) average( for last 17 days)
1st june 1000 800
2nd june 675 800
....
....
17 rows
call_strat_dt count(*) average( next 17 days)
next 17 rows

Correct history data

I have a scenario where i have to correct the history data. The current data is like below:
Status_cd event_id phase_cd start_dt end_dt
110 23456 30 1/1/2017 ?
110 23456 31 1/2/2017 ?
Status_cd event_id phase_cd start_dt end_dt
110 23456 30 1/1/2017 ?
111 23456 30 1/2/2017 ?
The major columns are status_cd and phase_cd. So, if any one of them change the history should be handled with the start dt of the next record as the end date of the previous record.
Here both the records are open which is not correct.
Please suggest on how to handle both the scenarios.
Thanks.
How are your history rows ordered in the table? In other words, how do you decide which history rows to compare to see if a value was changed? And how do you uniquely identify a history row entry?
If you order your history rows by start_dt, for example, you can compare the previous and current row values using window functions, like Rob suggested:
UPDATE MyHistoryTable
FROM (
-- Get source history rows that need to be updated
SELECT
history_row_id, -- Change this field to match your table
MAX(status_cd) OVER(ORDER BY start_dt ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS status_cd_next, -- Get "status_cd" value for "next" history row
MAX(phase_cd) OVER(ORDER BY start_dt ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS phase_cd_next,
MAX(start_dt) OVER(ORDER BY start_dt ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS start_dt_next
FROM MyHistoryTable
WHERE status_cd <> status_cd_next -- Check "status_cd" values are different
OR phase_cd <> phase_cd_next -- Check "phase_cd" values are different
) src
SET MyHistoryTable.end_dt = src.start_dt_next -- Update "end_dt" value of current history row to be "start_dt" value of next history row
WHERE MyHistoryTable.history_row_id = src.history_row_id -- Match source rows to target rows
This assumes you have a column to uniquely identify each history row, called "history_row_id". Give it a try and let me know.
I don't have a TD system to test on, so you may need to futz with the table aliases too. You'll also probably need to handle the edge cases (i.e. first/last rows in the table).

mySql sum a column and return only entries with and entry in last 10 minutes

heres a table, the time when the query runs i.e now is 2010-07-30 22:41:14
number | person | timestamp
45 mike 2008-02-15 15:31:14
56 mike 2008-02-15 15:30:56
67 mike 2008-02-17 13:31:14
34 mike 2010-07-30 22:31:14
56 bob 2009-07-30 22:37:14
67 bob 2009-07-30 22:37:14
22 tom 2010-07-30 22:37:14
78 fred 2010-07-30 22:37:14
Id like a query that can add up the number for each person. Then only display the name totals which have a recent entry say last 60 minutes. The difficult seems to be, that although its possible to use AND timestamp > now( ) - INTERVAL 600, this has the affect of stopping the full sum of the number.
the results I would from above are
Mike 202
tom 22
fred 78
bob is not included his latest entry is not recent enough its a year old! mike although he has several old entries is valid because he has one entry recently - but key, it still adds up his full 'number' and not just those with the time period.
go on get your head round that one in a single query ! and thanks
andy.
You want a HAVING clause:
select name, sum(number), max(timestamp_column)
from table
group by name
HAVING max( timestamp_column) > now( ) - INTERVAL 600;
andrew - in the spirit of education, i'm not going to show the query (actually, i'm being lazy but don't tell anyone) :).
basically tho', you'd have to do a subselect within your main criteria select. in psuedo code it would be:
select person, total as (select sum(number) from table1 t2 where t2.person=t1.person)
from table1 t1 where timestamp > now( ) - INTERVAL 600
that will blow up, but you get the gist...
jim

Get Correct Price based on Effectivity Date

I have a problem getting the right "Price" for a product based on Effectivity date.
Example, I have 2 tables:
a. "Transaction" table --> this contains the products ordered, and
b. "Item Master" table --> this contains the product prices and effectivity dates of those prices
Inside the Trasaction table:
INVOICE_NO INVOICE_DATE PRODUCT_PKG_CODE PRODUCT_PKG_ITEM
1234 6/29/2009 ProductA ProductA-01
1234 6/29/2009 ProductA ProductA-02
1234 6/29/2009 ProductA ProductA-03
Inside the "Item_Master" table:
PRODUCT_PKG_CODE PRODUCT_PKG_ITEM PRODUCT_ITEM_PRICE EFFECTIVITY_DATE
ProductA ProductA-01 25 6/1/2009
ProductA ProductA-02 22 6/1/2009
ProductA ProductA-03 20 6/1/2009
ProductA ProductA-01 15 5/1/2009
ProductA ProductA-02 12 5/1/2009
ProductA ProductA-03 10 5/1/2009
ProductA ProductA-01 19 4/1/2009
ProductA ProductA-02 17 4/1/2009
ProductA ProductA-03 15 4/1/2009
In my report, I need to display the Invoices and Orders,
as well as the Price of the Order Item which was effective
at the time it was paid (Invoice Date).
My query looks like this (my source db is Oracle):
SELECT T.INVOICE_NO,
T.INVOICE_DATE,
T.PRODUCT_PKG_CODE,
T.PRODUCT_PKG_ITEM,
P.PRODUCT_ITEM_PRICE FROM TRANSACTION T,
ITEM_MASTER P WHERE T.PRODUCT_PKG_CODE = P.PRODUCT_PKG_CODE
AND T.PRODUCT_PKG_ITEM = P.PRODUCT_PKG_ITEM
AND P.EFFECTIVITY_DATE <= T.INVOICE_DATE
AND T.INVOICE_NO = '1234';
...which shows 2 prices for each item.
I did some other different query styles
but to no avail, so I decided
it's time to get help. :)
Thanks to any of you who can
share your knowledge. --CJ--
p.s. Sorry, my post doesn't even look right! :D
If it's returning two rows with different effective dates that are less than the invoice date, you may want to change your date join to
'AND T.INVOICE_DATE = (
select max(effectivity_date)
from item_master
where effectivity_date < t.invoice_date)'
or something like that, to only get the one price that is the most recent one before the invoice date.
Analytics is your friend. You can use the FIRST_VALUE() function, for example, to get all the product_item_prices for the given product, sort by effectivity_date (descending), and just pick the first one. You'll need a DISTINCT as well so that only one row is returned for each transaction.
SELECT DISTINCT
T.INVOICE_NO,
T.INVOICE_DATE,
T.PRODUCT_PKG_CODE,
T.PRODUCT_PKG_ITEM,
FIRST_VALUE(P.PRODUCT_ITEM_PRICE)
OVER (PARTITION BY T.INVOICE_NO, T.INVOICE_DATE,
T.PRODUCT_PKG_CODE, T.PRODUCT_PKG_ITEM
ORDER BY P.EFFECTIVITY_DATE DESC)
as PRODUCT_ITEM_PRICE
FROM TRANSACTION T,
ITEM_MASTER P
WHERE T.PRODUCT_PKG_CODE = P.PRODUCT_PKG_CODE
AND T.PRODUCT_PKG_ITEM = P.PRODUCT_PKG_ITEM
AND P.EFFECTIVITY_DATE <= T.INVOICE_DATE
AND T.INVOICE_NO = '1234';
While your question's formatting is a bit too messy for me to get all the details, it sure does look like you're looking for the standard SQL construct ROW_NUMBER() OVER with both PARTITION and ORDER_BY -- it's in PostgreSql 8.4 and has been in Oracle [and MS SQL Server too, and DB2...] for quite a while, and it's the handiest way to select the "top" (or "top N") "by group" and with a certain order of anything in a SQL query. Look it up, see here for the PosgreSQL-specific docs.

Resources