QUANTILE FUNCTION IN TERADATA - teradata

I wanna get somes categories with quantile function, but i am getting the min and max of quantile.
Example
SELECT *
FROM
( SELECT ID, SALES, QUANTILE(5, SALES) AS QUINTIL
FROM
( SELECT ID, SUM(SALES) AS SALES
fROM TABL1
WHERE TIME_PER = 202004
GROUP BY 1
) AS TABL
) AS TABL
RESULT
enter image description here

Related

Create a function that count the number of users which represent a certain sexe and a certain category in R

I have a dataframe called "hcv" with each rows associated to one user, showing his sexe and his blood category (0, 0s , 1 ,2 or 3). Sexe and Category variables are both factor in my dataset.
I need to create a function that ask me a sexe and a category and return me the number of user which answer to both arguments.
so far this is what I have :
pow <- function(sexe, category) {
result <- sum(numbers == sexe, by = Category)
print(paste(result))
}
Thanks for your help !
should be enough:
sum( Sexe == sexe & category == Category )
also perhaps look at:
hcv %>% table( Sexe, Category )

Churn Rate in R

I have a data set named df_a running in millions. I want to calculate the churn rate and group into months.
On the sample data I ran the code to prepare my data.
The logic is to find the minimum month(acquired month)
find the last month based on the records
find the difference in months and group the difference in months
The code below
df_a<-data.table(df)
df_a[,"min_date" := min(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a[,"max_date" := max(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a$min_date_m<-anydate(df_a$min_date)
df_a$max_date_m<-anydate(df_a$max_date)
df_a$diff_days <- df_a$max_date_m - df_a$min_date_m
df_a$difference <- as.numeric(df_a$diff_days) /(365.25/12)
df_a$Month_Bucket<-ifelse((df_a$difference>=0 & df_a$difference<3),"3",
ifelse((df_a$difference>=3 & df_a$difference<6),"3-6",
ifelse((df_a$difference>=6 & df_a$difference<9),"6-9",
ifelse((df_a$difference>=9 & df_a$difference<12),"9-12",
ifelse((df_a$difference>=12 & df_a$difference<24),"12-24",
"24+")))))
data_a <- df_a[c(1,1:nrow(df_a)),]
setDT(data_a)
xxx<-(cohorts <-dcast(unique(data_a)[,cohort:=min(yw),by=CUSTOMER_DIMENSION_ID],cohort~Month_Bucket))
I am getting the output in the following format
Month 3
2020-08 92876
2020-07 144873
However the output is not correct
What I want is
Month no of unique customers acquired 0-3 3-6 6-9
2019-08 85749
2019-07 128060
The output basically is summing up the customers across months and assigning a bucket. However if I acquire 85749 customers in 2019-08 i will have lets say 25k customers in 0-3 25k again in 3-6 months
One here could do :
data_unique <- unique(data_a)
ccc <- ( cohorts <- dcast( data_unique[ ,
cohort := min(yw),
by=CUSTOMER_DIMENSION_ID],
cohort ~ Month_Bucket,
value.var = "CUSTOMER_DIMENSION_ID",
function(x) { length(unique(x) } ) )
)

Price weighted average for adjusted for new boughts and sells in r

I'm trying to find out the final weighted mean after buys and sells of my stocks.
So I am looking for a weighted average, to be adjusted by the buys and sells.
This is an example of my data. I have more than one stock but I can apply to the others using group_by.
ledger <-data.table(
ID = c(rep("b",3), rep("x",2)),
Prc = c(10,20,15, 35,40),
Qty= c(300,-50,100, 50,-10),
Op =c("Purchase", "Sale", "Purchase", "Purchase", "Sale")
)
ledger<-ledger %>%group_by(ID)%>%
mutate(Stock = cumsum(Qty))
ledger<-as.data.table(ledger)
View(ledger)
As I looked for my answer I found this code:
ledger[, Stock := cumsum(Qty)] # compute Stock value
ledger[, `:=` ( id = .I, AvgPrice = NA_real_ ) ] # add id and AvgPrice columns
ledger[ 1, AvgPrice := Prc ] # compute AvgPrice for first row
# work with remaining rows and find the AvgPrice
ledger[ ledger[, .I[-1]], AvgPrice := {
if( Op == "Sale" ){
ledger[ .I-1, AvgPrice ]
} else {
round( ( ( Qty * Prc ) + ledger[ .I-1, AvgPrice * Stock ] ) /
( Qty + ledger[ .I-1, Stock]) ,
digits = 2 )
}
}, by = id ]
ledger[, id := NULL ] # remove id column
That works very well. But I need to group_by my ID. So it wont make the average all toghether.
thanks for your contribution!!
I think it would help to think about storing some useful intermediate values to help with the calculation:
DF$Net_Stocks <- DF$Buy - DF$Sell
DF$Stocks_Owned <- cumsum(DF$Net_Stocks)
DF$Cost <- DF$price * DF$Buy
DF$Sales <- DF$price * DF$Sell
DF$Net_Cost <- DF$Cost - DF$Sales
DF$Total_Cost <- cumsum(DF$Net_Cost)
DF$Cost_Per_Held_Stock <- DF$Total_Cost / DF$Stocks_Owned
Now you can see better what is happening:
DF
#> ID price Buy Sell Net_Stocks Stocks_Owned Cost Sales Net_Cost Total_Cost Cost_Per_Held_Stock
#> b 10 100 50 50 50 1000 500 500 500 10.00000
#> b 20 200 100 100 150 4000 2000 2000 2500 16.66667
#> b 15 50 0 50 200 750 0 750 3250 16.25000
If you only want the formula for the last column, it is:
cumsum((DF$Buy - DF$Sell) * DF$price) / cumsum(DF$Buy - DF$Sell)

Summation over nearby observations

I have a large data.frame which includes the price of goods and the quantity that are sold with each price. I like to find the total quantity of goods that is sold with a price similar (within a range) to price of each row. For example for the jth observation (row) I like to find the sum of quantity of goods that are sold with price lower than Price_j+50 and higher than Price_j-50, and similarly for other observations.
I can run a for loop over observations and filter the data for each observation's price.
df<-data.frame(Price = runif(100)*100 , Q = runif(100)*1000)
SumQ = data.frame()
for (i in c(1:nrow(df))){
df_filterd <- df %>% filter(Price < Price[i]+50 & Price > Price[i]-50)%>% summarize(sumQ = sum(Q))
SumQ<-rbind(SumQ, df_filterd$sumQ)
}
Is there a more efficient way to do this? I have a large dataset and it takes a lot of time to run the for loop over all observations.
You want to avoid looping and binding the results - this will be very slow. Instead, try:
with(df, sapply(Price, function(x) sum(Q[Price < x+50 & Price > x-50])))
Or with dplyr and purrr you could do
df %>% mutate(sumQ = map_dbl(Price,
~sum(Q[Price < .+50 & Price > .-50])))
Price Q sumQ
1 5.2272345 284.433416 28356.80
2 17.7292069 454.122990 35459.90
3 9.7329295 509.266254 29989.69
4 68.1042808 131.169813 41230.23
5 38.5612268 938.653962 45227.63
6 44.5808938 774.296761 47758.30
...

R data.table merge / full outer join with na.fill / nomatch based on formula

What I need is to perform full outer join with some kind of smart na.fill / nomatch in a efficient way. I've already done it using loop but I would like to use matrix algebra or data.table operations to speed up the process.
Data below are sample of stock open orders information, full outer join is performed between datasets of asks open orders and bids open orders. A dataset are asks, B are bids. Both datasets stores atomic orders and their cumulative sums. The task is to match all ask orders with bid orders by cumulative value and vice versa.
Populate example data:
price = c(11.25,11.26,11.35,12.5,14.2)
amount = c(1.2,0.4,2.75,6.5,15.2)
A <- data.table(ask_price = price, ask_amount = amount, ask_cum_amount = cumsum(amount), cum_value = cumsum(price*amount), ask_avg_price = cumsum(price*amount)/cumsum(amount))
price = c(11.18,11.1,10.55,10.25,9.7)
amount = c(0.15,0.6,10.2,3.5,12)
B <- data.table(bid_price = price, bid_amount = amount, bid_cum_amount = cumsum(amount), cum_value = cumsum(price*amount), bid_avg_price = cumsum(price*amount)/cumsum(amount))
regular full outer join and it's results:
setkey(A, cum_value)
setkey(B, cum_value)
C <- merge(A,B,all=TRUE)
print(C)
na.fill / nomatch pseudocode formula, for every row (ask or bid) where cum_value not matches (please keep in mind that every other field than cum_value is related to ask OR bid):
avg_price["current NA"] <- cum_value["last non NA"]/cum_value["current NA"] * avg_price["last non NA"] + (1-cum_value["last non NA"]/cum_value["current NA"]) * price["next non NA"]
cum_amount["current NA"] <- cum_value["current NA"] / avg_price["current NA"]
expected results:
D <- data.table(
cum_value = c(1.677,8.337,13.5,18.004,49.2165,115.947,130.4665,151.822,268.222,346.3065),
ask_price = c(NA,NA,11.25,11.26,11.35,NA,12.5,NA,NA,14.2),
ask_amount = c(NA,NA,1.2,0.4,2.75,NA,6.5,NA,NA,15.2),
ask_cum_amount = c(0.149066666666667,0.741066666666667,1.2,1.6,4.35,9.66496172396059,10.85,12.3126600707381,20.4097766460076,26.05),
ask_avg_price = c(11.25,11.25,11.25,11.2525,11.31414,11.9966331281534,12.02456,12.3305605066459,13.1418390633132,13.29392),
bid_price = c(11.18,11.1,NA,NA,NA,10.55,NA,10.25,9.7,NA),
bid_amount = c(0.15,0.6,NA,NA,NA,10.2,NA,3.5,12,NA),
bid_cum_amount = c(0.15,0.75,1.23858478466587,1.66517233847558,4.6230572556498,10.95,12.3652404387114,14.45,26.45,NA),
bid_avg_price = c(11.18,11.116,10.8995364444444,10.8120940902022,10.6458772362927,10.58877,10.5510685899445,10.50671,10.14072,NA)
)
print(D)
Note that in the expected results the last NA is still as NA, this is because opposite order could not be matched because the market depth is not enough to fulfill the order at any price.
Is it possible to get expected results using matrix algebra or data.table operations or any other efficient way to avoid looping over full dataset?
Thanks in advance
Merge it back again with A and B with a roll to find the last/next non-NA prices.
E.g. see the output values of bid_avg_price for these two merges:
B[merge(A, B, all = T), roll = Inf]
B[merge(A, B, all = T), roll = -Inf]
That should give you all the info you need to compute those quantities.

Resources