I have a table of loan issuance and repayments by customers that I have preprocessed like this
customerID | balanceChange | trxDate | TYPE
242105 | 500 | 20170605 | loan
242105 | 1500 | 20170605 | loan
242105 | -1000 | 20170607 | payment
242111 | 500 | 20170605 | loan
242111 | -500 | 20170606 | payment
242111 | 500 | 20170607 | loan
242111 | -500 | 20170609 | payment
242151 | 500 | 20170605 | loan
What I would like to do is to (1) count for each of the loans issued every day, how many of them have been paid back in full, and (2) how many days did it take the customer to pay them.
The rule of the repayment is of course FIFO (First In First Out), so the oldest loan gets paid back first.
In the example above, the solution would be
trxDate | nRepayments | timeGap(days)
20170605 | 2 | 1.5
20170606 | 0 | 0
20170607 | 1 | 2
So, the explanation on why the solution is like that is on 20170605, there are 4 loans issued (2 to customerID 242105, and he other two to 242111 and 242151), but only 2 of those loans were paid back (the 500 given to 242105 and the 500 given to 242111). The timeGap is the average of sum of how many days did it took every customers to pay them back (242105 paid back on 20170607 - 2 day, and 242111 paid back on 20170606 - 1 day), so (2+1)/2 = 1.5.
I have tried to calculate the nRepayments (I figured if I did this the timeGap should be a piece of cake) with the following R script.
#Recoveries
data_loans_rec <- data_loans %>% arrange(customerID, trxDate) %>% as.data.table()
data_loans_rec[is.na(data_loans_rec)] <- 0
data_loans_rec <- data_loans_rec[, index := seq_len(.N), by = customerID][!(index == 1 & TYPE == "payment")][, index := seq_len(.N), by = customerID]
n_loans_given <- data_loans[TYPE == "loan", ][, .(nloans = .N), .(payment)][order(payment)]
n_loans_rec <- copy(n_loans_given)
n_loans_rec[, nloans:=0]
unique_cust <- unique(data_loans_rec$customerID)
#Check repayment for every customer================
for (i in 1:length(unique_cust)) {
cur_cust <- unique_cust[i]
list_loan <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(balanceChange)] )
list_loan_time <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(trxDate) ])
list_pay <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "payment", .(balanceChange) ])
if (dim(list_pay)[1] == 0) { #if there are no payments
list_pay <- c(0)
}
sum_paid <- sum(abs(list_pay))
i_paid_until <- 0
for (i_loantime in 1:(dim(list_loan_time)[1])) {
#if there is only one loan
if (i_loantime == 0) {
i_loantime <- 1
}
loan_curr <- list_loan[i_loantime]
loan_left <- loan_curr - sum_paid
if (loan_left <= 0) {
n_loans_rec[trxDate == list_loan_time[i_loantime], nloans:=nloans+1]
sum_paid <- sum_paid - loan_curr
print (paste(i_loantime, list_loan_time[i_loantime], n_loans_rec[trxDate == list_loan_time[i_loantime], .(nloans)]))
# break
} else {
break
}
}
print (i)
}
The idea is that for every customer, make a list of loans, time of loan, and payments. The best case scenario is if the customer's total amount of loan is equal or less (due to dirty data) the total amount of payment (full payment). Then the number of repayments equals the number of loans issued to that customer. The average case is when customer's make a partial payment. In which case, I sum the total amount of payments, and I iterate through each loan the customer made whilst summing the total amount of loans as I iterate. If the amount of loan finally exceeds the amount of payments, then I count how many loans have actually been covered by the customer's payments.
The problem is I have millions of customers, and each of them have made loans and payments at least 5 times. So, since I am using a nested loop, it would take hours to complete.
So, I am asking here if anyone has ever come across this problem and/or have a better, more efficient solution.
Thanks in advance!
Your logic is quite complicated and with this answer I don't attempt to replicate it fully; my intention is just to give you some ideas on how to optimise.
Also, as mentioned in comments, you could try to parallelise, or maybe use another programming language.
Anyway, as your setup is already with data.table, you can try to use global operations to the full set, as much as you can, which will usually go faster than your big loop. Something for example like this.
I first calculate, per customer id, the balance and the sum of payments done:
data_loans_rec <- data_loans_rec[, balance := sum(balanceChange), by = customerID]
data_loans_rec <- data_loans_rec[, sumPayments := sum(balanceChange[TYPE == "payment"]), by = customerID]
With this, you already know that every customer with balance 0 has repaid everything:
data_loans_rec <- data_loans_rec[TYPE == "loan" & balance == 0, repaid := TRUE, by = list(customerID, index)]
These operations of course read a lot of data if you have millions of customers, but I'd say that data.table should handle them pretty quickly.
For the rest of the customers, but only for those registers that are a loan and you don't know yet if they have been repaid, you can use a data.table function.
setRepaid <- function(balanceChange, sumPayments) {
# note that here you get a vector for all the loans of a customer
sumPay <- (-1) * sumPayments[1]
if (sumPay == 0)
return(rep(FALSE, length(balanceChange)))
number_of_loans_paid <- 0
for (i in 1:length(balanceChange)) {
if (sum(balanceChange[1:i]) > sumPay)
break
number_of_loans_paid <- number_of_loans_paid + 1
}
return(c(rep(TRUE, number_of_loans_paid), rep(FALSE, length(balanceChange)-number_of_loans_paid)))
}
data_loans_rec <- data_loans_rec[TYPE == "loan" & is.na(repaid), repaid := setRepaid(balanceChange, sumPayments), by = list(customerID) ]
With that you get the desired result, at least for your example.
customerID balanceChange trxDate TYPE index balance sumPayments repaid
1: 242105 500 20170605 loan 1 1000 -1000 TRUE
2: 242105 1500 20170605 loan 2 1000 -1000 FALSE
3: 242105 -1000 20170607 payment 3 1000 -1000 NA
4: 242111 500 20170605 loan 1 0 -1000 TRUE
5: 242111 -500 20170606 payment 2 0 -1000 NA
6: 242111 500 20170607 loan 3 0 -1000 TRUE
7: 242111 -500 20170609 payment 4 0 -1000 NA
8: 242151 500 20170605 loan 1 500 0 FALSE
Advantages being: the final loop works over much less customers, you have some stuff already precalculated, and you rely on data.table for actually substituting your loop. Hopefully this approach will give you an improvement. I think it is work a try.
Related
I am very new to Kusto queries and I have one that is giving me the proper data that I export to Excel to manage. My only problem is that I only care (right now) about yesterday and today in two separate Sheets. I can manually change the datetime with the information but I would like to be able to just refresh the data and it pull the newest number.
It sounds pretty simple but I cannot figure out how to specify the exact time I want. Has to be from 2 am day 1 until 1:59 day 2
Thanks
['Telemetry.WorkStation']
| where NexusUid == "08463c7b-fe37-43b6-a0d2-237472b9774d"
| where TelemetryLocalTimeStamp >= make_datetime(2023,2,15,2,0,0) and TelemetryLocalTimeStamp < make_datetime(2023,2,16,01,59,0)
| where NumberOfBinPresentations >0
ago(), now(), startofday() and some datetime arithmetic.
// Sample data generation. Not part of the solution.
let ['Telemetry.WorkStation'] = materialize(range i from 1 to 1000000 step 1 | extend NexusUid = "08463c7b-fe37-43b6-a0d2-237472b9774d", TelemetryLocalTimeStamp = ago(2d * rand()));
// Solution starts here.
['Telemetry.WorkStation']
| where NexusUid == "08463c7b-fe37-43b6-a0d2-237472b9774d"
| where TelemetryLocalTimeStamp >= startofday(ago(1d)) + 2h
and TelemetryLocalTimeStamp < startofday(now()) + 2h
| summarize count(), min(TelemetryLocalTimeStamp), max(TelemetryLocalTimeStamp)
count_
min_TelemetryLocalTimeStamp
max_TelemetryLocalTimeStamp
500539
2023-02-15T02:00:00.0162031Z
2023-02-16T01:59:59.8883692Z
Fiddle
I'm looking to create buckets for certain requests based on duration. So requests for name "A", I need a count of when the duration was less than <2secs, 2secs- 4secs and >4secs. I get the data individually using:
requests
| where name == "A"
| where duration <= 2000
| summarize count()
but what I really need is the number as a percentage of the total "A" requests, for example, a table like:
Name <2secs 2-4 secs >4secs
A 89% 98% 99%
Thanks,
Chris
One way to do it is to rely on performanceBucket field. This will give some distribution but performance buckets are preconfigured.
requests
| where timestamp > ago(1d)
| summarize count() by performanceBucket
Another approach is to do something like this:
requests
| where timestamp > ago(1d)
| extend requestPeformanceBucket = iff(duration < 2000, "<2secs",
iff(duration < 2000, "2secs-4secs", ">4secs"))
| summarize count() by requestPeformanceBucket
And here is how to get percentage:
let dataSet = requests
| where timestamp > ago(1d);
let totalCount = toscalar(dataSet | count);
dataSet
| extend requestPeformanceBucket = iff(duration < 2000, "<2secs",
iff(duration < 2000, "2secs-4secs", ">4secs"))
| summarize count() by requestPeformanceBucket
| project ["Bucket"]=requestPeformanceBucket,
["Count"]=count_,
["Percentage"]=strcat(round(todouble(count_) / totalCount * 100, 2), "%")
I'm sending customEvents to Azure Application Insights that look like this:
timestamp | name | customDimensions
----------------------------------------------------------------------------
2017-06-22T14:10:07.391Z | StatusChange | {"Status":"3000","Id":"49315"}
2017-06-22T14:10:14.699Z | StatusChange | {"Status":"3000","Id":"49315"}
2017-06-22T14:10:15.716Z | StatusChange | {"Status":"2000","Id":"49315"}
2017-06-22T14:10:21.164Z | StatusChange | {"Status":"1000","Id":"41986"}
2017-06-22T14:10:24.994Z | StatusChange | {"Status":"3000","Id":"41986"}
2017-06-22T14:10:25.604Z | StatusChange | {"Status":"2000","Id":"41986"}
2017-06-22T14:10:29.964Z | StatusChange | {"Status":"3000","Id":"54234"}
2017-06-22T14:10:35.192Z | StatusChange | {"Status":"2000","Id":"54234"}
2017-06-22T14:10:35.809Z | StatusChange | {"Status":"3000","Id":"54234"}
2017-06-22T14:10:39.22Z | StatusChange | {"Status":"1000","Id":"74458"}
Assuming that status 3000 is an error status, I'd like to get an alert when a certain percentage of Ids end up in the error status during the past hour.
As far as I know, Insights cannot do this by default, so I would like to try the approach described here to write an Analytics query that could trigger the alert. This is the best I've been able to come up with:
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize failures = sum(isError), successes = sum(1 - isError) by timestamp bin = 1h
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")
However, for my alert to work properly, the query should:
Return "PASSED" even if there are no events within the hour (another alert will take care of the absence of events)
Only take into account the final status of each Id within the hour.
As the request is written, if there are no events, the query returns neither "PASSED" nor "FAILED".
It also takes into account any records with Status == 3000, which means that the example above would return "FAILED" (5 out of 10 records have Status 3000), while in reality only 1 out of 4 Ids ended up in error state.
Can someone help me figure out the correct query?
(And optional secondary questions: Has anyone setup a similar alert using Insights? Is this a correct approach?)
As mentioned, since you're only querying on a singe hour your don't need to bin the timestamp, or use it as part of your aggregation at all.
To answer your questions:
The way to overcome no data at all would be to inject a synthetic row into your table which will translate to a success result if no other result is found
If you want your pass/fail criteria to be based on the final status for each ID, then you need to use argmax in your summarize - it will return the status corresponding to maximal timestamp.
So to wrap it all up:
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize argmax(timestamp, isError) by tostring(customDimensions.Id)
| summarize failures = sum(max_timestamp_isError), successes = sum(1 - max_timestamp_isError)
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project Result = iff(failure_Percent < 50, "PASSED", "FAILED"), IsSynthetic = 0
| union (datatable(Result:string, IsSynthetic:long) ["PASSED", 1])
| top 1 by IsSynthetic asc
| project Result
Regarding the bonus question - you can setup alerting based on Analytics queries using Flow. See here for a related question/answer
I'm presuming that the query returns no rows if you have no data in the hour, because the timestamp bin = 1h (aka bin(timestamp,1h)) doesn't return any bins?
but if you're only querying the last hour, i don't think you need the bin on timestamp at all?
without having your data it's hard to repro exactly but... you could try something like (beware syntax errors):
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize totalCount = count(), failures = countif(isError == 1), successes = countif(isError ==0)
| extend ratio = iff(totalCount == 0, 0, todouble(failures) / todouble(failures+successes))
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")
hypothetically, getting rid of the hour binning should just give you back a single row here of
totalCount = 0, failures = 0, successes = 0, so the math for failure percent should give you back 0 failure ratio, which should get you "PASSED".
without being to try it i'm not sure if that works or still returns you no row if there's no data?
for your second question, you could use something like
let maxTimestamp = toscalar(customEvents where timestamp > ago(1h)
| summarize max(timestamp));
customEvents | where timestamp == maxTimestamp ...
// ... more query here
to get just the row(s) that have that have a timestamp of the last event in the hour?
I am trying to use Neo4j to write a query that aggregates quantities along a particular sub-graph.
We have two stores Store1 and Store2 one with supplier S1 the other with supplier S2. We move 100 units from Store1 into Store3 and 200 units from Store2 to Store3.
We then move 100 units from Store3 to Store4. So now Store4 has 100 units and approximately 33 originated from supplier S1 and 66 from supplier S2.
I need the query to effectively return this information, E.g.
S1, 33
S2, 66
I have a recursive query to aggregate all the movements along each path
MATCH p=(store1:Store)-[m:MOVE_TO*]->(store2:Store { Name: 'Store4'})
RETURN store1.Supplier, reduce(amount = 0, n IN relationships(p) | amount + n.Quantity) AS reduction
Returns:
| store1.Supplier | reduction|
|-------------------- |-------------|
| S1 | 200 |
| S2 | 300 |
| null | 100 |
Desired:
| store1.Supplier | reduction|
|---------------------|-------------|
| S1 | 33.33 |
| S2 | 66.67 |
What about this one :
MATCH (s:Store) WHERE s.name = 'Store4'
MATCH (s)<-[t:MOVE_TO]-()<-[r:MOVE_TO]-(supp)
WITH t.qty as total, collect(r) as movements
WITH total, movements, reduce(totalSupplier = 0, r IN movements | totalSupplier + r.qty) as supCount
UNWIND movements as movement
RETURN startNode(movement).name as supplier, round(100.0*movement.qty/supCount) as pct
Which returns :
supplier pct
Store1 33
Store2 67
Returned 2 rows in 151 ms
So the following is pretty ugly, but it works for the example you've given.
MATCH (s4:Store { Name:'Store4' })<-[r1:MOVE_TO]-(s3:Store)<-[r2:MOVE_TO*]-(s:Store)
WITH s3, r1.Quantity as Factor, SUM(REDUCE(amount = 0, r IN r2 | amount + r.Quantity)) AS Total
MATCH (s3)<-[r1:MOVE_TO*]-(s:Store)
WITH s.Supplier as Supplier, REDUCE(amount = 0, r IN r1 | amount + r.Quantity) AS Quantity, Factor, Total
RETURN Supplier, Quantity, Total, toFloat(Quantity) / toFloat(Total) * Factor as Proportion
I'm sure it can be improved.
I'm trying to write some code to analyze my company's insurance plan offerings... but they're complicated! The PPO plan is straightforward, but the high deductible health plans are complicated, as they introduced a "split" deductible and out of pocket maximum (individual and total) for the family plans. It works like this:
Once the individual meets the individual deductible, he/she is covered at 90%
Once the remaining 1+ individuals on the plan meet the total deductible, the entire family is covered at 90%
The individual cannot satisfy the family deductible with only their medical expenses
I want to feed in a vector of expenses for my family members (there are four of them) and output the total cost for each plan. Below is a table of possible scenarios, with the following column codes:
ded_ind: did one individual meet the individual deductible?
ded_tot: was the total deductible reached?
oop_ind: was the individual out of pocket max reached
oop_tot: was the total out of pocket max reached?
exp_ind = the expenses of the highest spender
exp_rem = the expenses of the remaining /other/ family members (not the highest spender)
oop_max_ind = the level of expenses at which the individual has paid their out of pocket maximum (when ded_ind + 0.1 * exp_ind = out of pocket max for the individual
oop_max_fam = same as for individual, but for remaining family members
The table:
| ded_ind | oop_ind | ded_rem | oop_rem | formula
|---------+---------+---------+---------+---------------------------------------------------------------------------|
| 0 | 0 | 0 | 0 | exp_ind + exp_rem |
| 1 | 0 | 0 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem |
| 0 | 0 | 1 | 0 | exp_ind + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 0 | 0 | oop_max_ind + exp_fam |
| 1 | 0 | 1 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 0 | 0 | 1 | 1 | oop_max_rem + exp_ind |
| 1 | 0 | 1 | 1 | ded_ind + 0.1 * (exp_ind - ded_ind) + oop_max_rem |
| 1 | 1 | 1 | 0 | oop_ind_max + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 1 | 1 | oop_ind_max + oop_rem_max |
Omitted: 0 1 0 0, 0 0 0 1, 0 1 1 0, and 0 1 0 1 are not present, as oop_ind and oop_rem could not have been met if ded_ind and ded_rem, respectively, have not been met.
My current code is a somewhat massive ifelse loop like so (not the code, but what it does):
check if plan is ppo or hsa
if hsa plan
if exp_ind + exp_rem < ded_rem # didn't meet family deductible
if exp_ind < ded_ind # individual deductible also not met
cost = exp_ind + exp_rem
else is exp_ind > oop_ind_max # ded_ind met, is oop_ind?
ded_ind + 0.1 * (exp_ind - ded_ind) + exp_fam # didn't reach oop_max_ind
else oop_max_ind + exp_fam # reached oop_max_ind
else ...
After the else, the total is greater than the family deductible. I check to see if it was contributed by more than two people and just continue on like that.
My question, now that I've given some background to the problem: Is there a better way to manage conditional situations like this than ifelse loops to filter them down a bit at a time?
The code ends up seeming redundant, as one checks for some higher level conditions (consider the table where ded_rem is met or not met... one still has to check for ded_ind and oop_max_ind in both cases, and the code is the same... just positioned at two different places in the ifelse structure).
Could this be done with some sort of matrix operation? Are there other examples online of more clever ways to deal with filtering of conditions?
Many thanks for any suggestions.
P.S. I'm using R and will be creating an interactive with shiny so that other employees can input best and worst case scenarios for each of their family members and see which plan comes out ahead via a dot or bar chart.
The suggestion to convert to a binary value based on the result gave me an idea, which also helped me learn that one can do vectorized TRUE / FALSE checks (I guess that was probably obvious to many).
Here's my current idea:
expenses will be a vector of individual forecast medical expenses for the year (example of three people):
expenses <- c(1500, 100, 400)
We set exp_ind to the max value, and sum the rest for exp_rem
exp_ind <- max(expenses)
# [1] index of which() for cases with multiple max values
exp_rem <- sum(expenses[-which(expenses == exp_ind)[1]])
For any given plan, I can set up a vector with the cutoffs, for example:
individual deductible = 1000
individual out of pocket max = 2000 (need to incur 11k of expenses to get there)
family deductible = 2000
family out of pocket max = 4000 (need to incur 22k of expenses to get there)
Set those values:
ded_ind <- 1000
oop_max_ind <- 11000
ded_tot <- 2000
oop_max_tot <- 22000
cutoffs <- c(ded_ind, oop_max_ind, ded_tot, oop_max_tot)
Now we can check the input expense against the cutoffs:
result <- as.numeric(rep(c(exp_ind, exp_rem), each = 2) > cutoffs)
Last, convert to binary:
result_bin <- sum(2^(seq_along(result) - 1) * result)
Now I can set up functions for the possible outcomes based on the value in result_bin:
if(result_bin == 1) {cost <- ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem }
cost
[1] 1550
We can check this...
High spender would have paid his 1000 and then 10% of remaining 500 = 1050
Other members did not reach the family deductible and paid the full 400 + 100 = 500
Total: 1550
I still need to create a mapping of results_bin values to corresponding functions, but doing a vectorized check and converting a unique binary value is much, much better, in my opinion, than my ifelse nested mess.
I look at it like this: I'd have had to set the variables and write the functions anyway; this saves me 1) explicitly writing all the conditions, 2) the redundancy issue I was talking about in that one ends up writing identical "sibling" branches of parent splits in the ifelse structure, and lastly, 3) the code is far, far, far more easily followed.
Since this question is not very specific, here is a simpler example/answer:
# example data
test <- expand.grid(opt1=0:1,opt2=0:1)
# create a unique identifier to represent the binary variables
test$code <- with(allopts,paste(opt1,opt2,sep=""))
# create an input variable to be used in functions
test$var1 <- 1:4
# opt1 opt2 code var1
#1 0 0 00 1
#2 1 0 10 2
#3 0 1 01 3
#4 1 1 11 4
Respective functions to apply depending on binary conditions, along with intended results for each combo:
var1 + 10 #code 00 - intended result = 11
var1 + 100 #code 10 - intended result = 102
var1 + 1000 #code 01 - intended result = 1003
var1 + var1 #code 11 - intended result = 8
Use ifelse combinations to do the calculations:
test$result <- with(test,
ifelse(code == "00", var1 + 10,
ifelse(code == "10", var1 + 100,
ifelse(code == "01", var1 + 1000,
ifelse(code == "11", var1 + var1,
NA
)))))
Result:
opt1 opt2 code var1 result
1 0 0 00 1 11
2 1 0 10 2 102
3 0 1 01 3 1003
4 1 1 11 4 8