Dealing with conditionals in a better manner than deeply nested ifelse blocks - r

I'm trying to write some code to analyze my company's insurance plan offerings... but they're complicated! The PPO plan is straightforward, but the high deductible health plans are complicated, as they introduced a "split" deductible and out of pocket maximum (individual and total) for the family plans. It works like this:
Once the individual meets the individual deductible, he/she is covered at 90%
Once the remaining 1+ individuals on the plan meet the total deductible, the entire family is covered at 90%
The individual cannot satisfy the family deductible with only their medical expenses
I want to feed in a vector of expenses for my family members (there are four of them) and output the total cost for each plan. Below is a table of possible scenarios, with the following column codes:
ded_ind: did one individual meet the individual deductible?
ded_tot: was the total deductible reached?
oop_ind: was the individual out of pocket max reached
oop_tot: was the total out of pocket max reached?
exp_ind = the expenses of the highest spender
exp_rem = the expenses of the remaining /other/ family members (not the highest spender)
oop_max_ind = the level of expenses at which the individual has paid their out of pocket maximum (when ded_ind + 0.1 * exp_ind = out of pocket max for the individual
oop_max_fam = same as for individual, but for remaining family members
The table:
| ded_ind | oop_ind | ded_rem | oop_rem | formula
|---------+---------+---------+---------+---------------------------------------------------------------------------|
| 0 | 0 | 0 | 0 | exp_ind + exp_rem |
| 1 | 0 | 0 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem |
| 0 | 0 | 1 | 0 | exp_ind + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 0 | 0 | oop_max_ind + exp_fam |
| 1 | 0 | 1 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 0 | 0 | 1 | 1 | oop_max_rem + exp_ind |
| 1 | 0 | 1 | 1 | ded_ind + 0.1 * (exp_ind - ded_ind) + oop_max_rem |
| 1 | 1 | 1 | 0 | oop_ind_max + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 1 | 1 | oop_ind_max + oop_rem_max |
Omitted: 0 1 0 0, 0 0 0 1, 0 1 1 0, and 0 1 0 1 are not present, as oop_ind and oop_rem could not have been met if ded_ind and ded_rem, respectively, have not been met.
My current code is a somewhat massive ifelse loop like so (not the code, but what it does):
check if plan is ppo or hsa
if hsa plan
if exp_ind + exp_rem < ded_rem # didn't meet family deductible
if exp_ind < ded_ind # individual deductible also not met
cost = exp_ind + exp_rem
else is exp_ind > oop_ind_max # ded_ind met, is oop_ind?
ded_ind + 0.1 * (exp_ind - ded_ind) + exp_fam # didn't reach oop_max_ind
else oop_max_ind + exp_fam # reached oop_max_ind
else ...
After the else, the total is greater than the family deductible. I check to see if it was contributed by more than two people and just continue on like that.
My question, now that I've given some background to the problem: Is there a better way to manage conditional situations like this than ifelse loops to filter them down a bit at a time?
The code ends up seeming redundant, as one checks for some higher level conditions (consider the table where ded_rem is met or not met... one still has to check for ded_ind and oop_max_ind in both cases, and the code is the same... just positioned at two different places in the ifelse structure).
Could this be done with some sort of matrix operation? Are there other examples online of more clever ways to deal with filtering of conditions?
Many thanks for any suggestions.
P.S. I'm using R and will be creating an interactive with shiny so that other employees can input best and worst case scenarios for each of their family members and see which plan comes out ahead via a dot or bar chart.

The suggestion to convert to a binary value based on the result gave me an idea, which also helped me learn that one can do vectorized TRUE / FALSE checks (I guess that was probably obvious to many).
Here's my current idea:
expenses will be a vector of individual forecast medical expenses for the year (example of three people):
expenses <- c(1500, 100, 400)
We set exp_ind to the max value, and sum the rest for exp_rem
exp_ind <- max(expenses)
# [1] index of which() for cases with multiple max values
exp_rem <- sum(expenses[-which(expenses == exp_ind)[1]])
For any given plan, I can set up a vector with the cutoffs, for example:
individual deductible = 1000
individual out of pocket max = 2000 (need to incur 11k of expenses to get there)
family deductible = 2000
family out of pocket max = 4000 (need to incur 22k of expenses to get there)
Set those values:
ded_ind <- 1000
oop_max_ind <- 11000
ded_tot <- 2000
oop_max_tot <- 22000
cutoffs <- c(ded_ind, oop_max_ind, ded_tot, oop_max_tot)
Now we can check the input expense against the cutoffs:
result <- as.numeric(rep(c(exp_ind, exp_rem), each = 2) > cutoffs)
Last, convert to binary:
result_bin <- sum(2^(seq_along(result) - 1) * result)
Now I can set up functions for the possible outcomes based on the value in result_bin:
if(result_bin == 1) {cost <- ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem }
cost
[1] 1550
We can check this...
High spender would have paid his 1000 and then 10% of remaining 500 = 1050
Other members did not reach the family deductible and paid the full 400 + 100 = 500
Total: 1550
I still need to create a mapping of results_bin values to corresponding functions, but doing a vectorized check and converting a unique binary value is much, much better, in my opinion, than my ifelse nested mess.
I look at it like this: I'd have had to set the variables and write the functions anyway; this saves me 1) explicitly writing all the conditions, 2) the redundancy issue I was talking about in that one ends up writing identical "sibling" branches of parent splits in the ifelse structure, and lastly, 3) the code is far, far, far more easily followed.

Since this question is not very specific, here is a simpler example/answer:
# example data
test <- expand.grid(opt1=0:1,opt2=0:1)
# create a unique identifier to represent the binary variables
test$code <- with(allopts,paste(opt1,opt2,sep=""))
# create an input variable to be used in functions
test$var1 <- 1:4
# opt1 opt2 code var1
#1 0 0 00 1
#2 1 0 10 2
#3 0 1 01 3
#4 1 1 11 4
Respective functions to apply depending on binary conditions, along with intended results for each combo:
var1 + 10 #code 00 - intended result = 11
var1 + 100 #code 10 - intended result = 102
var1 + 1000 #code 01 - intended result = 1003
var1 + var1 #code 11 - intended result = 8
Use ifelse combinations to do the calculations:
test$result <- with(test,
ifelse(code == "00", var1 + 10,
ifelse(code == "10", var1 + 100,
ifelse(code == "01", var1 + 1000,
ifelse(code == "11", var1 + var1,
NA
)))))
Result:
opt1 opt2 code var1 result
1 0 0 00 1 11
2 1 0 10 2 102
3 0 1 01 3 1003
4 1 1 11 4 8

Related

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

Efficient loan repayment calculation

I have a table of loan issuance and repayments by customers that I have preprocessed like this
customerID | balanceChange | trxDate | TYPE
242105 | 500 | 20170605 | loan
242105 | 1500 | 20170605 | loan
242105 | -1000 | 20170607 | payment
242111 | 500 | 20170605 | loan
242111 | -500 | 20170606 | payment
242111 | 500 | 20170607 | loan
242111 | -500 | 20170609 | payment
242151 | 500 | 20170605 | loan
What I would like to do is to (1) count for each of the loans issued every day, how many of them have been paid back in full, and (2) how many days did it take the customer to pay them.
The rule of the repayment is of course FIFO (First In First Out), so the oldest loan gets paid back first.
In the example above, the solution would be
trxDate | nRepayments | timeGap(days)
20170605 | 2 | 1.5
20170606 | 0 | 0
20170607 | 1 | 2
So, the explanation on why the solution is like that is on 20170605, there are 4 loans issued (2 to customerID 242105, and he other two to 242111 and 242151), but only 2 of those loans were paid back (the 500 given to 242105 and the 500 given to 242111). The timeGap is the average of sum of how many days did it took every customers to pay them back (242105 paid back on 20170607 - 2 day, and 242111 paid back on 20170606 - 1 day), so (2+1)/2 = 1.5.
I have tried to calculate the nRepayments (I figured if I did this the timeGap should be a piece of cake) with the following R script.
#Recoveries
data_loans_rec <- data_loans %>% arrange(customerID, trxDate) %>% as.data.table()
data_loans_rec[is.na(data_loans_rec)] <- 0
data_loans_rec <- data_loans_rec[, index := seq_len(.N), by = customerID][!(index == 1 & TYPE == "payment")][, index := seq_len(.N), by = customerID]
n_loans_given <- data_loans[TYPE == "loan", ][, .(nloans = .N), .(payment)][order(payment)]
n_loans_rec <- copy(n_loans_given)
n_loans_rec[, nloans:=0]
unique_cust <- unique(data_loans_rec$customerID)
#Check repayment for every customer================
for (i in 1:length(unique_cust)) {
cur_cust <- unique_cust[i]
list_loan <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(balanceChange)] )
list_loan_time <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "loan", .(trxDate) ])
list_pay <- as.vector(data_loans_rec[customerID == cur_cust & TYPE == "payment", .(balanceChange) ])
if (dim(list_pay)[1] == 0) { #if there are no payments
list_pay <- c(0)
}
sum_paid <- sum(abs(list_pay))
i_paid_until <- 0
for (i_loantime in 1:(dim(list_loan_time)[1])) {
#if there is only one loan
if (i_loantime == 0) {
i_loantime <- 1
}
loan_curr <- list_loan[i_loantime]
loan_left <- loan_curr - sum_paid
if (loan_left <= 0) {
n_loans_rec[trxDate == list_loan_time[i_loantime], nloans:=nloans+1]
sum_paid <- sum_paid - loan_curr
print (paste(i_loantime, list_loan_time[i_loantime], n_loans_rec[trxDate == list_loan_time[i_loantime], .(nloans)]))
# break
} else {
break
}
}
print (i)
}
The idea is that for every customer, make a list of loans, time of loan, and payments. The best case scenario is if the customer's total amount of loan is equal or less (due to dirty data) the total amount of payment (full payment). Then the number of repayments equals the number of loans issued to that customer. The average case is when customer's make a partial payment. In which case, I sum the total amount of payments, and I iterate through each loan the customer made whilst summing the total amount of loans as I iterate. If the amount of loan finally exceeds the amount of payments, then I count how many loans have actually been covered by the customer's payments.
The problem is I have millions of customers, and each of them have made loans and payments at least 5 times. So, since I am using a nested loop, it would take hours to complete.
So, I am asking here if anyone has ever come across this problem and/or have a better, more efficient solution.
Thanks in advance!
Your logic is quite complicated and with this answer I don't attempt to replicate it fully; my intention is just to give you some ideas on how to optimise.
Also, as mentioned in comments, you could try to parallelise, or maybe use another programming language.
Anyway, as your setup is already with data.table, you can try to use global operations to the full set, as much as you can, which will usually go faster than your big loop. Something for example like this.
I first calculate, per customer id, the balance and the sum of payments done:
data_loans_rec <- data_loans_rec[, balance := sum(balanceChange), by = customerID]
data_loans_rec <- data_loans_rec[, sumPayments := sum(balanceChange[TYPE == "payment"]), by = customerID]
With this, you already know that every customer with balance 0 has repaid everything:
data_loans_rec <- data_loans_rec[TYPE == "loan" & balance == 0, repaid := TRUE, by = list(customerID, index)]
These operations of course read a lot of data if you have millions of customers, but I'd say that data.table should handle them pretty quickly.
For the rest of the customers, but only for those registers that are a loan and you don't know yet if they have been repaid, you can use a data.table function.
setRepaid <- function(balanceChange, sumPayments) {
# note that here you get a vector for all the loans of a customer
sumPay <- (-1) * sumPayments[1]
if (sumPay == 0)
return(rep(FALSE, length(balanceChange)))
number_of_loans_paid <- 0
for (i in 1:length(balanceChange)) {
if (sum(balanceChange[1:i]) > sumPay)
break
number_of_loans_paid <- number_of_loans_paid + 1
}
return(c(rep(TRUE, number_of_loans_paid), rep(FALSE, length(balanceChange)-number_of_loans_paid)))
}
data_loans_rec <- data_loans_rec[TYPE == "loan" & is.na(repaid), repaid := setRepaid(balanceChange, sumPayments), by = list(customerID) ]
With that you get the desired result, at least for your example.
customerID balanceChange trxDate TYPE index balance sumPayments repaid
1: 242105 500 20170605 loan 1 1000 -1000 TRUE
2: 242105 1500 20170605 loan 2 1000 -1000 FALSE
3: 242105 -1000 20170607 payment 3 1000 -1000 NA
4: 242111 500 20170605 loan 1 0 -1000 TRUE
5: 242111 -500 20170606 payment 2 0 -1000 NA
6: 242111 500 20170607 loan 3 0 -1000 TRUE
7: 242111 -500 20170609 payment 4 0 -1000 NA
8: 242151 500 20170605 loan 1 500 0 FALSE
Advantages being: the final loop works over much less customers, you have some stuff already precalculated, and you rely on data.table for actually substituting your loop. Hopefully this approach will give you an improvement. I think it is work a try.

GameTheory package: Convert data frame of games to Coalition Set

I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.

Recursive query with sub-graph aggregation

I am trying to use Neo4j to write a query that aggregates quantities along a particular sub-graph.
We have two stores Store1 and Store2 one with supplier S1 the other with supplier S2. We move 100 units from Store1 into Store3 and 200 units from Store2 to Store3.
We then move 100 units from Store3 to Store4. So now Store4 has 100 units and approximately 33 originated from supplier S1 and 66 from supplier S2.
I need the query to effectively return this information, E.g.
S1, 33
S2, 66
I have a recursive query to aggregate all the movements along each path
MATCH p=(store1:Store)-[m:MOVE_TO*]->(store2:Store { Name: 'Store4'})
RETURN store1.Supplier, reduce(amount = 0, n IN relationships(p) | amount + n.Quantity) AS reduction
Returns:
| store1.Supplier | reduction|
|-------------------- |-------------|
| S1 | 200 |
| S2 | 300 |
| null | 100 |
Desired:
| store1.Supplier | reduction|
|---------------------|-------------|
| S1 | 33.33 |
| S2 | 66.67 |
What about this one :
MATCH (s:Store) WHERE s.name = 'Store4'
MATCH (s)<-[t:MOVE_TO]-()<-[r:MOVE_TO]-(supp)
WITH t.qty as total, collect(r) as movements
WITH total, movements, reduce(totalSupplier = 0, r IN movements | totalSupplier + r.qty) as supCount
UNWIND movements as movement
RETURN startNode(movement).name as supplier, round(100.0*movement.qty/supCount) as pct
Which returns :
supplier pct
Store1 33
Store2 67
Returned 2 rows in 151 ms
So the following is pretty ugly, but it works for the example you've given.
MATCH (s4:Store { Name:'Store4' })<-[r1:MOVE_TO]-(s3:Store)<-[r2:MOVE_TO*]-(s:Store)
WITH s3, r1.Quantity as Factor, SUM(REDUCE(amount = 0, r IN r2 | amount + r.Quantity)) AS Total
MATCH (s3)<-[r1:MOVE_TO*]-(s:Store)
WITH s.Supplier as Supplier, REDUCE(amount = 0, r IN r1 | amount + r.Quantity) AS Quantity, Factor, Total
RETURN Supplier, Quantity, Total, toFloat(Quantity) / toFloat(Total) * Factor as Proportion
I'm sure it can be improved.

Can I calculate the average of these numbers?

I was wondering if it's possible to calculate the average of some numbers if I have this:
int currentCount = 12;
float currentScore = 6.1123 (this is a range of 1 <-> 10).
Now, if I receive another score (let's say 4.5), can I recalculate the average so it would be something like:
int currentCount now equals 13
float currentScore now equals ?????
or is this impossible and I still need to remember the list of scores?
The following formulas allow you to track averages just from stored average and count, as you requested.
currentScore = (currentScore * currentCount + newValue) / (currentCount + 1)
currentCount = currentCount + 1
This relies on the fact that your average is currently your sum divided by the count. So you simply multiply count by average to get the sum, add your new value and divide by (count+1), then increase count.
So, let's say you have the data {7,9,11,1,12} and the only thing you're keeping is the average and count. As each number is added, you get:
+--------+-------+----------------------+----------------------+
| Number | Count | Actual average | Calculated average |
+--------+-------+----------------------+----------------------+
| 7 | 1 | (7)/1 = 7 | (0 * 0 + 7) / 1 = 7 |
| 9 | 2 | (7+9)/2 = 8 | (7 * 1 + 9) / 2 = 8 |
| 11 | 3 | (7+9+11)/3 = 9 | (8 * 2 + 11) / 3 = 9 |
| 1 | 4 | (7+9+11+1)/4 = 7 | (9 * 3 + 1) / 4 = 7 |
| 12 | 5 | (7+9+11+1+12)/5 = 8 | (7 * 4 + 12) / 5 = 8 |
+--------+-------+----------------------+----------------------+
I like to store the sum and the count. It avoids an extra multiply each time.
current_sum += input;
current_count++;
current_average = current_sum/current_count;
It's quite easy really, when you look at the formula for the average: A1 + A2 + ... + AN/N. Now, If you have the old average and the N (numbers count) you can easily calculate the new average:
newScore = (currentScore * currentCount + someNewValue)/(currentCount + 1)
You can store currentCount and sumScore and you calculate sumScore/currentCount.
or... if you want to be silly, you can do it in one line :
current_average = (current_sum = current_sum + newValue) / ++current_count;
:)
float currentScore now equals (currentScore * (currentCount-1) + 4.5)/currentCount ?

Resources