Can I calculate the average of these numbers? - math

I was wondering if it's possible to calculate the average of some numbers if I have this:
int currentCount = 12;
float currentScore = 6.1123 (this is a range of 1 <-> 10).
Now, if I receive another score (let's say 4.5), can I recalculate the average so it would be something like:
int currentCount now equals 13
float currentScore now equals ?????
or is this impossible and I still need to remember the list of scores?

The following formulas allow you to track averages just from stored average and count, as you requested.
currentScore = (currentScore * currentCount + newValue) / (currentCount + 1)
currentCount = currentCount + 1
This relies on the fact that your average is currently your sum divided by the count. So you simply multiply count by average to get the sum, add your new value and divide by (count+1), then increase count.
So, let's say you have the data {7,9,11,1,12} and the only thing you're keeping is the average and count. As each number is added, you get:
+--------+-------+----------------------+----------------------+
| Number | Count | Actual average | Calculated average |
+--------+-------+----------------------+----------------------+
| 7 | 1 | (7)/1 = 7 | (0 * 0 + 7) / 1 = 7 |
| 9 | 2 | (7+9)/2 = 8 | (7 * 1 + 9) / 2 = 8 |
| 11 | 3 | (7+9+11)/3 = 9 | (8 * 2 + 11) / 3 = 9 |
| 1 | 4 | (7+9+11+1)/4 = 7 | (9 * 3 + 1) / 4 = 7 |
| 12 | 5 | (7+9+11+1+12)/5 = 8 | (7 * 4 + 12) / 5 = 8 |
+--------+-------+----------------------+----------------------+

I like to store the sum and the count. It avoids an extra multiply each time.
current_sum += input;
current_count++;
current_average = current_sum/current_count;

It's quite easy really, when you look at the formula for the average: A1 + A2 + ... + AN/N. Now, If you have the old average and the N (numbers count) you can easily calculate the new average:
newScore = (currentScore * currentCount + someNewValue)/(currentCount + 1)

You can store currentCount and sumScore and you calculate sumScore/currentCount.

or... if you want to be silly, you can do it in one line :
current_average = (current_sum = current_sum + newValue) / ++current_count;
:)

float currentScore now equals (currentScore * (currentCount-1) + 4.5)/currentCount ?

Related

Kusto: Apply function on multiple column values during bag_unpack

Given a dynamic field, say, milestones, it has value like: {"ta": 1655859586546, "tb": 1655859586646},
How do I print a table with columns like "ta", "tb" etc, with the single row as unixtime_milliseconds_todatetime(tolong(taValue)), unixtime_milliseconds_todatetime(tolong(tbValue)) etc.
I figured that I'll need to write a function that I can call, so I created this:-
let f = view(a:string ){
unixtime_milliseconds_todatetime(tolong(a))
};
I can use this function with a normal column as:- project f(columnName).
However, in this case, its a dynamic field, and the number of items in the list is large, so I do not want to enter the fields manually. This is what I have so far.
log_table
| take 1
| evaluate bag_unpack(milestones, "m_") // This gives me fields as columns
// | project-keep m_* // This would work, if I just wanted the value, however, I want `view(columnValue)
| project-keep f(m_*) // This of course doesn't work, but explains the idea.
Based on the mv-apply operator
// Generate data sample. Not part of the solution.
let log_table = materialize(range record_id from 1 to 10 step 1 | mv-apply range(1, 1 + rand(5), 1) on (summarize milestones = make_bag(pack_dictionary(strcat("t", make_string(to_utf8("a")[0] + toint(rand(26)))), 1600000000000 + rand(60000000000)))));
// Solution Starts here.
log_table
| mv-apply kv = milestones on
(
extend k = tostring(bag_keys(kv)[0])
| extend v = unixtime_milliseconds_todatetime(tolong(kv[k]))
| summarize milestones = make_bag(pack_dictionary(k, v))
)
| evaluate bag_unpack(milestones)
record_id
ta
tb
tc
td
te
tf
tg
th
ti
tk
tl
tm
to
tp
tr
tt
tu
tw
tx
tz
1
2021-07-06T20:24:47.767Z
2
2021-05-09T07:21:08.551Z
2022-07-28T20:57:16.025Z
2022-07-28T14:21:33.656Z
2020-11-09T00:54:39.71Z
2020-12-22T00:30:13.463Z
3
2021-12-07T11:07:39.204Z
2022-05-16T04:33:50.002Z
2021-10-20T12:19:27.222Z
4
2022-01-31T23:24:07.305Z
2021-01-20T17:38:53.21Z
5
2022-04-27T22:41:15.643Z
7
2022-01-22T08:30:08.995Z
2021-09-30T08:58:46.47Z
8
2022-03-14T13:41:10.968Z
2022-03-26T10:45:19.56Z
2022-08-06T16:50:37.003Z
10
2021-03-03T11:02:02.217Z
2021-02-28T09:52:24.327Z
2021-04-09T07:08:06.985Z
2020-12-28T20:18:04.973Z
9
2022-02-17T04:55:35.468Z
6
2022-08-02T14:44:15.414Z
2021-03-24T10:22:36.138Z
2020-12-17T01:14:40.652Z
2022-01-30T12:45:54.28Z
2022-03-31T02:29:43.114Z
Fiddle

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

Recursive query with sub-graph aggregation

I am trying to use Neo4j to write a query that aggregates quantities along a particular sub-graph.
We have two stores Store1 and Store2 one with supplier S1 the other with supplier S2. We move 100 units from Store1 into Store3 and 200 units from Store2 to Store3.
We then move 100 units from Store3 to Store4. So now Store4 has 100 units and approximately 33 originated from supplier S1 and 66 from supplier S2.
I need the query to effectively return this information, E.g.
S1, 33
S2, 66
I have a recursive query to aggregate all the movements along each path
MATCH p=(store1:Store)-[m:MOVE_TO*]->(store2:Store { Name: 'Store4'})
RETURN store1.Supplier, reduce(amount = 0, n IN relationships(p) | amount + n.Quantity) AS reduction
Returns:
| store1.Supplier | reduction|
|-------------------- |-------------|
| S1 | 200 |
| S2 | 300 |
| null | 100 |
Desired:
| store1.Supplier | reduction|
|---------------------|-------------|
| S1 | 33.33 |
| S2 | 66.67 |
What about this one :
MATCH (s:Store) WHERE s.name = 'Store4'
MATCH (s)<-[t:MOVE_TO]-()<-[r:MOVE_TO]-(supp)
WITH t.qty as total, collect(r) as movements
WITH total, movements, reduce(totalSupplier = 0, r IN movements | totalSupplier + r.qty) as supCount
UNWIND movements as movement
RETURN startNode(movement).name as supplier, round(100.0*movement.qty/supCount) as pct
Which returns :
supplier pct
Store1 33
Store2 67
Returned 2 rows in 151 ms
So the following is pretty ugly, but it works for the example you've given.
MATCH (s4:Store { Name:'Store4' })<-[r1:MOVE_TO]-(s3:Store)<-[r2:MOVE_TO*]-(s:Store)
WITH s3, r1.Quantity as Factor, SUM(REDUCE(amount = 0, r IN r2 | amount + r.Quantity)) AS Total
MATCH (s3)<-[r1:MOVE_TO*]-(s:Store)
WITH s.Supplier as Supplier, REDUCE(amount = 0, r IN r1 | amount + r.Quantity) AS Quantity, Factor, Total
RETURN Supplier, Quantity, Total, toFloat(Quantity) / toFloat(Total) * Factor as Proportion
I'm sure it can be improved.

Dealing with conditionals in a better manner than deeply nested ifelse blocks

I'm trying to write some code to analyze my company's insurance plan offerings... but they're complicated! The PPO plan is straightforward, but the high deductible health plans are complicated, as they introduced a "split" deductible and out of pocket maximum (individual and total) for the family plans. It works like this:
Once the individual meets the individual deductible, he/she is covered at 90%
Once the remaining 1+ individuals on the plan meet the total deductible, the entire family is covered at 90%
The individual cannot satisfy the family deductible with only their medical expenses
I want to feed in a vector of expenses for my family members (there are four of them) and output the total cost for each plan. Below is a table of possible scenarios, with the following column codes:
ded_ind: did one individual meet the individual deductible?
ded_tot: was the total deductible reached?
oop_ind: was the individual out of pocket max reached
oop_tot: was the total out of pocket max reached?
exp_ind = the expenses of the highest spender
exp_rem = the expenses of the remaining /other/ family members (not the highest spender)
oop_max_ind = the level of expenses at which the individual has paid their out of pocket maximum (when ded_ind + 0.1 * exp_ind = out of pocket max for the individual
oop_max_fam = same as for individual, but for remaining family members
The table:
| ded_ind | oop_ind | ded_rem | oop_rem | formula
|---------+---------+---------+---------+---------------------------------------------------------------------------|
| 0 | 0 | 0 | 0 | exp_ind + exp_rem |
| 1 | 0 | 0 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem |
| 0 | 0 | 1 | 0 | exp_ind + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 0 | 0 | oop_max_ind + exp_fam |
| 1 | 0 | 1 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 0 | 0 | 1 | 1 | oop_max_rem + exp_ind |
| 1 | 0 | 1 | 1 | ded_ind + 0.1 * (exp_ind - ded_ind) + oop_max_rem |
| 1 | 1 | 1 | 0 | oop_ind_max + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 1 | 1 | oop_ind_max + oop_rem_max |
Omitted: 0 1 0 0, 0 0 0 1, 0 1 1 0, and 0 1 0 1 are not present, as oop_ind and oop_rem could not have been met if ded_ind and ded_rem, respectively, have not been met.
My current code is a somewhat massive ifelse loop like so (not the code, but what it does):
check if plan is ppo or hsa
if hsa plan
if exp_ind + exp_rem < ded_rem # didn't meet family deductible
if exp_ind < ded_ind # individual deductible also not met
cost = exp_ind + exp_rem
else is exp_ind > oop_ind_max # ded_ind met, is oop_ind?
ded_ind + 0.1 * (exp_ind - ded_ind) + exp_fam # didn't reach oop_max_ind
else oop_max_ind + exp_fam # reached oop_max_ind
else ...
After the else, the total is greater than the family deductible. I check to see if it was contributed by more than two people and just continue on like that.
My question, now that I've given some background to the problem: Is there a better way to manage conditional situations like this than ifelse loops to filter them down a bit at a time?
The code ends up seeming redundant, as one checks for some higher level conditions (consider the table where ded_rem is met or not met... one still has to check for ded_ind and oop_max_ind in both cases, and the code is the same... just positioned at two different places in the ifelse structure).
Could this be done with some sort of matrix operation? Are there other examples online of more clever ways to deal with filtering of conditions?
Many thanks for any suggestions.
P.S. I'm using R and will be creating an interactive with shiny so that other employees can input best and worst case scenarios for each of their family members and see which plan comes out ahead via a dot or bar chart.
The suggestion to convert to a binary value based on the result gave me an idea, which also helped me learn that one can do vectorized TRUE / FALSE checks (I guess that was probably obvious to many).
Here's my current idea:
expenses will be a vector of individual forecast medical expenses for the year (example of three people):
expenses <- c(1500, 100, 400)
We set exp_ind to the max value, and sum the rest for exp_rem
exp_ind <- max(expenses)
# [1] index of which() for cases with multiple max values
exp_rem <- sum(expenses[-which(expenses == exp_ind)[1]])
For any given plan, I can set up a vector with the cutoffs, for example:
individual deductible = 1000
individual out of pocket max = 2000 (need to incur 11k of expenses to get there)
family deductible = 2000
family out of pocket max = 4000 (need to incur 22k of expenses to get there)
Set those values:
ded_ind <- 1000
oop_max_ind <- 11000
ded_tot <- 2000
oop_max_tot <- 22000
cutoffs <- c(ded_ind, oop_max_ind, ded_tot, oop_max_tot)
Now we can check the input expense against the cutoffs:
result <- as.numeric(rep(c(exp_ind, exp_rem), each = 2) > cutoffs)
Last, convert to binary:
result_bin <- sum(2^(seq_along(result) - 1) * result)
Now I can set up functions for the possible outcomes based on the value in result_bin:
if(result_bin == 1) {cost <- ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem }
cost
[1] 1550
We can check this...
High spender would have paid his 1000 and then 10% of remaining 500 = 1050
Other members did not reach the family deductible and paid the full 400 + 100 = 500
Total: 1550
I still need to create a mapping of results_bin values to corresponding functions, but doing a vectorized check and converting a unique binary value is much, much better, in my opinion, than my ifelse nested mess.
I look at it like this: I'd have had to set the variables and write the functions anyway; this saves me 1) explicitly writing all the conditions, 2) the redundancy issue I was talking about in that one ends up writing identical "sibling" branches of parent splits in the ifelse structure, and lastly, 3) the code is far, far, far more easily followed.
Since this question is not very specific, here is a simpler example/answer:
# example data
test <- expand.grid(opt1=0:1,opt2=0:1)
# create a unique identifier to represent the binary variables
test$code <- with(allopts,paste(opt1,opt2,sep=""))
# create an input variable to be used in functions
test$var1 <- 1:4
# opt1 opt2 code var1
#1 0 0 00 1
#2 1 0 10 2
#3 0 1 01 3
#4 1 1 11 4
Respective functions to apply depending on binary conditions, along with intended results for each combo:
var1 + 10 #code 00 - intended result = 11
var1 + 100 #code 10 - intended result = 102
var1 + 1000 #code 01 - intended result = 1003
var1 + var1 #code 11 - intended result = 8
Use ifelse combinations to do the calculations:
test$result <- with(test,
ifelse(code == "00", var1 + 10,
ifelse(code == "10", var1 + 100,
ifelse(code == "01", var1 + 1000,
ifelse(code == "11", var1 + var1,
NA
)))))
Result:
opt1 opt2 code var1 result
1 0 0 00 1 11
2 1 0 10 2 102
3 0 1 01 3 1003
4 1 1 11 4 8

How to calculate row / col from grid position?

Given a grid where I know the number of rows (which is fixed), and I know the current count of columns (which can grow arbitrarily), how do I calculate the row and column of an square from it's index?
+ + + + +
Cols ---> | 0 | 1 | 2 | 3 | ...
+--+---|---|---|---|---
0 | 0 | 3 | 6 | 9 | ...
+--+---|---|---|---|---
Rows 1 | 1 | 4 | 7 | A | ...
+--+---|---|---|---|---
2 | 2 | 5 | 8 | B | ...
+--+---|---|---|---|---
. . . . . ...
. . . . . .
. . . . . .
So, given:
final int mRowCount = /* something */;
int mColCount;
And given some function:
private void func(int index) {
int row = index % mRowCount;
int col = ???
How do I correctly calculate col? It must be a function of both the number of columns and rows, I'm pretty sure. But my brain is failing me.
Sample: If index == 4, then row = 1, col = 1. If index == 2 then row = 2, col = 0.
Thanks.
int col = index / mRowCount;
Didn't really understand your setup but if you got a normal grid with a progressive index like in the Android GridLayout:
+-------------------+
| 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|
| 10| 11| 12| 13| 14|
|---|---|---|---|---|
| 15| 16| 17| 18| 19|
+-------------------+
The calculation is:
int col = index % colCount;
int row = index / colCount;
For example:
row of index 6 = 6 / 5 = 1
column of index 12 = 12 % 5 = 2
index = col * mRowCount + row
then
row = index % mRowCount;
col = index / mRowCount;
I believe the column would be obtained by integer division:
int col = index / mRowCount;
It would be possible to limit it to a single division (eliminate the modulus operation) by replacing it with a multiplication and subtraction. I'm not sure if that is less costly; probably wouldn't matter in most situations:
int col = index / mRowCount;
int row = index - col * mRowCount;
row = CEILING(index / numberOfColumns)    
CEILING rounds up to the next integer
col = MOD(index / numberOfColumns)
 and this with one exception which you must take into account -->  when MOD=0, when col result is ZERO, then you set  your  col=numberOfColumns
(so for example, let say numberOfColumns=48... then,  MOD(96, 48) is ZERO, and so is  MOD(48, 48)=0  ... cause anything divisible by 48 will be MOD=0...  in other words, when MOD=0 you are in the LAST or HIGHEST column, you are in numberOfColumns column
column = index / max_rows;
row = index % max_rows;
row = index / numberOfColumns
and
column = index % numberOfColumns

Resources