How to build a distribution of percentages in U-SQL? - u-sql

In my database I have the column "Study hours per week". I want to build a distribution using U-SQL which groups the % of students into each 'student hour' bucket. Is there a built-in function to help me achieve this?
Essentially, I want to populate the right side of this table:
Study Hours per week | % of students
<= 1
<= 5
<= 10
<= 20
<= 40
<= 100
Example: If we had 10 unique students with the following study hours/week: [5, 6, 10, 9, 2, 25, 18, 5, 12, 1] the resulting output should be:
Study Hours per week | % of students
<= 1 | 10%
<= 5 | 40%
<= 10 | 70%
<= 20 | 90%
<= 40 | 100%
<= 100| 100%

Related

Find the size/level of a node from its position within an quadtree

I have a quadtree. The root node (level 0) is positioned at 0,0 by its centre. It has a width of 16, so its corners are at -8,-8 and 8,8. Since it's a quadtree, the root contains four children, each of which contain four children of their own and so on. The deepest level is level 3 at width 2. Here's a dodgy Paint drawing to better illustrate what I mean:
The large numbers indicate the centre position and width of each node. The small numbers around the sides indicate positions.
Given a valid position, how can I figure out what level or size of node exists at that position? It seems like this should be obvious but I can't get my head around the maths. I see the patterns in the diagram but I can't seem to translate it into code.
For instance, the node at position 1,1 is size 2/level 3. Position 4,6 is invalid because it's between nodes. Position -6,-2 is size 4/level 2.
Additional Notes:
Positions are addresses. They are exact and not worldspace, which is why it's possible to have an invalid address.
In practice the root node size could be as large as 4096, or even larger.
Observe that the coordinate values for the centre of each node are always +/- odd multiples of a power-of-2, the latter being related to the node size:
Node size | Allowed centre coordinates | Factor
-----------------------------------------------------
2 | 1, 3, 5, 7, 9, 11 ... | 1 = 2/2
-----------------------------------------------------
4 | 2, 6, 10, 14, 18, 22 ... | 2 = 4/2
-----------------------------------------------------
8 | 4, 12, 20, 28, 36, 44 ... | 4 = 8/2
-----------------------------------------------------
16 | 8, 24, 40, 56, 72, 88 ... | 8 = 16/2
The root node is a special case since it is always centred on 0,0, but in a larger quad-tree the 16x16 nodes would follow the same pattern.
Crucially, both X,Y values of the coordinate must share the same power-of-2 factor. This means that the binary representations of their absolute values must have the same number of trailing zeros. For your examples:
Example | Binary | Zeros | Valid
----------------------------------
X = 1 | 000001 | 0 | Y
Y = 1 | 000001 | 0 | S = 2
----------------------------------
X = 4 | 000100 | 2 | N
Y = 6 | 000110 | 1 |
----------------------------------
X =-6 | 000110 | 1 | Y
Y =-2 | 000010 | 1 | S = 4
Expressions for the desired results:
Size (S) = 2 ^ (Zeros + 1)
Level = [Maximum Level] - Zeros

Kusto - Group by duration value to show numbers

I use the below query to calculate the time diff between 2 events. But I am not sure how to group the duraions. I tried case function but it does not seem to work. Is there a way to group the duration . For example a pie or column chart to show number of items with durations more than 2 hours, more than 5 hours and more than 10 hours. Thanks
| where EventName in ('Handligrequest','Requestcomplete')
| summarize Time_diff = anyif(Timestamp,EventName == "SlackMessagePosted") - anyif(Timestamp,EventName == "ReceivedSlackMessage") by CorrelationId
| where isnotnull(Time_diff)
| extend Duration = format_timespan(Time_diff, 's')
| sort by Duration desc```
// Generate data sample. Not part of the solution
let t = materialize (range i from 1 to 1000 step 1 | extend Time_diff = 24h*rand());
// Solution Starts here
t
| summarize count() by time_diff_range = case(Time_diff >= 10h, "10h <= x", Time_diff >= 5h, "5h <= x < 10h", Time_diff >= 2h, "2h <= x < 5h", "x < 2h")
| render piechart
time_diff_range
count_
10h <= x
590
5h <= x < 10h
209
x < 2h
89
2h <= x < 5h
112
Fiddle

Calculating Survival rate from month to month without losing starting values

I have a set of code that divides the number of alive specimens from the initial count.
I am trying to determine the survival rate for the entire 5 month experiment but, there seems to be an issue with the computation each month. For the initial month, the code computes the correct survival rate (ie: 48/50 - 96%). But, the issue comes in when computing for the next month where the code will compute the survival rate from 48 instead of 50 (ie: 46/48 survived, instead of 46/50 which is what I need). It continues this way for the remainder of the experiment (30/46 month 3, then 20/30 for month 4).
Additionally, each of the "dead" specimens are then added to an NA group automatically (There should be no NA groups). I think if the first issue is taken care of then the NA issue wont happen. Is there a way to fix this with the code I have or do I need to rearrange the data in excel?
I have 2 species in 4 habitats that need this code for analysis.
Thanks!
Month 1
| Species | Cage || nStart | nAlive || PropAlive
| -------- | -------------- || | |
| X | 1 || 10 | 9 | .9
| Y | 2 || 10 | 8 | .8
| -------- | -------------- || | |
Month 2
| Species | Cage || nStart | nAlive || PropAlive (nAlive/nStart)
| -------- | -------------- || | |
| X | 1 || 9 | 8 | .89
| Y | 2 || 8 | 7 | .875
| -------- | -------------- || | |
month 2 should be 8/10 and 7/10 for Prop Alive not 8/9 and 7/8.
library(readxl)
library(tidyverse)
library(lme4)
library(car)
library(emmeans)
JulyData<- read_excel("~/R/Cage Data Final 2016 EMV 1.20.xlsx", sheet="7.1.2016")
str(JulyData)
summary(JulyData$Lice)
AllCages<- distinct(JulyData, Cage, Species)
AllCages$nStart<- rep(10,nrow(AllCages))
Alive<- JulyData%>%
filter(!is.na(Lice))%>%
group_by(Cage, Species)%>%
summarise(nAlive=n())
CleanData<-merge(AllCages,Alive, all=TRUE)
CleanData$nAlive[is.na(CleanData$nAlive)]<-0
CleanData$nAlive[CleanData$nAlive>10]<-10
CleanData<-CleanData %>%
separate(Cage,c("Habitat", "Rep"),1,remove=FALSE) %>%
mutate(nDead=nStart-nAlive)
CleanData
CleanData%>%
group_by(Species,Habitat)%>%
summarize(nStart=sum(nStart),
nAlive=sum(nAlive),
PropAlive = nAlive/nStart)
So, the issue was related to formatting within the data. The code is right - a simple labeling error was the issue.

Dealing with conditionals in a better manner than deeply nested ifelse blocks

I'm trying to write some code to analyze my company's insurance plan offerings... but they're complicated! The PPO plan is straightforward, but the high deductible health plans are complicated, as they introduced a "split" deductible and out of pocket maximum (individual and total) for the family plans. It works like this:
Once the individual meets the individual deductible, he/she is covered at 90%
Once the remaining 1+ individuals on the plan meet the total deductible, the entire family is covered at 90%
The individual cannot satisfy the family deductible with only their medical expenses
I want to feed in a vector of expenses for my family members (there are four of them) and output the total cost for each plan. Below is a table of possible scenarios, with the following column codes:
ded_ind: did one individual meet the individual deductible?
ded_tot: was the total deductible reached?
oop_ind: was the individual out of pocket max reached
oop_tot: was the total out of pocket max reached?
exp_ind = the expenses of the highest spender
exp_rem = the expenses of the remaining /other/ family members (not the highest spender)
oop_max_ind = the level of expenses at which the individual has paid their out of pocket maximum (when ded_ind + 0.1 * exp_ind = out of pocket max for the individual
oop_max_fam = same as for individual, but for remaining family members
The table:
| ded_ind | oop_ind | ded_rem | oop_rem | formula
|---------+---------+---------+---------+---------------------------------------------------------------------------|
| 0 | 0 | 0 | 0 | exp_ind + exp_rem |
| 1 | 0 | 0 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem |
| 0 | 0 | 1 | 0 | exp_ind + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 0 | 0 | oop_max_ind + exp_fam |
| 1 | 0 | 1 | 0 | ded_ind + 0.1 * (exp_ind - ded_ind) + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 0 | 0 | 1 | 1 | oop_max_rem + exp_ind |
| 1 | 0 | 1 | 1 | ded_ind + 0.1 * (exp_ind - ded_ind) + oop_max_rem |
| 1 | 1 | 1 | 0 | oop_ind_max + ded_rem + 0.1 * (exp_rem - ded_rem) |
| 1 | 1 | 1 | 1 | oop_ind_max + oop_rem_max |
Omitted: 0 1 0 0, 0 0 0 1, 0 1 1 0, and 0 1 0 1 are not present, as oop_ind and oop_rem could not have been met if ded_ind and ded_rem, respectively, have not been met.
My current code is a somewhat massive ifelse loop like so (not the code, but what it does):
check if plan is ppo or hsa
if hsa plan
if exp_ind + exp_rem < ded_rem # didn't meet family deductible
if exp_ind < ded_ind # individual deductible also not met
cost = exp_ind + exp_rem
else is exp_ind > oop_ind_max # ded_ind met, is oop_ind?
ded_ind + 0.1 * (exp_ind - ded_ind) + exp_fam # didn't reach oop_max_ind
else oop_max_ind + exp_fam # reached oop_max_ind
else ...
After the else, the total is greater than the family deductible. I check to see if it was contributed by more than two people and just continue on like that.
My question, now that I've given some background to the problem: Is there a better way to manage conditional situations like this than ifelse loops to filter them down a bit at a time?
The code ends up seeming redundant, as one checks for some higher level conditions (consider the table where ded_rem is met or not met... one still has to check for ded_ind and oop_max_ind in both cases, and the code is the same... just positioned at two different places in the ifelse structure).
Could this be done with some sort of matrix operation? Are there other examples online of more clever ways to deal with filtering of conditions?
Many thanks for any suggestions.
P.S. I'm using R and will be creating an interactive with shiny so that other employees can input best and worst case scenarios for each of their family members and see which plan comes out ahead via a dot or bar chart.
The suggestion to convert to a binary value based on the result gave me an idea, which also helped me learn that one can do vectorized TRUE / FALSE checks (I guess that was probably obvious to many).
Here's my current idea:
expenses will be a vector of individual forecast medical expenses for the year (example of three people):
expenses <- c(1500, 100, 400)
We set exp_ind to the max value, and sum the rest for exp_rem
exp_ind <- max(expenses)
# [1] index of which() for cases with multiple max values
exp_rem <- sum(expenses[-which(expenses == exp_ind)[1]])
For any given plan, I can set up a vector with the cutoffs, for example:
individual deductible = 1000
individual out of pocket max = 2000 (need to incur 11k of expenses to get there)
family deductible = 2000
family out of pocket max = 4000 (need to incur 22k of expenses to get there)
Set those values:
ded_ind <- 1000
oop_max_ind <- 11000
ded_tot <- 2000
oop_max_tot <- 22000
cutoffs <- c(ded_ind, oop_max_ind, ded_tot, oop_max_tot)
Now we can check the input expense against the cutoffs:
result <- as.numeric(rep(c(exp_ind, exp_rem), each = 2) > cutoffs)
Last, convert to binary:
result_bin <- sum(2^(seq_along(result) - 1) * result)
Now I can set up functions for the possible outcomes based on the value in result_bin:
if(result_bin == 1) {cost <- ded_ind + 0.1 * (exp_ind - ded_ind) + exp_rem }
cost
[1] 1550
We can check this...
High spender would have paid his 1000 and then 10% of remaining 500 = 1050
Other members did not reach the family deductible and paid the full 400 + 100 = 500
Total: 1550
I still need to create a mapping of results_bin values to corresponding functions, but doing a vectorized check and converting a unique binary value is much, much better, in my opinion, than my ifelse nested mess.
I look at it like this: I'd have had to set the variables and write the functions anyway; this saves me 1) explicitly writing all the conditions, 2) the redundancy issue I was talking about in that one ends up writing identical "sibling" branches of parent splits in the ifelse structure, and lastly, 3) the code is far, far, far more easily followed.
Since this question is not very specific, here is a simpler example/answer:
# example data
test <- expand.grid(opt1=0:1,opt2=0:1)
# create a unique identifier to represent the binary variables
test$code <- with(allopts,paste(opt1,opt2,sep=""))
# create an input variable to be used in functions
test$var1 <- 1:4
# opt1 opt2 code var1
#1 0 0 00 1
#2 1 0 10 2
#3 0 1 01 3
#4 1 1 11 4
Respective functions to apply depending on binary conditions, along with intended results for each combo:
var1 + 10 #code 00 - intended result = 11
var1 + 100 #code 10 - intended result = 102
var1 + 1000 #code 01 - intended result = 1003
var1 + var1 #code 11 - intended result = 8
Use ifelse combinations to do the calculations:
test$result <- with(test,
ifelse(code == "00", var1 + 10,
ifelse(code == "10", var1 + 100,
ifelse(code == "01", var1 + 1000,
ifelse(code == "11", var1 + var1,
NA
)))))
Result:
opt1 opt2 code var1 result
1 0 0 00 1 11
2 1 0 10 2 102
3 0 1 01 3 1003
4 1 1 11 4 8

Can I calculate the average of these numbers?

I was wondering if it's possible to calculate the average of some numbers if I have this:
int currentCount = 12;
float currentScore = 6.1123 (this is a range of 1 <-> 10).
Now, if I receive another score (let's say 4.5), can I recalculate the average so it would be something like:
int currentCount now equals 13
float currentScore now equals ?????
or is this impossible and I still need to remember the list of scores?
The following formulas allow you to track averages just from stored average and count, as you requested.
currentScore = (currentScore * currentCount + newValue) / (currentCount + 1)
currentCount = currentCount + 1
This relies on the fact that your average is currently your sum divided by the count. So you simply multiply count by average to get the sum, add your new value and divide by (count+1), then increase count.
So, let's say you have the data {7,9,11,1,12} and the only thing you're keeping is the average and count. As each number is added, you get:
+--------+-------+----------------------+----------------------+
| Number | Count | Actual average | Calculated average |
+--------+-------+----------------------+----------------------+
| 7 | 1 | (7)/1 = 7 | (0 * 0 + 7) / 1 = 7 |
| 9 | 2 | (7+9)/2 = 8 | (7 * 1 + 9) / 2 = 8 |
| 11 | 3 | (7+9+11)/3 = 9 | (8 * 2 + 11) / 3 = 9 |
| 1 | 4 | (7+9+11+1)/4 = 7 | (9 * 3 + 1) / 4 = 7 |
| 12 | 5 | (7+9+11+1+12)/5 = 8 | (7 * 4 + 12) / 5 = 8 |
+--------+-------+----------------------+----------------------+
I like to store the sum and the count. It avoids an extra multiply each time.
current_sum += input;
current_count++;
current_average = current_sum/current_count;
It's quite easy really, when you look at the formula for the average: A1 + A2 + ... + AN/N. Now, If you have the old average and the N (numbers count) you can easily calculate the new average:
newScore = (currentScore * currentCount + someNewValue)/(currentCount + 1)
You can store currentCount and sumScore and you calculate sumScore/currentCount.
or... if you want to be silly, you can do it in one line :
current_average = (current_sum = current_sum + newValue) / ++current_count;
:)
float currentScore now equals (currentScore * (currentCount-1) + 4.5)/currentCount ?

Resources