Here is the data
P0(24,0) P25(32.1875,26.6735) P50(35.4167,31.383) P75(40.45,42.6203)
P90(50.55,59.1531) P95(53.05,77.3846) P99(60.21,128.643)
P99.5(60.605,236.321) P99.9(60.921,5854.43) P100(61,63000)
I know P0 means 0th percentile and so on. What are the values in the brackets
It seems like there are two variables, a and b.
The first entry corresponds to a and the second entry corresponds to b.
P25(32.1875,26.6735) means 32.1875 is the 25th percentile of a and 26.6735 is the 25th percentile of b.
Related
The nearZeroVar() function from the mixOmics R package is simply the following code:
nearZeroVar(x, freqCut=95/5, uniqueCut=15) # default values shown
Here is the description of what this function does, straight from the source.
For example, an example of near zero variance predictor is one that,
for 1000 samples, has two distinct values and 999 of them are a single
value.
To be flagged, first the frequency of the most prevalent value over
the second most frequent value (called the “frequency ratio”) must be
above freqCut. Secondly, the “percent of unique values,” the number of
unique values divided by the total number of samples (times 100), must
also be below uniqueCut.
In the above example, the frequency ratio is 999 and the unique value
percentage is 0.0001.
I understand that the frequency ratio would be 999/1 (because there are 999 single values and 1 other value) so it would be 999. But shouldn't the unique value percentage be = 2/1000*100 = 0.2, since it would be 2 unique values over the number of samples. How does one obtain 0.0001 as the answer?
In this page there is the official example of the roll periods function.
What is the function used ? (given N the roll period)
With a simple moving average, the first N values should be NA, but they are not.
There are 99 values, so if I put 99 as roll period, I thought I would have a straight line, but it is not.
When I put 50 and 60, it seems that only values after the first 50 ones are changed.
Does anyone know the function, or how can I find it ?
It's a trailing average.
So if there are 100 values and you set 50 as the roll period, then:
the first value will just be the first value
the second will be the average of the first two
...
the 50th will be the average of the first 50
the 51st will be the average of values 2..51
...
the 100th will be the average of values 51..100
So basically what i am trying to do is to calculate what the top 10 are:
df_priceusd[order(df_priceusd$pop,decreasing=T)[1:10],]
and now i want to take the prices of the top 10 and calculate the mean of the top 10´s values to get one mean value for all of the 10.
can i somehow implement this:
mean(df_priceusd$mean_priceusd)
into my other code?
or should i go at it at another angle?
I have googled and keep ending up with formulas which are too slow. I suspect if I split the formula in steps (creating calculated columns), I might see some performance gain.
I have a table having some numeric columns along with some which would end up as slicers. The intention is to have 10th, 25th, 50th, 75th and 90th percentile over some numeric columns for the selected slicer.
This is what I have for the 10th Percentile over the column "Total Pd".
TotalPaid10thPercentile:=MINX(
FILTER(
VALUES(ClaimOutcomes[Total Pd]),
CALCULATE(
COUNTROWS(ClaimOutcomes),
ClaimOutcomes[Total Pd] <= EARLIER(ClaimOutcomes[Total Pd])
)> COUNTROWS(ClaimOutcomes)*0.1
),
ClaimOutcomes[Total Pd]
)
It takes several minutes and still no data shows up. I have around 300K records in this table.
I figured out a way to break the calculation down in a series of steps, which fetched a pretty fast solution.
For calculating the 10th percentile on Amount Paid in the table Data, I followed the below out-of-the-book formula :
Calculate the Ordinal rank for the 10th percentile element
10ptOrdinalRank:=0.10*(COUNTX('Data', [Amount Paid]) - 1) + 1
It might come out a decimal(fraction) number like 112.45
Compute the decimal part
10ptDecPart:=[10ptOrdinalRank] - TRUNC([10ptOrdinalRank])
Compute the ordinal rank of the element just below(floor)
10ptFloorElementRank:=FLOOR([10ptOrdinalRank],1)
Compute the ordinal rank of the element just above(ceiling)
10ptCeilingElementRank:=CEILING([10ptOrdinalRank], 1)
Compute element corresponding to floor
10ptFloorElement:=MAXX(TOPN([10ptFloorElementRank], 'Data',[Amount Paid],1), [Amount Paid])
Compute element corresponding to ceiling
10ptCeilingElement:=MAXX(TOPN([10ptCeilingElementRank], 'Data',[Amount Paid],1), [Amount Paid])
Compute the percentile value
10thPercValue:=[10ptFloorElement] + [10ptDecPart]*([10ptCeilingElement]-[10ptFloorElement])
I have found the performance remarkably faster than some other solutions I found on the net. Hope it helps someone in future.
If I have two vectors:
A<-c(1,2,3,4,5,6,7,8,9)
B<-c(10,20,30,40,50,60,70,80,90)
and each value in B corresponds to the value in A. If I for example run:
summary(B)
this will give me summary statistics for the values in B. My question is how do I figure out which values in A those summary stats correspond to?
This first quintile stat of Bcan be accessed like this:
summary(B)[2]
1st Qu.
30
Then B==summary(B)[2] will give you a boolean vector you can apply to A to extract its corresponding value like this:
A[B==summary(B)[2]]
[1] 3
So in this cas the value 30 in B corresponds to 3 in A. Note that in a more realistic setting, you might find different values in B that match the summary statistics (or none in the case of the mean).