Calculating PERCENTILE in DAX - olap

I have googled and keep ending up with formulas which are too slow. I suspect if I split the formula in steps (creating calculated columns), I might see some performance gain.
I have a table having some numeric columns along with some which would end up as slicers. The intention is to have 10th, 25th, 50th, 75th and 90th percentile over some numeric columns for the selected slicer.
This is what I have for the 10th Percentile over the column "Total Pd".
TotalPaid10thPercentile:=MINX(
FILTER(
VALUES(ClaimOutcomes[Total Pd]),
CALCULATE(
COUNTROWS(ClaimOutcomes),
ClaimOutcomes[Total Pd] <= EARLIER(ClaimOutcomes[Total Pd])
)> COUNTROWS(ClaimOutcomes)*0.1
),
ClaimOutcomes[Total Pd]
)
It takes several minutes and still no data shows up. I have around 300K records in this table.

I figured out a way to break the calculation down in a series of steps, which fetched a pretty fast solution.
For calculating the 10th percentile on Amount Paid in the table Data, I followed the below out-of-the-book formula :
Calculate the Ordinal rank for the 10th percentile element
10ptOrdinalRank:=0.10*(COUNTX('Data', [Amount Paid]) - 1) + 1
It might come out a decimal(fraction) number like 112.45
Compute the decimal part
10ptDecPart:=[10ptOrdinalRank] - TRUNC([10ptOrdinalRank])
Compute the ordinal rank of the element just below(floor)
10ptFloorElementRank:=FLOOR([10ptOrdinalRank],1)
Compute the ordinal rank of the element just above(ceiling)
10ptCeilingElementRank:=CEILING([10ptOrdinalRank], 1)
Compute element corresponding to floor
10ptFloorElement:=MAXX(TOPN([10ptFloorElementRank], 'Data',[Amount Paid],1), [Amount Paid])
Compute element corresponding to ceiling
10ptCeilingElement:=MAXX(TOPN([10ptCeilingElementRank], 'Data',[Amount Paid],1), [Amount Paid])
Compute the percentile value
10thPercValue:=[10ptFloorElement] + [10ptDecPart]*([10ptCeilingElement]-[10ptFloorElement])
I have found the performance remarkably faster than some other solutions I found on the net. Hope it helps someone in future.

Related

Finding a minimum of a row in a matrix

I've been asked for an assignment to generate 1000 random samples of size 100 from a Uniform[0,1] distribution. For each sample, record the minimum. Then find the mean and variance of these minima.
I used this code so far so that I can have all my samples in a matrix with 1000 rows and 100 columns. (I think that's what it is doing).
x <- matrix(runif(100000,0,1),1000,100)
So now I need to find the minimum of each row, then from there I need the expected and variance and I really don't know where to start. Thanks in advance.
Use this,
apply(x,2,min)
You will get min values for each column

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

R: How to get a count for a certain value in a matrix row in R?

Ok I have the following problem:
I have several ranks in a matrix in r. (I've got this by ranking asset returns. Ranks>=3 get an NA, Ranks <3 get the rank number. If some assets share a rank, less NAs are in a row). Here are two example rows and two example rows of a matrix with returns.
ranks<-matrix(c(1,1,2,NA,NA, 1,2,NA,NA,NA),nrow=2,ncol=5)
returns<-matrix(c(0.3,0.1,-0.5,-0.7,0.2,0.1,0.4,0.05,-0.7,-0.3),nrow=2,ncol=5)
Now if all assets are equally bought for our portfolio, I can calculate the average return with:
Mat.Ret<-returns*ranks
Mean.Ret<-rowMeans(Mat.Ret,na.rm=TRUE)
However I want to have the option of giving a vector of weights for the two ranks and these weights say how big of a percentage this particular rank should have in my portfolio. As an example we have a vector of
weights<-c(0.7,0.3)
Now how would I use this in my code? I want to calculate basically ranks*returns*weights. If only ONE rank 1 and ONE rank 2 are in the table, the code works. But how would I do this variable? I mean a solution would be to calculate for each rank how many times it exists in a particular row and then divide the weight by this count. And then I would multiply this "net weight" * rank * returns.
But I have no clue how to do this in code..any help?
UPDATE AFTER FIRST COMMENTS
Ok I want to Keep it flexible to adjust the weights depending on how many times a certain rank is given. A user can choose the top 5 ranked assets, so none or several assets may share ranks. So the distribution of weights must be very flexible. I've programmed a formula which doesn't work yet since I'm obviously not yet experienced enough with the whole matrix and vector selection syntax I guess. This is what I got so far:
ranks<-apply(ranks,1,function(x)distributeWeightsPerMatrixRow(x,weights))
distributeWeightsPerMatrixRow<-function(MatrixRow,Weights){
if(length(Weights)==length(MatrixRow[!is.na(MatrixRow)])){
MatrixRow <- Weights[MatrixRow]
} else {
for(i in 1:length(MatrixRow)){
if(!is.na(MatrixRow[i])){
EqWeights<-length(MatrixRow[MatrixRow==MatrixRow[i]])
MatrixRow[i]<-sum(Weights[MatrixRow[i]:(MatrixRow[i]+EqWeights-1)])/EqWeights
}
}
}
return(MatrixRow)
}
EDIT2:
Function seems to work, however now the resulting ranks object is the transposed version of the original matrix without the column names..
Since your ranks are integers above zero, you can use this matrix for indexing the vector ranks:
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
Update based on comment.
I suppose you're looking for something like this:
if (length(unique(na.omit(as.vector(ranks)))) == 1)
mat.weights <- (!is.na(ranks)) * 0.5
else
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
If there is only one rank. All weights become 0.5.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Generate a specific amount of random numbers that add up to a defined value

I would like to unit test the time writing software used at my company. In order to do this I would like to create sets of random numbers that add up to a defined value.
I want to be able to control the parameters:
Min and max value of the generated number
The n of the generated numbers
The sum of the generated numbers
For example, in 250 days a person worked 2000 hours. The 2000 hours have to randomly distributed over the 250 days. The maximum time time spend per day is 9 hours and the minimum amount is .25
I worked my way trough this SO question and found the method
diff(c(0, sort(runif(249)), 2000))
This results in 1 big number a 249 small numbers. That's why I would to be able to set min and max for the generated number. But I don't know where to start.
You will have no problem meeting any two out of your three constraints, but all three might be a problem. As you note, the standard way to generate N random numbers that add to a sum is to generate N-1 random numbers in the range of 0..sum, sort them, and take the differences. This is basically treating your sum as a number line, choosing N-1 random points, and your numbers are the segments between the points.
But this might not be compatible with constraints on the numbers themselves. For example, what if you want 10 numbers that add to 1000, but each has to be less than 100? That won't work. Even if you have ranges that are mathematically possible, forcing compliance with all the constraints might mean sacrificing uniformity or other desirable properties.
I suspect the only way to do this is to keep the sum constraint, the N constraint, do the standard N-1, sort, and diff thing, but restrict the resolution of the individual randoms to your desired minimum (in other words, instead of 0..100, maybe generate 0..10 times 10).
Or, instead of generating N-1 uniformly random points along the line, generate a random sample of points along the line within a similar low-resolution constraint.

Resources