Bucketing data in R

Bucketing data in R - r

I'm trying to make a function that determines what bucket a certain value goes into based off of a given vector. So my function has two inputs: a vector determining the break points for the bucket
(ex: if the vector is (1,4,5,10) the buckets would be <=1, 110)
and a certain number. I want the function to output a certain value determining the bucket.
For example if I input .9 the output could be 1, 1.6, the output could be 4, 5.8 the output could be 10, and 13, the output could be "10+".
The way I'm doing it right now is I first check if the input number is bigger than the vector's largest element or smaller than the vector's smallest element. If not, I then run a for loop (can't figure out how to use apply) to check if the number is in each specific interval. The problem is this is way too inefficient because I'm dealing with a large data set. Does anyone know an efficient way to do this?

The cut() function is convenient for bucketing: cut(splitme,breaks=vectorwithsplits) .
However, it looks like you're actually trying to figure out an insertion point. You need something like binary search.

Related

Matrice help: Finding average without the zeros

I'm creating a Monte Carlo model using R. My model creates matrices that are filled with either zeros or values that fall within the constraints. I'm running a couple hundred thousand n values thru my model, and I want to find the average of the non zero matrices that I've created. I'm guessing I can do something in the last section.
Thanks for the help!
Code:
n<-252500
PaidLoss_1<-numeric(n)
PaidLoss_2<-numeric(n)
PaidLoss_3<-numeric(n)
PaidLoss_4<-numeric(n)
PaidLoss_5<-numeric(n)
PaidLoss_6<-numeric(n)
PaidLoss_7<-numeric(n)
PaidLoss_8<-numeric(n)
PaidLoss_9<-numeric(n)
for(i in 1:n){
claim_type<-rmultinom(1,1,c(0.00166439057698873, 0.000810856947763742, 0.00183509730283373, 0.000725503584841243, 0.00405428473881871, 0.00725503584841243, 0.0100290201433936, 0.00529190850119495, 0.0103277569136224, 0.0096449300102424, 0.00375554796858996, 0.00806589279617617, 0.00776715602594742, 0.000768180266302492, 0.00405428473881871, 0.00226186411744623, 0.00354216456128371, 0.00277398429498122, 0.000682826903379993))
claim_type<-which(claim_type==1)
claim_Amanda<-runif(1, min=34115, max=2158707.51)
claim_Bob<-runif(1, min=16443, max=413150.50)
claim_Claire<-runif(1, min=30607.50, max=1341330.97)
claim_Doug<-runif(1, min=17554.20, max=969871)
if(claim_type==1){PaidLoss_1[i]<-1*claim_Amanda}
if(claim_type==2){PaidLoss_2[i]<-0*claim_Amanda}
if(claim_type==3){PaidLoss_3[i]<-1* claim_Bob}
if(claim_type==4){PaidLoss_4[i]<-0* claim_Bob}
if(claim_type==5){PaidLoss_5[i]<-1* claim_Claire}
if(claim_type==6){PaidLoss_6[i]<-0* claim_Claire}
}
PaidLoss1<-sum(PaidLoss_1)/2525
PaidLoss3<-sum(PaidLoss_3)/2525
PaidLoss5<-sum(PaidLoss_5)/2525
PaidLoss7<-sum(PaidLoss_7)/2525
partial output of my numeric matrix

First, let me make sure I've wrapped my head around what you want to do: you have several columns -- in your example, PaidLoss_1, ..., PaidLoss_9, which have many entries. Some of these entries are 0, and you'd like to take the average (within each column) of the entries that are not zero. Did I get that right?
If so:
Comment 1: At the very end of your code, you might want to avoid using sum and dividing by a number to get the mean you want. It obviously works, but it opens you up to a risk: if you ever change the value of n at the top, then in the best case scenario you have to edit several lines down below, and in the worst case scenario you forget to do that. So, I'd suggest something more like mean(PaidLoss_1) to get your mean.
Right now, you have n as 252500, and your denominator at the end is 2525, which has the effect of inflating your mean by a factor of 100. Maybe that's what you wanted; if so, I'd recommend mean(PaidLoss_1) * 100 for the same reasons as above.
Comment 2: You can do what you want via subsetting. Take a smaller example as a demonstration:
test <- c(10, 0, 10, 0, 10, 0)
mean(test) # gives 5
test!=0 # a vector of TRUE/FALSE for which are nonzero
test[test!=0] # the subset of test which we found to be nonzero
mean(test[test!=0]) # gives 10, the average of the nonzero entries
The middle three lines are just for demonstration; the only necessary lines to do what you want are the first (to declare the vector) and the last (to get the mean). So your code should be something like PaidLoss1 <- mean(PaidLoss_1[PaidLoss_1 != 0]), or perhaps that times 100.
Comment 3: You might consider organizing your stuff into a dataframe. Instead of typing PaidLoss_1, PaidLoss_2, etc., it might make sense to organize all this PaidLoss stuff into a matrix. You could then access elements of the matrix with [ , ] indexing. This would be useful because it would clean up some of the code and prevent you from having to type lots of things; you could also then make use of things like the apply() family of functions to save you from having to type the same commands over and over for different columns (such as the mean). You could also use a dataframe or something else to organize it, but having some structure would make your life easier.
(And to be super clear, your code is exactly what my code looked like when I first started writing in R. You can decide if it's worth pursuing some of that optimization; it probably just depends how much time you plan to eventually spend in R.)

Finding combinations from a list of numbers to reach a given sum WITHOUT repeats of the input numbers

I am trying to find a way to have a list of numbers be calculated to show all combinations that add up a desired sum WITHOUT repeating any of the input numbers after they've been used once, just consider it removed.
Example:
Input: 2, 5, 5, 10, 15, 10
Desired Sum: 20
Combinations: [10+10] [5+15]
So the left over numbers are just 2 and 5, since it can't repeat the already used numbers, so it can't say [5+5+10]
The problem I am having while searching for this function, even here on Stackoverflow, is that I can find a million working solutions that allow for repeat, but absolutely none that prevent repeats of the INPUT numbers (I can find some that disallow repeats of the combinations, but that doesn't so what I need it to do, and due to having many repeated input numbers, there inevitably will be repeats of combinations)
My purpose is that I need to group a large set of numbers (potentially hundreds of whole values between 5 and 14) into packs of 40 with no repeat of the input numbers.
Here's an example of the inputs I would like to use right now
[5,9,5,5,6,7,8,7,5,6,8,8,7,5,7,5,7,5,5,8,8,8,9,5,8,6,5,8,8,8,5,6,9,6,9,8,7,5,9,5,6,8,5,5,5,7,7,6,8,7,8,6,9,6,6,6,8,5,6,6,6,5,8,6,6,6,8,9,10,10,10,10,14,14,13,13,8,6,7,7,12,12,12,11,11,12,12,12,5,10,5,6,6,11,6,6,9,10,6,13,6,5,8,7,8,5,6,6,8,6,5]
And have that list show all possible combinations of 40 without using each value more than once.
Thanks for any help you can provide.

Simple (not very fast) method:
Repeat:
Find the first set with needed sum
Remove it's elements from the list
Repeat:
Check if similar set still exists // it's very likely for limited number range
Remove it's elements from the list

How to group data into unequal ranges and assign a value to those ranges in R? [duplicate]

I'm trying to make a function that determines what bucket a certain value goes into based off of a given vector. So my function has two inputs: a vector determining the break points for the bucket
(ex: if the vector is (1,4,5,10) the buckets would be <=1, 110)
and a certain number. I want the function to output a certain value determining the bucket.
For example if I input .9 the output could be 1, 1.6, the output could be 4, 5.8 the output could be 10, and 13, the output could be "10+".
The way I'm doing it right now is I first check if the input number is bigger than the vector's largest element or smaller than the vector's smallest element. If not, I then run a for loop (can't figure out how to use apply) to check if the number is in each specific interval. The problem is this is way too inefficient because I'm dealing with a large data set. Does anyone know an efficient way to do this?

The cut() function is convenient for bucketing: cut(splitme,breaks=vectorwithsplits) .
However, it looks like you're actually trying to figure out an insertion point. You need something like binary search.

How to automate a process by pulling elements from a data frame in R -looping with a string?

I am trying to automate a process instead of individually compute PPCC values for a large number of test cases. The details of my functions do not matter (though for reference I'm using Lmomco), my issue is either putting this into a loop or somehow using plyr or apply to repeat over and over. I do not know how to automate the string. For example I have sorted data by "M" parameter:
testx.100cv1<-by(x.cv1$first_year,x.cv1$M,sort)
I then apply a function here:
testexp<-lapply(testx.100cv1,parexp)
Now I want to do something to each "M", where in the example below, M = 1.02. Right now, I am manually changing this value and then recomputing for every M (and I have a lot of them). I'm looking for a way to write this M value into a loop so it reads it automatically.
exp<-quaexp(plotpos,testexp$'1.02')
PPCCexp<-cor(exp,testx.100cv1$'1.02')
I want to compute PPCC values for many distributions, so without automating, this will take over my life for a week.
Thanks!

In Graphite, how can count the number of data points above a value?

I was wondering if there was a Graphite function or a way of getting the number of points above a certain value. Let's say I have 2,44,24,522,52,534 for the same time and I want to get the number of points over 40, in this case it would be 3. Any tips?
Thanks

You can use removeBelowValue(your.metric, 40) to only display points above 40.
Then use something to make non zero value equal to 1 (I am thinking of pow(_, 0) but I am not sure of how it behaves with None values given by removeBelowValue). If you use recent (>0.9.x) version of graphite, you can use isNonNull instead of pow
In the end you can use any function to summarize your the 1s you have, summarize you should be good. You only have to select your range.
Suggestion : summarize(pow(removeBelowValue(your.metric,40),0), '1hour', 'sumSeries')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Bucketing data in R - r

The cut() function is convenient for bucketing: cut(splitme,breaks=vectorwithsplits) . However, it looks like you're actually trying to figure out an insertion point. You need something like binary search.

Related

Matrice help: Finding average without the zeros

Finding combinations from a list of numbers to reach a given sum WITHOUT repeats of the input numbers

How to group data into unequal ranges and assign a value to those ranges in R? [duplicate]

How to automate a process by pulling elements from a data frame in R -looping with a string?

In Graphite, how can count the number of data points above a value?

Categories

Resources