Distribution of money following different beta-distributions - r

I am trying to find a methodology (or even better, the code) to do the following in Netlogo. Any help is appreciated (I could always try to rewrite the code from R or Matlab to Netlogo):
I have $5000, which I want to distribute following different beta distributions among 10 000 actors. The maximum amount an actor may receive is $1.
Basically, I am looking for a way to generate random numbers to actors (10000 actors) in a [0,1] interval, following different beta distributions, where the mean of the distributed values remains equal to 0.5. This way the purchasing power of the population (10000 actors with a mean of 0.5 is $5000) remains equal for beta(1,1) (uniform population) as well as, for example, beta(4,1) (rich population).
an example with 5 actors distributing 2,5 dollar:
beta(1,1) 0,5 - 0,5 - 0,5 - 0,5 - 0,5 (mean 0,5)
beta(4,1) 0,1 - 0,2 - 0,5 - 0,7 - 1,0 (mean 0,5)
I've been thinking. If there is no apparent solution to this, maybe the following could work. We can write the shape of the frequency distribution of beta(4,1) as y=ax^2+b with some value for a and b (both increase exponentially).
In my case, the integral(0-1) of y=ax^2+b should be 5000. Playing around with values for a and b should give me the shape of beta(4,1).
The number of actors having 0.1 should then be the the integral(0-0.1) of y=ax^2+b where a and b are parameters of the shape resembling the beta(4,1).
Is my reasoning clear enough? Could someone extend this reasoning? Is there a link between the beta distribution and a function of a,b,x ?

Related

Is there any way to calculate effect size between a pre-test and a post-test when scores on pre-test is 0 (or almost 0)

I would like to calculate an effect size between scores from pre-test and post-test of my studies.
However, due to the nature of my research, pre-test scores are usually 0 or almost 0 (before the treatment, participants usually do not have any knowledge in question).
I cannot just use Cohen's d to calculate effect sizes since the pre-test scores do not follow a normal distribution.
Is there any way I can calculate effect sizes in this case?
Any suggestions would be greatly appreciated.
You are looking for Cohen's d to see if the difference between the two time points (pre- and post-treatment) is large or small. The Cohen's d can be calculated as follows:
(mean_post - mean_pre) / {(variance_post + variance_pre)/2}^0.5
Where variance_post and variance_pre are the sample variances. Nowhere does it require here that the pre- and post-treatment score are normally distributed.
There are multiple packages available in R that provide a function for Cohen's d: effsize, pwr and lsr. In lsr your R-code would look like this:
library(lsr)
cohensD(pre_test_vector, post_test_vector)
Sidenote: The average scores tend to a normal distribution when your sample size tends to infinity. As long as your sample size is large enough, the average scores follow a normal distribution (Central Limit Theorem).

Error probability function

I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.
I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6]
that proposes this formula to calculate the probability mass function in such cases.
I implemented the formula with R as shown here
pcr.prob <- function(k,N,eps){
v = numeric(k)
for(i in 1:k) {
v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
}
1 - sum(v)
}
From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.
However when I use my implementation of the formula I get a different value.
pcr.prob(3,800,0.0000185)
[1] 5.323567e-07
What could I be doing wrong in my implementation? Am I misinterpreting something?
Thanks
I think they've got the right number (0.00113), but badly explained in their paper.
The calculation you want to be doing is:
pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.
Somewhat statsy, may be worth a move…
Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:
pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
will continue to work even when the error incorporation rate is very low.

Fit negative binomial distribution in R

I have a data set derived from the sport Snooker:
https://www.dropbox.com/s/1rp6zmv8jwi873s/snooker.csv
Column "playerRating" can take the values from 0 to 1, and describes how good a player is:
0: bad player
1: good player
Column "suc" is the number of consecutive balls potted by each player with the specific rating.
I am trying to prove 2 things regarding the number of consecutive balls potted until first miss:
The distribution of successes follows a negative binomial
The number of success depends on the player's worth. ie if a player is really good, he will manage to pot more consecutive balls.
I am using the "fitdistrplus" package to fit my data, however, I am unable to find a way of using the "playerRatings" as input parameters.
Any help would be much appreciated!

R fast AUC function for non-binary dependent variable

I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html

Likert Rank ordering optimization heuristic possible?

I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.

Resources