Let we have N integer slots initially containing 0's and an infinite sequence of similar independent negative binomial variables wi ~ NB(l, q). Each next value from the sequence is added to the slot containing the minimal value. The question is: what is the distribution of the step number at which any of the slots exceeds given limit k?
If you ask, it's a simplified model of attack on vulnerable ("one unit lost means all die") stack of units in Freeciv game.
Related
The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.
I am running knn (in R) on a dataset where objects are classified A or B. However, there are many more A's than B's (18 of class A for every 1 of class B).
How should I combat this? If I use a k of 18, for example, and there are 7 B's in the neighbors (way more than the average B's in a group of 18), the test data will still be classified as A when it should probably be B.
I am thinking that a lower k will help me. Is there any rule of thumb for choosing the value of k, as it relates to the frequencies of the classes in the train set?
Ther is no such rule, for your case i would try a very small k probably between 3 and 6.
About the dataset, unless your test data or real world data are found in about the same ratio you have mentioned ( 18:1 ) i would remove some A's for more accurate results, i wont advise you doing it if the ratio is indeed close to the real world data because you will lose the effect of the ratio (lower probability classify for a lower probability data).
So we are given a set of integers from 0 to n. This is then randomized. The goal is to calculate the number of expected integers which remain in the same position in both lists. I have tried to set up two indicator variables for each integer and then mapping it to the two different sets, but I don't really know how to go from there.
The random variable X, representing the number of your integers which remain in the same position after randomisation, follows the binomial distribution with n+1 trials and a probability of 1/(n+1), therefore the expected number of integers remaining in place is 1.
My reasoning is:
Each integer can move to any other position in the list after randomisation, with equal probability. So whether an integer remains in place can be considered a Bernoulli distribution, with probability 1/(n+1), since there are n+1 possible position it could move to, and only 1 position for it to have remained in place.
There are therefore n+1 Bernoulli distributions, all with the same probability, and all independent of each other. (A Bernoulli distribution represents a yes / no outcome where the yes has a given probability.)
The binomial distribution is defined as the number of successes in a given number of identical independent trials, or (equivalently) the number of "yes" outcomes in a given number of independent Bernoulli distributions with the same probability.
The number of your integers which remain in place after randomisation is therefore a bimonial distribution, probability 1/(n+1) and with n+1 trials.
The mean of a binomial distribution with n trials with probability p is np, therefore in your case the expected number of integers remaining in place is (n+1) . (1/(n+1)) which is 1.
For more info on the binomial distribution, see wikipedia.
I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...
The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.
You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!
I have a data set derived from the sport Snooker:
https://www.dropbox.com/s/1rp6zmv8jwi873s/snooker.csv
Column "playerRating" can take the values from 0 to 1, and describes how good a player is:
0: bad player
1: good player
Column "suc" is the number of consecutive balls potted by each player with the specific rating.
I am trying to prove 2 things regarding the number of consecutive balls potted until first miss:
The distribution of successes follows a negative binomial
The number of success depends on the player's worth. ie if a player is really good, he will manage to pot more consecutive balls.
I am using the "fitdistrplus" package to fit my data, however, I am unable to find a way of using the "playerRatings" as input parameters.
Any help would be much appreciated!