Consensus on set.seed in R is that that it effectively generates a long sequence of pseudo-random numbers, pre-determined by the seed. Then the first call you make to this sequence (with the first non-deterministic function you use) takes the first batch from that sequence, the second call takes the next batch, so forth.
I am wondering what the limits to this are. Specifically, what happens when you get to the end of that long sequence? Let's say, after setting a seed, you then sample from the first 100 integers repeatedly. Would there come a point where you start generating the same samples (in the same order) as you were seeing at the beginning? How long would this take? (Does it depend on the seed?) If not, how would reaching the 'end' of the sequence and presumably circling back to the beginning manifest?
The ?RNGkind help page in R gives more details on the default random number generator, the "Mersenne Twister" algorithm:
"Mersenne-Twister": From Matsumoto and Nishimura (1998); code
updated in 2002. A twisted GFSR with period 2^19937 - 1 and
equidistribution in 623 consecutive dimensions (over the
whole period). The ‘seed’ is a 624-dimensional set of 32-bit
integers plus a current position in that set.
As stated there, the "period" (the length of time it takes to get back to the beginning and start repeating values is 2^19937-1, or approximately 10^(19937/log2(10)) = 10^6001.
If the size of your "batches" happened to line up exactly with the period, then you would indeed start getting the same batches again.
I'm not sure how many pseudorandom samples R uses to pick a sample of size 1 from a set. Ideally it would be only 1 (so your "batch size" would be 1), but it might be more depending on the generality/complexity of the sampling algorithm.
I know that runif() translates more or less directly from the PRNG, so a sequence of runif() calls would indeed repeat exactly.
Related
I'm writing a javascript program that sends a list of MIDI signals over a specified period of time.
If the signals are sent evenly, it's easy to determine how long to wait in between each signal: it's just the total duration divided by the number of signals.
However, I want to be able to offer a setting where the signals aren't sent equally: either the signals are sent with increasing or decreasing speed. In either case, the number of signals and the total amount of time remain the same.
Here's a picture to visualize what I'm talking about
Is there a simple logarithmic/exponential function where I can compute what these values are? I'm especially hoping it might be possible to use the same equation for both, simply changing a variable.
Thank you so much!
Since you do not give any method to get a pulse value, from the previous value or any other way, I assume we are free to come up with our own.
In both of your cases, it looks like you start with an initial time interval: let's call it a. Then the next interval is that value multiplied by a constant ratio: let's call it r. In the first decreasing case, your value of r is between zero and one (it looks like around 0.6), while in the second case your value of r is greater than one (around 1.6). So your time intervals, in Python notation, are
a, a*r, a*r**2, a*r**3, ...
Then the time of each signal is the sum of a geometric series,
a * (1 - r**n) / (1 - r)
where n is the number of the pulse (1 for the first, 2 for the second, etc.). That formula is valid if r is not one, but if r is one then the sequence is a trivial sequence of a regular signal and the nth signal is given at time
a * n
This is not a "fixed result" since you have two degrees of freedom--you can choose values of a and of r.
If you want to spread the signals more evenly, just bring r closer to one. A value of one is perfectly even, a value farther from one is more clumped at one end. One disadvantage of this method is that if the signal intervals are decreasing then the signals will completely stop at some point, namely at
a / (1 - r)
If you have signals already sent or received and you want to find the value of r, just find the time interval between three consecutive signals, and r is the value of the time interval between the 2nd and 3rd signal divided by the time interview between the 1st and 2nd signal. If you want to see if this model is a good one for a given set of signals, check the value of r at multiple signals--if the value of r is nearly constant then this is a good model.
Here's some pseudocode:
count = 0
for every item in a list
1/20 chance to add one to count
This is more or less my current code, but there could be hundreds of thousands of items in that list; therefore, it gets inefficient fast. (isn't this called like, 0(n) or something?)
Is there a way to compress this into one equation?
Let's look at the properties of the random variable you've described. Quoting Wikipedia:
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Let N be the number of items in the list, and C be a random variable that represents the count you're obtaining from your pseudocode. C will follow a binomial probability distribution (as shown in the image below), with p = 1/20:
The remaining problem is how to efficently poll a random variable with said probability distribution. There are a number of libraries that allow you to draw samples from random variables with a specified PDF. I've never had to implement it myself, so I don't exactly know the details, but many are open source and you can refer to the implementation for yourself.
Here's how you would calculate count with the numpy library in Python:
n, p = 10, 0.05 # 10 trials, probability of success is 0.05
count = np.random.binomial(n, p) # draw a single sample
Apparently the OP was asking for a more efficient way to generate random numbers with the same distribution this will give. I though the question was how to do the exact same operation as the loop, but as a one liner (and preferably with no temporary list that exists just to be iterated over).
If you sample a random number generator n times, it's going to have at best O(n) run time, regardless of how the code looks.
In some interpreted languages, using more compact syntax might make a noticeable difference in the constant factors of run time. Other things can affect the run time, like whether you store all the random values and then process them, or process them on the fly with no temporary storage.
None of this will allow you to avoid having your run time scale up linearly with n.
I need to write a function that returns on of the numbers (-2,-1,0,1,2) randomly, but I need the average of the output to be a specific number (say, 1.2).
I saw similar questions, but all the answers seem to rely on the target range being wide enough.
Is there a way to do this (without saving state) with this small selection of possible outputs?
UPDATE: I want to use this function for (randomized) testing, as a stub for an expensive function which I don't want to run. The consumer of this function runs it a couple of hundred times and takes an average. I've been using a simple randint function, but the average is always very close to 0, which is not realistic.
Point is, I just need something simple that won't always average to 0. I don't really care what the actual average is. I may have asked the question wrong.
Do you really mean to require that specific value to be the average, or rather the expected value? In other words, if the generated sequence were to contain an extraordinary number of small values in its initial part, should the rest of the sequence atempt to compensate for that in an attempt to get the overall average right? I assume not, I assume you want all your samples to be computed independently (after all, you said you don't want any state), in which case you can only control the expected value.
If you assign a probability pi for each of your possible choices, then the expected value will be the sum of these values, weighted by their probabilities:
EV = − 2p−2 − p−1 + p1 + 2p2 = 1.2
As additional constraints you have to require that each of these probabilities is non-negative, and that the above four add up to a value less than 1, with the remainder taken by the fifth probability p0.
there are many possible assignments which satisfy these requirements, and any one will do what you asked for. Which of them are reasonable for your application depends on what that application does.
You can use a PRNG which generates variables uniformly distributed in the range [0,1), and then map these to the cases you described by taking the cumulative sums of the probabilities as cut points.
Specifically around log log counting approach.
I'll try and clarify the use of probabilistic counters although note that I'm no expert on this matter.
The aim is to count to very very large numbers using only a little space to store the counter (e.g. using a 32 bits integer).
Morris came up with the idea to maintain a "log count", so instead of counting n, the counter holds log₂(n). In other words, given a value c of the counter, the real count represented by the counter is 2ᶜ.
As logs are not generally of integer value, the problem becomes when the c counter should be incremented, as we can only do so in steps of 1.
The idea here is to use a "probabilistic counter", so for each call to a method Increment on our counter, we update the actual counter value with a probability p. This is useful as it can be shown that the expected value represented by the counter value c with probabilistic updates is in fact n. In other words, on average the value represented by our counter after n calls to Increment is in fact n (but at any one point in time our counter is probably has an error)! We are trading accuracy for the ability to count up to very large numbers with little storage space (e.g. a single register).
One scheme to achieve this, as described by Morris, is to have a counter value c represent the actual count 2ᶜ (i.e. the counter holds the log₂ of the actual count). We update this counter with probability 1/2ᶜ where c is the current value of the counter.
Note that choosing this "base" of 2 means that our actual counts are always multiples of 2 (hence the term "order of magnitude estimate"). It is also possible to choose other b > 1 (typically such that b < 2) so that the error is smaller at the cost of being able to count smaller maximum numbers.
The log log comes into play because in base-2 a number x needs log₂ bits to be represented.
There are in fact many other schemes to approximate counting, and if you are in need of such a scheme you should probably research which one makes sense for your application.
References:
See Philippe Flajolet for a proof on the average value represented by the counter, or a much simpler treatment in the solutions to a problem 5-1 in the book "Introduction to Algorithms". The paper by Morris is usually behind paywalls, I could not find a free version to post here.
its not exactly for the log counting approach but i think it can help you,
using Morris' algorithm, the counter represents an "order of magnitude estimate" of the actual count.The approximation is mathematically unbiased.
To increment the counter, a pseudo-random event is used, such that the incrementing is a probabilistic event. To save space, only the exponent is kept. For example, in base 2, the counter can estimate the count to be 1, 2, 4, 8, 16, 32, and all of the powers of two. The memory requirement is simply to hold the exponent.
As an example, to increment from 4 to 8, a pseudo-random number would be generated such that a probability of .25 generates a positive change in the counter. Otherwise, the counter remains at 4. from wiki
Hello good people of stackoverflow, this is a conceptual question and could possibly belong in math.stackexchange.com, however since this relates to the processing speed of a CPU, I put it in here.
Anyways, my question is pretty simple. I have to calculate the sum of the cubes of 3 numbers in a range of numbers. That sounds confusing to me, so let me give an example.
I have a range of numbers, (0, 100), and a list of each numbers cube. I have to calculate each and every combination of 3 numbers in this set. For example, 0 + 0 + 0, 1 + 0 + 0, ... 98^3 + 99^3 + 100^3. That may make sense, I'm not sure if I explained it well enough.
So anyways, after all the sets are computed and checked against a list of numbers to see if the sum matches with any of those, the program moves on to the next set, (100, 200). This set needs to compute everything from 100-200 + 0-200 + 0-200. Than (200, 300) will need to do 200 - 300 + 0 - 300 + 0 - 300 and so on.
So, my question is, depending on the numbers given to a CPU to add, will the time taken increase due to size? And, will the time it takes for each set exponentially increase at a predictable rate or will it be exponential, however not constant.
The time to add two numbers is logarithmic with the magnitude of the numbers, or linear with the size (length) of the numbers.
For a 32-bit computer, numbers up to 2^32 will take 1 unit of time to add, numbers up to 2^64 will take 2 units, etc.
As I understand the question you have roughly 100*100*100 combinations for the first set (let's ignore that addition is commutative). For the next set you have 100*200*200, and for the third you have 100*300*300. So it looks like you have an O(n^2) process going on there. So if you want to calculate twice as many sets, it will take you four times as long. If you want to calculate thrice as many, it's going to take nine times as long. This is not exponential (such as 2^n), but usually referred to as quadratic.
It depends on how long "and so on" lasts. As long as you maximum number, cubed, fits in your longest integer type, no. It always takes just one instruction to add, so it's constant time.
Now, if you assume an arbitrary precision machine, like say writing these numbers on the tape of a turing machine in decimal symbols, then adding will take a longer time. In that case, consider how long it would take? In other words, think about how the length of a string of decimal symbols grows to represent a number n. It will take time at least proportional to that length.