Getting caught in loops - R - r

I am looking at whether or not certain 'systems' for betting really do work as claimed, namely, that they have a positive expectation. One such system is based on the rebate on loss. You basically have a large master pot, say $1 million. Your bankroll for each game is $50k.
The way it works is as follows:
Start with $50k, always bet on banker
If you win, add the money to the master pot. Then play again with $50k.
If you lose(now you're at $30k) play till you either:
(a) hit 0, you get a rebate of 10%. Begin playing again with $50k+5k=$55k.
(b) If you win more than the initial bankroll, add the money to the master pot. Then play again with $50k.
Continue until you double the master pot.
I just can't find an easy way of programming out the possible cases in R, since you can eventually go down an improbable path.
For example, you start at 50k, lose 20, win 19, now you're at 49, now you lose 20, lose 20, now you're at 9, you either lose 9 and get back 5k or you win and this cycle continues until you either end up with more than 50k or hit 0 and get the rebate on the 50k and start again with $50k +5k.
Here's some code I started, but I haven't figured out a good way of handling the cases where you get stuck and keeping track of the number of games played. Thanks again for your help. Obviously, I understand you may be busy and not have time.
p.loss <- .4462466
p.win <- .4585974
p.tie <- 1 - (p.win+p.loss)
prob <- c(p.win,p.tie,p.loss)
bet<-20
x <- c(19,0,-20)
r <- 10 # rebate = 20%
br.i <- 50
br<-200
#for(i in 1:100){
# cbr.i<-0
y <- sample(x,1,replace=T,prob)
cbr.i<-y+br.i
if(cbr.i > br.i){
br<-br+(cbr.i-br.i);
cbr.i<-br.i;
}else{
y <- sample(x,2,replace=T,prob);
if( sum(y)< cbr.i ){ cbr.i<-br.i+(1/r)*br.i; br<-br-br.i}
cbr.i<-y+
}else{
cbr.i<- sum(y) + cbr.i;
}if(cbr.i <= bet){
y <- sample(x,1,replace=T,prob)
if(abs(y)>cbr.i){ cbr.i<-br.i+(1/r)*br.i } }

The way you've phrased to rules don't make the game entirely clear to me, but here's some general advice on how you might solve your problem.
First of all, sit down with pen and paper, and see if you can make some progress towards an analytic solution. If the game is sufficiently complicated, this might not be possible, but you may get some more insight into how the game works.
Next step, if that fails, is to run a simulation. This means writing a function that accepts a starting level of player cash, and house cash (optionally this could be infinite), and a maximum number of bets to place. It then simulates playing the game, placing bets according your your betting system until either
i. The player goes broke
ii. The house goes broke
iii. You reach the maximum number of bets. (You need this maximum so you don't get stuck simulating forever.)
The function should return the amount of cash that the player has after all the bets have been placed.
Run this function lots of times, and compare the end cash with the starting cash. The average of end cash / start cash is a measure of your expectation.
Try the simulation with different inputs. (For instance, with many gambling games, even if you could theoretically make an infinite amount of money in the long run, stochastic variation means that you go broke before you get there.)

Related

Flipping coin simulation with gain/loss

Suppose you successively toss a fair coin and each time the result is
heads, you win $1, while if you get tails you lose 1$. Your initial capital is
3$. The throws stop if your capital is zeroed or you reach 10$. Let X_n be the
process that describes your chapter during the nth throw.
Simulate the X_n process 1000 times and present the graph
of its evolution through R.
2. Estimate the average number of consecutive throws until you stop. Is the result expected?
Can someone help me solve this or at least understand the steps I am supposed to take?
Someone already posted a link to a solution of your homework in the comments. I fear, however, that this uncommented code is incomprehensive for you, given that you have asked the question in the first place.
I would therefore suggest to first write your own implementation with an outer for loop and an inner while loop conditioned upon the running capital, call rbinom in each run and recompute the running capital. Store the resulting runs in a numeric vector and call mean on this vector.
It will start becoming interesting when you measure the runtime of your solution, which will be surprisingly slow. To speed it up, you must use "vectorization", which the linked to solution uses, but this is a completely different topic to be left for a different lesson...

Friends selection algorithm

In a .net project we have a group of 200 people of two types, lets say x and y, who need to be separated into groups of 7 or 8.
We have a web page where the people write other members they want to be in a group with. Each person builds a list of wanted members.
After this, there should be an algorithm to build the 7-8 member groups considering the peoples ratings, and the following condition: each group has at least 2 people of each type (x/y).
I'm pretty sure there must be a well known algorithm similar to this but didn't find one. Anyone knows how to do it?
this problem smells NP-Hard, so I suggest using Artificial Intelligence tools.
A possible approach is steepest ascent hill climbing [SAHC]
first, we will define our utility function (let it be u) as mentioned in the comments to the question. [sum of friends in group for each user]. let's define u(illegal) = -1 for illegal solution.
next,we define our 'world': S is the group of all possible solutions].
for each solution in S we define:
next(s)={all possibilities moving one person to a different group}
all we have to do now is run SAHC with random restarts:
1. best<- -INFINITY
2. while there is more time
3. choose a random legal solution
4. NEXT <- next(s)
5. if max{ U(NEXT) } < u(s): //s is the top of the hill
5.1. if u(s) > best: best <- u(s) //if s is better then the previous result - store it.
5.2. go to 2. //restart the hill climbing from a different random point.
6. else:
6.1. s <- max{ NEXT } //climb on the steepest hill.
6.2. goto 4.
7. return best //when out of time, return the best solution found so far.
It is anytime algorithm, meaning it will get a better result as you give it more time to run, and eventually [at time infinity] it will find the optimal result.

Mysterious combination

I decided to learn concurrency and wanted to find out in how many ways instructions from two different processes could overlap. The code for both processes is just a 10 iteration loop with 3 instructions performed in each iteration. I figured out the problem consisted of leaving X instructions fixed at a point and then fit the other X instructions from the other process between the spaces taking into account that they must be ordered (instruction 4 of process B must always come before instruction 20).
I wrote a program to count this number, looking at the results I found out that the solution is n Combination k, where k is the number of instructions executed throughout the whole loop of one process, so for 10 iterations it would be 30, and n is k*2 (2 processes). In other words, n number of objects with n/2 fixed and having to fit n/2 among the spaces without the latter n/2 losing their order.
Ok problem solved. No, not really. I have no idea why this is, I understand that the definition of a combination is, in how many ways can you take k elements from a group of n such that all the groups are different but the order in which you take the elements doesn't matter. In this case we have n elements and we are actually taking them all, because all the instructions are executed ( n C n).
If one explains it by saying that there are 2k blue (A) and red (B) objects in a bag and you take k objects from the bag, you are still only taking k instructions when 2k instructions are actually executed. Can you please shed some light into this?
Thanks in advance.
FWIW it can be viewed like this: you have a bag with k blue and k red balls. Balls of same color are indistinguishable (in analogy with the restriction that the order of instructions within the same process/thread is fixed - which is not true in modern processors btw, but let's keep it simple for now). How many different ways can you pull all the balls from the bag?
My combinatorial skills are quite rusty, but my first guess is
(2k!)
-----
2*k!
which, according to Wikipedia, indeed equals
(2k)
(k )
(sorry, I have no better idea how to show this).
For n processes, it can be generalized by having balls of n different color in the bag.
Update: Note that in the strict sense, this models only the situation when different processes are executed on a single processor, so all instructions from all processes must be ordered linearly on the processor level. In a multiprocessor environment, several instructions can be executed literally at the same time.
Generally, I agree with Péter's answer, but since it does not seem to have fully clicked for the OP, here's my shot at it (purely from a mathematical/combinatorial standpoint).
You have 2 sets of 30 (k) instructions that you're putting together, for a total of 60 (n) instructions. Since each set of 30 must be kept in order, we don't need to track which instruction within each set, just which set an instruction is from. So, we have 60 "slots" in which to place 30 instructions from one set (say, red) and 30 instructions from the other set (say, blue).
Let's start by placing the 30 red instructions into the 60 slots. There are (60 choose 30) = 60!/(30!30!) ways to do this (we're choosing which 30 slots of the 60 are filled by red instructions). Now, we still have the 30 blue instructions, but we only have 30 open slots left. There is (30 choose 30) = 30!/(30!0!) = 1 way to place the blue instructions in the remaining slots. So, in total, there are (60 choose 30) * (30 choose 30) = (60 choose 30) * 1 = (60 choose 30) ways to do it.
Now, let's suppose that instead of 2 sets of 30, you have 3 sets (red, green, blue) of k instructions. You have a total of 3k slots to fill. First, place the red ones: (3k choose k) = (3k)!/(k!(3k-k)!) = (3k)!/(k!(2k)!). Now, place the green ones into the remaining 2k slots: (2k choose k) = (2k)!/(k!k!). Finally, place the blue ones into the last k slots: (k choose k) = k!/(k!0!) = 1. In total: (3k choose k) * (2k choose k) * (k choose k) = ( (3k)! * (2k)! * k! ) / ( k!(2k)! * k!k! * k!0! ) = (3k)!/(k!k!k!).
As further extensions (though I'm not going to provide a full explanation):
if you have 3 sets of instructions with length a, b, and c, the number of possibilities is (a+b+c)!/(a!b!c!).
if you have n sets of instructions where the ith set has ki instructions, the number of possibilities is (k1+k2+...+kn)!/(k1!k2!...kn!).
Péter's answer is fine enough, but that doesn't explain just why concurrency is difficult. That's because more and more often nowadays you've got multiple execution units available (be they cores, CPUs, nodes, computers, whatever). That in turn means that the possibilities for overlapping between instructions is increased still further; there's no guarantee that what happens can be modeled correctly with any conventional interleaving.
This is why it is important to think in terms of using semaphores/mutexes correctly, and why memory barriers matter. That's because all of these things end up turning the true nasty picture into something that is far easier to understand. But because mutexes reduce the number of possible executions, they are reducing the overall performance and potential efficiency. It's definitely tricky, and that in turn is why it is far better if you can work in terms of message passing between threads of activity that do not otherwise interact; it's easier to understand and having fewer synchronizations is better.

Finding area of straight line with graph (Math question but needed for flot)

Okay, so this is a straight math question and I read up on meta that those need to be written to sound like programming questions. I'll do my best...
So I have graph made in flot that shows the network usage (in bytes/sec) for the user. The data is 4 minutes apart when there is activity, and otherwise set at the start of the usage range (let's say day 1) and the end of the range (day 7). The data is coming from a CGI script I have no control over, so I'm fairly limited in what I can provide the user.
I never took trig or calculus, so I'm pretty much in over my head. What I want is for the user to have the option to click any point on the graph and see their bandwidth usage for that moment. Since the lines between real data points are drawn straight, this can be done by getting the points before and after where the user has clicked and finding the y-interval.
It took me weeks to finally get a helpful math person to explain this to me. Everyone else has insisted on trying to teach me Riemann sum techniques and all sorts of other heavy stuff that not only is confusing to me, doesn't seem necessary for the problem.
But I also want the user to be able to highlight the graph from two arbitrary points on the y-axis (time) to get the amount of network usage total during that range. I know this would be inaccurate, but I need it to be the right inaccurate using a solid equation.
I thought this was the area under the line, but experiments with much simpler graphs makes this seem just far too high. I figured out I could take the distance from y2 - y1 and multiply it by x2 - x1 and then divide by two to get the area of the graph below the line like a triangle, but again, the numbers seemed to high. (maybe they are just big numbers and I don't get this math stuff at all).
So what I need, if anyone would be really awesome enough to provide it before this question is closed down for being too pure-math, is either the name of the concept I should be researching or the equation itself. Or the bad news that I do need advanced math to get an accurate result.
I am not bad at math, just as a last note, I just am not familiar with math beyond 10th grade and so I need some place to start. All the math sites seem to keep it too simple or way over my paygrade.
If I understood correctly what you're asking (and that is somewhat doubtful), you should find what you seek in these links:
Linear interpolation
(calculating the value of the point in between)
Trapezoidal rule
(calculating the area below the "curve")
*****Edit, so we can get this over :) without much ado:*****
So I have graph made in flot that shows the network usage (in bytes/sec) for the user. The data is 4 minutes apart when there is activity, and otherwise set at the start of the usage range (let's say day 1) and the end of the range (day 7). The data is coming from a CGI script I have no control over, so I'm fairly limited in what I can provide the user.
What is a "flot" ?
Okey, so you have speed on y axis [in bytes/sec]; and time on x axis in [sec], right?
That means, that if you're flotting (I'm bored, yes :) speed over time, in linear segments, interpolating at some particular point in time you'll get speed at that particular point in time.
If you wish to calculate how much bandwidth you've spend, you need to determine the area beneath that curve. The area from point "a" to point "b" will determine the spended bandwidth in [bytes] in that time period.
It took me weeks to finally get a helpful math person to explain this to me. Everyone else has insisted on trying to teach me Riemann sum techniques and all sorts of other heavy stuff that not only is confusing to me, doesn't seem necessary for the problem.
In the immortal words of Snoopy: "Good grief !"
But I also want the user to be able to highlight the graph from two arbitrary points on the y-axis (time) to get the amount of network usage total during that range. I know this would be inaccurate, but I need it to be the right inaccurate using a solid equation.
It would not be inaccurate.
It would be actually perfectly accurate (well, apart from roundoff error in bytes :), since you're using linear interpolation on linear segments.
I thought this was the area under the line, but experiments with much simpler graphs makes this seem just far too high. I figured out I could take the distance from y2 - y1 and multiply it by x2 - x1 and then divide by two to get the area of the graph below the line like a triangle, but again, the numbers seemed to high. (maybe they are just big numbers and I don't get this math stuff at all).
"like a triangle" --> should be "like a trapezoid"
If you do deltax*(y2-y1)/2 you will get the area, yes (this works only for linear segments). This is the basis principle of trapezoidal rule.
If you're uncertain about what you're calculating use dimensional analysis: speed is in bytes/sec, time is in sec, bandwidth is in bytes. Multiplying speed*time=bandwidth, and so on.
What I want is for the user to have
the option to click any point on the
graph and see their bandwidth usage
for that moment. Since the lines
between real data points are drawn
straight, this can be done by getting
the points before and after where the
user has clicked and finding the
y-interval.
Yes, that's a good way to find that instantaneous value. When you report that value back, it's in the same units as the y-axis, so that means bytes/sec, right?
I don't know how rapidly the rate changes between points, but it's even simpler if you simply pick the closest point and report its value. You simplify your problem without sacrificing too much accuracy.
I thought this was the area under the
line, but experiments with much
simpler graphs makes this seem just
far too high. I figured out I could
take the distance from y2 - y1 and
multiply it by x2 - x1 and then divide
by two to get the area of the graph
below the line like a triangle, but
again, the numbers seemed to high.
(maybe they are just big numbers and I
don't get this math stuff at all).
To calculate the total bytes over a given time interval, you should find the index closest to the starting and ending point and multiply the value of y by the spacing of your x-points and add them all together. That will give you the total # of bytes consumed during that time interval, but there's one more wrinkle you might have forgotten.
You said that the points come in "4 minutes apart", and your y-axis is in bytes/second. Remember that units matter. Your area is the sum of bytes/second times a spacing in minutes. To make the units come out right you have to multiply by 60 seconds/minute to get the final value of bytes that you want.
If that "too high" value is still off, consider units again. It's 1024 bytes per kbyte, and 1024*1024 bytes per MB. Check the units of the values you're checking the calculation against.
UPDATE:
No wonder you're having problems. Your original question CLEARLY stated bytes/sec. Even this question is imprecise and confusing. How did you arrive at "amount of data" at a given time stamp? Are those the total bits transferred since the last time stamp? If yes, simply add the values between the start and end of the interval you want and convert to the units convenient for you.
The network usage total is not in bytes (kilo-, mega-, whatever) per second. It would be in just straight bytes (or kilo-, or whatever).
For example, 2 megabytes per second over an interval of 10 seconds would be 20 megabytes total. It would not be 20 megabytes per second.
Or do you perhaps want average bytes per second over an interval?
This would be a lot easier for you if you would accept that there is well-established terminology for the concepts that you are having trouble expressing concisely or accurately, and that these mathematical terms have been around far longer than you. Since you've clearly gone through most of the trouble of understanding the concepts, you might as well break down and start calling them by their proper names.
That said:
There are 2 obvious ways to graph bandwidth, and two ways you might be getting the bandwidth data from the server. First, there's the cumulative usage function, which for any time is simply the total amount of data transferred since the start of the measurement. If you plot this function, you get a graph that never decreases (since you can't un-download something). The units of the values of this function will be bytes or kB or something like that.
What users are typically interested is in the instantaneous usage function, which is an indicator of how much bandwidth you are using right now. This is what users typically want to see. In mathematical terms, this is the derivative of the cumulative function. This derivative can take on any value from 0 (you aren't downloading) to the rated speed of your network link (indicating that you're pushing as much data as possible through your connection). The units of this function are bytes per second, or something related like Mbps (megabits per second).
You can approximate the instantaneous bandwidth with the average data usage over the past few seconds. This is computed as
(number of bytes transferred)
-----------------------------------------------------------------
(number of seconds that elapsed while transferring those bytes)
Generally speaking, the smaller the time interval, the more accurate the approximation. For simplicity's sake, you usually want to compute this as "number of bytes transferred since last report" divided by "number of seconds since last report".
As an example, if the server is giving you a report every 4 minutes of "total number of bytes transferred today", then it is giving you the cumulative function and you need to approximate the derivative. The instantaneous bandwidth usage rate you can report to users is:
(total transferred as of now) - (total as of 4 minutes ago) bytes
-----------------------------------------------------------
4*60 seconds
If the server is giving you reports of the form "number of bytes transferred since last report", then you can directly report this to users and plot that data relative to time. On the other hand, if the user (or you) is concerned about a quota on total bytes transferred per day, then you will need to transform the (approximately) instantaneous data you have into the cumulative data. This process, known as computing the integral, is the opposite of computing the derivative, and is in some ways conceptually simpler. If you've kept track of each of the reports from the server and the timestamp, then for each time, the value you plot is the total of all the reports that came in before that time. If you're doing this in realtime, then every time you get a new report, the graph jumps up by the amount in that report.
I am not bad at math, ... I just am not familiar with math beyond 10th grade
This is like saying "I'm not bad at programming, I have no trouble with ifs and loops but I never got around to writing more than one function."
I would suggest you enrol in a maths class of some kind. An understanding of matrices and the basics of calculus gives you an appreciation of many things, and can be useful in all sorts of areas. You'll be able to understand more of Wikipedia articles and SO answers - and questions!
If you can't afford that, try to find some lecture videos or something.
Everyone else has insisted on trying to teach me Riemann sum techniques
I can't see why. You don't need them for this - though if you had learned them, I expect you would find it easier to come up with a solution. You see, Riemann sums attempt to give you a "familiar" notion of area. The sort of area you (hopefully) learned years ago.
Getting the area below your usage graph between two points will tell you (approximately) how much was used over that period.
How do you find the area of a floor plan? You break it up into rectangles and triangles, find the area of each, and add them together. You can do the same thing with your graph, basically. Someone has worked out a simple way of doing this called the trapezoidal rule. It's just a matter of choosing how to divide your graph into strips, and in your case this is easy: just use the data points themselves as dividers. (You'll also need to work out the value of the graph at the left and right ends of the region selected by the user, using linear interpolation.)
If there's anything I've said that isn't clear to you (as there may well be), please leave a comment.

Detecting and fixing overflows

we have a particle detector hard-wired to use 16-bit and 8-bit buffers. Every now and then, there are certain [predicted] peaks of particle fluxes passing through it; that's okay. What is not okay is that these fluxes usually reach magnitudes above the capacity of the buffers to store them; thus, overflows occur. On a chart, they look like the flux suddenly drops and begins growing again. Can you propose a [mostly] accurate method of detecting points of data suffering from an overflow?
P.S. The detector is physically inaccessible, so fixing it the 'right way' by replacing the buffers doesn't seem to be an option.
Update: Some clarifications as requested. We use python at the data processing facility; the technology used in the detector itself is pretty obscure (treat it as if it was developed by a completely unrelated third party), but it is definitely unsophisticated, i.e. not running a 'real' OS, just some low-level stuff to record the detector readings and to respond to remote commands like power cycle. Memory corruption and other problems are not an issue right now. The overflows occur simply because the designer of the detector used 16-bit buffers for counting the particle flux, and sometimes the flux exceeds 65535 particles per second.
Update 2: As several readers have pointed out, the intended solution would have something to do with analyzing the flux profile to detect sharp declines (e.g. by an order of magnitude) in an attempt to separate them from normal fluctuations. Another problem arises: can restorations (points where the original flux drops below the overflowing level) be detected by simply running the correction program against the reverted (by the x axis) flux profile?
int32[] unwrap(int16[] x)
{
// this is pseudocode
int32[] y = new int32[x.length];
y[0] = x[0];
for (i = 1:x.length-1)
{
y[i] = y[i-1] + sign_extend(x[i]-x[i-1]);
// works fine as long as the "real" value of x[i] and x[i-1]
// differ by less than 1/2 of the span of allowable values
// of x's storage type (=32768 in the case of int16)
// Otherwise there is ambiguity.
}
return y;
}
int32 sign_extend(int16 x)
{
return (int32)x; // works properly in Java and in most C compilers
}
// exercise for the reader to write similar code to unwrap 8-bit arrays
// to a 16-bit or 32-bit array
Of course, ideally you'd fix the detector software to max out at 65535 to prevent wraparound of the sort that is causing your grief. I understand that this isn't always possible, or at least isn't always possible to do quickly.
When the particle flux exceeds 65535, does it do so quickly, or does the flux gradually increase and then gradually decrease? This makes a difference in what algorithm you might use to detect this. For example, if the flux goes up slowly enough:
true flux measurement
5000 5000
10000 10000
30000 30000
50000 50000
70000 4465
90000 24465
60000 60000
30000 30000
10000 10000
then you'll tend to have a large negative drop at times when you have overflowed. A much larger negative drop than you'll have at any other time. This can serve as a signal that you've overflowed. To find the end of the overflow time period, you could look for a large jump to a value not too far from 65535.
All of this depends on the maximum true flux that is possible and on how rapidly the flux rises and falls. For example, is it possible to get more than 128k counts in one measurement period? Is it possible for one measurement to be 5000 and the next measurement to be 50000? If the data is not well-behaved enough, you may be able to make only statistical judgment about when you have overflowed.
Your question needs to provide more information about your implementation - what language/framework are you using?
Data overflows in software (which is what I think you're talking about) are bad practice and should be avoided. While you are seeing (strange data output) is only one side effect that is possible when experiencing data overflows, but it is merely the tip of the iceberg of the sorts of issues you can see.
You could quite easily experience more serious issues like memory corruption, which can cause programs to crash loudly, or worse, obscurely.
Is there any validation you can do to prevent the overflows from occurring in the first place?
I really don't think you can fix it without fixing the underlying buffers. How are you supposed to tell the difference between the sequences of values (0, 1, 2, 1, 0) and (0, 1, 65538, 1, 0)? You can't.
How about using an HMM where the hidden state is whether you are in an overflow and the emissions are observed particle flux?
The tricky part would be coming up with the probability models for the transitions (which will basically encode the time-scale of peaks) and for the emissions (which you can build if you know how the flux behaves and how overflow affects measurement). These are domain-specific questions, so there probably aren't ready-made solutions out there.
But one you have the model, everything else---fitting your data, quantifying uncertainty, simulation, etc.---is routine.
You can only do this if the actual jumps between successive values are much smaller than 65536. Otherwise, an overflow-induced valley artifact is indistinguishable from a real valley, you can only guess. You can try to match overflows to corresponding restorations, by simultaneously analysing a signal from the right and the left (assuming that there is a recognizable base line).
Other than that, all you can do is to adjust your experiment by repeating it with different original particle flows, so that real valleys will not move, but artifact ones move to the point of overflow.

Resources