Monte Carlo simulations involving mutiple probability - math

I have to create code that is supposed to display the probability of getting an ace of spades and then any 2. The main problem is that I can write code that displays the probability of getting an ace of spades and a 2 of spades, i.e 1/52 * 1/51. I cannot get 1/52 and 4/51, how do I get these probabilities?
Here is the code I have this far
M = 100000; %number of MC experiments to run
N = 0; %number of successful MC experiments
P = 0; %probability
figure(1); %create a new figure window
hold on; %hold all plots
%start experiment loop
for i=1:M
deck = randperm(52)'; %generate deck of cards, 1x52 vector
pos1 = randi(52); %select position to draw from randomly
pos2 = randi(52); %select position to draw from randomly
while pos2 == pos1
pos2 = randi(52);
endwhile
if (deck(pos1) == 1 && deck(pos2) == 2)
N +=1; %increment number of successful experiments
endif
plot(i,N/M,'r*') %plot probability of successful experiments thus far
endfor
hold off; %release all plots
P = N/M; %calculate probability
format long %prefer long format
disp('Probability of drawing Ace of Spades and a 2 is:'), disp(P)

Originally a comment but exceeded the word count. More of a hint than an answer:
First of all, you should be clear about your encoding. What corresponds to an ace of spades? What corresponds to a 2? Exactly how does 1:52 map to the deck? I can't think of a natural encoding in which the number 1 corresponds to the ace of spades and the number 2 corresponds to a card which is a 2. A principled solution is to use quotient and remainder upon division by 4 to determine rank and suit respectively. A cheap but serviceable solution (even though it isn't very natural) is to have 1 correspond to the ace of spades and the numbers 2,3,4,5 correspond to the 4 twos (leaving the rest of the encoding unspecified).
Once you get that straight in your mind, just shuffle the deck and look at the top two cards (no reason to further randomize the selection of cards, pos1 and pos2 are pointless: just use 1,2). Is the top one the ace of spades under your encoding? Is the next one a 2 under your encoding?

Related

Random walk on the integer - computing the "steps" in R

Starts at 0 and at each step moves 1 or -1 with equal probability. To illustrate this walk, consider the following: a marker is placed at zero on the number line, and a fair coin is flipped. If the coin lands on heads, the marker is moved one unit to the right. If it lands on tails, the marker is moved one unit to the left.Basically, I know how to simulate a random walk but instead of having the position x for each n;
function(n){
return(cumsum(c(0, sample(c(-1, 1), size = n-1, replace =TRUE))))
}
I want to get a vector with each modification (either 1 or -1)but I don't really know how to modify my current function. Maybe with another variable counting the difference between each position?)

Plotting large time series

Summary of Question:
Are there any easy to implement algorithms for reducing the number of points needed to represent a time series without altering how it appears in a plot?
Motivating Problem:
I'm trying to interactively visualize 10 to 15 channels of data logged from an embedded system at ~20 kHz. Logs can cover upwards of an hour of time which means that I'm dealing with between 1e8 and 1e9 points. Further, I care about potentially small anomalies that last for very short periods of time (i.e. less than 1 ms) such that simple decimation isn't an option.
Not surprisingly, most plotting libraries get a little sad if you do the naive thing and try to hand them arrays of data larger than the dedicated GPU memory. It's actually a bit worse than this on my system; using a vector of random floats as a test case, I'm only getting about 5e7 points out of the stock Matlab plotting function and Python + matplotlib before my refresh rate drops below 1 FPS.
Existing Questions and Solutions:
This problem is somewhat similar to a number of existing questions such as:
How to plot large data vectors accurately at all zoom levels in real time?
How to plot large time series (thousands of administration times/doses of a medication)?
[Several Cross Validated questions]
but deals with larger data sets and/or is more stringent about fidelity at the cost of interactivity (it would be great to get 60 FPS silky smooth panning and zooming, but realistically, I would be happy with 1 FPS).
Clearly, some form of data reduction is needed. There are two paradigms that I have found while searching for existing tools that solve my problem:
Decimate but track outliers: A good example of this is Matlab + dsplot (i.e. the tool suggested in the accepted answer of the first question I linked above). dsplot decimates down to a fixed number of evenly spaced points, but then adds back in outliers identified using the standard deviation of a high pass FIR filter. While this is probably a viable solution for several classes of data, it potentially has difficulties if there is substantial frequency content past the filter cutoff frequency and may require tuning.
Plot min and max: With this approach, you divide the time series up in to intervals corresponding to each horizontal pixel and plot just the minimum and maximum values in each interval. Matlab + Plot (Big) is a good example of this, but uses an O(n) calculation of min and max making it a bit slow by the time you get to 1e8 or 1e9 points. A binary search tree in a mex function or python would solve this problem, but is complicated to implemented.
Are there any simpler solutions that do what I want?
Edit (2018-02-18): Question refactored to focus on algorithms instead of tools implementing algorithms.
I had the very same problem displaying pressure timeseries of hundreds of sensors, with samples every minute for several years. In some cases (like when cleaning the data), I wanted to see all the outliers, others I was more interested in the trend. So I wrote a function that can reduce the number of data points using two methods: visvalingam and Douglas-Peucker. The first tend to remove outliers, and the second keeps them. I've optimized the function to work over large datasets.
I did that after realizing that all the plotting methods weren't capable to handle that many points, and the ones that did, were decimating the dataset in a way that I couldn't control. The function is the following:
function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option)
%lineSimplification Reduce the number of points of the line described by X
%and Y to N. Preserving the most relevant ones.
% Using an adapted method of visvalingam and Douglas-Peucker algorithms.
% The number of points of the line is reduced iteratively until reaching
% N non-NaN points. Repeated NaN points in original data are deleted but
% non-repeated NaNs are preserved to keep line breaks.
% The two available methods are
%
% Visvalingam: The relevance of a point is proportional to the area of
% the triangle defined by the point and its two neighbors.
%
% Douglas-Peucker: The relevance of a point is proportional to the
% distance between it and the straight line defined by its two neighbors.
% Note that the implementation here is iterative but NOT recursive as in
% the original algorithm. This allows to better handle large data sets.
%
% DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker
% keeps them.
%
% INPUTS:
% X: X coordinates of the line points
% Y: Y coordinates of the line points
% method: Either 'Visvalingam' or 'DouglasPeucker' (default)
% option: Either 'silent' (default) or 'verbose' if additional outputs
% of the calculations are desired.
%
% OUTPUTS:
% X: X coordinates of the simplified line points
% Y: Y coordinates of the simplified line points
% indices: Indices to the positions of the points preserved in the
% original X and Y. Therefore Output X is equal to the input
% X(indices).
% relevance: Relevance of the returned points. It can be used to furder
% simplify the line dinamically by keeping only points with
% higher relevance. But this will produce bigger distortions of
% the line shape than calling again lineSimplification with a
% smaller value for N, as removing a point changes the relevance
% of its neighbors.
%
% Implementation by Camilo Rada - camilo#rada.cl
%
if nargin < 3
error('Line points positions X, Y and target point count N MUST be specified');
end
if nargin < 4
method='DouglasPeucker';
end
if nargin < 5
option='silent';
end
doDisplay=strcmp(option,'verbose');
X=double(X(:));
Y=double(Y(:));
indices=1:length(Y);
if length(X)~=length(Y)
error('Vectors X and Y MUST have the same number of elements');
end
if N>=length(Y)
relevance=ones(length(Y),1);
if doDisplay
disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.')
end
return
end
% Removing repeated NaN from Y
% We find all the NaNs with another NaN to the left
repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1));
%We also consider a repeated NaN the first element if NaN
repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)];
Y=Y(~repeatedNaNs);
X=X(~repeatedNaNs);
indices=indices(~repeatedNaNs);
%Removing trailing NaN if any
if isnan(Y(end))
Y=Y(1:end-1);
X=X(1:end-1);
indices=indices(1:end-1);
end
pCount=length(X);
if doDisplay
disp(['Initial point count = ' num2str(pCount)])
disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))])
end
iterCount=0;
while pCount>N
iterCount=iterCount+1;
% If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and
% (x3,y3) the are uf such triangle is
% area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2)
% now the areas of the triangles defined by each point of X,Y and its two
% neighbors are
twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1))));
switch method
case 'Visvalingam'
% In this case the relevance is given by the area of the
% triangle formed by each point end the two points besides
relevance=twiceTriangleArea/2;
case 'DouglasPeucker'
% In this case the relevance is given by the minimum distance
% from the point to the line formed by its two neighbors
neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]);
relevance=twiceTriangleArea./neighborDistances;
otherwise
error(['Unknown method: ' method]);
end
relevance=[Inf; relevance; Inf];
%We remove the pCount-N least relevant points as long as they are not contiguous
[srelevance, sortorder]= sort(relevance,'descend');
firstFinite=find(isfinite(srelevance),1,'first');
startPos=uint32(firstFinite+N+1);
toRemove=sort(sortorder(startPos:end));
if isempty(toRemove)
break;
end
%Now we have to deal with contigous elements, as removing one will
%change the relevance of the neighbors. Therefore we have to
%identify pairs of contigous points and only remove the one with
%leeser relevance
%Contigous will be true for an element if the next or the previous
%element is also flagged for removal
contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1];
notContiguous=~contiguousToKeep;
%And the relevances asoociated to the elements flagged for removal
contRel=relevance(toRemove);
% Now we rearrange contigous so it is sorted in two rows, therefore
% if both rows are true in a given column, we have a case of two
% contigous points that are both flagged for removal
% this process is demenden of the rearrangement, as contigous
% elements can end up in different colums, so it has to be done
% twice to make sure no contigous elements are removed
nContiguous=length(contiguousToKeep);
for paddingMode=1:2
%The rearragngement is only possible if we have an even number of
%elements, so we add one dummy zero at the end if needed
if paddingMode==1
if mod(nContiguous,2)
pcontiguous=[contiguousToKeep; false];
pcontRel=[contRel; -Inf];
else
pcontiguous=contiguousToKeep;
pcontRel=contRel;
end
else
if mod(nContiguous,2)
pcontiguous=[false; contiguousToKeep];
pcontRel=[-Inf; contRel];
else
pcontiguous=[false; contiguousToKeep(1:end-1)];
pcontRel=[-Inf; contRel(1:end-1)];
end
end
contiguousPairs=reshape(pcontiguous,2,[]);
pcontRel=reshape(pcontRel,2,[]);
%finding colums with contigous element
contCols=all(contiguousPairs);
if ~any(contCols) && paddingMode==2
break;
end
%finding the row of the least relevant element of each column
[~, lesserElementRow]=max(pcontRel);
%The index in contigous of the first element of each pair is
if paddingMode==1
firstElementIdx=((1:size(contiguousPairs,2))*2)-1;
else
firstElementIdx=((1:size(contiguousPairs,2))*2)-2;
end
% and the index in contigous of the most relevant element of each
% pair is
lesserElementIdx=firstElementIdx+lesserElementRow-1;
%now we set the least relevant element as NOT continous, so it is
%removed
contiguousToKeep(lesserElementIdx(contCols))=false;
end
%and now we delete the relevant continous points from the toRemove
%list
toRemove=toRemove(contiguousToKeep | notContiguous);
if any(diff(toRemove(:))==1) && doDisplay
warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.'])
end
toRemoveLogical=false(pCount,1);
toRemoveLogical(toRemove)=true;
X=X(~toRemoveLogical);
Y=Y(~toRemoveLogical);
indices=indices(~toRemoveLogical);
pCount=length(X);
nRemoved=sum(toRemoveLogical);
if doDisplay
disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)'])
end
if nRemoved==0
break;
end
end
end
function d = ppDistance(p1,p2)
d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2);
end

Find start point (time) of each cycle in a sine wave

I am tying to achieve sine wave gradually changing from 8Hz to 2Hz over 5 seconds:
This waveform was produced in Cool Edit. I gave it a start frequency of 8Hz, an end frequency of 2Hz and a duration of 5 seconds. The sine wave gradually changes from one frequency to the other over the given time.
My question is, how can I accurately find the start time of each cycle (highlighted with a red dot), using a FOR loop?
Pseudo code:
time = 5 //Duration
freq1 = 8 //Start frequency
freq2 = 2 //End frequency
cycles = ( (freq1 + freq2) / 2 ) * time //Total number of cycles
for(i = 0; i < cycles; i++) {
/* Formula to find start time of each cycle */
}
That is backward thinking for this problem which leads to madness in the program. Not to mention the individual waves will not be a sin wave because the frequency is changing (they will be slightly distorted) which you will not achieve with your generator and also there is very slight chance the ending signal will stop on zero after 5sec. Instead do a continuous sin wave with variable frequency:
First compute actual frequency
linear interpolation will suffice (unless you need different change)
f=f0+(f1-f0)*t/T
where:
f0=8 [Hz] start frequency
f1=2 [Hz] stop frequency
T =5 [s] change time
t =<0,T> is actual time in [s]
compute the sin wave data
for (t=0.0,angle=0.0;t<=T;t+=dt)
{
f=f0+((f1-f0)*t/T); // actual frequency
signal=Amplitude*sin(angle); // your signal put it in a array or output somewhere ...
angle+=6.283185307179586476925286766559*dt*f; // update phase
while (angle>6.283185307179586476925286766559) // cut just to avoid floating rounding problems
angle-=6.283185307179586476925286766559;
}
Where dt [s] is a time step you want to sample your signal with. If you are generating this in Real Time and outputting to real HW you can use a timer or measure the time directly (with performance counters on Windows or by RDTSC or whatever you have at disposal)
If you got predefined number of samples n for this then
dt=T/double(n-1);
Here sample output (n=image width):
If you also need the number of periods then add counter increment inside the angle cut while loop And also there is your zero point too (but if samplerate is too small or you need high precision you need to interpolate the real zero position).

Dimensions of fractals: boxing count, hausdorff, packing in R^n space

I would like to calculate dimensions of fractal written as a n-dimensional array of 0s and 1s. It includes boxing count, hausdorff and packing dimension.
I have only idea how to code boxing count dimensions (just counting 1's in n-dimensional matrix and then use this formula:
boxing_count=-log(v)/log(n);
where n-number of 1's and n-space dimension (R^n)
This approach simulate counting minimal resolution boxes 1 x 1 x ... x 1 so numerical it is like limit eps->0. What do you think about this solution?
Do you have any idea (or maybe code) for calculating hausdorff or packing dimension?
The Hausdorff and packing dimension are purely mathematical tools based in measure theory. They have wonderful properties in that context but are not well suited for experimentation. In short, there is no reason to expect that you can estimate their values based on a single matrix approximation to some set.
Box counting dimension, by contrast, is well suited for numerical investigation. Specifically, let N(e) denote the number of squares of side length e required to cover your fractal set. As you seem to know, the box counting dimension of your set is the limit as e->0 of
log(N(e))/log(1/e)
However, I don't think that just choosing the smallest available value of e is generally a good idea. The standard interpretation in the physics literature, as I understand it, is to presume that the relationship between N(e) and e should be maintained over a broad range of values. A standard way to compute the box-counting dimension is compute N(e) for some choices of e chosen from a sequence that tends geometrically to zero. We then fit a line to the points in a log-log plot of N(e) versus 1/e The box-counting dimension should be approximately the slope of that line.
Example
As a concrete example, the following Python code generates a binary matrix that describes a fractal structure.
import numpy as np
size = 1024
first_row = np.zeros(size, dtype=int)
first_row[int(size/2)-1] = 1
rows = np.zeros((int(size/2),size),dtype=int)
rows[0] = first_row
for i in range(1,int(size/2)):
rows[i] = (np.roll(rows[i-1],-1) + rows[i-1] + np.roll(rows[i-1],1)) % 2
m = int(np.log(size)/np.log(2))
rows = rows[0:2**(m-1),0:2**m]
We can view the fractal structure by simply interpreting each 1 as a black pixel and each zero as white pixel.
import matplotlib.pyplot as plt
plt.matshow(rows, cmap = plt.cm.binary)
This matrix makes a nice test since it can be shown that there is an actual limiting object whose fractal dimension is log(1+sqrt(5))/log(2) or approximately 1.694, yet it's complicated enough to make the box counting estimate a little tricky.
Now, this matrix is 512 rows by 1024 columns; it decomposes naturally into 2 matrices that are 512 by 512. Each of those decomposes naturally into 4 matrices that are 256 by 256, etc. For each such decomposition, we need to count the number of sub matrices that have at least one non-zero element. We can perform this analysis as follows:
cnts = []
for lev in range(m):
block_size = 2**lev
cnt = 0
for j in range(int(size/(2*block_size))):
for i in range(int(size/block_size)):
cnt = cnt + rows[j*block_size:(j+1)*block_size, i*block_size:(i+1)*block_size].any()
cnts.append(cnt)
data = np.array([(2**(m-(k+1)),cnts[k]) for k in range(m)])
data
# Out:
# array([[ 512, 45568],
# [ 256, 22784],
# [ 128, 7040],
# [ 64, 2176],
# [ 32, 672],
# [ 16, 208],
# [ 8, 64],
# [ 4, 20],
# [ 2, 6],
# [ 1, 2]])
Now, your idea is to simply compute log(45568)/log(512) or approximately 1.7195, which is not too bad. I'm recommending that we examine a log-log plot of this data.
xs = np.log(data[:,0])
ys = np.log(data[:,1])
plt.plot(xs,ys, 'o')
This indeed looks close to linear, indicating that we might expect our box-counting technique to work reasonably well. First, though, it might be reasonable to exclude the one point that appears to be an outlier. In fact, that's one of the desirable characteristics of this approach. Here's how to do so:
plt.plot(xs,ys, 'o')
xs = xs[1:]
ys = ys[1:]
A = np.vstack([xs, np.ones(len(xs))]).T
m,b = np.linalg.lstsq(A, ys)[0]
def line(x): return m*x+b
ys = line(xs)
plt.plot(xs,ys)
m
# Out: 1.6902585379630133
Well, the result looks pretty good. In particular, this is a definitive example that this approach can work better than the simple idea of using just one data point. In fairness, though, it's not hard to find examples where the simple approach works better. Also, this set is regular enough that we get some nice results. Generally, one can't really expect box-counting computations to be too reliable.

Boxplot main rectangles delimiter which percentage of data points?

I used the command:
boxplot(V15~Class,data=trainData, main="V15 value depending on Class", xlab="Class", ylab="V15")
I would like to understand which is the percentage of points in the rectangle(s)?
I mean: if I take all the samples inside the main rectangle, what percentage of the total count of samples will it be?
I found the documentation, but cannot figure out this answer.
The help text for boxplot, which you refer to, suggest that you should "See Also boxplot.stats which does the computation". From the "Details" section:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n.
Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4),
the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.
So yes, basically the middle 50% of the values fall inside the box, but the details of the calculation depend on the nature of the data.

Resources