Basic Arithmetic Coding - math

I have to explain to my class how to do basic arithmetic coding on a small message. I've been investigating lots of documents and reading a lot and I can say I theoretically understand how this method works, but still have some questions.
I'm stepping through these examples (first example, second page) - we have 'eaii!' message, and we want to code it using the arithmetic method.
In the example, it sets
Symbol Probability Range
a .2 [0 , 0.2)
e .3 [0.2 , 0.5)
i .1 [0.5 , 0.6)
o .2 [0.6 , 0.8)
u .1 [0.8 , 0.9)
! .1 [0.9 , 1.0)
My first question is, how did it set probabilities?, my logic tells me that if i have two 'i' symbols then that symbol should have the highest probability, shouldn't it?
Also how did it determined which range to start from and other ranges afterwards??
Another example was coding the message 'abc', which was set like this:
Symbol Probability Range
a .7 [0 , 0.7)
b .1 [0.7 , 0.8)
c .2 [0.8 , 1.0)
I also don't understand why the first symbol has substantially greater probability than the others and even if it was an order of appearance thing, I don't understand how it set it to 0.7, like why not 0.8 or 0.5.
I hope I made myself clear and I'd appreciate any kind of help.

They are imagining a fixed model for the data that was established long before that specific message is to be encoded. The model was in principle constructed from a large ensemble of such messages, so there is no reason to believe that eaii! by itself should match the probabilities in the model. Of course, the model is just for illustration purposes, and no more real than the eaii! message. (Though I think I said exactly that the other day when I was pulling something out of the oven.)
The order of the symbols in the model is arbitrary. It just needs to be the same model on both ends. It is of course important that the probabilities add up to one.
The second model is simply another arbitrary model to illustrate how a symbol can be coded in less than a bit, when it has a probability greater than 1/2. For that model, each a in a series of a's would take a little over half a bit.

I think the probability is for the sample.
To determine which probability to test for a specific sample, just count each character, and divide, for each, its number of occurrences by the total number of characters. (sum is 1).
Note that the arithmetic algorithm is efficient when there is a huge repetition of the same character.
For example :
aaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaabaaaaa will compress very good
abfpababbahnajdapkalamkmdamlkapaaapokpokpdq will not compress very good (try Huffman)

Related

Test if regression weights are different from 1

I am doing an lm()regression with R where I use stock quotations. I used exponential weights for the regression : the older the data, the less weight. My weights formula is like this : alpha^(seq(685,1,by=-1))) (the data length is 685), and to find alpha I tried every value between 0.9 and 1.1 with a step of 0.0001 and I chose the alpha which minimizes the difference between the predicted values and the real values. This alpha is equal to 0.9992 so I would like to know if it is statistically different from 1.
In other words I would like to know if the weights are different from 1. Is it possible to achieve that and if so, how could I do this ?
I don't really know whether this question should be asked on stats.stackexchange but it involves Rso I hope it is not misplaced.

Decimal points - Probability value of 0 in Language R

How to treat p value in R ?
I am expecting very low p values like:
1.00E-80
I need to -log10
-log10(1.00E-80)
-log10(0) is Inf, but Inf at sense of rounding too.
But is seems that after 1.00E-308, R yields 0.
1/10^308
[1] 1e-308
1/10^309
[1] 0
Is the accuracy of p-value display with lm function the same as the cutoff point, 1e-308, or it is just designed such that we need a cutoff point and I need to consider a different cutoff point - such as 1e-100 (for example) to replace 0 with <1e-100.
There are a variety of possible answers -- which one is most useful depends on the context:
R is indeed incapable under ordinary circumstances of storing floating-point values closer to zero than .Machine$double.xmin, which varies by platform but is typically (as you discovered) on the order of 1e-308. If you really need to work with numbers this small and can't find a way to work on the log scale directly, you need to search Stack Overflow or the R wiki for methods for dealing with arbitrary/extended precision values (but you probably should try to work on the log scale -- it will be much less of a hassle)
in many circumstances R actually computes p values on the (natural) log scale internally, and can if requested return the log values rather than exponentiating them before giving the answer. For example, dnorm(-100,log=TRUE) gives -5000.919. You can convert directly to the log10 scale (without exponentiating and then using log10) by dividing by log(10): dnorm(-100,log=TRUE)/log(10)=-2171, which would be too small to represent in floating point. For the p*** (cumulative distribution function) functions, use log.p=TRUE rather than log=TRUE. (This particular point depends heavily on your particular context. Even if you are not using built-in R functions you may be able to find a way to extract results on the log scale.)
in some cases R presents p-value results as being <2.2e-16 even when a more precise value is known: (t1 <- t.test(rnorm(10,100),rnorm(10,80)))
prints
....
t = 56.2902, df = 17.904, p-value < 2.2e-16
but you can still extract the precise p-value from the result
> t1$p.value
[1] 1.856174e-18
(in many cases this behaviour is controlled by the format.pval() function)
An illustration of how all this would work with lm:
d <- data.frame(x=rep(1:5,each=10))
set.seed(101)
d$y <- rnorm(50,mean=d$x,sd=0.0001)
lm1 <- lm(y~x,data=d)
summary(lm1) prints the p-value of the slope as <2.2e-16, but if we use coef(summary(lm1)) (which does not use the p-value formatting), we can see that the value is 9.690173e-203.
A more extreme case:
set.seed(101); d$y <- rnorm(50,mean=d$x,sd=1e-7)
lm2 <- lm(y~x,data=d)
coef(summary(lm2))
shows that the p-value has actually underflowed to zero. However, we can still get an answer on the log scale:
tval <- coef(summary(lm2))["x","t value"]
2*pt(abs(tval),df=48,lower.tail=FALSE,log.p=TRUE)/log(10)
gives -692.62 (you can check this approach with the previous example where the p-value doesn't overflow and see that you get the same answer as printed in the summary).
Small numbers are generally hard to deal with.
The limit in R for infinite is caused by the use of double precision floating point :
?double All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308.
http://en.wikipedia.org/wiki/Double_precision_floating-point_format
You may find the Rmpfr package helpful here as it allows you to create multiple precision numbers.
install.packages("Rmpfr")
require(Rmpfr)
log(mpfr(1/10^309, precBits=500))

R fast AUC function for non-binary dependent variable

I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html

Finding a reasonable (noise-free) maximum element in a vector

Consider a vector V riddled with noisy elements. What would be the fastest (or any) way to find a reasonable maximum element?
For e.g.,
V = [1 2 3 4 100 1000]
rmax = 4;
I was thinking of sorting the elements and finding the second differential {i.e. diff(diff(unique(V)))}.
EDIT: Sorry about the delay.
I can't post any representative data since it contains 6.15e5 elements. But here's a plot of the sorted elements.
By just looking at the plot, a piecewise linear function may work.
Anyway, regarding my previous conjecture about using differentials, here's a plot of diff(sort(V));
I hope it's clearer now.
EDIT: Just to be clear, the desired "maximum" value would be the value right before the step in the plot of the sorted elements.
NEW ANSWER:
Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:
diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);
Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:
binEdges = min(V):1e7:max(V); % Create vector of bin edges
n = histc(V,binEdges); % Bin amplitude data
binThreshold = 100; % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first'); % Find first bin whose count is low
signalThreshold = binEdges(index);
OLD ANSWER (for posterity):
Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:
nDevs = 2; % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V); % Index of "reasonable" values
maxValue = max(V(index)); % Maximum of "reasonable" values
I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).
You need to describe formally the expected information in the vector, and the type of noise.
You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.
To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution
First, pick your favorite method for identifying outliers...
If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.
Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?
In that case, you could remove noise, and then find the max
max( v( find(v<=10 & v>=-10) ) )

how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler

Resources