Coin Toss Plot similar to Feller - r

From Feller (1950) An Introduction to Probability Theory:
A path of length n can be interpreted as the record of an ideal experiment consisting of n successive tosses of a coin. If +1 stands for heads, then Sk equals the (positive or negative) excess of the accumulated number of heads over tails at the conclusion of the kth trial. The classical description introduces the fictitious gambler Peter who at each trial wins or loses a unit amount. The sequence S1, S2,...Sn then represents Peter's successive cumulative gains.
I have a column of ones and zeros from a real coin toss experiment and would like to construct a graph similar to that Feller presents (as described above). cumsum and plotCsum don't seem to be quite what I am looking for.

I have a column of ones and zeros
Maybe it works if you convert the 0 into -1


How to obtain the maximum sum of the array with the following condition?

Suppose the problem posed is as follows:
On Mars there lives a colony of worms. Each worm is represented as elements in an 1D array. Worms decide to eat each other but any worm can eat only its nearest neighbour. Each worm has a preset amount of energy(i.e the value of the element). On Mars, the laws dictate that when a worm i with energy x eats another worm with energy y, the i-th worm’s final energy becomes x-y. A worm is allowed to have negative energy levels.
Find the maximum value of energy of the last standing worm.
Sample data:
0,-1,-1,-1,-1 has answer 4.
2,1,2,1 has answer 4.
What will be the suitable logic to address this problem?
This problem has a surprisingly simple O(N) solution.
If any two members in the array have different signs, the answer is then sum of absolute values of all elements.
To see why, imagine a single positive value in the array, all other elements are negative (Example 1). Now the best strategy would be keeping this value positive and gradually eating all neighbors away to increase this positive value. The position of the positive value doesn't matter. The strategy is same in case of a single negative element.
In more general case, if an array of size N have values of different signs, we can always find an array of size N-1 with different signs, because there must be a pair of neighbors with different sign, which we can combine to form a number of any sign we prefer.
For example with this array : [1,2,-5,4,-10]
we can combine either (2,-5) or (4,-10). Lets combine (4,-10) to get [1,2,-5,-14]
We can only take (2,-5) now. So our array now is : [1,-7,-14]
Again only (1,-7) possible. But this time we have to keep combined value positive. So we are left with: [8,-14]
Final combining gives us 22, sum of all absolute values.
In case of all values with same sign, our first move would be to produce an opposite sign combining a neighbor pair with as little "cost" as possible. Intuitively, we don't want to waste two big numbers on this conversion. If we take x,y neighbor pair, when combined the new value (of opposite sign) will be abs(x-y). Since result is simply sum of absolute values, we can interpret it as - "loosing" abs(x) and abs(y) from maximum possible output and "gaining" abs(x-y) instead. So the "cost" for using this pair for sign conversion is abs(x)+abs(y)-abs(x-y). Since we need to minimise this cost, we choose from initial array neighbor pair that have lowest such value.
So if we take the above array but now all values are positive [1,2,5,4,10]:
"cost" of converting (1,2) to -1 is 1+2-abs(-1)=2.
"cost" of converting (2,5) to -3 is 2+5-abs(-3)=4.
"cost" of converting (5,4) to -1 is 5+4-abs(-1)=8.
"cost" of converting (4,10) to -6 is 4+10-abs(-6)=8.
So, we take and convert pair (1,2) to -1. Then just sum absolute values of resultant array to get 20. Notice that this value is exactly 2 less than our previous example.

Understanding the probability of a double-six if i roll two dice

The probability of a double-six in one throw of two die is 1/36 or 0.028.
If I threw a pair of die a hundred times would 3 (0.028 * 100) be
The amount of times (3) I would get a double-six
The probability (3%) of getting a double-six on all throws.
I have a feeling the correct answer is number 1, because intuitively the chance of getting a double six every time on a hundred throws seems to be a lot lower than 3%.
Please explain, as simply as you can, which is the correct understanding and why.
The probablity of not having double six in one throw (all but one outcome divided by all outcomes):
The probability of not having double six in N throws
(35/36)**N /* where ** is raising into N-th power */
The probability of having at least one double six in N throws
P(N) = 1 - (35/36)**N
if N == 100 we have
P(100) == 0.94022021...
It is nearly 1., but with a twist in the interpretation. 2.8 is the average number of double sixes if you were to perform a series of experiments with 100 throws each. The correct answer for 2. was given by Dmitry.
Generate a specific amount of random numbers that add up to a defined value

I would like to unit test the time writing software used at my company. In order to do this I would like to create sets of random numbers that add up to a defined value.
I want to be able to control the parameters:
Min and max value of the generated number
The n of the generated numbers
The sum of the generated numbers
For example, in 250 days a person worked 2000 hours. The 2000 hours have to randomly distributed over the 250 days. The maximum time time spend per day is 9 hours and the minimum amount is .25
I worked my way trough this SO question and found the method
diff(c(0, sort(runif(249)), 2000))
This results in 1 big number a 249 small numbers. That's why I would to be able to set min and max for the generated number. But I don't know where to start.
You will have no problem meeting any two out of your three constraints, but all three might be a problem. As you note, the standard way to generate N random numbers that add to a sum is to generate N-1 random numbers in the range of 0..sum, sort them, and take the differences. This is basically treating your sum as a number line, choosing N-1 random points, and your numbers are the segments between the points.
But this might not be compatible with constraints on the numbers themselves. For example, what if you want 10 numbers that add to 1000, but each has to be less than 100? That won't work. Even if you have ranges that are mathematically possible, forcing compliance with all the constraints might mean sacrificing uniformity or other desirable properties.
I suspect the only way to do this is to keep the sum constraint, the N constraint, do the standard N-1, sort, and diff thing, but restrict the resolution of the individual randoms to your desired minimum (in other words, instead of 0..100, maybe generate 0..10 times 10).
Or, instead of generating N-1 uniformly random points along the line, generate a random sample of points along the line within a similar low-resolution constraint.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
X[i][j] = 0;
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.
I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."
Now I'm wondering....aren't they 2 different things?
Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.
I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
If d2 and q are tf-idf vectors, then
where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.
There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.
(Formula taken from the Wikipedia, hence the d2.)
TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).
To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.
You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)
TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):
Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).
Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).
The TF-IDF of 'egg' is:
1 * log (100/99) = 0.01 # document 1
2 * log (100/99) = 0.02 # document 2
1 * log (100/99) = 0.01 # document 3
These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.
The TF-IDF for 'arugula' is:
1 * log (100/9) = 2.40 # document 1
2 * log (100/9) = 4.81 # document 2
0 * log (100/9) = 0 # document 3
'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.
The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').
The complete mathematical procedure for cosine similarity is explained in these tutorials
Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)
tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this:
Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.
Then you can find the cosine similarity between the documents.
TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings
