Generating two sets of numbers where the sum of each set and the sum of their dot product is N - r

In this question Getting N random numbers that the sum is M, the object was to generate a set of random numbers that sums to a specific number N. After reading this question, I started playing around with the idea of generating sets of numbers that satisfy this condition
sum(A) == sum(B) && sum(B) == sum(A * B)
An example of this would be
A <- c(5, 5, -10, 6, 6, -12)
B <- c(5, -5, 0, 6, -6, 0)
In this case, the three sums equal zero. Obviously, those sets aren't random, but they satisfy the condition. Is there a way to generate 'random' sets of data that satisfy the above condition? (As opposed to using a little algorithm as in the above example.)
(Note: I tagged this as an R question, but the language really doesn't matter to me.)

You'd need to define the first vector in n-dimensional space, and the 2nd one will have N-2 degrees of freedom (i.e. random numbers) since the sum and one angle are already determined.
The 2nd vector would need to be transformed into N-dimensional space; There are infinitely many transforms that could work, so if you don't care about the probability distribution of the resulting vectors, just choose the one that's most intuitive to you.
There's a nice geometrical interpretation to the first constraint: it constrains the 2nd vector to a (hyper-)plane in N-dimensional space; the 2nd constraint doesn't have a simple geometric interpretation.
check out hyperspherical cooridnates.

You can generate one set completely randomly. And generate randomly all numbers in set B except for two numbers. Since you have two equations you should be able to solve for those two numbers.

Related

Series vector for approximating pi

I've been set a question about Madhava's approximation of pi. The first part of it is to create a vector which contains the first 20 terms in the series. I know I could just input the first 20 terms into a vector, however that seems like a really long winded way of doing things. I was wondering if there is an easier way to create the vector?
Currently I have the vector
g = c((-3)^(-0)/(2*0+1), (-3)^(-1)/(2*1+1), (-3)^(-2)/(2*2+1), (-3)^(-3)/(2*3+1), (-3)^(-4)/(2*4+1), (-3)^(-5)/(2*5+1), (-3)^(-6)/(2*6+1), (-3)^(-7)/(2*7+1), (-3)^(-8)/(2*8+1), (-3)^(-9)/(2*9+1), (-3)^(-10)/(2*10+1), (-3)^(-11)/(2*11+1), (-3)^(-12)/(2*12+1), (-3)^(-13)/(2*13+1), (-3)^(-14)/(2*14+1), (-3)^(-15)/(2*15+1), (-3)^(-16)/(2*16+1), (-3)^(-17)/(2*17+1), (-3)^(-18)/(2*18+1), (-3)^(-19)/(2*19+1), (-3)^(-20)/(2*20+1))
And
h = sqrt(12)
So I have done g*h to get the approximation of pi. Surely there's an easier way of doing this?
Apologies if this is relatively basic, I am very new to R and still learning how to properly use stack overflow.
Thanks.
One of the best features of R is that it is vectorised. This means that we can do operations element-wise on entire vectors rather than having to type out the operation for each element. For example, if you wanted to find the square of the first five natural numbers (starting at one), we can do this:
(1:5)^2
which results in the output
[1] 1 4 9 16 25
instead of having to do this:
c(1^2, 2^2, 3^2, 4^2, 5^2)
which gives the same output.
Applying this amazing property of R to your situation, instead of having to manually construct the whole vector, we can just do this:
series <- sqrt(12) * c(1, -1) / 3^(0:19) / seq(from=1, by=2, length.out=20)
sum(series)
which gives the following output:
[1] 3.141593
and we can see more decimal places by doing this:
sprintf("%0.20f", sum(series))
[1] "3.14159265357140338182"
To explain a little further what I did in that line of code to generate the series:
We want to multiply the entire thing by the square root of 12, hence the sqrt(12), which will be applied to every element of the resulting vector
We need the signs of the series to alternate, which is accomplished via * c(1, -1); this is because of recycling, where R recycles elements of vectors when doing vector operations. It will multiply the first element by one, the second element by -1, then recycle and multiply the third element by 1, the fourth by -1, etc.
We need to divide each element by 1, 3, 9, etc., which is accomplished by / 3^(0:19) which gives / c(3^0, 3^1, ...)
Lastly, we also need to divide by 1, 3, 5, 7, etc. which is accomplished by seq(from=1, by=2, length.out=20) (see help(seq))

Using head() to print n ordered rows in dataframe from random starting position

I know I can use
head(sample(x),m)
to print a random selection of m rows from my dataset, but in this case each new draw is randomized. What if, instead of randomizing every draw, I wanted to randomize only the starting position for the first draw, while preserving the order of subsequent rows?
To illustrate, imagine we have a dataset of n rows and I wanted to print m of them in order, starting from a random position. The randomly drawn starting position is 5, so my desired function would print 5, 6, 7, ..., m < n.
This is more of a theoretical question, not a diagnostic one, so I don't believe a MWE example is needed...please let me know if you think it is and I will be happy to provide one.
We create a numeric index using the sample element and adding with the sequence of 'n' rows that should follow it. If the sampled index is say the last row, then we can create a condition to check for those cases
i1 <- sample(nrow(df1), 1)+ 0:3
df1[ i1[i1 <= nrow(df1)], ]

Stirling numbers of the second kind with multisets

I was looking at [Stirling numbers of the second kind], which are the total number of ways to split a set of length n into k non-empty subsets, where order does not matter.(http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html), and was wondering how to write a non-naive algorithm to compute
S(n, k {occurences of each element})
Where
S(6, 3, {1, 2, 3} )
would give the total number of ways a set with 6 elements in which 3 are the same element and a different 2 are another element (and 1 is its unique element) could be split into 3 non-empty sets, ignoring permutations.
There is a recursive formula for regular Stirling numbers of the second kind S(n, k), but unlikely to be a comparable function for multisets.
So what's an algorithm that could calculate this number?
Relevant question on Math.SE here, without a real method to calculate this number.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

What does rnorm in R return when the sd argument contains a vector?

What does the following code do:
rnorm(10, mean=2, sd=1:10)
The first number is from N(2,1)
The second number if from N(2,2)
The third number is from N(2,3)
etc...?
The first argument tells R how many random variates you want returned. In this case, it will give you back 10 values. Those values will be drawn from normal distributions with mean equal to 2. In addition, all 10 values will be drawn from distributions with different standard deviations, the first with SD=1, the second 2, ..., the 10th SD=10. Perhaps the thing to understand is that R, by its nature, is vectorized. That is, there is no such thing as a scalar, only a vector of length=1. (I recognize that that doesn't make a lot of sense within pure math, but it does in computer science.) As a result, arguments are often 'recycled' so that they will all match the length of the longest vector, i.e., you end up with a vector of 10 means, each equal to 2, to match your vector of 10 SDs. HTH.

Resources