Initialize kmeans, *vector* initial centroids, R

Initialize kmeans, *vector* initial centroids, R - r

In this post there is a method to initialize the centers for the K-means algorithm in R. However, the data used therein is scalar (i.e. numbers).
A variation on this question: what if the data has multiple dimensions. In that case, the new centers should be vectors, so start should be a vector of vectors... I tried something like :
C1<- c(1,2)
C2<- c(4,-5)
to have my two initial centers, and then use
kmeans(dat, c(C1,C2))
but it didn't work. I also tried cbind() instead of c(). Same result...

You expand the matrix start to have cluster rows and variables columns (dimensions), where cluster is the number of clusters you are attempting to identify and variables is the number of variables in the data set.
Here is an extension of the post you linked to, expanding the example to 3 dimensions (variables), x, y, and z:
set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
y = rnorm(99, mean = c(-5, 0, 5)),
z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)
The plot is:
Now we need to specify cluster centres for each of our three clusters. This is done via a matrix as before:
start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)
> start
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
Here, the important thing to note is that the clusters are in rows. The columns are coordinates on that dimension of the specified cluster centre. Hence for cluster 1 we are specifying that the centroid is at (-5,-5,-5)
Calling kmeans()
kmeans(dat, start)
results in it picking groups very close to our initial starting points (as it should for this example):
> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
Clustering vector:
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Within cluster sum of squares by cluster:
[1] 117.78043 77.65203 77.00541
(between_SS / total_SS = 93.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
It is worth noting here the output for the cluster centres:
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
This layout is exactly the same as the matrix start.
You don't have to build the matrix directly using matrix(), nor do you have to specify the centres column-wise. For example:
c1 <- c(-5, -5, -5)
c2 <- c( 0, 0, 2)
c3 <- c( 5, 5, -4)
start2 <- rbind(c1, c2, c3)
> start2
[,1] [,2] [,3]
c1 -5 -5 -5
c2 0 0 2
c3 5 5 -4
Or
start3 <- matrix(c(-5, -5, -5,
0, 0, 2,
5, 5, -4), ncol = 3, nrow = 3, byrow = TRUE)
> start3
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
If those are more comfortable for you.
The key thing to remember is that variables are in columns, cluster centres in the rows.

## Your centers
C1 <- c(1, 2)
C2 <- c(4, -5)
## Simulate some data with groups around these centers
library(MASS)
set.seed(0)
dat <- rbind(mvrnorm(100, mu=C1, Sigma = matrix(c(2,3,3,10), 2)),
mvrnorm(100, mu=C2, Sigma = matrix(c(10,3,3,2), 2)))
clusts <- kmeans(dat, rbind(C1, C2)) # get clusters with your center starting points
## Look at them
plot(dat, col=clusts$cluster)

Related

Finding conditional and joint probabilities from a simulation

Consider the Markov chain with state space S = {1, 2} and transition matrix
and initial distribution α = (1/2, 1/2).
Suppose, the source code for simulation is the following:
alpha <- c(1, 1) / 2
mat <- matrix(c(1 / 2, 0, 1 / 2, 1), nrow = 2, ncol = 2)
chainSim <- function(alpha, mat, n)
{
out <- numeric(n)
out[1] <- sample(1:2, 1, prob = alpha)
for(i in 2:n)
out[i] <- sample(1:2, 1, prob = mat[out[i - 1], ])
out
}
Suppose the following is the result of a 5-step Markov Chain simulation repeated 10 times:
> sim
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 1 1 2 2 2 1 1 1 2
[2,] 2 1 2 2 2 2 2 1 1 2
[3,] 2 1 2 2 2 2 2 1 2 2
[4,] 2 2 2 2 2 2 2 1 2 2
[5,] 2 2 2 2 2 2 2 2 2 2
[6,] 2 2 2 2 2 2 2 2 2 2
What would be the values of the following?
P(X1 = 1, X3 = 1)
P(X5 = 2 | X0 = 1, X2 = 1)
E(X2)
I tried them as follows:
mean(sim[4, ] == 1 && sim[2, ]== 1)
?
c(1,2) * mean(sim[2, ])
What would be (2)? Am I correct with the rest?
Kindly explain your response.

You are almost correct about 1: there is a difference whether you use && or &, see
?`&&`
It should be
mean(sim[1 + 1, ] == 1 & sim[1 + 3, ] == 1)
Then 2 is given by
mean(sim[1 + 5, sim[1 + 0, ] == 1 & sim[1 + 2, ] == 1] == 2)
where you may get NaN in the case if the conditional event {X0 = 1, X2 = 1} doesn't appear in your simulation.
Lastly, point 3 is
mean(sim[1 + 2, ])
since a natural estimator of the expected value is simply the sample average.

Take advantage of the problem structure, state 2 is an absorbing state. The only way for X1=1 and X3=1 is if it begins with 1 and in every intermediate step, it keep visiting state 1. Hence, the answer is (0.5)4=0.0625.
In terms of simulation, rather than
mean(sim[4, ] == 1 && sim[2, ]== 1
It should be
mean(sim[4, ] == 1 & sim[2, ]== 1
&& only check the first component.
For the second part, one possible way is to note that
P(X5=2|X0=1, X2=1)=P(X5=2,X0=1, X2=1)/P(X0=1, X2=1)
of which you can then first estimate the numerator and the denominator separately and then compute the ratio.
Alternatively, P(X5=2|X0=1, X2=1)=P(X5=2| X2=1)=P(X3=2| X0=1)
For the third question, E(X2) is a single number, it is not a vector. It can be estimated by mean(sim[3,])

distribute `n` among `k` units without repetition and zero structures in R

I was wondering if there might be a way in R to distribute n among k units without repetition (e.g., 3 5 2 is the same as 5 3 2, and 2 3 5 and 5 2 3) and without considering 0 combinations (i.e., no 9 1 0) and see the make-up of this distribution?
For example if n = 9 and k = 3 then we expect the make-up to be:
(Note: k will always be the # of columns)
3 3 3
4 3 2
4 1 4
5 2 2
5 1 3
6 2 1
7 1 1
makeup <- function(n, k){
# your suggested solution #
}

These are called integer partitions (more specifically restricted integer partitions) and can efficiently be generated with the packages partitions or arrangements like so:
partitions::restrictedparts(9, 3, include.zero = FALSE)
[1,] 7 6 5 4 5 4 3
[2,] 1 2 3 4 2 3 3
[3,] 1 1 1 1 2 2 3
arrangements::partitions(9, 3)
[,1] [,2] [,3]
[1,] 1 1 7
[2,] 1 2 6
[3,] 1 3 5
[4,] 1 4 4
[5,] 2 2 5
[6,] 2 3 4
[7,] 3 3 3
They are much faster than the solutions thus provided:
library(microbenchmark)
microbenchmark(arrangePack = arrangements::partitions(20, 5),
partsPack = partitions::restrictedparts(20, 5, include.zero = FALSE),
myfun2(20, 5, 20),
myfun1(20, 5, 20),
makeup(20, 5),
mycomb(20, 5), times = 3, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
arrangePack 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
partsPack 3.070203 2.755573 2.084231 2.553477 1.854912 1.458389 3
myfun2(20, 5, 20) 10005.679667 8528.784033 6636.284386 7580.133387 5852.625112 4872.050067 3
myfun1(20, 5, 20) 12770.400243 10574.957696 8005.844282 9164.764625 6897.696334 5610.854109 3
makeup(20, 5) 15422.745155 12560.083171 9248.916738 10721.316721 7812.997976 6162.166646 3
mycomb(20, 5) 1854.125325 1507.150003 1120.616461 1284.278219 950.015812 760.280469 3
In fact, for the example below, the other functions will error out because of memory:
system.time(arrangements::partitions(100, 10))
user system elapsed
0.068 0.031 0.099
arrangements::npartitions(100, 10)
[1] 2977866

You may try gtools::combinations for this work like below with repeats.allowed=TRUE option:
m <- gtools::combinations(9, 3, repeats.allowed = TRUE)
m[rowSums(m) == 9,]
A probable function could be, with options(expressions = 500000), this function could go till n = 500 (successfully ran on my machine for n=500, r=3):
mycomb <- function(n, r, sumval){
m <- combinations(n, r, repeats.allowed = TRUE)
m[rowSums(m) == sumval,]
}
mycomb(9,3,9)
Output:
# [,1] [,2] [,3]
#[1,] 1 1 7
#[2,] 1 2 6
#[3,] 1 3 5
#[4,] 1 4 4
#[5,] 2 2 5
#[6,] 2 3 4
#[7,] 3 3 3

Here's a base solution using expand.grid. I'm not going to recommend it for large n, but it works:
makeup <- function(n, k) {
x <- expand.grid(rep(list(1:n), 3)) # generate all combinations
x <- x[rowSums(x) == n,] # filter out stuff that doesn't sum to n
x <- as.data.frame(t(apply(x, 1, sort))) # order everything
unique(x) # keep non-duplicates
}
A little rethinking simplifies this greatly. If we have a vector of n objects, we can break it apart at n-1 different spots.. starting from this, we can reduce the work substantially:
makeup <- function(n, k) {
splits <- combn(n-1, k-1) # locations where to split up the data
bins <- rbind(rep(0, ncol(splits)), splits) # add an extra "split" before the 1st element
x <- apply(bins, 2, function(x) c(x[-1],9) -x) # count how many items in each bin
x <- as.data.frame(t(apply(x, 2, sort))) # order everything
unique(x) # keep non-duplicates
}

using matrix in base R:
myfun1 <- function( n, k){
x <- as.matrix(expand.grid( rep(list(seq_len(n)), k)))
x <- x[rowSums(x) == n,]
x[ ! duplicated( t( apply(x, 1, sort)) ),]
}
myfun1( n = 9, k = 3 )
May be this using data.table.
myfun2 <- function( n, k){
require('data.table')
dt <- do.call(CJ, rep(list(seq_len(n)), k))
dt <- dt[rowSums(dt) == n,]
dt[which(!duplicated(dt[, transpose(lapply( transpose(.SD), sort ))])),]
}
myfun2( n = 9, k = 3 )
# V1 V2 V3
# 1: 7 1 1
# 2: 6 2 1
# 3: 5 3 1
# 4: 4 4 1
# 5: 5 2 2
# 6: 4 3 2
# 7: 3 3 3

How to get the absolute difference between values in two columns in a matrix

I'm having a matrix like the following
i j value
[1,] 3 6 0.194201129
[2,] 3 5 0.164547043
[3,] 3 4 0.107149279
[4,] 4 3 0.004927017
[5,] 3 1 0.080454448
[6,] 1 2 0.003220612
[7,] 2 6 0.162313646
[8,] 3 3 0.114992628
[9,] 4 1 0.015337253
[10,] 1 6 0.026550051
[11,] 3 2 0.057004116
[12,] 4 2 0.006441224
[13,] 4 5 0.025641026
[14,] 2 4 0.004885993
[15,] 1 1 0.036552785
[16,] 1 5 0.048249186
[17,] 1 4 0.006053565
[18,] 1 3 0.004970296
As you can see for some i, j pairs there is an inverse pair. For example for i = 3, j = 1 , there is a pair with i = 1, j = 3.
Here is what I want to achieve.
For every i, j pair to subtract its inverse value and get the absolute value of the subtraction. For those pairs that have no inverse pair, 0 is subtracted from them.
Here are a couple of examples:
For i = 3, j = 5 there is no inverse pair (i = 5, j = 3) and thus the calculation becomes:
abs(0.164547043 - 0)
For i = 3, j = 1 there is an inverse pair on the matrix with i = 1, j = 3 and thus the calculation is going to be :
abs(0.004970296 - 0.080454448)
I approached this, by writing a bunch of code (65 lines) full of for loops and it's hard to read and be edited.
So I was wondering if there is another more efficient way to do something like that, by using more compact functions.
Motivated by a previous post where its answer was pretty simple (by using the aggregate() function) and by searching online for those functions, I'm trying to use here the mapply(), but the truth is that I cannot handle the inverse pairs.
EDIT:
dput()
memMatrix <- structure(c(3, 3, 3, 4, 3, 1, 2, 3, 4, 1, 3, 4, 4, 2, 1, 1, 1,
1, 6, 5, 4, 3, 1, 2, 6, 3, 1, 6, 2, 2, 5, 4, 1, 5, 4, 3, 0.194201128983738,
0.164547043451226, 0.107149278958536, 0.00492701677834917, 0.0804544476798398,
0.00322061191626409, 0.162313646044361, 0.114992627755601, 0.0153372534398016,
0.0265500506171091, 0.0570041160347523, 0.00644122383252818,
0.0256410256410256, 0.00488599348534202, 0.0365527853282693,
0.0482491856677524, 0.0060535654765406, 0.00497029586494912), .Dim = c(18L,
3L), .Dimnames = list(NULL, c("i", "j", "value")))
Also here is the code that so far works but it is a lot more complicated
Where memMatrix is the matrix given on top of the post. And here you cans see a little difference that I'm multiplying the absolut value with a variable called probability_distribution, but that's doesn't really matter. I through it away (the multiplcation) from the initial post to make it more simple.
subFunc <- function( memMatrix , probability_distribution )
{
# Node specific edge relevance matrix
node_edgeRelm <- matrix(ncol = 3)
colnames(node_edgeRelm) <- c("i","j","rel")
node_edgeRelm <- na.omit(node_edgeRelm)
for ( row in 1:nrow( memMatrix ) )
{
pair_i <- memMatrix[row,"i"]
pair_j <- memMatrix[row,"j"]
# If already this pair of i and j has been calculated continue with the next pair
# At the end of a new calculation, we store the i,j (verse) values in order from lower to higher
# and then we check here for the inverse j,i values (if exists).
if( pair_i < pair_j )
if( any(node_edgeRelm[,"i"] == pair_i & node_edgeRelm[,"j"] == pair_j) ) next
if( pair_j < pair_i )
if( any(node_edgeRelm[,"i"] == pair_j & node_edgeRelm[,"j"] == pair_i) ) next
# Verse i,j
mepm_ij <- as.numeric( memMatrix[which( memMatrix[,"i"] == pair_i & memMatrix[,"j"] == pair_j ), "mep"] )
if( length(mepm_ij) == 0 )
mepm_ij <- 0
# Inverse j,i
mepm_ji <- as.numeric( memMatrix[which( memMatrix[,"i"] == pair_j & memMatrix[,"j"] == pair_i ), "mep"] )
if( length(mepm_ji) == 0 )
mepm_ji <- 0
# Calculate the edge relevance for that specific initial node x and pair i,j
edge_relevance <- probability_distribution * abs( mepm_ij - mepm_ji )
# Store that specific edge relevance with an order from lower to higher node
if ( pair_i < pair_j)
node_edgeRelm <- rbind( node_edgeRelm, c( as.numeric(pair_i), as.numeric(pair_j), as.numeric(edge_relevance) ) )
else
node_edgeRelm <- rbind( node_edgeRelm, c( as.numeric(pair_j), as.numeric(pair_i), as.numeric(edge_relevance) ) )
}
na.omit(node_edgeRelm)
}
you can run it as subFunc(memMatrix, 1/3)

Assuming that the input is matrix m group the value elements by those that have the same i, j or j, i. There will either be 1 or 2 value elements in each such group so for any specific group append a zero to that 1 or 2 length vector and take the first 2 elements, difference the elements of the resulting 2 element vector and take the absolute value. This procedure does not change the row order. It gives a data frame but it could be converted back to a matrix if need be using as.matrix. No packages are used.
absdiff <- function(x) abs(diff(c(x, 0)[1:2]))
transform(m, value = ave(value, pmin(i, j), pmax(i, j), FUN = absdiff))
giving:
i j value
1 3 6 0.194201129
2 3 5 0.164547043
3 3 4 0.102222262
4 4 3 0.102222262
5 3 1 0.075484152
6 1 2 0.003220612
7 2 6 0.162313646
8 3 3 0.114992628
9 4 1 0.009283688
10 1 6 0.026550051
11 3 2 0.057004116
12 4 2 0.001555230
13 4 5 0.025641026
14 2 4 0.001555230
15 1 1 0.036552785
16 1 5 0.048249186
17 1 4 0.009283688
18 1 3 0.075484152

Here is a solution with library(purr) to make match() work on lists
library(purrr)
Create a match that operates on lists
match2 = as_mapper(match)
Create a list containing vectors with length 2 containing the two values, then second list with the values reversed, then match the two lists
i = match2(L <- map2(df[,1], df[,2], c),
map(L, rev))
Extract third column of the matched indices
v = df[i,3]
Replace the NA/unmatched with 0, do the subtraction then abs()
cbind(df, abs(df[,3]-replace(v, is.na(v), 0)))

You can try a tidyverse solution:
library(tidyverse)
df %>% as.tibble() %>%
rowwise() %>%
mutate(id=paste(sort(c(i,j)), collapse = "_")) %>%
group_by(id) %>%
mutate(n=paste0("n", 1:n())) %>%
select(-1,-2) %>%
spread(n, value, fill = 0) %>%
mutate(result=abs(n1-n2))
# A tibble: 14 x 4
# Groups: id [14]
id n1 n2 result
<chr> <dbl> <dbl> <dbl>
1 1_1 0.036552785 0.000000000 0.036552785
2 1_2 0.003220612 0.000000000 0.003220612
3 1_3 0.080454448 0.004970296 0.075484152
4 1_4 0.015337253 0.006053565 0.009283688
5 1_5 0.048249186 0.000000000 0.048249186
6 1_6 0.026550051 0.000000000 0.026550051
7 2_3 0.057004116 0.000000000 0.057004116
8 2_4 0.006441224 0.004885993 0.001555230
9 2_6 0.162313646 0.000000000 0.162313646
10 3_3 0.114992628 0.000000000 0.114992628
11 3_4 0.107149279 0.004927017 0.102222262
12 3_5 0.164547043 0.000000000 0.164547043
13 3_6 0.194201129 0.000000000 0.194201129
14 4_5 0.025641026 0.000000000 0.025641026
The idea is:
Sort rowwise i and j and paste together in a new column id.
Group by id and add number of occurences n
Spread by n
calculate the absolute difference.

Base R:
Lte say the name of your matrix is mat
> B=matrix(0,max(mat[,1:2]),max(mat[,1:2]))
> B[mat[,1:2]]=mat[,3]
> A=cbind(which(upper.tri(B,T),T),abs(`diag<-`(B,0)[upper.tri(B,T)]-t(B)[upper.tri(B,T)]))
> A[A[,3]>0,]
row col
[1,] 1 1 0.036552785
[2,] 1 2 0.003220612
[3,] 1 3 0.075484152
[4,] 2 3 0.057004116
[5,] 3 3 0.114992628
[6,] 1 4 0.009283688
[7,] 2 4 0.001555230
[8,] 3 4 0.102222262
[9,] 1 5 0.048249186
[10,] 3 5 0.164547043
[11,] 4 5 0.025641026
[12,] 1 6 0.026550051
[13,] 2 6 0.162313646
[14,] 3 6 0.194201129

finding the length and positions of sub-series within a series of numbers

I have a vector made of 0 and non-zero numbers. I would like to know the length and starting-position of each of the non-zero number series:
a = c(0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 2.6301334 1.8372030 0.0000000 0.0000000 0.0000000 1.5632647 1.1433757 0.0000000 1.5412216 0.8762267 0.0000000 1.3087967 0.0000000 0.0000000 0.0000000)
based on a previous post it is easy to find the starting positions of the non-zero regions:
Finding the index of first changes in the elements of a vector in R
c(1,1+which(diff(a)!=0))
However I cannot seem to configure a way of finding the length of these regions....
I have tried the following:
dif=diff(which(a==0))
dif_corrected=dif-1 # to correct for the added lengths
row=rbind(postion=seq(length(a)), length=c(1, dif_corrected))
position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
length 1 0 0 0 0 2 0 0 2 2 1 0 0 1 0
NOTE: not all columns are displayed ( there are actually 20)
Then I subset this to take away 0 values:
> row[,-which(row[2,]==0)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
position 1 6 9 10 11 14 19
length 1 2 2 2 1 1 2
This seems like a decent way of coming up with the positions and lengths of each non-zero series in the series, but it is incorrect:
The position 9 (identified as the start of a non-zero series) is a 0 and instead 10 and 11 are non-zero so I would expect the position 10 and a length of 2 to appear here....
The only result that is correct is position 6 which is the start of the first non-zero series- it is correctly identified as having a length of 2- all other positions are incorrect.
Can anyone tell me how to index correctly to identify the starting-position of each of the non-zero series and the corresponding lengths?
NOTE I only did this in R because of the usefulness of the which command but it would also be good to know how to do this numpy and create a dictionary of positions and length values

It seems like rle could be useful here.
# a slightly simpler vector
a <- c(0, 0, 1, 2, 0, 2, 1, 2, 0, 0, 0, 1)
# runs of zero and non-zero elements
r <- rle(a != 0)
# lengths of non-zero elements
r$lengths[r$values]
# [1] 2 3 1
# start of non-zero runs
cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
# [1] 3 6 12
This also works on vectors with only 0 or non-0, and does not depend on whether or not the vector starts/ends with 0 or non-0. E.g.:
a <- c(1, 1)
a <- c(0, 0)
a <- c(1, 1, 0, 1, 1)
a <- c(0, 0, 1, 1, 0, 0)
A possibly data.table alternative, using rleid to create groups, and .I to get start index and calculate length.
library(data.table)
d <- data.table(a)
d[ , .(start = min(.I), len = max(.I) - min(.I) + 1, nonzero = (a != 0)[1]),
by = .(run = rleid(a != 0))]
# run start len nonzero
# 1: 1 1 2 FALSE
# 2: 2 3 2 TRUE
# 3: 3 5 1 FALSE
# 4: 4 6 3 TRUE
# 5: 5 9 3 FALSE
# 6: 6 12 1 TRUE
If desired, the runs can then easily be sliced by the 'nonzero' column.

For numpy this is a parallel method to #Maple (with a fix for arrays ending with a nonzero):
def subSeries(a):
d = np.logical_not(np.isclose(a, np.zeros_like(a))).astype(int)
starts = np.where(np.diff(np.r_[0, d, 0]) == 1))
ends = np.where(np.diff(np.r_[0, d, 0]) == -1))
return np.c_[starts - 1, ends - starts]

Definition:
sublistLen = function(list) {
z_list <- c(0, list, 0)
ids_start <- which(diff(z_list != 0) == 1)
ids_end <- which(diff(z_list != 0) == - 1)
lengths <- ids_end - ids_start
return(
list(
'ids_start' = ids_start,
'ids_end' = ids_end - 1,
'lengths' = lengths)
)
}
Example:
> a <- c(-2,0,0,12,5,0,124,0,0,0,0,4,48,24,12,2,0,9,1)
> sublistLen(a)
$ids_start
[1] 1 4 7 12 18
$ids_end
[1] 1 5 7 16 19
$lengths
[1] 1 2 1 5 2

Vector whose elements add up to a value in R

I'm trying to create a vector whose elements add up to a specific number. For example, let's say I want to create a vector with 4 elements, and they must add up to 20, so its elements could be 6, 6, 4, 4 or 2, 5, 7, 6, whatever. I tried to run some lines using sample() and seq() but I cannot do it.
Any help appreciated.

To divide into 4 parts, you need three breakpoints from the 19 possible breaks between 20 numbers. Then your partitions are just the sizes of the intervals between 0, your partitions, and 20:
> sort(sample(19,3))
[1] 5 7 12
> diff(c(0, 5,7,12,20))
[1] 5 2 5 8
Test, lets create a big matrix of them. Each column is an instance:
> trials = sapply(1:1000, function(X){diff(c(0,sort(sample(19,3)),20))})
> trials[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 1 8 13 3 2
[2,] 4 7 10 2 9 5
[3,] 2 11 1 4 3 7
[4,] 11 1 1 1 5 6
Do they all add to 20?
> all(apply(trials,2,sum)==20)
[1] TRUE
Are there any weird cases?
> range(trials)
[1] 1 17
No, there are no zeroes and nothing bigger than 17, which will be a (1,1,1,17) case. You can't have an 18 without a zero.

foo = function(n, sum1){
#Divide sum1 into 'n' parts
x = rep(sum1/n, n)
#For each x, sample a value from 1 to that value minus one
f = sapply(x, function(a) sample(1:(a-1), 1))
#Add and subtract f from 'x' so that sum(x) does not change
x = x + sample(f)
x = x - sample(f)
x = floor(x)
x[n] = x[n] - (sum(x) - sum1)
return(x)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Initialize kmeans, vector initial centroids, R - r

Related

Finding conditional and joint probabilities from a simulation

distribute `n` among `k` units without repetition and zero structures in R

How to get the absolute difference between values in two columns in a matrix

finding the length and positions of sub-series within a series of numbers

Vector whose elements add up to a value in R

Categories

Resources