Split number in random parts with constrain - r

I am trying to find a smart way for splitting a number (eg 50) into a number of random parts (e.g. 20) BUT under the constrain that each generated random value cannot be greater than a specific value (e.g. 4).
So for example in this case I would expect as an output a vector of 20 values of which sum is 50 but none of the 20 values is greater than 4 (e.g 2.5, 1.3, 3.9 etc..)
I had a look at similar questions but from what i see these are dealing with splitting a number into equal or random parts but none of them included the constrain, which is the bit i am stuck with! Any help would be higly appreciated!!

here is a fast (random) solution (as long as you can appect one-decimal parts).
every time you run partitionsSample, you will get a different answer.
library(RcppAlgos)
# input
goal <- 50
parts <- 20
maxval <- 4
# sample with 1 decimal
m <- partitionsSample(maxval * 10, parts, repetition = FALSE, target = goal * 10, n = 1)
# divide by ten
answer <- m[1,]/10
# [1] 0.2 1.4 1.5 1.6 1.7 1.8 1.9 2.2 2.3 2.6 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.7 3.9
# check
sum(answer)
[1] 50

set.seed(42)
repeat {
vec <- runif(20)
vec <- 50 * vec/sum(vec)
if (all(vec <= 4)) break
}
sum(vec)
# [1] 50
# [1] 50
vec
# [1] 3.7299658 3.8207653 1.1666852 3.3860087 2.6166080 2.1165253 3.0033133 0.5490801 2.6787741 2.8747815 1.8663641
# [12] 2.9320577 3.8109668 1.0414675 1.8849202 3.8327490 3.9885516 0.4790347 1.9367197 2.2846613
Note: it is feasible with certain combinations that this could run "forever" (improbable to find a vector where all values are under 4). This works well here, but if you (say) change 20 to 10, it will never succeed (and will loop forever).

Another possible solution (by limiting the range of the interval of the uniform distribution, it is more likely to get a solution):
set.seed(123)
x <- 50
n <- 20
threshold <- 4
gotit <- F
while (!gotit)
{
v <- round(runif(n, 0, x/(n/2)), 1)
if (sum(v) == x & all(v <= threshold))
gotit <- T
}
v
#> [1] 2.2 3.0 2.2 3.0 3.0 0.5 2.4 2.5 2.7 4.0 1.0 1.7 1.2 2.8 2.9 3.3 2.9 3.0 3.0
#> [20] 2.7

Related

R Use of number 2 prefacing nrow function?

Probably a silly question but what is the purpose of the number 2 in the below code, (specifically the 2:nrow)
cumsum_Petal.Width <- iris$Petal.Width[1] # Create new vector object
for(i in 2:nrow(iris)) { # Use nrow as condition
cumsum_Petal.Width[i] <- # Calculate cumulative sum
cumsum_Petal.Width[i - 1] + iris$Petal.Width[i]
}
cumsum_Petal.Width # Print vector to RStudio console
# 0.2 0.4 0.6 0.8 1.0 1.4 1.7 1.9 2.1 2.2

how to subset a vector in the way that represent the general shape of original vector in R

I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.
suppose that one of my vectors is
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
what are the 10 represntive points of this vector?
In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:
df = as.data.frame(y)
y2 <- df %>%
mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>%
mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
select(y)
Though this does not guarantee you'll have exactly 10 points.
Thanks to #minem, I got my answer. Perfect!
library(kmlShape)
Px=(1:length(y))
Py=y
par(mfrow=c(1,2))
plot(Px,Py,type="l",main="original points")
plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")
and the result is as below (using Ramer–Douglas–Peucker algorithm):
The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :
Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...
# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
idxReduced <- 1:length(x)
while(length(idxReduced) > 10){
minDist <- NULL
idxFinal <- NULL
for(idxToRemove in 1:length(idxReduced)){
newIdxs <- idxReduced[-idxToRemove]
spf <- splinefun(x[newIdxs],y[newIdxs])
full <- spf(x)
dist <- sum((full-y)^2)
if(is.null(minDist) || dist < minDist){
minDist <- dist
idxFinal <- newIdxs
}
}
idxReduced <- idxFinal
}
return(list(x=x[idxReduced],y=y[idxReduced]))
}
Usage :
y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)
par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')
You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:
df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))
cutpoints <- seq(min(df$values), max(df$values), length.out = 5)
> cutpoints
[1] -2.00 4.25 10.50 16.75 23.00
df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)
> df
values quintiles
1 2.5 [-2,4.25]
2 1.0 [-2,4.25]
3 0.0 [-2,4.25]
4 1.2 [-2,4.25]
5 2.0 [-2,4.25]
6 3.0 [-2,4.25]
7 2.0 [-2,4.25]
8 1.0 [-2,4.25]
9 0.0 [-2,4.25]
10 -2.0 [-2,4.25]
11 -1.0 [-2,4.25]
12 0.5 [-2,4.25]
13 2.0 [-2,4.25]
14 3.0 [-2,4.25]
15 6.0 (4.25,10.5]
16 5.0 (4.25,10.5]
17 7.0 (4.25,10.5]
18 9.0 (4.25,10.5]
19 11.0 (10.5,16.8]
20 15.0 (10.5,16.8]
21 23.0 (16.8,23]
Now you could split the data by quintiles, calculate the propensities and sample from the groups.
groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))
> nsample
[1] 7 2 1 1
resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))
> z
[1] 2.0 1.0 0.0 1.0 3.0 0.5 3.0 5.0 9.0 11.0 23.0
Due to ceiling(), this may lead to 11 cases being sampled instead of 10.
Apparently you are interested in systematic sampling. If so, maybe the following can help.
set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]

Sum elements of a list in R

I have multiple files with the same structure. I would like to calculate the sum of each first element of each file.
Here is the head of one file:
> head(NodesA1to10)
NodeAMean NodeBMean NodeCMean NodeBMeanclo NodeCMeanclo NodeAMeanclo NodeBMeanin NodeCMeanin
1 0.6 0.5 0.1 0.06978355 0.08988359 0.06957624 0.3890288 0.2389382
2 0.6 1.1 0.4 0.04403793 0.04955660 0.04052408 0.4039787 0.2533702
3 0.5 1.6 0.4 0.03244537 0.03386485 0.02925561 0.4301655 0.2714006
4 0.6 2.7 1.3 0.02629242 0.02545891 0.02101667 0.4710782 0.2987464
5 1.0 2.9 1.4 0.02194354 0.02050273 0.01643439 0.5221127 0.3531018
6 1.8 5.1 2.0 0.01895985 0.01674231 0.01369418 0.5651066 0.3922610
NodeAMeanin NodeBMeanout NodeCMeanout NodeAMeanout
1 0.3550979 0.3236412 0.3807460 0.2786777
2 0.6173852 0.4587713 0.4784292 0.4671891
3 0.7020968 0.5933347 0.5594147 0.5536033
4 0.7686962 0.7336590 0.6230537 0.5872133
5 0.8024617 0.8738159 0.7147912 0.6457293
6 0.8289504 1.0158102 0.7665834 0.7186818
I have in total 10 files with the same format. I would like to calculate sum in as below:
N<-10
B1_1 <-sum(NodesA1to10$NodeBMean[1],NodesA11to20$NodeBMean[1],NodesA21to30$NodeBMean[1],NodesA31to40$NodeBMean[1],
NodesA41to50$NodeBMean[1],NodesA61to70$NodeBMean[1],NodesA71to80$NodeBMean[1],NodesA81to90$NodeBMean[1],
NodesA91to100$NodeBMean[1])/N
B1_2 <-sum(NodesA1to10$NodeBMean[2],NodesA11to20$NodeBMean[2],NodesA21to30$NodeBMean[2],NodesA31to40$NodeBMean[2],
NodesA41to50$NodeBMean[2],NodesA61to70$NodeBMean[2],NodesA71to80$NodeBMean[2],NodesA81to90$NodeBMean[2],
NodesA91to100$NodeBMean[2])/N
This code works for me, but I have to repeat it 10 times and further do it for different conditions, therefore I would really like to do it in faster way. I tried to use list:
allNodesA <- list(NodesA1to10,NodesA11to20,NodesA21to30,
NodesA31to40,NodesA41to50,NodesA51to60,NodesA61to70,
NodesA71to80,NodesA81to90,NodesA91to100)
B1<-lapply(allNodesA, function(z) { z$NodeBMean <- sum(z$NodeBMean[[1]]);z})
NodesBb<-do.call(rbind.data.frame, B1)
B1<-list()
for (j in 1:10)
{
B1[[j]]<-lapply(allNodesA, function(z) { sum(z[[j]]$NodeBMean[j])})
}
But this solution gives me an error:
Error in z[[j]]$NodeBMean : $ operator is invalid for atomic vectors
Could someone please help me how to find more optimal ways to get a sum of every 1st, 2nd....10th element from many files?
Something like this may work for you.
N <-10
allNodesA <- list(NodesA1to10,NodesA11to20,NodesA21to30,
NodesA31to40,NodesA41to50,NodesA51to60,NodesA61to70,
NodesA71to80,NodesA81to90,NodesA91to100)
vals <- numeric()
for(i in 1:nrow(NodesA1to10)){
vals <- append(vals, sum(sapply(allNodesA, function(x) x[i,"NodeBMean"])) / N)
}
vals

Multiply each component of vector by another vector (resulting in vector of length m*n)

Say I am making parts that come in three sizes, and each size has a certain tolerance:
target <- c(2, 4, 6)
tolerance <- c(0.95, 1.05)
What I'd like to end up with is an array that contains the limits of the tolerance for each target (i.e. target*0.95, target*1.05):
tol = (2*0.95, 2*1.05, 4*0.95, 4*1.05, 6*0.95, 6*1.05)
Here's a really ugly way of getting there, but I know there is a simple way to do this.
j<-1
tol<-NULL
for (i in target){
tol[j] <- i*tolerance[1]
tol[j+1] <- i*tolerance[2]
j<-j+2
}
The vector tol can be calculated using outer() like this:
tol <- c(outer(tolerance,target))
#> tol
#[1] 1.9 2.1 3.8 4.2 5.7 6.3
You can achieve that using matrix product:
target <- c(2, 4, 6)
tolerance <- c(0.95, 1.05)
target %*% t(tolerance)
[,1] [,2]
[1,] 1.9 2.1
[2,] 3.8 4.2
[3,] 5.7 6.3
The other answer would have my preference, but this alternative might generalise better in some specific context (more than two vectors)
Reduce("*", expand.grid(list(tolerance, target)))
Mostly for fun - using R's recycling:
rep(target, each = length(tolerance)) * tolerance
#[1] 1.9 2.1 3.8 4.2 5.7 6.3

Computing sparse pairwise distance matrix in R

I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist(), for example, which would result in a dense matrix (and probably storage problems for my size N).
The packages for kNN that I've found so far (knnflex, kknn, etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise distance function.
Closer to my goal, I see that the spam package has a nearest.dist() function that allows one to only consider distances less than some threshold, delta. In my case, however, a particular value of delta may produce too many distances (so that I have to store the NxN matrix densely) or too few distances (so that I can't use kNN).
I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics packages, but it doesn't seem like I can leverage these methods in this case.
Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for loops and save results in a Matrix object.
Well, we can't have you resorting to for-loops, now can we :)
There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:
sparseDist <- function(m, k) {
m <- t(m)
n <- ncol(m)
d <- vapply( seq_len(n-1L), function(i) {
d<-colSums((m[, seq(i+1L, n), drop=FALSE]-m[,i])^2)
o<-sort.list(d, na.last=NA, method='quick')[seq_len(k)]
c(sqrt(d[o]), o+i)
}, numeric(2*k)
)
dimnames(d) <- list(c(paste('d', seq_len(k), sep=''),
paste('i', seq_len(k), sep='')), colnames(m)[-n])
d
}
Trying it out on 9 2d-points:
> m <- matrix(c(0,0, 1.1,0, 2,0, 0,1.2, 1.1,1.2, 2,1.2, 0,2, 1.1,2, 2,2),
9, byrow=TRUE, dimnames=list(letters[1:9], letters[24:25]))
> print(dist(m), digits=2)
a b c d e f g h
b 1.1
c 2.0 0.9
d 1.2 1.6 2.3
e 1.6 1.2 1.5 1.1
f 2.3 1.5 1.2 2.0 0.9
g 2.0 2.3 2.8 0.8 1.4 2.2
h 2.3 2.0 2.2 1.4 0.8 1.2 1.1
i 2.8 2.2 2.0 2.2 1.2 0.8 2.0 0.9
> print(sparseDist(m, 3), digits=2)
a b c d e f g h
d1 1.1 0.9 1.2 0.8 0.8 0.8 1.1 0.9
d2 1.2 1.2 1.5 1.1 0.9 1.2 2.0 NA
d3 1.6 1.5 2.0 1.4 1.2 2.2 NA NA
i1 2.0 3.0 6.0 7.0 8.0 9.0 8.0 9.0
i2 4.0 5.0 5.0 5.0 6.0 8.0 9.0 NA
i3 5.0 6.0 9.0 8.0 9.0 7.0 NA NA
And trying it on a larger problem (10k points). Still, on 100k points and more dimensions it will take a long time (like 15-30 minutes).
n<-1e4; m<-3; m=matrix(runif(n*m), n)
system.time( d <- sparseDist(m, 3) ) # 9 seconds on my machine...
P.S. Just noted that you posted an answer as I was writing this: the solution here is roughly twice as fast because it doesn't calculate the same distance twice (the distance between points 1 and 13 is the same as between points 13 and 1).
For now I am using the following, inspired by this answer. The output is a n x k matrix where element (i,k) is the index of the data point that is the kth closest to i.
n <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = n)
min.k.dists <- function(x,k=5) {
apply(x,2,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
})
}
min.k.dists(x) # first row should be 1:ncol(x); these points have distance 0
dist(t(x)) # can check answer against this
If one is worried about how ties are handled and whatnot, perhaps rank() should be incorporated.
The above code seems somewhat fast, but I'm sure it could be improved (though I don't have time to go the C or fortran route). So I'm still open to fast and sparse implementations of the above.
Below I include a parallelized version that I ended up using:
min.k.dists <- function(x,k=5,cores=1) {
require(multicore)
xx <- as.list(as.data.frame(x))
names(xx) <- c()
m <- mclapply(xx,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
},mc.cores=cores)
t(do.call(rbind,m))
}
If you want to keep the logic of your min.k.dist function and return duplicate distances, you might want to consider modifying it a bit. It seems pointless to return the first line with 0 distance, right? ...and by incorporating some of the tricks in my other answer, you can speed up your version by some 30%:
min.k.dists2 <- function(x, k=4L) {
k <- max(2L, k + 1L)
apply(x, 2, function(r) {
sort.list(colSums((x - r)^2), na.last=NA, method='quick')[2:k]
})
}
> n<-1e4; m<-3; m=matrix(runif(n*m), n)
> system.time(d <- min.k.dists(t(m), 4)) #To get 3 nearest neighbours and itself
user system elapsed
17.26 0.00 17.30
> system.time(d <- min.k.dists2(t(m), 3)) #To get 3 nearest neighbours
user system elapsed
12.7 0.0 12.7

Resources