I'm not exactly sure how to go about this in R. I've got a data set with 40 values, some of which repeat and I want to perform a small bootstrap on this dataset to find the mean of two or more consecutive values. For example, I randomly select a value from the dataset provided below, say the very first value is selected which is 0.2, so x1=0.2. How can I make sure that in the same for loop R is able to select the next value, x2, to be 0.2 as that is the second value in the dataset? Thus it would appear as x1=0.2 and x2=0.2.
I can't really think of a way for this to be done as it would need to be repeated for each iteration and since the sample() function selects any random value that makes it harder to pinpoint exactly which value it selected given there are repeated values.
I've provided a sample code that calculates the mean for 1 observation and I would like to get it to work for 2 consecutive observations. So then I can calculate the means individually and display them.
If anyone has any way to handle this I would appreciate it.
Thanks ahead of time.
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
B<- 500
result1<- numeric(B)
# result2<- numerib(B)
for (b in 1:B){
x1<-sample(x=x,size =1, replace=TRUE)
# x2<-
result1[b]<-x1
# result2[b]<-x2
}
mean1<- mean(result1)
# mean2<- mean(result2)
A simple approach could be:
result <- matrix(nrow = B, ncol = 2)
for (b in 1:B){
idx1 <- sample(seq_along(x), size = 1)
idx2 <- idx1 %% length(x) + 1
result[b, 1] <- x[idx1]
result[b, 2] <- x[idx2]
}
storing the results in a matrix:
> result
[,1] [,2]
[1,] 0.21 0.21
[2,] 0.12 0.20
[3,] 0.21 0.21
[4,] 0.10 0.02
[5,] 0.10 0.02
[6,] 0.21 0.20
[7,] 0.02 -0.02
[8,] -0.02 0.01
[9,] 0.21 0.20
[10,] 0.17 0.15
Sample the indices of x, then use this to subset x for result1. Use the sampled index + 1 to subset x for result2. However, you also need a wrap around so that if you sample the last member of x, you sample the first as well (as the "next" value)
B <- 500
result1<- numeric(B)
result2 <- numeric(B)
for(i in 1:B) {
j <- sample(seq_along(x), 1)
if(j == 40) k <- 1
else k <- j + 1
result1[i] <- x[j]
result2[i] <- x[k]
}
mean(result1)
#> [1] 0.12618
mean(result2)
#> [1] 0.13034
Note also that since R is vectorized, you don't need a loop here at all. You could just do:
result1 <- sample(seq_along(x), 500, replace = TRUE)
result2 <- result1 + 1
result2[result2 == 41] <- 1
mean(x[result1])
#> [1] 0.12568
mean(x[result2])
#> [1] 0.12596
Created on 2022-03-28 by the reprex package (v2.0.1)
Could you work out all the possible consecutive means and then sample from that? How about:
library(RcppRoll)
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
rollmean <- roll_mean(x,2)
r <- sample(rollmean, 500, replace= T)
hist(r)
Which gives you:
Related
I want to generate a vector of a given length, e.g., n = 5. Each value in the vector should be a proportion (i.e., a value between 0 and 1) so that across n elements they sum up to 1.
Unfortunately, I have two vectors: one (mymins) defines the allowed lower boundaries of each proportion and the other (mymaxs) defines the allowed top boundaries of each proportion.
In my example below the desired proportion for the first element is allowed to fall anywhere between 0.3 and 0.9. And for the last element, the desired proportion is allowed to fall between 0.05 and 0.7.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
Let's assume that mymins are always 'legitimate' (i.e., their sum is never larger than 1).
How could I find a set of 5 proportions such that they all sum to 1 but lie within the boundaries?
Here is what I tried:
n = 5
mydif <- mymaxs - mymins # possible range for each proportion
myorder <- rank(mydif) # order those differences from smallest to largest
mytarget <- sum(mydif) # sum up the 5 ranges
x <- sort(runif(n))[myorder] # generate 5 random values an sort them in the order of mydif
x2 <- mymins + x / sum(x) * mytarget # rescale random values to sum up to mytarget and add them to mymins
x3 <- x2/sum(x2) # rescale x2 to sum up to 1
As you can see, I am not very far - because after rescaling some values are outside of their allowed boundaries.
I should probably also mention that I need this operation to be fast - because I am using it in an optimization loop.
I also tried to find a solution using optim, however the problem is that it always finds the same solution - and I need to generate a DIFFERENT solutions every time I find the proporotion:
myfun <- function(x) {
x <- round(x, 4)
abovemins <- x - mymins
n_belowmins <- sum(abovemins < 0)
if (n_belowmins > 0) return(100000)
belowmax <- x - mymaxs
n_abovemax <- sum(belowmax > 0)
if (n_abovemax > 0) return(100000)
mydist <- abs(sum(x) - 1)
return(mydist)
}
myopt <- optim(par = mymins + 0.01, fn = myfun)
myopt$par
sum(round(myopt$par, 4))
Thank you very much for your suggestions!
Perhaps its better to think of this in a different way. Your samples actually need to sum to 0.35 (which is 1 - sum(mymins)), then be added on to the minimum values
constrained_sample <- function(mymins, mymaxs)
{
sizes <- mymaxs - mymins
samp <- (runif(5) * sizes)
samp/sum(samp) * (1 - sum(mymins)) + mymins
}
It works like this:
constrained_sample(mymins, mymaxs)
#> [1] 0.31728333 0.17839397 0.07196067 0.29146744 0.14089459
We can test this works by running the following loop, which will print a message to the console if any of the criteria aren't met:
for(i in 1:1000)
{
test <- constrained_sample(mymins, mymaxs)
if(!all(test > mymins) | !all(test < mymaxs) | abs(sum(test) - 1) > 1e6) cat("failure")
}
This throws no errors, since the criteria are always met. However, as #GregorThomas points out, the bounds aren't realistic in this case. We can see a range of solutions constrained by your conditions using a boxplot:
samp <- constrained_sample(mymins, mymaxs)
for(i in 1:999) samp <- rbind(samp, constrained_sample(mymins, mymaxs))
df <- data.frame(val = c(samp[,1], samp[,2], samp[,3], samp[,4], samp[,5]),
index = factor(rep(1:5, each = 1000)))
ggplot(df, aes(x = index, y = val)) + geom_boxplot()
Because you need 5 random numbers to sum to 1, you really only have 4 independent numbers and one dependent number.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
set.seed(42)
iter <- 1000
while(iter > 0 &&
(
(1 - sum(x <- runif(4, mymins[-5], mymaxs[-5]))) < mymins[5] ||
(1 - sum(x)) > mymaxs[5]
)
) iter <- iter - 1
if (iter < 1) {
# failed
stop("unable to find something within 1000 iterations")
} else {
x <- c(x, 1-sum(x))
}
sum(x)
# [1] 1
all(mymins <= x & x <= mymaxs)
# [1] TRUE
x
# [1] 0.37732330 0.21618036 0.07225311 0.24250359 0.09173965
The reason I use iter there is to make sure you don't take an "infinite" amount of time to find something. If your mymins and mymaxs combination make this mathematically infeasible (as your first example was), then you don't need to spin forever. If it is mathematically improbable to find it in a reasonable amount of time, you need to weigh how long you want to do this.
One reason this takes so long is that we are iteratively pulling entropy. If you expect this to go for a long time, then it is generally better to pre-calculate as much as you think you'll need (overall) and run things as a matrix.
set.seed(42)
n <- 10000
m <- matrix(runif(prod(n, length(mymins)-1)), nrow = n)
m <- t(t(m) * (mymaxs[-5] - mymins[-5]) + mymins[-5])
remainders <- (1 - rowSums(m))
ind <- mymins[5] <= remainders & remainders <= mymaxs[5]
table(ind)
# ind
# FALSE TRUE
# 9981 19
m <- cbind(m[ind,,drop=FALSE], remainders[ind])
nrow(m)
# [1] 19
rowSums(m)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
head(m)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.3405821 0.1306152 0.05931363 0.2199362 0.24955282
# [2,] 0.3601376 0.1367465 0.20235704 0.2477507 0.05300821
# [3,] 0.4469526 0.1279795 0.02265618 0.2881733 0.11423845
# [4,] 0.5450527 0.1029903 0.07503371 0.2052423 0.07168103
# [5,] 0.3161519 0.1469783 0.15290720 0.3268470 0.05711557
# [6,] 0.4782448 0.1185735 0.01664063 0.2178225 0.16871845
all(
mymins[1] <= m[,1] & m[,1] <= mymaxs[1],
mymins[2] <= m[,2] & m[,2] <= mymaxs[2],
mymins[3] <= m[,3] & m[,3] <= mymaxs[3],
mymins[4] <= m[,4] & m[,4] <= mymaxs[4],
mymins[5] <= m[,5] & m[,5] <= mymaxs[5]
)
# [1] TRUE
This time it took 10000 attempts to make 19 valid combinations. It might take more or fewer attempts based on randomness, so ymmv with regards to how much you need to pre-generate.
If your example bounds are realistic, we can refine them quite a bit, narrowing the range of possibilities. For the current version of the question with:
mymins = c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs = c(0.9, 1, 1, 1, 0.7)
What's the max for x[1]? Well, if x[2:5] take on minimum values, they will add up to 0.1 + 0 + 0.2 + 0.05 = 0.35, so based on the other mins only we know that max value for x[1] is 1 - 0.35 = 0.65. The 0.9 in mymaxs is way too high.
We can calculate the actual max values taking the minimum of the max values based on the minimums and the mymaxs vector:
new_max = pmin(mymaxs, 1 - (sum(mymins) - mymins))
new_max
# [1] 0.65 0.45 0.35 0.55 0.40
We can similarly revise the min bounds, though in this case even the revised max bounds new_max are high enough that it would have any impact on the minimums.
new_min = pmax(mymins, 1 - (sum(new_max) - new_max))
new_min
# [1] 0.30 0.10 0.00 0.20 0.05
With these adjustments, we should be able to see easily if any solutions are possible (all(new_min < new_max)). And then generating random numbers as in r2evans's answer should go much quicker using the new bounds.
I'm working with the popbio package on a population model. It looks something like this:
library(popbio)
babies <- 0.3
kids <- 0.5
teens <- 0.75
adults <- 0.98
A <- c(0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults
)
A <- matrix ((A), ncol=6, byrow = TRUE)
N<-c(10,10,10,10,10,10)
N<-matrix (N, ncol=1)
model <- pop.projection(A,N,iterations=10)
model
I'd like to know how I can randomise the input so that at each iteration, which represents years this case, I'd get a different input for the matrix elements. So, for instance, my model runs for 10 years, and I'd like to have the baby survival rate change for each year. babies <- rnorm(1,0.3,0.1)doesn't do it because that still leaves me with a single value, just randomly selected.
Update: This is distinct from running 10 separate models with different initial, random values. I'd like the update to occur within a single model run, which itself has 10 iteration in the pop.projection function.
Hope you can help.
I know this answer is very late, but here's one approach using expressions. First, use an expression to create the matrix.
vr <- list( babies=0.3, kids=0.5, teens=0.75, adults=0.98 )
Ax <- expression( matrix(c(
0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults), ncol=6, byrow = TRUE ))
A1 <- eval(Ax, vr)
lambda(A1)
[1] 1.011821
Next, use an expression to create vital rates with nrorm or other functions.
vr2 <- expression( list( babies=rnorm(1,0.3,0.1), kids=0.5, teens=0.75, adults=0.98 ))
A2 <- eval(Ax, eval( vr2))
lambda(A2)
[1] 1.014586
Apply the expression to 100 matrices.
x <- sapply(1:100, function(x) lambda(eval(Ax, eval(vr2))))
quantile(x, c(.05,.95))
5% 95%
0.996523 1.025900
Finally, make two small changes to pop.projection by adding the vr option and a line to evaluate A at each time step.
pop.projection2 <- function (Ax, vr, n, iterations = 20)
{
x <- length(n)
t <- iterations
stage <- matrix(numeric(x * t), nrow = x)
pop <- numeric(t)
change <- numeric(t - 1)
for (i in 1:t) {
stage[, i] <- n
pop[i] <- sum(n)
if (i > 1) {
change[i - 1] <- pop[i]/pop[i - 1]
}
## evaluate Ax
A <- eval(Ax, eval(vr))
n <- A %*% n
}
colnames(stage) <- 0:(t - 1)
w <- stage[, t]
pop.proj <- list(lambda = pop[t]/pop[t - 1], stable.stage = w/sum(w),
stage.vectors = stage, pop.sizes = pop, pop.changes = change)
pop.proj
}
n <-c(10,10,10,10,10,10)
pop.projection2(Ax, vr2, n, 10)
$lambda
[1] 0.9874586
$stable.stage
[1] 0.33673579 0.11242588 0.08552367 0.02189786 0.02086656 0.42255023
$stage.vectors
0 1 2 3 4 5 6 7 8 9
[1,] 10 11.590000 16.375700 19.108186 20.2560223 20.5559445 20.5506251 20.5898222 20.7603581 20.713271
[2,] 10 4.147274 3.332772 4.443311 5.6693931 1.9018887 6.8455597 5.3879202 10.5214540 6.915534
[3,] 10 5.000000 2.073637 1.666386 2.2216556 2.8346965 0.9509443 3.4227799 2.6939601 5.260727
[4,] 10 5.000000 2.500000 1.036819 0.8331931 1.1108278 1.4173483 0.4754722 1.7113899 1.346980
[5,] 10 7.500000 3.750000 1.875000 0.7776139 0.6248948 0.8331209 1.0630112 0.3566041 1.283542
[6,] 10 17.300000 22.579000 24.939920 25.8473716 25.9136346 25.8640330 25.9715930 26.2494195 25.991884
$pop.sizes
[1] 60.00000 50.53727 50.61111 53.06962 55.60525 52.94189 56.46163 56.91060 62.29319 61.51194
$pop.changes
[1] 0.8422879 1.0014610 1.0485765 1.0477793 0.9521023 1.0664832 1.0079517 1.0945797 0.9874586
I have an expression
qbinom(0.05, n, .47) - 1
and I want to create a loop which iterates this expression over n for n = (20,200). For each iteration of this loop, this function will produce a number. I want to take the maximum of the 180 numbers it will produce. So, something like.
for (n in 20:200) {
max(qbinom(0.05, n, .47)-1)
But I'm not sure how exactly to do this.
Thanks!
First, I will show you how to do this with a loop.
n <- 20:200
MAX = -Inf ## initialize maximum
for (i in 1:length(n)) {
x <- qbinom(0.05, n[i], 0.47) - 1
if (x > MAX) MAX <- x
}
MAX
# [1] 81
Note, I am not keeping a record of all 181 values generated. Each value is treated as a temporary value and will be overwritten in the next iteration. In the end, we only have a single value MAX.
If you want to at the same time retain all the records, we need first initialize a vector to hold them.
n <- 20:200
MAX = -Inf ## initialize maximum
x <- numeric(length(n)) ## vector to hold record
for (i in 1:length(n)) {
x[i] <- qbinom(0.05, n[i], 0.47) - 1
if (x[i] > MAX) MAX <- x[i]
}
## check the first few values of `x`
head(x)
# [1] 5 5 6 6 6 7
MAX
# [1] 81
Now I am showing the vectorization solution.
max(qbinom(0.05, 20:200, 0.47) - 1)
# [1] 81
R functions related to probability distributions are vectorized in the same fashion. For those related to binomial distributions, you can read ?rbinom for details.
Note, the vectorization is achieved with recycling rule. For example, by specifying:
qbinom(0.05, 1:4, 0.47)
R will first do recycling:
p: 0.05 0.05 0.05 0.05
mean: 1 2 3 4
sd: 0.47 0.47 0.47 0.47
then evaluate
qbinom(p[i], mean[i], sd[i])
via a C-level loop.
Follow-up
How would I be able to know which of the 20:200 corresponds to the maximum using the vectorization solution?
We can use
x <- qbinom(0.05, 20:200, 0.47) - 1
i <- which.max(x)
# [1] 179
Note, i is the position in vector 20:200. To get the n you want, you need:
(20:200)[i]
# 198
The maximum is
x[i]
# [1] 81
I want to apply a function to a matrix input a, this function would change the first element to c[a[1]] and the next elements to b[a[i],a[i+1]] starting from i = 1 up to i = ncol(a) - 1.
example input:
a <- matrix(c(1,4,3,1),nrow=1)
b <- matrix(1:25,ncol=5,nrow=5)
c <- matrix(4:8,ncol=5,nrow=1)
expected output:
>a
4 16 14 3
#c[a[1]] gave us the first element: 4
#b[a[1],a[2]] gave us the second element: 16
#b[a[2],a[3]] gave us the third element: 14
#b[a[3],a[4]] gave us the fourth element: 3
I've been trying to use mapply() without any success so far. The idea is to avoid loops since those things can lead to major performance decrease in R
Step 1: using single index for addressing matrix
In R matrix elements are stored in column-major order into a vector, so A[i, j] is the same as A[(j-1)*nrow(A) + i]. Consider an example of random 3-by-3 matrix:
set.seed(1); A <- round(matrix(runif(9), 3, 3), 2)
> A
[,1] [,2] [,3]
[1,] 0.27 0.91 0.94
[2,] 0.37 0.20 0.66
[3,] 0.57 0.90 0.63
Now, this matrix has 3 rows (nrow(A) = 3). Compare:
A[2,3] # 0.66
A[(3-1) * 3 + 2] # 0.66
Step 2: vectorizing
You can address multiple elements of a matrix at a time. However, you can only do this by using single indexing mode (Not too precise here, see #alexis_laz's remark later). For example, if you want to extract A[1,2] and A[3,1], but if you do:
A[c(1,3), c(2,1)]
# [,1] [,2]
# [1,] 0.91 0.27
# [2,] 0.90 0.57
You actually get a block. Now, if you use single indexing, you get what you need:
A[3 * (c(2,1) - 1) + c(1,3)]
# [1] 0.91 0.57
Step 3: getting single index for your problem
Suppose n <- length(a) and you want to address those elements of b:
a[1] a[2]
a[2] a[3]
. .
. .
a[n-1] a[n]
you can use single index nrow(b) * (a[2:n] - 1) + a[1:(n-1)].
Step 4: complete solution
Since you only have single row for a and c, you should store them as vectors rather than matrices.
a <- c(1,4,3,1)
c <- 4:8
If you were given a matrix and have no choice (as they are currently are in your question), you can convert them into vectors by:
a <- as.numeric(a)
c <- as.numeric(c)
Now, as discussed, we have index for address b matrix:
n <- length(a)
b_ind <- nrow(b) * (a[2:n] - 1) + a[1:(n-1)]
You also address a[1] element of c as the first element of your final result, so we need concatenate: c[a[1]] and b[b_ind] by:
a <- c(c[a[1]], b[b_ind])
# > a
# [1] 4 16 14 3
This approach is fully vectorized, even better than *apply family.
alexis_laz's remark
alexis_laz reminds me that we can use "matrix-index" as well, i.e., we can also address matrix b via:
b[cbind(a[1:(n-1)],a[2:n])] ## or b[cbind(a[-n], a[-1])]
However, I think using single index is slightly faster, because we need to access the index matrix by row in order to address b, so we pay higher memory latency than using vector index.
I'm trying to remove the shackles of some legacy code that we use to make decision trees in a retail setting. I got to playing with hclust in R and it's beautiful and I'd like to use it. The heavy lifting for calculating distances is done in SQL and I get an output like this:
main with dist
A A 0.00
A B 1.37
A C 0.64
B B 0
B C 0.1
C C 0
That's loaded as a data frame right now (just reading the SQL query dump), but hclust wants a matrix of distances. E.g.,:
A B C
--+-----------------
A | 0
B | 1.37 0
C | 0.64 0.1 0
My thinking is too procedural and I'm trying to do it in nested loops at the moment. Can someone point me in the direction of something more R-idiomatic to do this?
Thank!
If you are looking for an actual distance matrix in R, try:
as.dist(xtabs(dist ~ with + main, mydf), diag = TRUE)
# A B C
# A 0.00
# B 1.37 0.00
# C 0.64 0.10 0.00
I'm presuming that the combinations of "main" and "with" are unique, otherwise xtabs would sum the "dist" values.
I would suggest to change from letters to numbers (which is straight forward using the ASCII codes) and then use the linearized indices of R matrices to access each pair in a vectorwise manner.
Minimal example:
N <- 3
d <- data.frame(x = c(1,2), y = c(2,3), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(d$y-1)*N+d$x] = d$v
The output is:
[,1] [,2] [,3]
[1,] 0 0.1 0.0
[2,] 0 0.0 0.2
[3,] 0 0.0 0.0
EDIT: To preserve arbitrary strings as row and col names, consider the following example:
codes <- c('A','B','C')
N <- 3
d <- data.frame(x = c('A','B'), y = c('B','C'), v = c(0.1, 0.2))
m <- matrix(0, N, N)
m[(vapply(d$y, function(x) which(codes == x), 0)-1)*N+
vapply(d$x, function(x) which(codes == x), 0)] = d$v
rownames(m) = codes
colnames(m) = codes