R while loop with vector condition - r

I want to vectorize a function that uses a while-loop.
The original function is
getParamsLeadtime <- function(leadtimeMean, in_tolerance, tolerance){
searchShape=0
quantil=0
# iterates the parameters until the percentage of values is within the interval of tolerance
while (quantil < in_tolerance){
searchShape = searchShape+1
quantil <- pgamma(leadtimeMean+tolerance,shape=searchShape,rate=searchShape/leadtimeMean) -
pgamma(leadtimeMean-tolerance,shape=searchShape,rate=searchShape/leadtimeMean)
}
leadtimeShape <- searchShape
leadtimeRate <- searchShape/leadtimeMean
return(c(leadtimeShape, leadtimeRate))
}
I would like to have a vectorized call to this function to apply it to a data frame. Currently I am looping through it:
leadtimes <- data.frame()
for (a in seq(92:103)) {
leadtimes <- rbind(leadtimes, getParamsLeadtime(a, .85,2))
}
When I tried to vectorize the function, the while did not seem to accept a vector as condition. The following warning occured:
Warning message:
In while (input["U"] < rep(tolerance, dim(input)[1])) { :
the condition has length > 1 and only the first element will be used
This let me suppose that while does not like vectors. Can you tell me how to vectorize the function?
On a sidenote, I wonder why the column names of the resulting leadtimes-data.frame appear to be values:
> leadtimes
X1 X1.1
1 1 1.000000
2 1 0.500000
3 4 1.333333
4 8 2.000000
5 13 2.600000
6 19 3.166667
7 25 3.571429
8 33 4.125000
9 42 4.666667
10 52 5.200000
11 63 5.727273
12 74 6.166667

Here's an option that is pretty performant.
We vectorize the calculation of pgamma for a given mean lead time, for both the +tol and the -tol case, over a sufficiently large sequence of shp. We calculate a (vectorized) difference, and compare to in_tol. The index (minus 1, since we start our sequence at 0) of the first element of the vector that is greater than in_tol is the lowest value of shp that leads to a pgamma of greater than in_tol.
f <- function(lead, in_tol, tol) {
shp <- which(!(pgamma(lead + tol, 0:10000, (0:10000)/lead) -
pgamma(lead - tol, 0:10000, (0:10000)/lead))
< in_tol)[1] - 1
rate <- shp/lead
c(shp, rate)
}
We can then sapply this over a range of mean lead times.
t(sapply(1:12, f, 0.85, 2))
## [,1] [,2]
## [1,] 1 1.000000
## [2,] 1 0.500000
## [3,] 4 1.333333
## [4,] 8 2.000000
## [5,] 13 2.600000
## [6,] 19 3.166667
## [7,] 25 3.571429
## [8,] 33 4.125000
## [9,] 42 4.666667
## [10,] 52 5.200000
## [11,] 63 5.727273
## [12,] 74 6.166667
system.time(leadtimes <- sapply(1:103, f, 0.85, 2))
## user system elapsed
## 1.28 0.00 1.30
You just need to make sure you choose a sensible upper ceiling for the shape parameter (here I've chosen 10000, which was more than generous). Note that if you don't choose an upper limit that is high enough, some return values will be NA.

Related

Vectorizing a for loop that changes columns of a matrix

Say I have a vector of ages of 100 trees. Then I age those trees up for 5, 10, 15, and 20 years into the future to create a matrix of tree ages for this year and four 5-year planning periods in the future.
But then, I decide to cut some of those trees (only 10 per planning period), documented in a matrix of T/F values where T is harvested and F is not (trees can't be harvested twice).
age.vec <- sample(x = 1:150, size = 100, replace = T) # create our trees
age.mat <- cbind(age.vec, age.vec+5, age.vec + 10, age.vec + 15, age.vec + 20) # grow them up
x.mat <- matrix(data = F, nrow = 100, ncol = 5) # create the empty harvest matrix
x.mat[cbind(sample(1:100, size = 50), rep(1:5, each = 10))] <- T # 10 trees/year harvested
So then, the ages of trees that are harvested become zero in that year:
age.mat[x.mat] <- 0
I then would like to age the harvested trees up again for the following periods. E.g. if a tree were harvested in the first planning period, in the second planning period (5 years later), I want the age of the tree to be 5, then in the third planning period (10 years later), I want the age of the tree to be 10. I have successfully implemented this in the following for loop:
for (i in 2:5){ # we don't need to calculate over the first year
age.mat[,i]<-age.mat[,i-1]+5L # add 5 to previous year
age.mat[x.mat[,i],i] <- 0L # reset age of harvested trees to zero
}
This works, however, it is clunky and slow. Is there a way to implement this faster (i.e. without the for loop)? It also is implemented within a function, which means that using "apply" actually slows things down, so it needs to be vectorized directly. This is something I'm iterating over thousands of times so speed is of the essence!
Thank you!
An alternative to the t(apply in #Jon Spring's answer is matrixStats::rowCumsums.
library(matrixStats)
n <- 1e4L
n10 <- n/10L
age.mat <- outer(sample(150, n, TRUE), seq(0, 20, 5), "+")
x.mat <- matrix(FALSE, n, 5) # create the empty harvest matrix
# sample harvests so that no tree is harvested twice
x.mat[matrix(c(sample(n, n/2L), sample(n10:(6L*n10 - 1L)) %/% n10), n/2L)] <- TRUE
f1 <- function(age, x) {
age[x[,1],] <- 0
for (i in 2:5){ # we don't need to calculate over the first year
age[,i] <- age[,i - 1] + 5L # add 5 to previous year
age[x[,i], i] <- 0L # reset age of harvested trees to zero
}
age
}
f2 <- function(age, x) {
age - rowCumsums(x*age)
}
microbenchmark::microbenchmark(f1 = f1(age.mat, x.mat),
f2 = f2(age.mat, x.mat),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 294.4 530.2 1023.450 566.6 629.35 33222.8 100
#> f2 135.2 263.6 334.622 284.2 307.15 4343.6 100
This looks to be about 12x faster, based on testing with rbenchmark.
Here's an approach relying on the fact that harvesting a tree doesn't stop the passage of time, it just resets the clock. So we can think of a harvest as subtracting the harvest age from all future ages.
x.die <- x.mat * age.mat
x.dif <- t(apply(x.die, 1, cumsum))
age.mat2 <- age.mat - x.dif
x.die, by multiplying the harvests by the ages, we get the age at each harvest. The next line calculates the cumulative sum of these across each row, and finally we subtract those from the original ages.
I assume your "trees can't be harvested twice" means we won't ever see two TRUEs in one row of x.mat? My code won't work right if there were more than one harvest per tree location.
I found a way to do it! I implemented the idea of going backwards from #john-spring, where I created a matrix with the age of the stand at the harvested year filled in for the harvested year and all subsequent years, then subtracted that from my pre-made aged-up matrix. I built a function similar to what "fill" from tidyr or "na.locf" from zoo did (because they were too slow).
First I used arrayInd to determine the positions in the matrix of trees that were changed. I then used that to make another matrix that combined a repeat of each index row a number of times equal to the number of periods minus the period the tree was harvested in plus one, and a sequence vector of the same length that sequences from the period of the index number to the number of periods.
x.ind <- arrayInd(which(x.mat), dim(x.mat)) # gets index of row that was changed
x.new.ind <- cbind(rep(x.ind[,1], times = nper-x.ind[,2]+1), sequence(nvec = nper-x.ind[,2]+1, from = x.ind[,2]))
For example, if there was a tree harvested at position [4, 2], meaning the fourth tree was harvested in the second period, and we had 5 periods total, it would create a matrix:
[,1] [,2]
[1,] 4 2
[2,] 4 3
[3,] 4 4
[4,] 4 5
Then I made a vector with the ages of the trees that were harvested in the correct positions, and zeros in the rest of the positions (e.g. for our example, if the tree harvested was 100 years old, we would have a vector of 0 0 0 100 0 (if we had 5 trees)).
ages.vec <- vector(mode = "integer", length = nrow(age.mat))
ages.vec[x.ind[,1]]<- age.mat[x.ind]
I then multiplied this vector by a logical matrix with "T" at the row, column positions in the matrix above.
Continuing with the above example, we get:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 100 100 100 100
[5,] 0 0 0 0 0
I then subtracted it from our current (already aged-up) ages matrix. So tree four was 95 100 105 110 115 and now it is 95 0 5 10 15.
new.ages.mat<- age.mat - replace(x.mat, x.new.ind, TRUE)*ages.vec
Though this might not be the most elegant solution, using microbenchmark, it is 90x faster than our for loop, and 3x faster than the lovely apply function that John created. I would put in the microbenchmark calls and results, but this post is long enough already! I know there's a better way to create the ages.vec and incorporate it, and am going to continue working on that, and will update this answer with my results!
This approach builds on the use of which used with arr.ind=TRUE to create a two column matrix the encodes the starting locations (in first column) and times (in second column) for new tree planting. It does violate the functional programming paradigm by using <<- to assign new values to age.mat` "in place".
fiveseq <- seq(0,20, by=5) # this way one only needs to call `seq` once
apply(which(x.mat, arr.ind=TRUE) , 1,
function(r) {age.mat[ r[1], r[2]:5] <<- fiveseq[ 1:(6-r[2])] } )
In summary, it locates the new locations and intervals and replaces the rest of that row with the right number of items from the sequence {0, 5, 10, 15, 20}
(I would be interested in seeing how this compares with the benchmarking framework that you have already established.)
You can use apply to work on each vector rowwise, then use some logic within the function to adjust the values.
Should be about 4 times faster
age.mat |>
apply(1, \(x) {
if(any(x == 0 & (which(x == 0) != length(x)))) {
x[which(x == 0):length(x)] <- (0:(length(x) - which(x == 0))) * 5
x
} else x
}) |> t()
[,1] [,2] [,3] [,4] [,5]
[1,] 101 0 5 10 15
[2,] 55 60 65 70 75
[3,] 23 28 33 0 5
[4,] 0 5 10 15 20
[5,] 23 28 33 0 5
[6,] 84 0 5 10 15
[7,] 52 57 62 0 5
[8,] 26 31 36 41 0
[9,] 114 119 124 129 0
[10,] 33 38 43 48 53
[11,] 144 149 154 159 164
[12,] 19 24 29 34 39
[13,] 43 48 53 58 63
[14,] 69 74 79 84 89
[15,] 98 103 108 113 118
[16,] 110 115 120 125 130
[17,] 8 13 18 23 28
[18,] 16 21 26 31 36
[19,] 1 6 11 16 21
[20,] 60 65 0 5 10

Reverse indexing of a matrix in R

I am trying to revert the indexing of a matrix in R. The following example illustrates my problem:
#sample data:
set.seed(21)
m <- matrix(sample(100,size = 100),10,10)
# sorting:
t(apply(m,1,order))
# new exemplary order after sorting:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 7 10 6 5 9 2 4 1 8
[2,] 1 6 4 7 3 9 5 8 2 10
[3,] 2 5 8 10 4 7 9 1 3 6
[4,] 8 1 9 2 7 3 4 6 10 5
[5,] 6 9 5 2 7 3 10 4 8 1
[6,] 2 7 4 8 6 9 3 10 1 5
[7,] 1 6 4 10 3 2 7 8 9 5
[8,] 1 2 6 9 3 10 5 7 4 8
[9,] 9 4 5 7 10 2 8 3 1 6
[10,] 6 8 4 3 2 1 5 10 7 9
# we can create m2 with the above sorting. We also add 1000 to all values
m2 <- t(apply(m,1,function(x){
x[order(x)]
})) + 1000
# the next step would be to obtain the original arrangement of columns again, as described below.
After the sorting of my data we have the following situation: In row 1, the 3rd column (of matrix m2) is mapped to the original first column (of matrix m), the 7th column is mapped to the original second column, the 10th column to the original 3rd column, and so on.
My question is as follows: Can I somehow revert this mapping in R? What I mean by this is again for row 1, move the 1st column (of m2) to the position of the 3rd column (of m), then move the 2nd column to the position of the 7th, move the 3rd to the position of the 10th, and so on.
In the end what I try to achieve is to sort my data but save the existing arrangement of the columns somehow, so later, that means after some transformations of my data, I can rearrange them to the original ordering again. When I use the usual sorting algortihms in R, I am losing the old positioning of my columns. Of course most of the time you would not need those anymore, but atm I do need them.
Background
I think it will help to examine the effect of the order() and rank() functions on a simple vector. Consider:
x <- c('c','b','d','b','a');
seq_along(x);
## [1] 1 2 3 4 5
order(x);
## [1] 5 2 4 1 3
rank(x); ## default is ties.method='average'
## [1] 4.0 2.5 5.0 2.5 1.0
rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(x,ties.method='last'); ## available from 3.3.0
## [1] 4 3 5 2 1
rank(x,ties.method='random'); ## we can ignore this one, obviously
## [1] 4 2 5 3 1
rank(x,ties.method='max');
## [1] 4 3 5 3 1
rank(x,ties.method='min');
## [1] 4 2 5 2 1
(I used character values to demonstrate that these principles and algorithms can apply to any (comparable) data type, not just numeric types. But obviously this includes numeric types.)
The order() function returns a vector that is the same length as the input vector. The order values represent a reordering of the input indexes (which are shown above courtesy of seq_along()) in such a way that when the input vector is indexed with the order vector, it will be sorted (according to the chosen sort method, which (if not explicitly overridden by a method argument), is radixsort for integer, logical, and factor, shellsort otherwise, and takes into account the collation order of the current locale for character values when not using radixsort). In other words, for an element of the result vector, its value gives the input index of the element in the input vector that should be moved to that position in order to sort it.
To try to put it even more plainly, an element of the order vector basically says "place the input vector element with this index in my position". Or, in a slightly more generic way (which will dovetail with the parallel description of rank()):
order element: the input vector element with this index sorts into my position.
In a sense, rank() does the inverse of what order() does. Its elements correspond to the elements of the input vector by index, and its values give a representation of the sort order of the corresponding input element (with tiebreaking behavior depending on the ties.method argument; this contrasts with order(), which always preserves the incoming order of ties, equivalent to ties.method='first' for rank()).
To use the same language structure that I just used for order(), which is the plainest manner of expression I can think of:
rank element: the input vector element in my position sorts into this index.
Of course, this description is only perfectly accurate for ties.method='first'. For the others, the destination index for ties will actually be the reverse of the incoming order (for 'last'), the lowest index of the duplicate set (for 'min'), the highest (for 'max'), the average (for 'average', which is actually the default), or random (for 'random'). But for our purposes, since we need to mirror the proper sort order as per order() (and therefore sort(), which uses order() internally), let's ignore the other cases from this point forward.
I've thought of one final way to articulate the behaviors of the order() and rank() functions: order() defines how to pull elements of the input vector into a sorted order, while rank() defines how to push elements of the input vector into a sorted order.
This is why indexing the input vector with the results of order() is the correct way to sort it. Indexing a vector is inherently a pulling operation. Each respective index vector element effectively pulls the input vector element that is stored at the index given by that index vector element into the position occupied by that index vector element in the index vector.
Of course, the "push vector" produced by rank() cannot be used in the same way as the "pull vector" produced by order() to directly sort the input vector, since indexing is a pull operation. But we can ask, is it in any way possible to use the push vector to sort the input vector? Yes, I've thought of how this can be done. The solution is index-assigning, which is inherently a push operation. Specifically, we can index the input vector with the push vector as the (lvalue) LHS and assign the input vector itself as the RHS.
So, here are the three methods you can use to sort a vector:
x[order(x)];
[1] "a" "b" "b" "c" "d"
sort(x); ## uses order() internally
[1] "a" "b" "b" "c" "d"
y <- x; y[rank(y,ties.method='first')] <- y; y; ## (copied to protect x, but not necessary)
[1] "a" "b" "b" "c" "d"
An interesting property of the rank() function with ties.method='first' is that it is idempotent. This is because, once you've produced a rank vector, ranking it again will not change the result. Think about it: say the first element ranks 4th. Then the first call will produce a 4 in that position. Running rank() again will again find that it ranks 4th. You don't even need to specify ties.method anymore for the subsequent calls to rank, because the values will have become distinct on the first call's (potential) tiebreaking.
rank(x,ties.method='first');
## [1] 4 2 5 3 1
rank(rank(x,ties.method='first'));
## [1] 4 2 5 3 1
rank(rank(rank(x,ties.method='first')));
## [1] 4 2 5 3 1
y <- rank(x,ties.method='first'); for (i in seq_len(1e3L)) y <- rank(y); y;
## [1] 4 2 5 3 1
On the other hand, order() is not idempotent. Repeatedly calling order() has the interesting effect of alternating between the push and pull vectors.
order(x);
## [1] 5 2 4 1 3
order(order(x));
## [1] 4 2 5 3 1
order(order(order(x)));
## [1] 5 2 4 1 3
Think about it: if the last element sorts 1st, then the first call to order() will pull it into the 1st position by placing its index (which is largest of all indexes) into the 1st position. The second call to order() will identify that the element in the 1st position is largest in the entire vector, and thus will pull index 1 into the last position, which is equivalent to ranking the last element with its rank of 1.
Solutions
Based on all of the above, we can devise 3 solutions to your problem of "desorting", if you will.
For input, let's assume that we have (1) the input vector x, (2) its sort order o, and (3) the sorted and possibly transformed vector xs. For output we need to produce the same vector xs but desorted according to o.
Common input:
x <- c('c','b','d','b','a'); ## input vector
o <- order(x); ## order vector
xs <- x[o]; ## sorted vector
xs <- paste0(xs,seq_along(xs)); ## somewhat arbitrary transformation
x;
## [1] "c" "b" "d" "b" "a"
o;
## [1] 5 2 4 1 3
xs;
## [1] "a1" "b2" "b3" "c4" "d5"
Method 1: pull rank()
Since the order and rank vectors are effectively inverses of each other (i.e. pull and push vectors), one solution is to compute the rank vector in addition to the order vector o, and use it to desort xs.
xs[rank(x,ties.method='first')];
## [1] "c4" "b2" "d5" "b3" "a1"
Method 2: pull repeated order()
Alternatively, instead of computing rank(), we can simply use a repeated order() call on o to generate the same push vector, and use it as above.
xs[order(o)];
## [1] "c4" "b2" "d5" "b3" "a1"
Method 3: push order()
I was thinking to myself that, since we already have the order vector o, we really shouldn't have to go to the trouble of computing another order or rank vector. Eventually I realized that the best solution is to use the pull vector o as a push vector. This accomplishes the desorting objective with the least work.
xs[o] <- xs;
xs;
## [1] "c4" "b2" "d5" "b3" "a1"
Benchmarking
library(microbenchmark);
desort.rank <- function(x,o,xs) xs[rank(x,ties.method='first')];
desort.2order <- function(x,o,xs) xs[order(o)];
desort.assign <- function(x,o,xs) { xs[o] <- xs; xs; };
## simple test case
x <- c('c','b','d','b','a');
o <- order(x);
xs <- x[o];
xs <- paste0(xs,seq_along(xs));
ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE
microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: microseconds
## expr min lq mean median uq max neval
## desort.rank(x, o, xs) 106.487 122.523 132.15393 129.366 139.843 253.171 100
## desort.2order(x, o, xs) 9.837 12.403 15.66990 13.686 16.251 76.122 100
## desort.assign(x, o, xs) 1.711 2.567 3.99916 3.421 4.277 17.535 100
## scale test case
set.seed(1L);
NN <- 1e4; NE <- 1e5; x <- sample(seq_len(NN),NE,T);
o <- order(x);
xs <- x[o];
xs <- xs+seq(0L,NE-1L)/NE;
ex <- desort.rank(x,o,xs);
identical(ex,desort.2order(x,o,xs));
## [1] TRUE
identical(ex,desort.assign(x,o,xs));
## [1] TRUE
microbenchmark(desort.rank(x,o,xs),desort.2order(x,o,xs),desort.assign(x,o,xs));
## Unit: milliseconds
## expr min lq mean median uq max neval
## desort.rank(x, o, xs) 36.488185 37.486967 39.89157 38.613191 39.145405 85.849143 100
## desort.2order(x, o, xs) 16.764414 17.262630 18.10341 17.443527 19.014296 28.338835 100
## desort.assign(x, o, xs) 1.457014 1.498495 1.82893 1.527363 1.592151 4.255573 100
So, clearly the index-assignment solution is the best.
Demo
Below is a demonstration of how this solution can be used for your sample input.
I honestly think that a simple for-loop over the rows is preferable to an apply() call in this case, since you can modify the matrix in-place. If you need to preserve the sorted intermediate matrix, you can copy it before applying this desorting operation.
## generate input matrix
set.seed(21L); m <- matrix(sample(seq_len(100L)),10L); m;
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 79 61 1 66 40 39 2 86 44 26
## [2,] 25 84 49 35 67 32 36 70 50 100
## [3,] 69 6 90 51 30 92 65 34 68 42
## [4,] 18 54 72 73 85 75 55 15 27 77
## [5,] 93 16 23 58 9 7 19 64 8 46
## [6,] 88 4 60 13 98 47 5 29 56 80
## [7,] 10 45 43 14 95 11 74 76 83 38
## [8,] 17 24 57 82 63 28 71 87 53 59
## [9,] 91 41 81 21 22 94 33 62 12 37
## [10,] 78 52 48 31 89 3 97 20 99 96
## sort each row, capturing sort order in rowwise order matrix
o <- matrix(NA_integer_,nrow(m),ncol(m)); ## preallocate
for (ri in seq_len(nrow(m))) m[ri,] <- m[ri,o[ri,] <- order(m[ri,],decreasing=T)];
## whole-matrix transformation
## embed row index as tenth digit, column index as hundredth (arbitrary)
m <- m+(row(m)-1L)/nrow(m)+(col(m)-1L)/ncol(m)/10;
## desort
for (ri in seq_len(nrow(m))) m[ri,o[ri,]] <- m[ri,]; m;
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 79.01 61.03 1.09 66.02 40.05 39.06 2.08 86.00 44.04 26.07
## [2,] 25.19 84.11 49.15 35.17 67.13 32.18 36.16 70.12 50.14 100.10
## [3,] 69.22 6.29 90.21 51.25 30.28 92.20 65.24 34.27 68.23 42.26
## [4,] 18.38 54.36 72.34 73.33 85.30 75.32 55.35 15.39 27.37 77.31
## [5,] 93.40 16.46 23.44 58.42 9.47 7.49 19.45 64.41 8.48 46.43
## [6,] 88.51 4.59 60.53 13.57 98.50 47.55 5.58 29.56 56.54 80.52
## [7,] 10.69 45.64 43.65 14.67 95.60 11.68 74.63 76.62 83.61 38.66
## [8,] 17.79 24.78 57.75 82.71 63.73 28.77 71.72 87.70 53.76 59.74
## [9,] 91.81 41.84 81.82 21.88 22.87 94.80 33.86 62.83 12.89 37.85
## [10,] 78.94 52.95 48.96 31.97 89.93 3.99 97.91 20.98 99.90 96.92
rank is the complement to order(). You need to save the original rank() and you can use that to get back to the original ordering after rearranging with order().
I think your example is overcomplicated (far from minimal!) by putting things in a matrix and doing extra stuff. Because you are applying functions at the row-level you just need to solve it for a vector. An example:
set.seed(47)
x = rnorm(10)
xo = order(x)
xr = rank(x)
x[xo][xr] == x
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your case, you can perform whatever transformations you want on the ordered vector x[xo], then index the result by [xr] to get back to the original ordering.
sorted_result = x[xo] + c(1, diff(x[xo])) # some order-dependent transformation
final_result = sorted_result[xr] # back to original ordering
If there's a possibility of ties, you'll want to use ties.method = 'first' in the rank() call.
Taking this back to the matrix example:
m3 = t(apply(m, 1, function(x) {
xo = order(x)
xr = rank(x, ties.method = 'first')
(x[xo] + 1000)[xr] # add 1000 to sorted matrix and then "unsort"
}))
# check that it worked
all(m3 == (m + 1000))
# [1] TRUE

R: calling a matrix value of column 2 dependent on the value of column 1

I admit that I am totally new to R and have a few beginner's problems;
my problem is the following:
I have quite a long matrix TEST of length 5000 with 2 columns (column 1 = time; column 2 = concentration of a species).
I want to use the right concentration values for calculation of propensities in stochastic simulations.
I already have an alogrithm that gives me the simulation time t_sim; what I would need is a line of code that gives the respective concentration value at t= t_sim;
also: the time vector might have a big step size so that t_sim would have to be rounded to a bigger value in order to call the respective concentration value.
I know this probably quite an easy problem but I really do not see the solution in R.
Best wishes and many thanks,
Arne
Without sample data this answer is kind of a shot in the dark, but I think that this might work:
t_conc <- TEST[which.min(abs(t_sim-TEST[,1])),2]
where TEST is the matrix with two columns as described in the OP and the output t_conc is the concentration that corresponds to the value of time in the matrix that is closest to the input value t_sim.
Here's another shot in the dark:
set.seed(1);
N <- 20; test <- matrix(c(sort(sample(100,N)),rnorm(N,0.5,0.2)),N,dimnames=list(NULL,c('time','concentration')));
test;
## time concentration
## [1,] 6 0.80235623
## [2,] 16 0.57796865
## [3,] 19 0.37575188
## [4,] 20 0.05706002
## [5,] 27 0.72498618
## [6,] 32 0.49101328
## [7,] 34 0.49676195
## [8,] 37 0.68876724
## [9,] 43 0.66424424
## [10,] 57 0.61878026
## [11,] 58 0.68379547
## [12,] 61 0.65642726
## [13,] 62 0.51491300
## [14,] 63 0.10212966
## [15,] 67 0.62396515
## [16,] 83 0.48877425
## [17,] 86 0.46884090
## [18,] 88 0.20584952
## [19,] 89 0.40436999
## [20,] 97 0.58358831
t_sim <- 39;
test[findInterval(t_sim,test[,'time']),'concentration'];
## concentration
## 0.6887672
Note that findInterval() returns the index of the lesser time value if t_sim falls between two time values, as my example shows. If you want the greater, you need a bit more work:
i <- findInterval(t_sim,test[,'time']);
if (test[i,'time'] != t_sim && i < nrow(test)) i <- i+1;
test[i,'concentration'];
## concentration
## 0.6642442
If you want the nearest, see R: find nearest index.

Transforming rows in a PCA context using dudi.pca

I have a huge matrix of genetic data (1e7 rows representing individuals x 5,000 columns representing markers) on which I would like to perform a PCA in order to keep c. 20 columns. However, due to memory issues, I cannot perform PCA using either dudi.pca or big.PCA on R 3.1.2 on a 8GB 64bits machine.
An alternative was to compute an approximation of the coordinates of principal axes on a row-subset of the matrix and then transform the whole matrix using a linear combination with the approximate PA coordinates.
I am facing a simple PCA-related problem using dudi.pca: how can I get the row coordinates using the original matrix and the matrix of column coordinates (= principal axes) ?
Here is a simple example, let's take a random matrix M (3 rows and 4 columns) such as:
M=
1 9 10 13
20 13 20 7
18 19 17 10
Doing dudi.pca(M, center=T, scale=T) and keeping only one PC, dudi.pca outputs the following $c1 matrix (column normed scores ie principal axes):
c1 =
-0.547
-0.395
-0.539
0.504
To compute the row coordinates of the data on the first principal axis, I thought doing the inner product:
r =
-0.547*1 + -0.395*9 + -0.539*10 + -0.504*13
-0.547*20 + -0.395*13 + -0.539*20 + -0.504*17
-0.547*18 + -0.395*19 + -0.539*17 + -0.504*10
i.e.
r =
-2.944
-23.331
-21.481
But if I look up at the $li (row coordinates ie principal components) natively computed by dudi.pca on the same dataset, I read:
r' =
2.565
-1.559
-1.005
Am I doing something wrong when formulating the row coordinates using dudi.pca $ci matrix?
Many thanks for your help,
Quaerens.
Code :
> M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
> M
[,1] [,2] [,3] [,4]
[1,] 1 9 10 13
[2,] 20 13 20 7
[3,] 18 19 17 10
> N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
> N$c1
CS1
V1 -0.5468634
V2 -0.3955638
V3 -0.5389504
V4 0.5039863
> r=c( M[1,] %*% N$c1[,1], M[2,] %*% N$c1[,1], M[3,] %*% N$c1[,1] )
> r
[1] -2.94462 -23.33070 -21.48155
> N$li
Axis1
1 2.565165
2 -1.559546
3 -1.005619
If this is still of interest...
ADE4 works on the duality diagram, hence when p is greater than n singular value decomposition is carried out on the nxn symmetric matrix
library(ade4)
M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 9 10 13
## [2,] 20 13 20 7
## [3,] 18 19 17 10
N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
#dimensions of M
n=3
p=4
X=scalewt(M,center=T,scale=T)
#this could be done in two ways. Singular Value Decomposition or Duality Diagrams.
#Consider a Singular value decomposition of X; S=UDV; where S is X, U is the left triangular matrix, and V is the right triangular matrix, and D is the diagonal matrix of eigen values
svd=svd(X)
#These are equivalent
N$c1
svd$v[,1]
#Equivalent
N$eig
## [1] 3.341175 0.658825
svd$d[1:2]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(svd$d)
#N$lw gives the row weights
N$lw
#0.3333333 0.3333333 0.3333333
#find the inverse of the diagonal matrix of row weights; this is the normalization part
K=solve(sqrt(diag(N$lw,n)))%*%svd$u
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619
This could also be done using the duality diagram implemented in ade4
look here for references on the duality diagram implemented in ade4: http://projecteuclid.org/euclid.aoas/1324399594
Q<-diag(p)
D<-diag(1/n, n)
rk<-qr(X)
rank=rk$rank
#Statistical Triplets
V<-t(X)%*%D%*%X
W<-X%*%Q%*%t(X)
#Compute the eigen values and vectors of the statistical triplet
example.eigen=eigen(W%*%D)
#Equivalent
N$eig
## [1] 3.341175 0.658825
example.eigen$values[1:rank]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(example.eigen$values[1:rank])
#find the inverse of the diagonal matrix of row weights; this is the normalizing part
Binv<-solve(sqrt(D))
K=Binv%*%example.eigen$vectors[,1:rank]
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619

transform this function using normal programming code and without using R functions

I have this function in R from a previous question here
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
this function will rotate the data frame d by K, that's mean it will take K rows from the end of the data frame and place them on the top.
I want to create the same function(in the same language) but without using R pre-made functions(head, tail,...), but only using basics of programming.(for , ...)
How this can be done?
Well I don't know what you mean with without using R functions since pretty much everything is an R function, but here is a solution using only the very generic nrow() (Number of rows of a matrix), %% (modulus) and seq_len (equivalent to 1:length(x) except that it works better):
m <- matrix(1:40,,2,byrow=TRUE)
shift2 <- function(d, k) d[(seq_len(nrow(d))-k-1)%%(nrow(d))+1,]
shift2(m,5)
[,1] [,2]
[1,] 31 32
[2,] 33 34
[3,] 35 36
[4,] 37 38
[5,] 39 40
[6,] 1 2
[7,] 3 4
[8,] 5 6
[9,] 7 8
[10,] 9 10
[11,] 11 12
[12,] 13 14
[13,] 15 16
[14,] 17 18
[15,] 19 20
[16,] 21 22
[17,] 23 24
[18,] 25 26
[19,] 27 28
[20,] 29 30
If you mean with "normal programming code" that it shouldn't be vectorized then, well, you are learning either the wrong language in the right way or the right language in the wrong way. Everytime you come up with a vectorized solution instead of for loops you are happy in R.
But if you really really want to do this with loops here is exactly the same function unvectorized:
shift3 <- function(d, k)
{
out <- matrix(,nrow(d),ncol(d))
sorts <- (seq_len(nrow(d))-k-1)%%(nrow(d))+1
for (i in seq_len(nrow(d))) out[i,] <- d[sorts[i],]
return(out)
}
Proof they are all equal:
all(shift(m,5) == shift2(m,5) & shift2(m,5) == shift3(m,5))
[1] TRUE
EDIT:
Actually shift3() there STILL contained a lot of vectorizations, showing just how native that is in R. Here is a fully unvectorized version:
shift3 <- function(d, k)
{
out <- matrix(,nrow(d),ncol(d))
sor <- numeric(1)
for (i in seq_len(nrow(d)))
{
if (i-k < 1) sor <- nrow(d)-k+i else sor <- i-k
for (j in seq_len(ncol(d))) out[i,j] <- d[sor,j]
}
return(out)
}

Resources