R ranges: 1:0 - illogical behavior - r

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?

I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

Related

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

R:Calculating percentage values across a matrix based on the values in another matrix

I have two matrices, one is a 10x1 double matrix, that can be expanded to any user preset number, eg. 100.
View(min_matrx)
V1
1 27
2 46
3 30
4 59
5 46
6 45
7 34
8 31
9 52
10 46
The other matrix looks like this, there are more rows not shown:
View(main_matrx)
row.names sum_value
s17 45
s7469 213
s20984 24
s17309 214
s7432369 43
s221320984 12
s17556 34
s741269 11
s20132984 35
For each row name in main_matrx I want to count the number of times that a value more than the sum_value in main_matrx appears in min_matrx. Then I want to divide that by the number of rows in min_matrx and add that value as a new column in main_matrx.
For example, in row 1 of main_matrx for s17, the number of times a value appears that is more than 45 in min_matrx =5 times.
Now divide that 5 by 10 rows of min_matrx=> 5/10 =0.5 would be the value I'd like to have as a new column in main_matrx for s17. Then the same formula for all the s_ids in the row names.
So far I have fiddled with:
for(s in 1:length(main_matrx)) {
new<-sum(main_matrx[s,]>min_CPRS_set)/length(min_matrx)
}
and I tried using apply() but I'm still not getting results.
apply(main_matrx,1:length(main_matrx), function(x) sum(main_matrx>min_CPRS_set)/length(min_matrx)))
Now, I'm just stuck because it's not working. I'm still new to R so my code isn't particularly efficient. Any suggestions?
Lots of ways to approach this. Here's one that came to my head (I think I understand what you're after; again it's much easier to understand an example than with words alone. In the future I'd suggest an example to accompany the text question.)
Where x is an element, y is a vector
FUN <- function(x, y = min_matrix[, 1]) {
sum(y > x)/length(y)
}
main_matrx$new <- sapply(main_matrx[, 2], FUN)
## > main_matrx
## row.names sum_value new
## 1 s17 45 0.5
## 2 s7469 213 0.0
## 3 s20984 24 1.0
## 4 s17309 214 0.0
## 5 s7432369 43 0.6
## 6 s221320984 12 1.0
## 7 s17556 34 0.6
## 8 s741269 11 1.0
## 9 s20132984 35 0.6

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

under what circumstances does R recycle?

I have two variables, x (takes in 5 values) and y (takes in 11 values). When I want to run the argument,
> v <- 2*x +y +1
R responds:
Error at 2* x+y: Longer object length is not a multiple of shorter object length.
I tried: 1*x gives me 5 values of x, but y has 11 values. So R says it can’t add 11 to 5 values? – This raises the general question: Under what circumstances does recycling work?
Recycling works in your example:
> x <- seq(5)
> y <- seq(11)
> x+y
[1] 2 4 6 8 10 7 9 11 13 15 12
Warning message:
In x + y : longer object length is not a multiple of shorter object length
> v <- 2*x +y +1
Warning message:
In 2 * x + y :
longer object length is not a multiple of shorter object length
> v
[1] 4 7 10 13 16 9 12 15 18 21 14
The "error" that you reported is in fact a "warning" which means that R is notifying you that it is recycling but recycles anyway. You may have options(warn=2) turned on, which converts warnings into error messages.
In general, avoid relying on recycling. If you get in the habit of ignoring the warnings, some day it will bite you and your code will fail in some very hard to diagnose way.
It doesn't work this way. You have to have vectors of the same length:
x_samelen = c(1,2,3)
y_samelen = c(10,20,30)
x_samelen*y_samelen
[1] 10 40 90
If vectors are of the same length, the result is well defined and understood. You can do "recycling", but it really is not advisable to do so.
I wrote a short script to make your two vectors of the same length, via padding the short vector. This will let you execute your code without warnings:
x_orig <- c(1,2,3,4,5,6,7,8,9,10,11)
y_orig <- c(21,22,23,24,25)
if ( length(x_orig)>length(y_orig) ) {
x <- x_orig
y <- head(x = as.vector(t(rep(x=y_orig, times=ceiling(length(x_orig)/length(y_orig))))), n = length(x_orig) )
cat("padding y\r\n")
} else {
x <- head(x = as.vector(t(rep(x=x_orig, times=ceiling(length(y_orig)/length(x_orig))))), n = length(y_orig) )
y <- y_orig
cat("padding x\r\n")
}
The results are:
x_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
y_orig
[1] 21 22 23 24 25
x
[1] 1 2 3 4 5 6 7 8 9 10 11
y
[1] 21 22 23 24 25 21 22 23 24 25 21
If you reverse x_orig and y_orig:
x_orig
[1] 21 22 23 24 25
y_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
x
[1] 21 22 23 24 25 21 22 23 24 25 21
y
[1] 1 2 3 4 5 6 7 8 9 10 11

Resources