what number is greater than x and falls after position y - r

Which number of x is > 5 and falls after the 10th position? It is the number in position 11.
But I find myself writing a long code to get to the answer and I am wondering if there is a quicker way.
x <- c(5,7,3,6,9,4,1,4,7,10,8,5,7,9,7,1, 8, 4, 4,9);
Identify the location of all numbers >5 call it x1:
x1 <- which(x>5);
Identify the first number of the locations(x1) that occurred after 10th position:
first(which(x1 >10))
this identifies location 6 of x1;
identify the location of that number in the original vector (x):
x1[first(which(x1 >10))];
now we have the position of the value we want in the original vector (x), and this code pulls the value we want:
x[x1[first(which(x1 >10))]]
This seems like a very long code to answer a simple question, do you know shorter/simpler way to get to same results?

What's wrong with some very simple indexing and subseting?
This gives the indices of all elements in x greater than 5 and which occur after position 10:
> which(x > 5 & seq_along(x) > 10)
[1] 11 13 14 15 17 20
So the ultimate Answer is
> which(x > 5 & seq_along(x) > 10)[1]
[1] 11
or
> head(which(x > 5 & seq_along(x) > 10), 1)
[1] 11
The trick used here is to generate a vector of indices of x using the seq_along() function. That way we can generate all the matches in a single logical statement, and then select the first of these.
If you want to extract the identified element then:
> want <- which(x > 5 & seq_along(x) > 10)[1]
> x[want]
[1] 8
[It wasn't clear whether you wanted to identify which element met your criteria or the value of that element.]

An alternative version using functional programming to extract the index:
> Find(function(y) x[y] > 5 & y > 10, seq_along(x))
[1] 11
or to extract the element:
> x[Find(function(y) x[y] > 5 & y > 10, seq_along(x))]
[1] 8
Don't know about difference in performance compared with simple indexing, though.

Related

Obtaining index within vector using sliding window containing minimum number of values in R

I've searched around but haven't been able to find a solution to my question yet. I'm not really sure where to start.
I have a numeric vector in R. For example:
vec<-c(8,1,2,5,20,1,6,7,13,1,8,1,14,1,1,4,2,7)
I'm looking to find the index where the value '1' occurs at least 3 times within a window of 5. So in the above example, the output would be '10' as the window containing '1,8,1,14,1' is the first sequence of 5 values where 3 values are '1' and the index of the start of that sequence is 10.
Any help would be appreciated.
If you only want to get indeces, try using rollapply from zoo package as this:
> library(zoo)
> which(rollapply(vec, 5, FUN=function(x) sum(x==1)>=3))
[1] 10 11 12
Try this one-liner. Note that each of the returned 3 indexes satisfy the condition.
library(zoo)
which(rollapply(vec, 5, function(x) sum(x == 1) >= 3, fill = FALSE, align = "left"))
## [1] 10 11 12
vec<-c(8,1,2,5,20,1,6,7,13,1,8,1,14,1,1,4,2,7)
window=5
numberToFind=1
timesToFind=3
for(i in 1:(length(vec)-window+1)) {
if(sum(vec[i:(i+window-1)] == numberToFind) == timesToFind) {
print(i)
break
}
}

How do i shift the values in a column of a data frame either up or down?

I'm trying to write some code that effectively shifts the values in the first column of a dataframe either up or down. The conditions for it moving up or down are as follows:
1) If the difference between the value directly below the selected element in the data frame 'playerlist' and the value of the selected element is less than OR equal to the difference between the value directly above the selected element in the data frame and the value of the selected element then the data in the first column shift up (i.e. the playerlist[1, 1] becomes playerlist[2, 1], playerlist[2, 1] becomes playlist[3, 1] etc.).
2) If the converse is true, (i.e. the difference between the value directly below the selected element in the data frame 'playerlist' and the value of the selected element is (only) more than the difference between the value directly above the selected element in the data frame and the value of the selected element) then the data shifts down (i.e. the playerlist[3, 1] becomes playerlist[2, 1], playerlist[2, 1] becomes playlist[1, 1] etc.).
3) If neither the above value or the below value of the selected element's value are less than the selected value, then nothing happens.
NB:
*number_of_players is an external input, in the below example it is running with value 7 (i.e. this means that playerlist contains 7 rows.
**Take x to be the row of the selected data (i.e. so the selected data is always playerlist[x, 1]).
dicear <- function(x){ #x is the player playing the card
y <- x-1
z <- x+1
if(x <- 1){
y <- number_of_players
}
if(x <- number_of_players){
z <- 1
}
if(playerlist[x, 1]>playerlist[z, 1] & (playerlist[x, 1]-playerlist[z, 1]) >= (playerlist[x, 1] - playerlist[y, 1])){
for(i in 1:nrow(playerlist)){
dummy <- i+1
if(i <- nrow(playerlist)){
dummy <- 1
}
else{
dummy <- i+1
}
playerlist[i, 1] <<- playerlist[dummy, 1]
}
}
else {
if(playerlist[x, 1]>playerlist[y, 1] & (playerlist[x, 1]-playerlist[y, 1]>(playerlist[x, 1]-playerlist[z, 1]))){
for(i in 1:nrow(playerlist)){
dummy <- i-1
if(i <- 1){
dummy <- nrow(playerlist)
}
else{
dummy <- i-1
}
playerlist[i, 1] <<- playerlist[dummy, 1]
}
}
}
}
To help clarify the question that you have, I am providing some guidelines to make this problem easier for me to approach. Shifting data around in a vector is simpler to consider than moving data in columns of a data frames. Data frames columns can be vectorized (saved as vectors). Your question asks to evaluate the differences between the previous value (i-1), and the following value (i+1) where i is the value being evaluated. As given, this excludes the first and final values. The first value has no previous value and the final value has no next value to perform the difference calculation. I will focus on a single position in a given vector.
Given the vector, z <- c(1, 2, 8, 4, 5) lets go through the procedure given by your guidelines. For simplicity, calculate the absolute value of the differences evaluating z[2].
> abs(z[1] - z[2])
[1] 1
> abs(z[2] - z[3])
[1] 6
The abs(z[2] - z[3]) > abs(z[1] -z[2]) and the value z[2] shifts down to modify the vector to be 'z <- c(1,8,2,4,5)`.
Repeating the procedure on z[2], which is now 8, gives the following result: z <- c(1,2,8,4,5) which is the original vector. So instead let's test z[3], which is 2: z <- c(1,2,8,4,5). Again back to the original vector. My thinking may be flawed. Please provide examples in the comments if I have made a mistake.
That considered, the following may be useful.
z <- c(1,2,8,4,5)
for(i in 1:5) print(c(z[i], z[-i]))
If all you want to do is shift a particular value around, the simple for loop given above will print() those sequences where i is the iteration variable from one through 5 (1:5). As i advances, the resulting vector shifts the place value at z[i] to the first position (i.e z[1] <- z[i]). None of the values are lost.
> for(i in 1:5) print(c(z[i], z[-i]))
[1] 1 2 8 4 5
[1] 2 1 8 4 5
[1] 8 1 2 4 5
[1] 4 1 2 8 5
[1] 5 1 2 8 4
You can also calculate the differences in the vector using diff().
> diff(z)
[1] 1 6 -4 1
> abs(diff(z))
[1] 1 6 4 1
The values calculated by diff() are essentially the same values you wish to evaluate where the first value given by diff() is the difference between z[1] and z[2]; the second is the difference between z[2] and z[3], and so on. You could perform comparisons using difference values from diff().
> diffs <- diff(z)
> diffs[1] > diffs[2] | diffs[2] <= diffs[3]
[1] FALSE
> diffs[2] > diffs[3] | diffs[3] <= diffs[4]
[1] TRUE
Keep in mind that z has 5 elements whereas diffs has only 4 elements.
You may provide some details about your questions such as an example of playerlist and output for what changes you propose to see in the data frame.
Another option that may or may not be helpful is the sort() function. When I first read your question I thought maybe you were trying to sequentially sort your data one at a time, for example to change a a player's ranking according to a turn of a game. You may arrange your data from smallest to largest using sort().
> sort(z)
[1] 1 2 4 5 8
There is an interesting principle here to explore.

Get index of vector between 1nd and 2nd appearance of number 1

Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7

Arithmetic Progression series in R

I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.
I have a sequence like this,
1 2 3 4 5 8 9 10 12 14 15 17 18 19
So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.
i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria.
(1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.
(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.
Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).
Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).
How can I do this in R?
Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.
Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.
Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.
If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.
In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.
Here's a potential solution - albeit a very ugly, sloppy one:
##
arithSeq <- function(x=nSeq, minSize=4){
##
dx <- diff(x,lag=1)
Runs <- rle(diff(x))
##
rLens <- Runs[[1]]
rVals <- Runs[[2]]
pStart <- c(
rep(1,rLens[1]),
rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
)
pEnd <- pStart + c(
rep(rLens[1]-1, rLens[1]),
rep(rLens[-1],times=rLens[-1])
)
pGrp <- rep(1:length(rLens),times=rLens)
pLen <- rep(rLens, times=rLens)
dAll <- data.frame(
pStart=pStart,
pEnd=pEnd,
pGrp=pGrp,
pLen=pLen,
runVal=rep(rVals,rLens)
)
##
dSub <- subset(dAll, pLen >= minSize - 1)
##
uVals <- unique(dSub$runVal)
##
maxSub <- subset(dSub, runVal==uVals[1])
maxLen <- max(maxSub$pLen)
maxSub <- subset(maxSub, pLen==maxLen)
##
if(length(uVals) > 1){
for(i in 2:length(uVals)){
iSub <- subset(dSub, runVal==uVals[i])
iMaxLen <- max(iSub$pLen)
iSub <- subset(iSub, pLen==iMaxLen)
maxSub <- rbind(
maxSub,
iSub)
maxSub
}
##
}
##
deDup <- maxSub[!duplicated(maxSub),]
seqStarts <- as.numeric(rownames(deDup))
outList <- list(NULL); length(outList) <- nrow(deDup)
for(i in 1:nrow(deDup)){
outList[[i]] <- list(
Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
Length=deDup[i,"pLen"]+1,
StartPosition=seqStarts[i],
EndPosition=seqStarts[i]+deDup[i,"pLen"])
outList
}
##
return(outList)
##
}
##
So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.
set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
[1] 16 17 18 19 20 21 22 23 24 25 26
[[1]]$Length
[1] 11
[[1]]$StartPosition
[1] 18
[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25
[[1]]$Length
[1] 5
[[1]]$StartPosition
[1] 12760
[[1]]$EndPosition
[1] 12764
[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19
[[2]]$Length
[1] 5
[[2]]$StartPosition
[1] 37988
[[2]]$EndPosition
[1] 37992
Like I said, its sloppy and inelegant, but it should get you started.

Sum object in a column between an interval defined by another dataframe

I am trying to obtain the sum of values of a column (B) based on the interval between two values on another column (A) in a "reference" dataframe (df):
A <- seq(1:10)
B <- c(4,3,5,7,5,7,4,7,3,7)
df <- data.frame(A,B)
I have found two ways of doing this:
y <- sum(subset(df, A < 3 & A >= 1, select = "B"))
> y
[1] 7
and
z <- with(df,sum(df[A<3 & A>=1,"B"]))
> z
[1] 7
However, I would like to do this based on a two vectors of values stored on another dataframe
C <- c(3,7,7)
D <- c(1,1,5)
df2 <- data.frame(C,D)
to obtain a column of y values for each pair of C and D values.
I have created a function:
myfn <- function(c,d) {
y <-sum(subset(df, A < c & A >= d, select = "B"))
return(y)
}
Which works fine with numbers
myfn(3,1)
[1] 7
but not with vectors.
myfn(c=C,d=D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
> myfn(df2$C,df2$D)
[1] 19
Warning messages:
1: In A < a :
longer object length is not a multiple of shorter object length
2: In A >= b :
longer object length is not a multiple of shorter object length
>
Does anyone have any suggestion about how I could calculate such interval for sequence of values?
Try:
mapply(myfn, C, D)
# [1] 7 31 12
The problem is that your function is not naturally vectorized. You can see that because your return value is a sum of the inputs, and sum is not a vectorized operation.
Beyond that, if you look at myfn, the expression A < c & A >= d doesn't make sense when c and d have more than one value. There, you are comparing each value in df to the corresponding value in your C and D vectors (so first value to first, second to second, etc.), instead of comparing all the values in df to each value in C and D in turn.
By using mapply, I'm basically looping through your function with as arguments a single value from C and D at a time.
Fortunately in your case it turns out that C,D have different number of elements than df, so you actually got a warning. If they were the same length you would not have gotten a warning and you would have gotten a single value answer, instead of the three you are presumably looking for.
There are better ways to do this, but the mapply approach is pretty trivial here and works with your code pretty much as is.
Another way...
is.between <- function(x,vec){
return(x>=min(vec) & x<max(vec))
}
apply(df2,1,function(x){sum(df[is.between(df$A,x),]$B)})
# [1] 7 31 12

Resources