Find values that are between list of numbers - r

I have two list of numbers like below.
x <- c(1, 5, 10, 17, 21, 30)
y <- c(2, 7, 19)
In my dataset, x divides 1 to 30 in different segments (from 1-5, 5-10, 10-17, 17-21, 21-30). Would it be possible to match these segments to numbers in y? (In this case, I'd want to get c(1,5,17) as an output because 2 is between 1 and 5, 7 is between 5 and 10, and 19 is in between 17 and 21.)

?findInterval to the rescue:
x[findInterval(y,x)]
#[1] 1 5 17

Using cut is another option
cut(y, breaks = x, labels = x[-length(x)])
#[1] 1 5 17
Could be also done with labels = FALSE
x[cut(y, breaks = x, labels = FALSE)]
#[1] 1 5 17

You can do this with sapply and a simple function
sapply(y, function(a) x[max(which(x<a))])
[1] 1 5 17

Related

Examine if a value is in an interval using R

Having the following vector:
t <- c(2, 6, 8, 20, 22, 30, 40, 45, 60)
I would like to find the values that fall between the following intervals:
g <- list(c(1,20), c(20, 40))
The desired output is:
1, 20 c(2, 6, 8)
20, 40 c(20, 22, 30)
Using the dplyr library, I do the following:
library(dplyr)
for(i in t){
for(h in g){
if(between(i, h[[1]], h[[2]])==TRUE){print(c(i, h[[1]], h[[2]]))}
}}
Is there a better way of doing this in R?
We can loop over the list 'g' and extract the 't' elements based on the first and second values by creating a logical vector with >/< and extract the elements of 't'
lapply(g, function(x) t[t >= x[1] & t < x[2]])
-output
[[1]]
[1] 2 6 8
[[2]]
[1] 20 22 30
library(purrr)
library(dplyr)
map(g,~keep(t,between(t,.[1],.[2])))
[[1]]
[1] 2 6 8 20
[[2]]
[1] 20 22 30 40
You may find findInterval() from base R useful:
g <- c(1, 20, 40)
t <- c(2, 6, 8, 20, 22, 30, 40, 45, 60)
findInterval(t, g)
#> [1] 1 1 1 2 2 2 3 3 3
So t[1], t[2] and t[3] are in the first interval, t[4], t[5] and
t[6] in the second, and t[7], t[8] and t[9] the third (meaning that
these values are bigger than the right end point of the second interval.)
If you had values lower than one they would be labelled by 0:
t2 <- c(-1, 0, 2, 6, 8, 20, 22, 30, 40, 45, 60)
findInterval(t2, g)
#> [1] 0 0 1 1 1 2 2 2 3 3 3
You can save the result of findInterval() as e.g. y and use which(y==1) to find which entries correspond to the first interval.
We can try cut + is.na like below
lapply(
g,
function(x) {
t[!is.na(cut(t, x, include.lowest = TRUE))]
}
)
which gives
[[1]]
[1] 2 6 8 20
[[2]]
[1] 20 22 30 40

How to condense non-sequential integers?

I'm trying to condense non-sequential numbers to subset haplotype data. I could do it manually, but given that I've got hundreds to do, I'd rather not if there's an alternative
class(haplotype1[[1]])
#[1] "integer"
haplotype1[[1]]
#[1] 1 2 3 4 5 7 8 9 10 11
I want to get [1:5, 7:11], which seems simple, but I haven't found a solution exactly matching my problem
Thanks!
Using cumsum to create the sequential groups,
tapply(x, cumsum(c(TRUE, diff(x) != 1)), FUN = function(i)paste(i[1], i[length(i)], sep = ':'))
# 1 2
#"1:5" "7:11"
It's unclear what type of object you want to create. I would just store the start and end values.
x <- c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11)
starts <- x[!c(FALSE, diff(x) == 1L)]
#[1] 1 7
ends <- x[!c(diff(x) == 1L, FALSE)]
#[1] 5 11
paste(starts, ends, sep = ":")
#[1] "1:5" "7:11"
Maybe you want something like this ?
vec <- c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11)
split(vec, cumsum(c(1,diff(vec)>1)))
# $`1`
# [1] 1 2 3 4 5
#
# $`2`
# [1] 7 8 9 10 11

Consecutive Sum of a Vector

This is a question following a previous one. In that question, it is suggested to use rollapply to calculate sum of the 1st, 2nd, 3rd entry of a vector; then 2nd, 3rd, 4th; and so on.
My question is how calculate sum of the 1st, 2nd and 3rd; then the 4th, 5th and 6th. That is, rolling without overlapping. Can this be easily done, please?
Same idea. You just need to specify the by argument. Default is 1.
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
zoo::rollapply(x, 3, by = 3, sum)
#[1] 10 20 12
#or another Base R option
sapply(split(x, ceiling(seq_along(x)/3)), sum)
# 1 2 3
#10 20 12
Using tapply in base R:
set.seed(1)
vec <- sample(10, 20, replace = TRUE)
#[1] 3 4 6 10 3 9 10 7 7 1 3 2 7 4 8 5 8 10 4 8
unname(tapply(vec, (seq_along(vec)-1) %/% 3, sum))
# [1] 13 22 24 6 19 23 12
Alternatively,
colSums(matrix(vec[1:(ceiling(length(vec)/3)*3)], nrow = 3), na.rm = TRUE)
#[1] 13 22 24 6 19 23 12
vec[1:(ceiling(length(vec)/3)*3)] fills in the vector with NA if the length is not divisible by 3. Then, you simply ignore NAs in colSums.
Yet another one using cut and aggregate:
x <- ceiling(length(vec)/3)*3
df <- data.frame(vec=vec[1:x], col=cut(1:x, breaks = seq(0,x,3)))
aggregate(vec~col, df, sum, na.rm = TRUE)[[2]]
#[1] 13 22 24 6 19 23 12
We can use roll_sum from RcppRoll which would be very efficient
library(RcppRoll)
roll_sum(x, n=3)[c(TRUE, FALSE, FALSE)]
#[1] 10 20 12
data
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
you can define the window size, and do:
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
n <- 3
diff(c(0, cumsum(x)[slice.index(x, 1)%%n == 0]))
p.s. using the input from the answer by #Sotos

Sorting by successive vectors in R [duplicate]

I have a vector x, that I would like to sort based on the order of values in vector y. The two vectors are not of the same length.
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
The expected result would be:
[1] 4 4 4 2 2 1 3 3 3
what about this one
x[order(match(x,y))]
You could convert x into an ordered factor:
x.factor <- factor(x, levels = y, ordered=TRUE)
sort(x)
sort(x.factor)
Obviously, changing your numbers into factors can radically change the way code downstream reacts to x. But since you didn't give us any context about what happens next, I thought I would suggest this as an option.
How about?:
rep(y,table(x)[as.character(y)])
(Ian's is probably still better)
In case you need to get order on "y" no matter if it's numbers or characters:
x[order(ordered(x, levels = y))]
4 4 4 2 2 1 3 3 3
By steps:
a <- ordered(x, levels = y) # Create ordered factor from "x" upon order in "y".
[1] 2 2 3 4 1 4 4 3 3
Levels: 4 < 2 < 1 < 3
b <- order(a) # Define "x" order that match to order in "y".
[1] 4 6 7 1 2 5 3 8 9
x[b] # Reorder "x" according to order in "y".
[1] 4 4 4 2 2 1 3 3 3
[Edit: Clearly Ian has the right approach, but I will leave this in for posterity.]
You can do this without loops by indexing on your y vector. Add an incrementing numeric value to y and merge them:
y <- data.frame(index=1:length(y), x=y)
x <- data.frame(x=x)
x <- merge(x,y)
x <- x[order(x$index),"x"]
x
[1] 4 4 4 2 2 1 3 3 3
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
for(i in y) { z <- c(z, rep(i, sum(x==i))) }
The result in z: 4 4 4 2 2 1 3 3 3
The important steps:
for(i in y) -- Loops over the elements of interest.
z <- c(z, ...) -- Concatenates each subexpression in turn
rep(i, sum(x==i)) -- Repeats i (the current element of interest) sum(x==i) times (the number of times we found i in x).
Also you can use sqldf and do it by a join function in sql likes the following:
library(sqldf)
x <- data.frame(x = c(2, 2, 3, 4, 1, 4, 4, 3, 3))
y <- data.frame(y = c(4, 2, 1, 3))
result <- sqldf("SELECT x.x FROM y JOIN x on y.y = x.x")
ordered_x <- result[[1]]

Map numbers to smallest in a vector of numbers in R

Given a vector of numbers, I'd like to map each to the smallest in a separate vector that the number does not exceed. For example:
# Given these
v1 <- 1:10
v2 <- c(2, 5, 11)
# I'd like to return
result <- c(2, 2, 5, 5, 5, 11, 11, 11, 11, 11)
Try
cut(v1, c(0, v2), labels = v2)
[1] 2 2 5 5 5 11 11 11 11 11
Levels: 2 5 11
which can be converted to a numeric vector using as.numeric(as.character(...)).
Another way (Thanks for the edit #Ananda)
v2[findInterval(v1, v2 + 1) + 1]
# [1] 2 2 5 5 5 11 11 11 11 11]

Resources