First, I simplify my question. I want to extract certain ranges from a numeric vector. For example, extracting 3 ranges from 1:20 at the same time :
1 < x < 5
8 < x < 12
17 < x < 20
Therefore, the expected output is 2, 3, 4, 9, 10, 11, 18, 19.
I try to use the function findInterval() and control arguments rightmost.closed and left.open to do that, but any arguments sets cannot achieve the goal.
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
x[findInterval(x, v) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19
x[findInterval(x, v, rightmost.closed = T) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19 20
x[findInterval(x, v, left.open = T) %% 2 == 1]
# [1] 2 3 4 5 9 10 11 12 18 19 20
By the way, the conditions can also be a matrix like that :
[,1] [,2]
[1,] 1 5
[2,] 8 12
[3,] 17 20
I don't want to use for loop if it's not necessary.
I am grateful for any helps.
I'd probably do it using purrr::map2 or Map, passing your lower-bounds and upper-bounds as arguments and filtering your dataset with a custom function
library(purrr)
x <- 1:20
lower_bounds <- c(1, 8, 17)
upper_bounds <- c(5, 12, 20)
map2(
lower_bounds, upper_bounds, function(lower, upper) {
x[x > lower & x < upper]
}
)
You may use data.table::inrange and its incbounds argument. Assuming ranges are in a matrix 'm', as shown in your question:
x[data.table::inrange(x, m[ , 1], m[ , 2], incbounds = FALSE)]
# [1] 2 3 4 9 10 11 18 19
m <- matrix(v, ncol = 2, byrow = TRUE)
You were on the right path, and left.open indeed helps, but rightmost.closed actually concerns only the last interval rather than the right "side" of each interval. Hence, we need to use left.open twice. As you yourself figured out, it looks like an optimal way to do that is
x[findInterval(x, v) %% 2 == 1 & findInterval(x, v, left.open = TRUE) %% 2 == 1]
# [1] 2 3 4 9 10 11 18 19
Clearly there are alternatives. E.g.,
fun <- function(x, v)
if(length(v) > 1) v[1] < x & x < v[2] | fun(x, v[-1:-2]) else FALSE
x[fun(x, v)]
# [1] 2 3 4 9 10 11 18 19
I found an easy way just with sapply() :
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
(v.df <- as.data.frame(matrix(v, 3, 2, byrow = T)))
# V1 V2
# 1 1 5
# 2 8 12
# 3 17 20
y <- sapply(x, function(x){
ind <- (x > v.df$V1 & x < v.df$V2)
if(any(ind)) x else NA
})
y[!is.na(y)]
# [1] 2 3 4 9 10 11 18 19
Related
I'm looking for a way to identify a growing season which consists of a number of days greater than say 60 between the last frost day of spring and the first frost day in the fall. A general version of this problem is this. If I have a vector of numbers like testVec, I want the item numbers of the beginning and end range of values where the number of items is 5 or greater and all of them are greater than 0.
testVec <- c(1,3,4,0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
In this example, the relevant range is 1,3,4,6,7,5,9 which is testVec[9] to testVec[15]
One option could be:
testVec[with(rle(testVec > 0), rep(lengths * values >= 5, lengths))]
[1] 1 3 4 6 7 5 9
Here, the idea is to, first, create runs of values that are smaller or equal to zero and bigger than zero. Second, it checks whether the runs of values bigger than zero are of length 5 or more. Finally, it subsets the original vector for the runs of values bigger than zero with length 5 or more.
1) rleid This also handles any number of sequences including zero. rleid(ok) is a vector the same length as ok such that the first run of identical elements is replaced with 1, the second run with 2 and so on. The result is a list of vectors where each vector has its positions in the original input as its names.
library(data.table)
getSeq <- function(x) {
names(x) <- seq_along(x)
ok <- x > 0
s <- split(x[ok], rleid(ok)[ok])
unname(s)[lengths(s) >= 5]
}
getSeq(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
getSeq(numeric(16))
## list()
getSeq(c(testVec, 10 * testVec))
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
##
## [[2]]
## 25 26 27 28 29 30 31
## 10 30 40 60 70 50 90
If a data frame were desired then following gives the values and which sequence the row came from. The row names indicate the positions in the original input.
gs <- getSeq(c(testVec, 10 * testVec))
names(gs) <- seq_along(gs)
if (length(gs)) stack(gs) else gs
## values ind
## 9 1 1
## 10 3 1
## 11 4 1
## 12 6 1
## 13 7 1
## 14 5 1
## 15 9 1
## 25 10 2
## 26 30 2
## 27 40 2
## 28 60 2
## 29 70 2
## 30 50 2
## 31 90 2
2) gregexpr Replace each element that is > 0 with 1 and each other element with 0 pasting the 0's and 1's into a single character string. Then use gregexpr to look for sequences of 1's at least 5 long and for the ith such nonoverlapping sequence return the first positions, g, and lengths, attr(g, "match.length"). Define a function vals which extracts the values at the required positions from testVec of the ith such nonoverlapping sequence returning a list such that the ith component of the list is the ith such sequence. The names in the output vector are its positions in the input.
getSeq2 <- function(x) {
g <- gregexpr("1{5,}", paste(+(x > 0), collapse = ""))[[1]]
vals <- function(i) {
ix <- seq(g[i], length = attr(g, "match.length")[i])
setNames(x[ix], ix)
}
if (length(g) == 1 && g == -1) list() else lapply(seq_along(g), vals)
}
getSeq2(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
The above handles any number of sequences including 0 but if we knew there were exactly one sequence (which is the case for the example in the question) then it could be simplified to the following where the return value is just that vector:
g <- gregexpr("1{5,}", paste(+(testVec > 0), collapse = ""))[[1]]
ix <- seq(g, length = attr(g, "match.length"))
setNames(testVec[ix], ix)
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
You could "fix" #tmfmnk's solution like this:
f1 <- function(x, threshold, n) {
range(which(with(rle(x > threshold), rep(lengths * values >= n, lengths))))
}
x <- c(1, 3, 4, 0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
f1(x, 0, 5)
#[1] 9 15
But that does not work well when there are multiple runs
xx <- c(x, x)
f1(xx, 0, 5)
#[1] 9 31
Here is another, not so concise approach that returns the start and end of the longest run (the first one if there are ties).
f2 <- function(x, threshold, n) {
y <- x > threshold
y[is.na(y)] <- FALSE
a <- ave(y, cumsum(!y), FUN=cumsum)
m <- max(a)
if (m < n) return (c(NA, NA))
i <- which(a == m)[1]
c(i-m+1, i)
}
f2(x, 0, 5)
#[1] 9 15
f2(xx, 0, 5)
#[1] 9 15
or with rle
f3 <- function(x, threshold, n) {
y <- x > threshold
r <- rle(y)
m <- max(r$lengths)
if (m < n) return (c(NA, NA))
i <- sum(r$lengths[1:which.max(r$lengths)[1]])
c(i-max(r$lengths)+1, i)
}
f3(x, 0, 5)
#[1] 9 15
f3(xx, 0, 5)
#[1] 9 15
If you wanted the first run that is at least n, that is you do not want a next run, even if it is longer, you could do
f4 <- function(x, threshold, n) {
y <- with(rle(x > threshold), rep(lengths * values >= n, lengths))
i <- which(y)[1]
j <- i + which(!y[-c(1:i)])[1] - 1
c(i, j)
}
I am trying to make a piece-wise function. This is a really basic one. I want y to be a list of values (preferably not just a list of integers but a list of real numbers like (1.34, 20.92) in the future).
How might I make a piece-wise function?
y <- 1:10
if (y < 2){
print("CAN'T COMPUTE")
} else if (y >= 2 & y < 6){
print(y^2)
} else {
print(y * 2)
}
Let me give it a try:
library("dplyr")
y <- 1:10
y %>%
as_tibble() %>%
mutate(res = case_when(y < 2 ~ "CAN'T COMPUTE",
y >= 2 & y < 6 ~ as.character(y^2),
TRUE ~ as.character(y*2)))
Here's the results:
# A tibble: 10 x 2
value res
<int> <chr>
1 1 CAN'T COMPUTE
2 2 4
3 3 9
4 4 16
5 5 25
6 6 12
7 7 14
8 8 16
9 9 18
10 10 20
Here are a some base R approaches. We have used NA instead of a character string in order to produce a numeric vector result. The first uses a nested ifelse. The second uses a single ifelse to select between NA and the other values and computes the other values using a formula. The third computes which leg of the result is wanted (1, 2 or 3) and then uses switch to select that leg. The fourth is a variation of three that uses findInterval to compute the leg number.
ifelse(y < 2, NA, ifelse(y < 6, y^2, 2*y))
## [1] NA 4 9 16 25 12 14 16 18 20
ifelse(y < 2, NA, (y < 6) * y^2 + (y >= 6) * 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, 1 + (y >= 2) + (y >= 6), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, findInterval(y, c(-Inf, 2, 6, Inf), left.open = FALSE), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
The input vector is as below,
data=c(1,1,1,1,11,1,1,1,1,12,1,1,2,1,1,1)
I want the output as 1,1,1,1,11,11,11,11,11,12,12,12,2,2,2,2 where the 1's proceeding the non 1's should be imputed the non 1 value in R.
I tried the following code
data=c(1,1,1,1,11,1,1,1,1,12,1,1,2,1,1,1)
sapply(data, function(x) ifelse (lag(x)!=1,lag(x),x))
but it didn't yield expected output
You can convert every 1 after the first non-1 value to NA then use zoo::na.locf():
library(zoo)
x <- c(1,1,1,1,11,1,1,1,1,12,1,1,2,1,1,1)
data[seq_along(x) > which.max(x!= 1) & x== 1] <- NA
na.locf(x)
[1] 1 1 1 1 11 11 11 11 11 12 12 12 2 2 2 2
Or using replace() to add the NA values:
na.locf(replace(x, seq_along(x) > which.max(x != 1) & x == 1, NA))
In response to your comment about applying it to groups, you can use ave():
df <- data.frame(x = c(x, rev(x)), grp = rep(1:2, each = length(x)))
ave(df$x, df$grp, FUN = function(y)
na.locf(replace(y, seq_along(y) > which.max(y != 1) & y == 1, NA))
)
You can write your custom fill function:
x <- c(1,1,1,1,11,1,1,1,1,12,1,1,2,1,1,1)
myfill <- function(x) {
mem <- x[1]
for (i in seq_along(x)) {
if (x[i] == 1) {
x[i] <- mem
} else {
mem <- x[i]
}
}
x
}
myfill(x)
# 1 1 1 1 11 11 11 11 11 12 12 12 2 2 2 2
You could match unique 1 and non-1 values with the cumsum of non-1 values.
(c(1, x[x != 1]))[match(cumsum(x != 1), 0:3)]
# [1] 1 1 1 1 11 11 11 11 11 12 12 12 2 2 2 2
Data
x <- c(1, 1, 1, 1, 11, 1, 1, 1, 1, 12, 1, 1, 2, 1, 1, 1)
You can use rle from base to overwrite 1 with the value before.
x <- rle(data)
y <- c(FALSE, (x$values == 1)[-1])
x$values[y] <- x$values[which(y)-1]
inverse.rle(x)
# [1] 1 1 1 1 11 11 11 11 11 12 12 12 2 2 2 2
What is the optimal way to get the index of all elements that are repeated # times? I want to identify the elements that are duplicated more than 2 times.
rle() and rleid() both hint to the values I need but neither method directly gives me the indices.
I came up with this code:
t1 <- c(1, 10, 10, 10, 14, 37, 3, 14, 8, 8, 8, 8, 39, 12)
t2 <- lag(t1,1)
t2[is.na(t2)] <- 0
t3 <- ifelse(t1 - t2 == 0, 1, 0)
t4 <- rep(0, length(t3))
for (i in 2:length(t3)) t4[i] <- ifelse(t3[i] > 0, t3[i - 1] + t3[i], 0)
which(t4 > 1)
returns:
[1] 4 11 12
and those are the values I need.
Are there any R-functions that are more appropriate?
Ben
One option with data.table. No real reason to use this instead of lag/shift when n = 2, but for larger n this would save you from creating a large number of new lagged vectors.
library(data.table)
which(rowid(rleid(t1)) > 2)
# [1] 4 11 12
Explanation:
rleid will produce a unique value for each "run" of equal values, and rowid will mark how many elements "into" the run each element is. What you want is elements more than 2 "into" a run.
data.table(
t1,
rleid(t1),
rowid(t1))
# t1 V2 V3
# 1: 1 1 1
# 2: 10 2 1
# 3: 10 2 2
# 4: 10 2 3
# 5: 14 3 1
# 6: 37 4 1
# 7: 3 5 1
# 8: 14 6 2
# 9: 8 7 1
# 10: 8 7 2
# 11: 8 7 3
# 12: 8 7 4
# 13: 39 8 1
# 14: 12 9 1
Edit: If, as in the example posed by this question, no two runs (even length-1 "runs") are of the same value (or if you don't care whether the duplicates are next to eachother), you can just use which(rowid(t1) > 2) instead. (This is noted by Frank in the comments)
Hopefully this example clarifies the differences
a <- c(1, 1, 1, 2, 2, 1)
which(rowid(a) > 2)
# [1] 3 6
which(rowid(rleid(a)) > 2)
# [1] 3
You can use dplyr::lag or data.table::shift (note, default for shift is to lag, so shift(t1, 1) is equal to shift(t1, 1, type = "lag"):
which(t1 == lag(t1, 1) & lag(t1, 1) == lag(t1, 2))
[1] 4 11 12
# Or
which(t1 == shift(t1, 1) & shift(t1, 1) == shift(t1, 2))
[1] 4 11 12
If you need it to scale for several duplicates you can do the following (thanks for the tip #IceCreamToucan):
n <- 2
df1 <- sapply(0:n, function(x) shift(t1, x))
which(rowMeans(df1 == df1[,1]) == 1)
[1] 4 11 12
This is usually a case that rle is useful, i.e.
v1 <- rle(t1)
i1 <- seq_along(t1)[t1 %in% v1$values[v1$lengths > 2]]
i2 <- t1[t1 %in% v1$values[v1$lengths > 2]]
tapply(i1, i2, function(i) tail(i, -2))
#$`8`
#[1] 11 12
#$`10`
#[1] 4
You can unlist and get it as a vector,
unlist(tapply(i1, i2, function(i) tail(i, -2)))
#81 82 10
#11 12 4
There is also a function called rleid in data.table package which we can use,
unlist(lapply(Filter(function(i) length(i) > 2, split(seq_along(t1), data.table::rleid(t1))),
function(i) tail(i, -2)))
#2 71 72
#4 11 12
Another possibility involving rle() could be:
pseudo_rleid <- with(rle(t1), rep(seq_along(values), lengths))
which(ave(t1, pseudo_rleid, FUN = function(x) seq_along(x) > 2) != 0)
[1] 4 11 12
df <- data.frame(x = seq(1:10))
I want this:
df$y <- c(1, 2, 3, 4, 5, 15, 20 , 25, 30, 35)
i.e. each y is the sum of previous five x values. This implies the first
five y will be same as x
What I get is this:
df$y1 <- c(df$x[1:4], RcppRoll::roll_sum(df$x, 5))
x y y1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 15
6 15 20
7 20 25
8 25 30
9 30 35
10 35 40
In summary, I need y but I am only able to achieve y1
1) enhanced sum function Define a function Sum which sums its first 5 values if it receives 6 values and returns the last value otherwise. Then use it with partial=TRUE in rollapplyr:
Sum <- function(x) if (length(x) < 6) tail(x, 1) else sum(head(x, -1))
rollapplyr(x, 6, Sum, partial = TRUE)
## [1] 1 2 3 4 5 15 20 25 30 35
2) sum 6 and subtract off original Another possibility is to take the running sum of 6 elements filling in the first 5 elements with NA and subtracting off the original vector. Finally fill in the first 5.
replace(rollsumr(x, 6, fill = NA) - x, 1:5, head(x, 5))
## [1] 1 2 3 4 5 15 20 25 30 35
3) specify offsets A third possibility is to use the offset form of width to specify the prior 5 elements:
c(head(x, 5), rollapplyr(x, list(-(1:5)), sum))
## [1] 1 2 3 4 5 15 20 25 30 35
4) alternative specification of offsets In this alternative we specify an offset of 0 for each of the first 5 elements and offsets of -(1:5) for the rest.
width <- replace(rep(list(-(1:5)), length(x)), 1:5, list(0))
rollapply(x, width, sum)
## [1] 1 2 3 4 5 15 20 25 30 35
Note
The scheme for filling in the first 5 elements seems quite unusual and you might consider using partial sums for the first 5 with NA or 0 for the first one since there are no prior elements fir that one:
rollapplyr(x, list(-(1:5)), sum, partial = TRUE, fill = NA)
## [1] NA 1 3 6 10 15 20 25 30 35
rollapplyr(x, list(-(1:5)), sum, partial = TRUE, fill = 0)
## [1] 0 1 3 6 10 15 20 25 30 35
rollapplyr(x, 6, sum, partial = TRUE) - x
## [1] 0 1 3 6 10 15 20 25 30 35
A simple approach would be:
df <- data.frame(x = seq(1:10))
mysum <- function(x, k = 5) {
res <- rep(NA, length(x))
for (i in seq_along(x)) {
if (i <= k) { # edited ;-)
res[i] <- x[i]
} else {
res[i] <- sum(x[(i-k):(i-1)])
}
}
res
}
mysum(df$x)
# [1] 1 2 3 4 5 15 20 25 30 35
mysum <- function(x, k = 5) {
res <- x[1:k]
append<-sapply(2:(len(x)+1-k),function(i) sum(x[i:(i+k-1)]))
return(c(res,append))
}
mysum(df$x)