Related
I'm looking for a way to identify a growing season which consists of a number of days greater than say 60 between the last frost day of spring and the first frost day in the fall. A general version of this problem is this. If I have a vector of numbers like testVec, I want the item numbers of the beginning and end range of values where the number of items is 5 or greater and all of them are greater than 0.
testVec <- c(1,3,4,0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
In this example, the relevant range is 1,3,4,6,7,5,9 which is testVec[9] to testVec[15]
One option could be:
testVec[with(rle(testVec > 0), rep(lengths * values >= 5, lengths))]
[1] 1 3 4 6 7 5 9
Here, the idea is to, first, create runs of values that are smaller or equal to zero and bigger than zero. Second, it checks whether the runs of values bigger than zero are of length 5 or more. Finally, it subsets the original vector for the runs of values bigger than zero with length 5 or more.
1) rleid This also handles any number of sequences including zero. rleid(ok) is a vector the same length as ok such that the first run of identical elements is replaced with 1, the second run with 2 and so on. The result is a list of vectors where each vector has its positions in the original input as its names.
library(data.table)
getSeq <- function(x) {
names(x) <- seq_along(x)
ok <- x > 0
s <- split(x[ok], rleid(ok)[ok])
unname(s)[lengths(s) >= 5]
}
getSeq(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
getSeq(numeric(16))
## list()
getSeq(c(testVec, 10 * testVec))
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
##
## [[2]]
## 25 26 27 28 29 30 31
## 10 30 40 60 70 50 90
If a data frame were desired then following gives the values and which sequence the row came from. The row names indicate the positions in the original input.
gs <- getSeq(c(testVec, 10 * testVec))
names(gs) <- seq_along(gs)
if (length(gs)) stack(gs) else gs
## values ind
## 9 1 1
## 10 3 1
## 11 4 1
## 12 6 1
## 13 7 1
## 14 5 1
## 15 9 1
## 25 10 2
## 26 30 2
## 27 40 2
## 28 60 2
## 29 70 2
## 30 50 2
## 31 90 2
2) gregexpr Replace each element that is > 0 with 1 and each other element with 0 pasting the 0's and 1's into a single character string. Then use gregexpr to look for sequences of 1's at least 5 long and for the ith such nonoverlapping sequence return the first positions, g, and lengths, attr(g, "match.length"). Define a function vals which extracts the values at the required positions from testVec of the ith such nonoverlapping sequence returning a list such that the ith component of the list is the ith such sequence. The names in the output vector are its positions in the input.
getSeq2 <- function(x) {
g <- gregexpr("1{5,}", paste(+(x > 0), collapse = ""))[[1]]
vals <- function(i) {
ix <- seq(g[i], length = attr(g, "match.length")[i])
setNames(x[ix], ix)
}
if (length(g) == 1 && g == -1) list() else lapply(seq_along(g), vals)
}
getSeq2(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
The above handles any number of sequences including 0 but if we knew there were exactly one sequence (which is the case for the example in the question) then it could be simplified to the following where the return value is just that vector:
g <- gregexpr("1{5,}", paste(+(testVec > 0), collapse = ""))[[1]]
ix <- seq(g, length = attr(g, "match.length"))
setNames(testVec[ix], ix)
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
You could "fix" #tmfmnk's solution like this:
f1 <- function(x, threshold, n) {
range(which(with(rle(x > threshold), rep(lengths * values >= n, lengths))))
}
x <- c(1, 3, 4, 0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
f1(x, 0, 5)
#[1] 9 15
But that does not work well when there are multiple runs
xx <- c(x, x)
f1(xx, 0, 5)
#[1] 9 31
Here is another, not so concise approach that returns the start and end of the longest run (the first one if there are ties).
f2 <- function(x, threshold, n) {
y <- x > threshold
y[is.na(y)] <- FALSE
a <- ave(y, cumsum(!y), FUN=cumsum)
m <- max(a)
if (m < n) return (c(NA, NA))
i <- which(a == m)[1]
c(i-m+1, i)
}
f2(x, 0, 5)
#[1] 9 15
f2(xx, 0, 5)
#[1] 9 15
or with rle
f3 <- function(x, threshold, n) {
y <- x > threshold
r <- rle(y)
m <- max(r$lengths)
if (m < n) return (c(NA, NA))
i <- sum(r$lengths[1:which.max(r$lengths)[1]])
c(i-max(r$lengths)+1, i)
}
f3(x, 0, 5)
#[1] 9 15
f3(xx, 0, 5)
#[1] 9 15
If you wanted the first run that is at least n, that is you do not want a next run, even if it is longer, you could do
f4 <- function(x, threshold, n) {
y <- with(rle(x > threshold), rep(lengths * values >= n, lengths))
i <- which(y)[1]
j <- i + which(!y[-c(1:i)])[1] - 1
c(i, j)
}
In R, I try systematically to avoid "for" loops and use lapply() family instead.
But how to do so when an iteration contains an increment step ?
For example : is it possible to obtain the same result as below with a lapply approach ?
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
> a b
> 1 0 0
> 2 1 0-1
> 3 0 0-1-0
> 4 0 0-1-0-0
> 5 1 0-1-0-0-1
> 6 0 0-1-0-0-1-0
> 7 0 0-1-0-0-1-0-0
> 8 0 0-1-0-0-1-0-0-0
> 9 1 0-1-0-0-1-0-0-0-1
> 10 1 0-1-0-0-1-0-0-0-1-1
EDIT
My question was very badly redacted. The below new example is much more illustrative : is it anyway to use lapply family if each iteration is calculated from the previous one ?
a <- c()
b <- c()
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 1 0-1-0-1
5 1 0-1-0-1-1
6 1 0-1-0-1-1-1
7 1 0-1-0-1-1-1-1
8 0 0-1-0-1-1-1-1-0
9 1 0-1-0-1-1-1-1-0-1
10 1 0-1-0-1-1-1-1-0-1-1
For the sake of completeness, there is also the accumulate() function from the purrr package.
So, building on the answers of Sotos and ThomasIsCoding:
df <- data.frame(a = 1:10)
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
The difference to Reduce() is
that accumulate() is a function verb on its own (no additional parameter accumulate = TRUE required)
and that additional arguments like sep = "-" can be passed on to the mapped function which may help to avoid the creation of an anonymous function.
EDIT
If I understand correctly OP's edit of the question, the OP is asking if a for loop which computes a result iteratively can be replaced by lapply().
This is difficult to answer for me. Here are some thoughts and observations:
First, accumulate() still will work:
set.seed(1L) # required for reproducible data
df <- data.frame(a = sample(0:1, 10L, TRUE))
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
This is possible because the computation of a can be pulled out off the loop as it does not depend on b.
IMHO, accumulate() and Reduce() do what the OP is looking for but is not called lapply(): They take the result of the previous iteration and combine it with the actual value, for instance
Reduce(`+`, 1:3)
returns the sum of 1, 2, and 3 by iteratively computing (((0 + 1) + 2) + 3). This can be visualised by using the accumulate parameter
Reduce(`+`, 1:3, accumulate = TRUE)
[1] 1 3 6
Second, there is a major difference between a for loop and functions of the lapply() family: lapply(X, FUN, ...) requires a function FUN to be called on each element of X. So, scoping rules for functions apply.
When we transplant the body of the loop into an anonymous function within lapply()
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
})
we get
[[1]]
[1] "0"
[[2]]
[1] "1"
[[3]]
[1] "0"
[[4]]
[1] "0"
[[5]]
[1] "1"
[[6]]
[1] "0"
[[7]]
[1] "0"
[[8]]
[1] "0"
[[9]]
[1] "1"
[[10]]
[1] "1"
data.frame(a, b)
data frame with 0 columns and 0 rows data.frame(a, b)
Due to the scoping rules, a and b inside the function are considered as local to the function. No reference is made to a and b defined outside of the function.
This can be fixed by global assignment using the global assignment operator <<-:
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <<- c(a, sample(c(0,1), 1))
b <<- c(b, (paste(a, collapse = "-")))
})
data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
However, global assignment is considered bad programming practice and should be avoided, see, e.g., the 6th Circle of Patrick Burns' The R Inferno and many questions on SO.
Third, the way the loop is written grows vectors in the loop. This also is considered bad practice as it requires to copy the data over and over again which may slow down tremendously with increasing size. See, e.g., the 2nd Circle of Patrick Burns' The R Inferno.
However, the original code
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
can be re-written as
a <- integer(10)
b <- character(10)
set.seed(1L) # required for reproducible data
for (i in seq_along(a)) {
a[i] <- sample(c(0,1), 1)
b[i] <- if (i == 1L) a[1] else paste(b[i-1], a[i], sep = "-")
}
data.frame(a, b)
Here, vectors are pre-allocated with the required size to hold the result. Elements to update are identified by subscripting.
Calculation of b[i] still depends only the value of the previous iteration b[i-1] and the actual value a[i] as requested by the OP.
Another way is to use Reduce with accumulate = TRUE, i.e.
df$new <- do.call(rbind, Reduce(paste, split(df, seq(nrow(df))), accumulate = TRUE))
which gives,
a new
1 1 1
2 2 1 2
3 3 1 2 3
4 4 1 2 3 4
5 5 1 2 3 4 5
6 6 1 2 3 4 5 6
7 7 1 2 3 4 5 6 7
8 8 1 2 3 4 5 6 7 8
9 9 1 2 3 4 5 6 7 8 9
10 10 1 2 3 4 5 6 7 8 9 10
You can use sapply (lapply would work too but it returns a list) and iterate over every value of a in df and create a sequence and paste the value together.
df <- data.frame(a = 1:10)
df$b <- sapply(df$a, function(x) paste(seq(x), collapse = "-"))
df
# a b
#1 1 1
#2 2 1-2
#3 3 1-2-3
#4 4 1-2-3-4
#5 5 1-2-3-4-5
#6 6 1-2-3-4-5-6
#7 7 1-2-3-4-5-6-7
#8 8 1-2-3-4-5-6-7-8
#9 9 1-2-3-4-5-6-7-8-9
#10 10 1-2-3-4-5-6-7-8-9-10
If there could be non-numerical values in data on which we can not use seq like
df <- data.frame(a =letters[1:10])
In those case, we can use
df$b <- sapply(seq_along(df$a), function(x) paste(df$a[seq_len(x)], collapse = "-"))
df
# a b
#1 a a
#2 b a-b
#3 c a-b-c
#4 d a-b-c-d
#5 e a-b-c-d-e
#6 f a-b-c-d-e-f
#7 g a-b-c-d-e-f-g
#8 h a-b-c-d-e-f-g-h
#9 i a-b-c-d-e-f-g-h-i
#10 j a-b-c-d-e-f-g-h-i-j
Another way of using Reduce, different to the approach by #Sotos
df$b <- Reduce(function(...) paste(...,sep = "-"), df$a, accumulate = T)
such that
> df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
Suppose you have two vectors y and x with length(y) < length(x).
What I want is to count how often all elements of y appear in x in the same order.
I can assume that the elements of y appear at least one time in x in the right order.
Example:
y = c(10,20)
x = c(10,20,20,10,20)
The indexpairs of appereances of y in x are:
(1,2),(1,3),(1,5),(4,5)
so my result should be 4.
I've already written a function:
countAllPositionsOfLCS <- function(y, x) {
potIndexList <- lapply(y, function(k) {
which(k == x)
})
previousIndices <- potIndexList[[1]]
counter <- length(potIndexList[[1]])
if (length(potIndexList) >= 2) {
for (k in 2:length(potIndexList)) {
newIndices <- potIndexList[[k]]
for (i in 1:length(previousIndices)) {
currentFittingInds <- newIndices[which(previousIndices[i] < newIndices)]
counter <- counter + length(currentFittingInds) - 1
}
previousIndices <- newIndices
}
}
counter
}
I tested this function with
c(10,20,30) and c(10,20,20,10,20,20,30,30)
The proper result should be 12, but the function says 10.
Indexpairs are 1,2,7 | 1,2,8 | 1,3,7 | 1,3,8 | 1,5,7 | 1,5,8 | 1,6,7 | 1,6,8 | 4,5,7 | 4,5,8 | 4,6,7 | 4,6,8.
My idea was following:
First of all I look where the values of y appear in x and for each value of y I store the indices in the list potIndexList.
then I go through all elements of this list, say we are at element potIndexList[[k]], so we have all the indices there, where the k-th element of y apperas in x. Since I want to preserve the order of y in x, I have to get rid of some indices. Therefore I go through the indices and check if previousIndices[i] < potIndexList[[k]]. If this is TRUE I know that the order is right.
I suggest using expand.grid unless your actual problem involves much larger vectors:
y = c(10,20,30)
x = c(10,20,20,10,20,20,30,30)
#find matches:
inds <- lapply(y, function(z) which(z == x))
#all combinations of matches:
res <- do.call(expand.grid, inds)
#remove combinations with wrong order:
res <- res[apply(res, 1, function(z) all(order(z) == seq_along(y))),]
# Var1 Var2 Var3
# 1 1 2 7
# 3 1 3 7
# 5 1 5 7
# 6 4 5 7
# 7 1 6 7
# 8 4 6 7
# 9 1 2 8
#11 1 3 8
#13 1 5 8
#14 4 5 8
#15 1 6 8
#16 4 6 8
nrow(res)
#[1] 12
I've a list with groups of numbers as follows:
myList <- list(1:5, c(1,3,4,7), 2:6, c(3,6:9), 4:8)
myList
#[[1]]
#[1] 1 2 3 4 5
#
#[[2]]
#[1] 1 3 4 7
#
#[[3]]
#[1] 2 3 4 5 6
#
#[[4]]
#[1] 3 6 7 8 9
#
#[[5]]
#[1] 4 5 6 7 8
So we have 5 groups of numbers in myList. I want to merge two groups if they have at least s% same numbers in them. I compute similarity using the following function call:
get.Overlap <- function(group_A, group_B)
{
common <- length(intersect(group_A, group_B))
minimum_length <- min(length(group_A), length(group_B))
#formula: |A ^ B| / min{|A|, |B|}
overlap_score <- common / minimum_length
overlap_score
}
I have implemented mergeList(myList, threshold_s) using loops as follows:
mergeList <- function(myList, threshold_s)
{
returnList <- list()
i <- 1
while(i <= length(myList))
{
thisList <- myList[[i]]
j <- i + 1
while(j <= length(myList))
{
tempList <- myList[[j]]
if(get.Overlap(thisList, tempList) >= threshold_s)
{
thisList <- union(thisList, tempList)
myList[[j]] <- NULL
}
else
{
j <- j + 1
}
}
returnList <- c(returnList, list(thisList))
i <- i + 1
}
returnList
}
Now if I call merge(myList, threshold_s), where threshold_s is set to 0.80 meaning 80% similarity, the output will be
#[[1]]
#[1] 1 2 3 4 5 6
#
#[[2]]
#[1] 1 3 4 7
#
#[[3]]
#[1] 3 4 6 7 8 9
The complexity is comparatively high. I am looking for a fast implement of merge(myList, threshold_s) for a large list, say length of myList may be around 50,000.
Thanks in advance.
I have a numeric vector in R, say
v= c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
Now, I have to find all the consecutive sub-vector of size 4 out of it with the condition that each element of the sub-vector must be greater than 2 and all sub-vector must be disjoint in the sense that non of the two sub-vector can contain same index element. So my output will be:
(3,5,6,7),(3,4,5,7),(5,6,7,11)
Edited:
Other examples for illustration purpose: for,
v=c(3,3,3,3,1,3,3,3,3,3,3,3,3)
output will be :
(3,3,3,3), (3,3,3,3),(3,3,3,3).
and for,
v= c(2,3,5,5,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
output will be
(3,5,5,7),(3,4,5,7),(5,6,7,11)
The second condition on the output simply says that if we found any sub- array say (v[m],v[m+1],v[m+2],v[m+3]) with each element greater than > 2 then it will goes into my output and the next sub-array can only be start from v[m+4](if possible)
This solution uses embed() to create a matrix of lags and then extracts the desired rows from this matrix:
v <- c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
e <- embed(v, 4)
ret <- which(
apply(e, 1, function(x)all(x > 2)) &
apply(e, 1, function(x)length(unique(x)) == 4)
)
rows <- ret[c(1, 1 + which(diff(ret) > 4))]
e[rows, 4:1]
[,1] [,2] [,3] [,4]
[1,] 3 5 6 7
[2,] 3 4 5 7
[3,] 5 6 7 11
Try:
fun1 <- function(vec, n, cond1) {
lst1 <- lapply(1:(length(vec) - n+1), function(i) {
x1 <- vec[i:(i + (n-1))]
if (all(diff(x1) >= 0) & all(x1 > cond1))
x1
})
indx <- which(sapply(lst1, length) == n)
indx2 <- unlist(lapply(split(indx, cumsum(c(TRUE, diff(indx) != 1))), function(x) x[seq(1,
length(x), by = n-1)]))
lst1[indx2]
}
v1 <- c(3,3,3,3,1,3,3,3,3,3,3,3,3)
v2 <- c(2,3,5,5,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
v3 <- c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4)
fun1(v1,4,2)
#[[1]]
#[1] 3 3 3 3
#[[2]]
#[1] 3 3 3 3
#[[3]]
#[1] 3 3 3 3
fun1(v2,4,2)
#[[1]]
#[1] 3 5 5 7
#[[2]]
#[1] 3 4 5 7
#[[3]]
#[1] 5 6 7 11
fun1(v3,4,2)
#[[1]]
#[1] 3 5 6 7
#[[2]]
#[1] 3 4 5 7
#[[3]]
#[1] 5 6 7 11
Here is another idea based on rle:
ff = function(x, size, thres)
{
valid_subsets = sapply(head(seq_along(x), -(size - 1)),
function(i) all(x[i:(i + (size - 1))] > thres))
r = rle(valid_subsets)
lapply(unlist(mapply(function(a, b) a + (seq_len(b) - 1) * size,
(cumsum(r$lengths) - r$lengths + 1)[which(r$values)],
(r$lengths[which(r$values)] + size - 1) %/% size)),
function(i) x[i:(i + (size - 1))])
}
ff(c(3,3,3,3,1,3,3,3,3,3,3,3,3), 4, 2)
ff(c(2,3,5,6,7,6,3,2,3,4,5,7,8,9,6,1,1,2,5,6,7,11,2,3,4), 4, 2)
Testing on another vector (I assume this is the correct output):
set.seed(4); xx = sample(1:10, 20, T)
xx
# [1] 6 1 3 3 9 3 8 10 10 1 8 3 2 10 5 5 10 6 10 8
ff(xx, 4, 2)
#[[1]]
#[1] 3 3 9 3
#
#[[2]]
#[1] 10 5 5 10
Unless I'm missing something, on "xx" (as well as on other cases) the other posted answers do not seem to work:
fun1(xx, 4, 2)
#[[1]]
#[1] 3 8 10 10
#e[rows, 4:1]
#[1] 9 3 8 10