Find start and end positions/indices of runs/consecutive values - r

Problem: Given an atomic vector, find the start and end indices of runs in the vector.
Example vector with runs:
x = rev(rep(6:10, 1:5))
# [1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
Output from rle():
# Run Length Encoding
# lengths: int [1:5] 5 4 3 2 1
# values : int [1:5] 10 9 8 7 6
Desired output:
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
The base rle class doesn't appear to provide this functionality, but the class Rle and function rle2 do. However, given how minor the functionality is, sticking to base R seems more sensible than installing and loading additional packages.
There are examples of code snippets (here, here and on SO) which solve the slightly different problem of finding start and end indices for runs which satisfy some condition. I wanted something that would be more general, could be performed in one line, and didn't involve the assignment of temporary variables or values.
Answering my own question because I was frustrated by the lack of search results. I hope this helps somebody!

Core logic:
# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)
# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)
# Display results
data.frame(start, end)
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
Tidyverse/dplyr way (data frame-centric):
rle(x) %>%
unclass() %>% %>%
mutate(end = cumsum(lengths),
start = c(1, dplyr::lag(end)[-1] + 1)) %>%
magrittr::extract(c(1,2,4,3)) # To re-order start before end for display
Because the start and end vectors are the same length as the values component of the rle object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter or subset the start and end vectors using the condition on the run values.

A data.table possibility, where .I and .N are used to pick relevant indices, per group defined by rleid runs.
data.table(x)[ , .(start = .I[1], end = .I[.N]), by = rleid(x)][, rleid := NULL][]
# start end
# 1: 1 5
# 2: 6 9
# 3: 10 12
# 4: 13 14
# 5: 15 15


Summing sequences in r using data.table

I am trying to sum pieces of a series using data.table in r. The idea is that I define a start index and an end index as columns in the table, then make a third column for "sum of the series between start and end indexes."
series = c(1,2,3,4,5,6)
a = data.table(start=c(1,2,3),end=c(4,5,6))
a[,S := sum(series[start:end])]
Expected result:
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Actual result:
Warning messages:
1: In start:end : numerical expression has 3 elements: only the first used
2: In start:end : numerical expression has 3 elements: only the first used
> a
start end S
1: 1 4 10
2: 2 5 10
3: 3 6 10
What am I missing here? If I just do a[,S := start+end] the code executes as one would expect.
An option is to loop over the 'start', 'end' columns with Map, get the sequence (:) of the corresponding elements, get the sum and unlist, the list column to assign (:=) it to a new column
a[, S := unlist(Map(function(x, y) sum(x:y), start, end))]
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18
The : is not vectorized for its operands i.e. it takes just a single operand on either side, and that is the reason it showed a warning
Maybe you can try cumsum like below, which allows you apply vectorized operations within data.table
cs <- cumsum(series)
a[,S := cs[end]-c(0,cs)[start]]
which gives
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
You can use the arithmetic series formula:
a[, S := (end - start + 1) * (start + end) / 2]
start end S
1: 1 4 10
2: 2 5 14
3: 3 6 18
Your code would work if you make this a rowwise operation so each start and end represent a single value at a time.
a[,S := sum(series[start:end]), 1:nrow(a)]
# start end S
#1: 1 4 10
#2: 2 5 14
#3: 3 6 18

R rearrange data

I have a bunch of texts written by the same person, and I'm trying to estimate the templates they use for each text. The way I'm going about this is:
create a TermDocumentMatrix for all the texts
take the raw Euclidean distance of each pair
cut out any pair greater than X distance (10 for the sake of argument)
flatten the forest
return one example of each template with some summarized stats
I'm able to get to the point of having the distance pairs, but I am unable to convert the dist instance to something I can work with. There is a reproducible example at the bottom.
The data in the dist instance looks like this:
The row and column names correspond to indexes in the original list of texts which I can use to do achieve step 5.
What I have been trying to get out of this is a sparse matrix with col name, row name, value.
col, row, value
1 2 14.966630
1 3 12.449900
1 4 13.490738
1 5 12.688578
1 6 12.369317
2 3 12.449900
2 4 13.564660
2 5 12.922848
2 6 12.529964
3 4 5.385165
3 5 5.830952
3 6 5.830952
4 5 7.416198
4 6 7.937254
5 6 7.615773
From this point I would be comfortable cutting out all pairs greater than my cutoff and flattening the forest, i.e. returning 3 templates in this example, a group containing only document 1, a group containing only document 2 and a third group containing documents 3, 4, 5, and 6.
I have tried a bunch of things from creating a matrix out of this and then trying to make it sparse, to directly using the vector inside of the dist class, and I just can't seem to figure it out.
Reproducible example:
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
# I'm stuck turning tdm.dist into what I have shown
A classic approach to turn a "matrix"-like object to a [row, col, value] "data.frame" is the route. Specifically here, we need:
subset(, as.numeric(Var1) < as.numeric(Var2))
But that includes way too many coercions and creation of a larger object only to be subset immediately.
Since dist stores its values in a "lower.tri"angle form we could use combn to generate the row/col indices and cbind with the "dist" object:
data.frame(, combn(attr(tdm.dist, "Size"), 2, simplify = FALSE)), c(tdm.dist))
Also, "Matrix" package has some flexibility that, along its memory efficiency in creating objects, could be used here:
tmp = combn(attr(tdm.dist, "Size"), 2)
summary(sparseMatrix(i = tmp[2, ], j = tmp[1, ], x = c(tdm.dist),
dims = rep_len(attr(tdm.dist, "Size"), 2), symmetric = TRUE))
Additionally, among different functions that handle "dist" objects,
cutree(hclust(tdm.dist), h = 10)
#1 2 3 4 5 6
#1 2 3 3 3 3
groups by specifying the cut height.
That's how I've done a very similar thing in the past using dplyr and tidyr packages.
You can run the chained (%>%) script row by row to see how the dataset is updated step by step.
tdm <- matrix(c(1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,3,1,2,2,2,3,2,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,4,1,1,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,2,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,1,1,1,0,1,0,1,0,0,2,0,0,0,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,3,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,1,0,0,2,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,1,1,1,1,0,1,0,0,0,0,1,2,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,2,1,1,1,0,0,0,0,1,2,2,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,2,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,2,0,2,2,3,2,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,1,1,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,2,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,2,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,3,0,1,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,4,2,4,6,4,3,1,0,1,2,1,1,0,1,0,0,0,0,2,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,2,1,2,2,2,2,1,0,1,2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,2,2,2,2,2,2,3,3,4,5,3,1,2,1,1,1,1,1,1,0,0,0,0,3,3,0,0,1,1,0,1,0,0,0,0), nrow=6)
rownames(tdm) <- 1:6
colnames(tdm) <- paste("term", 1:229, sep="")
tdm.dist <- dist(tdm)
tdm.dist %>%
as.matrix() %>% # update dist object to a matrix
data.frame() %>% # update matrix to a data frame
setNames(nm = 1:ncol(.)) %>% # update column names
mutate(names1 = 1:nrow(.)) %>% # use rownames as a variable
gather(names2, value , -names1) %>% # reshape data
filter(names1 <= names2) # keep the values only once
# names1 names2 value
# 1 1 1 0.000000
# 2 1 2 14.966630
# 3 2 2 0.000000
# 4 1 3 12.449900
# 5 2 3 12.449900
# 6 3 3 0.000000
# 7 1 4 13.490738
# 8 2 4 13.564660
# 9 3 4 5.385165
# 10 4 4 0.000000
# 11 1 5 12.688578
# 12 2 5 12.922848
# 13 3 5 5.830952
# 14 4 5 7.416198
# 15 5 5 0.000000
# 16 1 6 12.369317
# 17 2 6 12.529964
# 18 3 6 5.830952
# 19 4 6 7.937254
# 20 5 6 7.615773
# 21 6 6 0.000000

R Split data.frame using a column that represents and on/off switch

I have data that looks like the following:
a <- data.frame(cbind(x=seq(50),
I would like to split the data.frame a on the column z but have each group as a separate data.frame as a member of a list i.e. in my example the first 5 rows would be the first item in the list the next 8 rows would be the next item in the list, the next 3 rows would be item after that etc. etc.
Simple factors combine all the 1s together and all the 0s together...
I'm sure that there is a simple way to do this, but it has eluded for at the moment.
Try the rleid function in data.table v > 1.9.5
split(a, rleid(a$z))
# $`1`
# x y z
# 1 1 -0.03737561 0
# 2 2 -0.48663043 0
# 3 3 -0.98518106 0
# 4 4 0.09014355 0
# 5 5 -0.07703517 0
# $`2`
# x y z
# 6 6 0.3884339 1
# 7 7 1.5962833 1
# 8 8 -1.3750668 1
# 9 9 0.7987056 1
# 10 10 0.3483114 1
# 11 11 -0.1777759 1
# 12 12 1.1239553 1
# 13 13 0.4841117 1
Or, also with cumsum:
split(a, c(0, cumsum(diff(a$z) != 0)))
Here are some base R options.
Using rle. A variant of rleid function in the comments by #Spacedman
split(a,inverse.rle(within.list(rle(a$z), values <- seq_along(values))))
By using cumsum after creating a logical index based on whether the adjacent elements are equal or not
split(a, cumsum(c(TRUE, a$z[-1]!=a$z[-nrow(a)])))

input sequential numbers without specific end in a data frame's column in r

I would like to give a sequence of numbers to a new column to a data frame. But this sequence will repeat several times based on a value in another column. (i.e It starts from 1 until that specific value will be changed to other value).
My problem is how to define the ending point for each sequence in r.
A part of my data frame with the column "V2" which I intend to add:
V1 V2(new added column with sequential numbers)
12 1
12 2
12 3
12 4
12 5
13 1
13 2
13 3
13 4
13 5
13 6
14 1
14 2
14 3
14 4
I tried to use the following code, which was not working!
count <- table(df$V1)
c <- as.integer(names(count)[df$V1==12])
df$V2<- seq(1,c, by=1)
It sounds like you might be looking for rle since you're interested in any time the "V1" variable changes.
Try the following:
> sequence(rle(df$V1)$lengths)
[1] 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4
rle is a very good solution but you could also have used ave:
tab$V2 <- ave(tab$V1, tab$V1, FUN=seq_along)
Well Ananda beats my effort:
vec = numeric(0)
for(i in unique(df$V1)){
n = length(df$V1[df$V1 == i])
vec = c(vec, 1:n)

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
#[1] 1 2 4 5 5 5 6 6
