When we want a sequence in R, we use either construction:
> 1:5
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
this produces a sequence from start to stop (inclusive)
is there a way to generate a sequence from start to stop (exclusive)? like
[1] 1 2 3 4
Also, I don't want to use a workaround like a minus operator, like:
seq(1,5-1)
This is because I would like to have statements in my code that are elegant and concise. In my real world example the start and stop are not hardcoded integers but descriptive variable names. Using the variable_name -1 construction just my script uglier and difficult to read for a reviewer.
PS: The difference between this question and the one at remove the last element of a vector is that I am asking for sequence generation while the former focuses on removing the last element of a vector
Moreover the answers provided here are different and relevant to my problem
One possible solution would be
head(1:5, -1)
# [1] 1 2 3 4
or you could define your own function
seq_last_exclusive <- function(x) return(x[-length(x)])
seq_last_exclusive(1:5)
# [1] 1 2 3 4
We can use the following function
f <- function(start, stop, ...) {
if(identical(start, stop)) {
return(vector("integer", 0))
}
seq.int(from = start, to = stop - 1L, ...)
}
Test
f(1, 5)
# [1] 1 2 3 4
f(1, 1)
# integer(0)
Related
I have a df like this (with ~800,000 lines)
# str
# 1 .||.
# 2 .
# 3 .|..
# 4 ..
and I want a new data frame like this (record the location in each character string with a .) (sorry about the formatting of columns)
# str loc
# 1 .||. 1 4
# 2 . 1
# 3 .|.. 1 3 4
# 4 .. 1 2
I can get the locations with gregexpr(".", str, fixed = TRUE), but I don’t know how to get the first part of the gregexpr output, without the three attribute parts. I will later use the location vectors in other calculations. As gregexpr is vectorized, I do not want to use a loop to do this, as this would take too long. I think this problem must have been addressed in previous questions, but I can’t find a solution. Also, if there is a completely different way to handle this, please tell me.
Here's an example. Is this what you mean?
S = c("appleap", "tapppapp")
P = "ap"
lapply(gregexpr(P, S), function(x) as.vector(x))
#[[1]]
#[1] 1 6
#[[2]]
#[1] 2 6
This question already has answers here:
How to calculate the number of occurrence of a given character in each row of a column of strings?
(14 answers)
Closed 6 years ago.
suppose I have a long string such like:
c<-"abcabcdabcdeabcdefghijkabcdabcaba"
My question is how to quickly count the number of exact "abcd" in c.
1) gregexpr First paste "abcd" onto c so that there is at least 1 match. (This is needed because gregexpr returns -1 for any component of c having no matches rather than a zero length numeric vector.) Now, gregexpr returns a list whose components are numeric vectors of the starting positions of the matches one component per component of c -- in this case c only has one component but the code below works more generally. Now find the lengths of the components of the result of gregexpr and subtract 1 to take into account the extra abcd we added. No packages are used.
Example 1
lengths(gregexpr("abcd", paste(c, "abcd"))) - 1
## [1] 4
Note: If we knew that there was at least one match it could be slightly simplified to: lengths(gregexpr("abcd", c)) .
Example 2
Here is another example. Here DF has 3 rows and the corresponding components of c have 4, 4, and 0 occurrences of "abcd".
DF <- data.frame(c = c(c, c, "X")) # test input
lengths(gregexpr("abcd", paste(DF$c, "abcd"))) - 1
## [1] 4 4 0
2) regmatches
Here is an alternative approach. This approach has the advantage that no special code is needed for the no-match case. Again, no packages are used.
Here are the same two examples:
lengths(regmatches(c, gregexpr("abcd", c)))
## [1] 4
lengths(regmatches(DF$c, gregexpr("abcd", DF$c)))
## [1] 4 4 0
Using library stringr, you can do it as follows (on larger set, it will be fairly fast and efficient):
library(stringr)
c <- "abcabcdabcdeabcdefghijkabcdabcaba"
c
[1] "abcabcdabcdeabcdefghijkabcdabcaba"
str_count(c, 'abcd')
[1] 4
This will work on a column of a data frame as follows:
df <- data.frame(txt = rep(c, 10))
df$abcd_count <- str_count(df$txt, 'abcd')
df
txt abcd_count
1 abcabcdabcdeabcdefghijkabcdabcaba 4
2 abcabcdabcdeabcdefghijkabcdabcaba 4
3 abcabcdabcdeabcdefghijkabcdabcaba 4
4 abcabcdabcdeabcdefghijkabcdabcaba 4
5 abcabcdabcdeabcdefghijkabcdabcaba 4
6 abcabcdabcdeabcdefghijkabcdabcaba 4
7 abcabcdabcdeabcdefghijkabcdabcaba 4
8 abcabcdabcdeabcdefghijkabcdabcaba 4
9 abcabcdabcdeabcdefghijkabcdabcaba 4
10 abcabcdabcdeabcdefghijkabcdabcaba 4
Here is one method using base Rs gsub and strsplit:
# example
temp <- "abcabcdabcdeabcdefghijkabcdabcaba"
# substitute pattern for character not in string, here 9
temp2 <- gsub("abcd", "9", temp)
# split on 9, and count number of elements
length(strsplit(temp2, split="9")[[1]]) - 1
You need the [[1]] because strsplit is designed to operate over vectors of strings, here the vector is of length 1. An alternative to [[1]] in this case is unlist.
Also, 1 is subtracted because the number of elements are one larger than the number of abcd patterns by 1.
I'm reading a book on R and I don't understand the behavior of the seq function. Can someone please explain to me what it's doing when you give it a vector such as what is shown below on line 4?
> seq(1,5,1)
[1] 1 2 3 4 5
> x <- c(1,5,1)
> seq(x)
[1] 1 2 3
seq generates a sequence basically, so:
seq(from, to, increment)
printed out 1 to 5 incrementing by 1 each time.
Then the c function combines lists or vectors. So it has added the variables to x and then seq is performed on x which by default calls seq_len which outputs a sequence of 1 to length(x).
Check the documenation in the links below to see the default methods.
Sequence generation: seq
Combine/concatenate: c
data:
row A B
1 1 1
2 1 1
3 1 2
4 1 3
5 1 1
6 1 2
7 1 3
Hi all! What I'm trying to do (example above) is to sum those values in column A, but only when column B = 1 (so starting with a simple subset line - below).
sum(data$A[data$B==1])
However, I only want to do this the first time that condition occurs until the values switch. If that condition re-occurs later in the column (row 5 in the example), I'm not interested in it!
I'd really appreciate your help in this (I suspect simple) problem!
Using data.table for syntax elegance, you can use rle to get this done
library(data.table)
DT <- data.table(data)
DT[ ,B1 := {
bb <- rle(B==1)
r <- bb$values
r[r] <- seq_len(sum(r))
bb$values <- r
inverse.rle(bb)
} ]
DT[B1 == 1, sum(a)]
# [1] 2
Here's a rather elaborate way of doing that:
data$counter = cumsum(data$B == 1)
sum(data$A[(data$counter >= 1:nrow(data) - sum(data$counter == 0)) &
(data$counter != 0)])
Another way:
idx <- which(data$B == 1)
sum(data$A[idx[idx == (seq_along(idx) + idx[1] - 1)]])
# [1] 2
# or alternatively
sum(data$A[idx[idx == seq(idx[1], length.out = length(idx))]])
# [1] 2
The idea: First get all indices of 1. Here it's c(2,3,5). From the start index = "2", you want to get all the indices that are continuous (or consecutive, that is, c(2,3,4,5...)). So, from 2 take that many consecutive numbers and equate them. They'll not be equal the moment they are not continuous. That is, once there's a mismatch, all the other following numbers will also have a mismatch. So, the first few numbers for which the match is equal will only be the ones that are "consecutive" (which is what you desire).
Say I applied the cut function on seq(15) like this cut(seq(15), 5)
I would get a list of bins in which each element would fall. What if I want to extract the members or elements of the third level? How can I refer to the elements that would fall in the 3rd bin after cutting the sequence?
Addressing Arun's comment:I will provide the cut function a vector like this: temp <- cut(seq(15), c(.9,4,8,12,15)). I am looking for the elements of the seq(15) that would fall in the 3rd level. They are 9,10,11,12. There is already an answer that worked bellow.
You can use labels=F to get
cut(seq(15),5,labels=F)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Then
x <- seq(15)
> x[cut(x,5,labels=F)==3]
[1] 7 8 9
Your question is poorly worded and somewhat ambiguous, but can use basic indexing for this:
temp <- cut(seq(15), 5)
temp[temp == levels(temp)[3]]
# [1] (6.6,9.4] (6.6,9.4] (6.6,9.4]
# Levels: (0.986,3.79] (3.79,6.6] (6.6,9.4] (9.4,12.2] (12.2,15]
Or, if you wanted the relevant values from seq(15):
seq(15)[temp == levels(temp)[3]]
# [1] 7 8 9