I hope this question is appropriate.
IF I have a string like:
> x <- "ABCDDDDDABC"
If I use regexpr it returns me the position (and length) of the first match, while if I use gregexpr it returns position and length of all matches.
> regexpr("ABC",x)
[1] 1
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE
> gregexpr("ABC",x)
[[1]]
[1] 1 9
attr(,"match.length")
[1] 3 3
attr(,"useBytes")
[1] TRUE
What I wanna do is only find the position of the last ABC, I don't care about any before it.
My desired output have to be:
[1] 9
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE
Which only return the position of the last ABC. Is there a simple way to do this, maybe the use of a wildcard I am unaware of? I have looked in the forum and online but I can not find a universal solution.
IMPORTANT NOTE: I do not want a regex that take into account the context around the ABC of this particular string. I want something that gives me the position of the last ABC of every string i may feed it with.
We can use the convenient function from stringi
library(stringi)
as.vector(stri_locate_last_regex(x, "ABC")[,1])
#[1] 9
Related
Alternaion with quantifies in gregexpr and str_extract_all function
require(stringr)
gregexpr(pattern = "(h|a)*", "xxhx")
[[1]]
[1] 1 2 3 4
attr(,"match.length")
[1] 0 0 1 0
attr(,"useBytes")
[1] TRUE
str_extract_all(pattern = "(h|a)*", "xxhx")
[[1]]
[1] "" "" "h" "" ""
why gregexpr indicates 3 voids while str_extract_all indicates 4 voids
This is the difference between how TRE (gregexpr) and ICU (str_extract_all) regex engines deal with empty (also called "zero length") regex matches. TRE regex advances the regex index after a zero length match, while ICU allows testing the same position twice.
It becomes obvious what positions are tried by both engines if you use replacing functions:
> gsub("(h|a)*", "-\\1", "xxhx")
[1] "-x-x-hx-"
> str_replace_all("xxhx", "(h|a)*", "-\\1")
[1] "-x-x-h-x-"
The TRE engine matched h and moved the index after x while ICU engine matched h and stopped right after h before x to match the empty location before it.
I have a vector that look like this:
data <- c("0115", "0159", "0256", "0211")
I want to filter the data based on the first 2 elements of my vector. For example:
group 1 - elements that start with 01
group 2 - elements that start with 02
Any idea how to accomplish this?
You might want to use Regular Expression (regex) to find strings that start with "01" or "02".
Base approach is use grep(), which returns indices of strings that match a pattern. Here's an example - notice I've changed the 2nd and 4th data elements to demonstrate how just searching for "01" or "02" will lead to incorrect answer:
d <- c("0115", "0102", "0256", "0201")
grep("01", d)
#> [1] 1 2 4
d[grep("01", d)]
#> [1] "0115" "0102" "0201"
Because this searches for "01" anywhere, you get "0201" in the mix. To avoid, add "^" to the pattern to specify that the string starts with "01":
grep("^01", d)
#> [1] 1 2
d[grep("^01", d)]
#> [1] "0115" "0102"
If you use the stringr package, you can also use str_detect() in the same way:
library(stringr)
d[str_detect(d, "^01")]
#> [1] "0115" "0102"
For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.
I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
I have tried some different packages in order to build a R program that will take as input a text file and produce a list of the words inside that file. Each word should have a vector with all the places that this word exist in the file.
As an example, if the text file has the string:
"this is a nice text with nice characters"
The output should be something like:
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8
I came across a useful post, http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-td4644053.html, but it does not include the positions of each words.
I found a similar function called "str_locate", however I want to count "words" and not "characters".
Any guidance of what packages / techniques to use on that, would be really appreciated
You can do this with base R (which curiously produces exact your suggested output):
# data
x <- "this is a nice text with nice characters"
# split on whitespace
words <- strsplit(x, split = ' ')[[1]]
# find positions of every word
sapply(unique(words), function(x) which(x == words))
### result ###
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8