Find matching indices for pattern vector in grep - r

Is there an easy way to find corresponding vector indices when working with grep?
v=c(123,456,789,651)
pat=c(1,35,47,8)
id=grep(paste0(pat, collapse="|"), v)
v[id]
[1] 123 789 651
I would like to generate:
pat_id
[1] 1 4 1
So that pat[pat_id] would give me the values in pat that matched.
pat[pat_id]
[1] 1 8 1
match() cannot be used in this case because strings have to be identical to count as a match.

We can loop over v and use str_detect since it is vectorised to find if the pattern exists in any of them and return the index or the vector directly.
library(stringr)
unlist(sapply(v, function(x) which(str_detect(x, as.character(pat)))))
#[1] 1 4 1
If the final goal is to get pat vectors instead we can directly do
unlist(sapply(v, function(x) pat[str_detect(x, as.character(pat))]))
#[1] 1 8 1

Related

is there an equivalent of the 'match' function in R, that works with regex?

advantage of 'match', it's returning the matching indices from the lexicon
disadvantage it doesn't accept regex
Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')
Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')
Index <- match(Corpus, Lexicon)
match returns the indices of the dictionary
Index
# [1] 2 3 4 6
Lexicon[Index]
# [1] "animalada" "fe" "fernandez" "ladrillo"
I need to work with a dictionary that includes regex
Lexicon<- c('anima.+$', '.*ez$', '^fe.*$', 'ladr.*$')
problem the 'match' function, doesn't work with regex !
Use str_which + sapply. Note that one regex can apply to multiple values, hence the list.
library(stringr)
sapply(Corpus, \(x) str_which(x, Lexicon))
# $animalada
# [1] 1
#
# $fe
# [1] 3
#
# $fernandez
# [1] 2 3
#
# $ladrillo
# [1] 4

Create and apply a function that calculates the length of many strings in a vector

Suppose I have a long vector with characters which is more or less like this:
vec <- c("32, 25", "5", "15, 24")
I want to apply a function which give me the number of strings for any element separated by a comma and returns me a vector with any individual length. Using lapply and my toy vector, this is my approach:
lapply(vec, function(x) {
a <- strsplit(x, ",")
y <- length(a[[1:length(a)]])
unlist(y[1:length(y)])
})
[[1]]
[1] 2
[[2]]
[1] 1
[[3]]
[1] 2
This almost gives me what I want since first element has 2 strings, second element 1 string and third element 2 strings. The problem is I can't achieve that my function returns me a vector of the form c(2,1,2). I'm using this function to create a new variable on some data.frame which I'm working with.
Any idea will be much appreciated.
You could do:
stringr::str_count(vec, ",") + 1
#> [1] 2 1 2
Or, in base R:
nchar(gsub("[^,]", "", vec)) + 1
#> [1] 2 1 2

Get indexes of unique values in a vector

I have a vector like this.
filenames <- c("kisyu2_mst.csv", "kisyu3_mst.csv", "kisyu2_mst.csv",
"kisyu3_mst.csv", "kisyu3_mst.csv")
I need to get indices from filenames vector for each unique value.output look like this
for "kisyu2_mst.csv" indices vector c(1,3)
for "kisyu3_mst.csv" indices vector c(2,4,5)
Finally, I need to insert it to a list like this:
final <- list("kisyu2_mst.csv" = c(1,3), "kisyu3_mst.csv"=c(2,4,5))
How to get the indices of unique value from the vector?
We can use split
split(seq_along(filenames), filenames)
#$kisyu2_mst.csv
#[1] 1 3
#$kisyu3_mst.csv
#[1] 2 4 5
We could try which:
sapply(unique(filenames), function(i) which(filenames %in% i))
# $kisyu2_mst.csv
# [1] 1 3
#
# $kisyu3_mst.csv
# [1] 2 4 5
We can use tapply
tapply(seq_along(filenames), filenames, FUN = I)
#$kisyu2_mst.csv
#[1] 1 3
#$kisyu3_mst.csv
#[1] 2 4 5

Count number of occurrences at end of string

I want to count how many commas are at the end of a string with a regex:
x <- c("w,x,,", "w,x,", "w,x", "w,x,,,")
I'd like to get:
[1] 2 1 0 3
This gives:
library(stringi)
stringi::stri_count_regex(x, ",+$")
## [1] 1 1 0
Because I'm using a quantifier but don't know how to count actual number of times single character was repeated at end.
The "match.length" attribute within the regexpr seem to get the job done (-1 is used to distinguish no match from zero-width matches such as lookaheads)
attr(regexpr(",+$", x), "match.length")
## [1] 2 1 -1 3
Another option (with contribution from #JasonAizkalns) would be
nchar(x) - nchar(gsub(",+$", "", x))
## [1] 2 1 0 3
Or using stringi package combined with nchar while specifying , keepNA = TRUE (this way no matches will be specified as NAs)
library(stringi)
nchar(stri_extract_all_regex(x, ",+$"), keepNA = TRUE)
## [1] 2 1 NA 3

Extracting unique numbers from string in R

I have a list of strings which contain random characters such as:
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
I'd like to know which numbers are present at least once (unique()) in this list. The solution of my example is:
solution: c(7,667,11,5,2)
If someone has a method that does not consider 11 as "eleven" but as "one and one", it would also be useful. The solution in this condition would be:
solution: c(7,6,1,5,2)
(I found this post on a related subject: Extracting numbers from vectors of strings)
For the second answer, you can use gsub to remove everything from the string that's not a number, then split the string as follows:
unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2
For the first answer, similarly using strsplit,
unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1] 7 667 11 5 2
PS: don't name your variable list (as there's an inbuilt function list). I've named your data as ll.
Here is yet another answer, this one using gregexpr to find the numbers, and regmatches to extract them:
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
temp1 <- gregexpr("[0-9]", l) # Individual digits
temp2 <- gregexpr("[0-9]+", l) # Numbers with any number of digits
as.numeric(unique(unlist(regmatches(l, temp1))))
# [1] 7 6 1 5 2
as.numeric(unique(unlist(regmatches(l, temp2))))
# [1] 7 667 11 5 2
A solution using stringi
# extract the numbers:
nums <- stri_extract_all_regex(list, "[0-9]+")
# Make vector and get unique numbers:
nums <- unlist(nums)
nums <- unique(nums)
And that's your first solution
For the second solution I would use substr:
nums_first <- sapply(nums, function(x) unique(substr(x,1,1)))
You could use ?strsplit (like suggested in #Arun's answer in Extracting numbers from vectors (of strings)):
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
## split string at non-digits
s <- strsplit(l, "[^[:digit:]]")
## convert strings to numeric ("" become NA)
solution <- as.numeric(unlist(s))
## remove NA and duplicates
solution <- unique(solution[!is.na(solution)])
# [1] 7 667 11 5 2
A stringr solution with str_match_all and piped operators. For the first solution:
library(stringr)
str_match_all(ll, "[0-9]+") %>% unlist %>% unique %>% as.numeric
Second solution:
str_match_all(ll, "[0-9]") %>% unlist %>% unique %>% as.numeric
(Note: I've also called the list ll)
Use strsplit using pattern as the inverse of numeric digits: 0-9
For the example you have provided, do this:
tmp <- sapply(list, function (k) strsplit(k, "[^0-9]"))
Then simply take a union of all `sets' in the list, like so:
tmp <- Reduce(union, tmp)
Then you only have to remove the empty string.
Check out the str_extract_numbers() function from the strex package.
pacman::p_load(strex)
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
charvec <- unlist(list)
print(charvec)
#> [1] "djud7+dg[a]hs667" "7fd*hac11(5)" "2tu,g7gka5"
str_extract_numbers(charvec)
#> [[1]]
#> [1] 7 667
#>
#> [[2]]
#> [1] 7 11 5
#>
#> [[3]]
#> [1] 2 7 5
unique(unlist(str_extract_numbers(charvec)))
#> [1] 7 667 11 5 2
Created on 2018-09-03 by the reprex package (v0.2.0).

Resources