stringr equivalent to grep - r

Is there an stringr equivalent to base R's grep function?
I want to have the index of the string that matches. Example:
grep("F|Y", LETTERS)
[1] 6 25
With stringr my workaround would be using which as follows:
which(str_detect(LETTERS, "F|Y"))
[1] 6 25

Sorry for the late answer but it might be helpful for future visitors:
Now you can use str_which(string, pattern) which is a wrapper around which(str_detect(string, pattern)) and equivalent to grep(pattern, string).
str_which(LETTERS, "F|Y")
[1] 6 25
More details at: http://stringr.tidyverse.org/reference/str_subset.html

With the new update string_like will also be applicable.
which(str_like(LETTERS, "F|Y"))
Read more about the stringr updates that are linked below.
Hope this helps everyone.

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Replace specific pattern (shortening notations) by full notation in R

I have a data frame of short forms like
Ann-e/i is the short form for Anne and Anni
How can I replace the pattern -e/i in the data frame by the full notations?
Another example is Matt-e/i for Matte and Matti.
Thanks in advance for any help!
x <- c("Ann-e/i", "Matt-e/i")
gsub("(^[a-zA-Z]+?)-([a-z])/([a-z])$", "\\1\\2 and \\1\\3", x)
[1] "Anne and Anni" "Matte and Matti"
Wimpel's suggestion using gsub from base R works well and is quite flexible. Another approach is provided by the package stringr from the tidyverse, which might be more intuitive.
library(stringr)
strings <- c("Ann-e/i", "Annerl", "Matt-e/i")
str_replace(strings, "(\\w+)-e/i", "\\1i or \\1e")
#> [1] "Anni or Anne" "Annerl" "Matti or Matte"
Created on 2021-11-08 by the reprex package (v2.0.1)
You'll find it helpful to learn about regular expressions (regex), if you're not already familiar with them. Since there are several varieties of regex with different syntax, here's a link that is specific to using it with stringr. https://stringr.tidyverse.org/articles/regular-expressions.html
If you have comma-separated values you can do either of this depending on your desired outcome:
Data:
string <- c("Annerl,Ann-e/i", "Matt-e/i")
First solution:
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", unlist(strsplit(string, ",")))
# [1] "Annerl" "Anne and Anni" "Matte and Matti"
Second:
c(sub("(^\\w+),(\\w+)-(\\w)/(\\w)$|", "\\1, \\2\\3 and \\2\\4", string[grepl(",", string)]),
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", string[grep(",", string, invert = TRUE)]))
# [1] "Annerl, Anne and Anni" "Matte and Matti"

Partial string extraction with stringr - getting NA

I'm trying to extract part of a string using stringr.
I'm aiming for the output to be E5_1_C33 and E5_1_C23, but instead I'm getting NA.
Any help would be appreciated!
library(stringr)
mystring <- c("can_ComplianceWHOInfrastructurePol_E5_1_C33","can_ComplianceWHOInfrastructurePol_E5_1_C23")
str_extract(mystring, "A\\d_\\d_B\\d\\d$")
slightly modified your line , as as need any letter not only A and B:
str_extract(mystring, "[A-z]\\d_\\d_[A-z]\\d\\d$")
Here's an R base approach using gsub
> gsub(".*(\\w{2}_\\w{1}_\\w{3})$", "\\1", mystring)
[1] "E5_1_C33" "E5_1_C23"

Using tidyr package in R, are we able to filter and extract links from a tibble?

Let's say I have this in my tibble,
Transcript
1 Hi i would like to find out more about http://mywebsite.com/internalfaq/faq/154200 please help
2 Hello my results were withheld at https://mywebsite.com/123 hope you can help
3 Hello my friend join me at https://mywebsite.com/456
I tried
links = data %>%
extract(Transcript, url.pattern)
but it's not giving me what I want. It's not returning me the list of links even though I supply the url pattern. It returns me the first word only. Is there something wrong here that I did?
Thanks in advance!
This is my url pattern: https://mywebsite.com/.*
The into input to extract must be specified. Also, try adding parentheses to your regex.
url.pattern <- "(https://mywebsite.com/[^> | ]*)"
data %>%
extract(Transcript, into = 'link',regex = url.pattern)
you can use regmatches
regmatches(h,gregexpr("http.*?(\\d+)",h))
[[1]]
[1] "https://mywebsite.com/internalfaq/faq/154200" "http://mywebsite.com/internalfaq/faq/154200"
[[2]]
[1] "https://mywebsite.com/123" "https://mywebsite.com/123"
[[3]]
[1] "https://mywebsite.com/456"
This gives you the whole url's. What is h? his the Transcript[,1]. It is a vector and not a dataframe.
Since it seems the webpages are repeated, you can obtaine only the first one in every vector by using regexpr instead of gregexpr:
regmatches(h,regexpr("http.*?(\\d+)",h))
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"
You can also use the sub function with a backreference:
sub("(.*:)(.*\\d+)(.*)","https:\\2",h)
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"

Extract a part of string on a particular reference

I need to extract number that comes after "&r=" in the below link.
http://asdf.com/product/eyewear/eyeglasses?Brand[]=Allen%20Solly&r=472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}
Here's what i tried
C has my link stored in.
sub(".*&r=", "",c)
"472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}"
This only gives me whole after part of the string .
I only need the number i.e 472020 .
Any idea?
Here is how to get it using sub
sub(".*=(\\d+)&.*", "\\1", z)
#[1] "472020"
or
as.integer(sub(".*=(\\d+)&.*", "\\1", z))
#[1] 472020
For completeness sake, here it is with the base R regmatches/regexpr combo:
regmatches(z, regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE))
It uses the same Perl-flavoured regex as #akrun's stringr version. regexpr (or gregexpr if several matches of the same pattern are expected in the same string) matches the pattern, while regmatches extracts it (it is vectorized so several strings can be matched/extracted at once).
> as.integer(regmatches(z,regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE)))
#[1] 472020
We can use str_extract
library(stringr)
as.numeric(str_extract(z, "(?<=\\&r\\=)\\d+"))
#[1] 472020
If there are several matches use str_extract_all in place of str_extract

Resources