Extract substring using regular expression in R

Extract substring using regular expression in R - r

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.
I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:
\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)
For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A
I used
map(strsplit(filenames, '_'),3)
but failed, because the new filenames could have _, too.
I turned to regular expression for advanced matching, and come up with this
gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)
still did not get what I needed.

You may use
filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)
Which yields
[1] "MS-15-0452-268" "SP56-01_A"
The pattern here is
^ # start of the string
(?:[^_]+_){2} # not _, twice
(.+?) # anything lazily afterwards
_\\d+ # until there's _\d+
.* # consume the rest of the string
This pattern is replaced by the first captured group and hence the filename in question.

Call me a hack. But if that is guaranteed to be the format of all my strings, then I would just use strsplit to hack the name apart, then only keep what I wanted:
string <- 'ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM'
string_bits <- strsplit(string, '_')[[1]]
file_name<- string_bits[3]
file_name
[1] "MS-15-0452-268"
And if you had a list of many file names, you could remove the explicit [[1]] use sapply() to get the third element of every one:
sapply(string_bits, "[[", 3)

Related

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.

If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"

Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.

You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.

How about trying gsub("(.*?)/+$","\\1",s)?

Extracting substring from string if pattern varies [duplicate]

This question already has answers here:
Function to extract domain name from URL in R
(5 answers)
Closed 4 years ago.
I would like to extract names from weblinks using substr(). My problem is that the patterns vary slightly, so I am not sure how to account for the variances. Here is a sample:
INPUT:
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
What I want as Output
c("gatcoin", "appcoins", "pareto", "betbox", "aidcoin")
In my understanding I need to specify the start and end of the string to be extracted, but sometimes start would be "https://", while other times it would be "https://www."
How could I solve this?

You can do this easily with stringr...
library(stringr)
str_match(list, "\\/(www\\.)*(\\w+)\\.")[,3]
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
The regex extracts the first sequence of letters between a slash and an optional www., and the next dot.
The equivalent in base R is slightly messier...
sub(".+?\\/(?:www\\.)*(\\w+)\\..+", "\\1", list)
This adds the start and end of the string as well, replacing the whole lot with just the capture group you want. It sets the optional www. as a non-capturing group, as sub and str_match behave differently if the first group is not found.

list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN- Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
pattern <- c("https://", "www.", "http://")
for(p in pattern) list <- gsub(p, "", list)
unlist(lapply(strsplit(list, "[.]"), function(x) x[1]))
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"

You could use Regular Expressions. However, that is reinventing the wheel. People have thought about how to split URLs before so use an already existing function.
For example parse_url of the httr package. Or google "R parse URL" for alternatives.
urls <- list("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
Use lapply to use parse_url for every element of urls
parsed <- lapply(urls, httr::parse_url)
Now you have a list of lists. Each element of the list parsed has multiple elements itself which contain the parts of the URL`.
Extract all the elements parsed[[...]]$hostname:
hostname <- sapply(parsed, function(e) e$hostname)
Split those by the dot and take the second last element:
domain <- strsplit(hostname, "\\.")
domain <- sapply(domain, function(d) d[length(d)-1])

This Regex captures the word after the ://(www.) .
(?::\/\/(?:www.)?)(\w+)

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!

It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.

Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)

Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.

I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

R: Why can't for loop or c() work out for grep function?

Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P

From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))

Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.