Extracting substring from string if pattern varies [duplicate]

Extracting substring from string if pattern varies [duplicate] - r

This question already has answers here:
Function to extract domain name from URL in R
(5 answers)
Closed 4 years ago.
I would like to extract names from weblinks using substr(). My problem is that the patterns vary slightly, so I am not sure how to account for the variances. Here is a sample:
INPUT:
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
What I want as Output
c("gatcoin", "appcoins", "pareto", "betbox", "aidcoin")
In my understanding I need to specify the start and end of the string to be extracted, but sometimes start would be "https://", while other times it would be "https://www."
How could I solve this?

You can do this easily with stringr...
library(stringr)
str_match(list, "\\/(www\\.)*(\\w+)\\.")[,3]
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
The regex extracts the first sequence of letters between a slash and an optional www., and the next dot.
The equivalent in base R is slightly messier...
sub(".+?\\/(?:www\\.)*(\\w+)\\..+", "\\1", list)
This adds the start and end of the string as well, replacing the whole lot with just the capture group you want. It sets the optional www. as a non-capturing group, as sub and str_match behave differently if the first group is not found.

list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN- Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
pattern <- c("https://", "www.", "http://")
for(p in pattern) list <- gsub(p, "", list)
unlist(lapply(strsplit(list, "[.]"), function(x) x[1]))
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"

You could use Regular Expressions. However, that is reinventing the wheel. People have thought about how to split URLs before so use an already existing function.
For example parse_url of the httr package. Or google "R parse URL" for alternatives.
urls <- list("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
Use lapply to use parse_url for every element of urls
parsed <- lapply(urls, httr::parse_url)
Now you have a list of lists. Each element of the list parsed has multiple elements itself which contain the parts of the URL`.
Extract all the elements parsed[[...]]$hostname:
hostname <- sapply(parsed, function(e) e$hostname)
Split those by the dot and take the second last element:
domain <- strsplit(hostname, "\\.")
domain <- sapply(domain, function(d) d[length(d)-1])

This Regex captures the word after the ://(www.) .
(?::\/\/(?:www.)?)(\w+)

Related

Splicing names in a list, and adding characters to a specific string?

Hello there: I currently have a list of file names (100s) which are separated by multiple "/" at certain points. I would like to find the last "/" in each name and replace it with "/Old". A quick example of what I have tried:
I have managed to do it for a single file name in the list but can't seem to apply it to the whole list.
Test<- "Cars/sedan/Camry"
Then I know I tried finding the last "/" in the name I tried the following :
Last <- tail(gregexpr("/", Test)[[1]], n= 1)
str_sub(Test, Last, Last)<- "/Old"
Which gives me
Test[1] "Cars/sedan/OldCamry"
Which is exactly what I need but I am having troubling applying tail and gregexpr to my list of names so that it does it all at the same time.
Thanks for any help!
Apologies for my poor formatting still adjusting.

If your file names are in a character vector you can use str_replace() from the stringr package for this:
items <- c(
"Cars/sedan/Camry",
"Cars/sedan/XJ8",
"Cars/SUV/Cayenne"
)
stringr::str_replace(items, pattern = "([^/]+$)", replacement = "Old\\1")
[1] "Cars/sedan/OldCamry" "Cars/sedan/OldXJ8" "Cars/SUV/OldCayenne"

Keeping a stringi function as an alternative.
If your dataframe is "df" and your text is in column named "text.
library(stringi)
df %>%
mutate(new_text=stringi::stri_replace_last_fixed(text, '/', '/Old '))

Is there an R function to replace a matched RegEx with a string of characters with the same length? [duplicate]

This question already has an answer here:
Replace every single character at the start of string that matches a regex pattern
(1 answer)
Closed 2 years ago.
I have a vector
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
and I want to replace all N in the head of all elements using same length "-".
When I use function gsub only replace with one "-".
gsub("^N+", "-", test)
# [1] "-CTCGTNNNGTCGTNN" "-CGTNNNGTCGTGN"
But I want the result looks like this
# "---CTCGTNNNGTCGTNN", "-----CGTNNNGTCGTGN"
Is there any R function that can do this? Thanks for your patience and advice.

You can write:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN", "XNNNNNCGTNNNGTCGTGN")
gsub("\\GN", "-", perl=TRUE, test)
which returns:
"---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN" "XNNNNNCGTNNNGTCGTGN"
regex | R code
\G, which is supported by Perl (and by PCRE (PHP), Ruby, Python's PyPI regex engine and others), asserts that the current position is at the beginning of the string for the first match and at the end of the previous match thereafter.
If the string were "NNNCTCGTNNNGTCGTNN" the first three "N"'s would each be matched (and replaced with a hyphen by gsub), then the attempt to match "C" would fail, terminating the match and string replacement.

One approach would be to use the stringr functions, which support regex callbacks:
test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
repl <- function(x) { gsub("N", "-", x) }
str_replace_all(test, "^N+", function(m) repl(m))
[1] "---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN"
The strategy here is to first match ^N+ to capture one or more leading N. Then, we pass that match to a callback function which replaces each N with a dash.

Add space before a character with gsub (R) [duplicate]

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!

It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.

Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.

If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"

Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.

You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.

How about trying gsub("(.*?)/+$","\\1",s)?

Extract substring using regular expression in R

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.
I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:
\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)
For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A
I used
map(strsplit(filenames, '_'),3)
but failed, because the new filenames could have _, too.
I turned to regular expression for advanced matching, and come up with this
gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)
still did not get what I needed.

You may use
filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)
Which yields
[1] "MS-15-0452-268" "SP56-01_A"
The pattern here is
^ # start of the string
(?:[^_]+_){2} # not _, twice
(.+?) # anything lazily afterwards
_\\d+ # until there's _\d+
.* # consume the rest of the string
This pattern is replaced by the first captured group and hence the filename in question.

Call me a hack. But if that is guaranteed to be the format of all my strings, then I would just use strsplit to hack the name apart, then only keep what I wanted:
string <- 'ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM'
string_bits <- strsplit(string, '_')[[1]]
file_name<- string_bits[3]
file_name
[1] "MS-15-0452-268"
And if you had a list of many file names, you could remove the explicit [[1]] use sapply() to get the third element of every one:
sapply(string_bits, "[[", 3)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting substring from string if pattern varies [duplicate] - r

This Regex captures the word after the ://(www.) . (?::\/\/(?:www.)?)(\w+)

Related

Splicing names in a list, and adding characters to a specific string?

Is there an R function to replace a matched RegEx with a string of characters with the same length? [duplicate]

Add space before a character with gsub (R) [duplicate]

Ignore last "/" in R regex

Extract substring using regular expression in R

Categories

Resources