How to get the text between two words in R? - r

I am trying to get the text between two words in a sentence.
For example the sentence is -
x <- "This is my first sentence"
Now I want the text between This and first which is is my .
I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
This is the closest what I have reached with gsub.
gsub(".*This\\s*|first*", "", x)
The output it gives is
[1] "is my sentence"
In reality, what I need is only
[1] "is my"
Any help would be appreciated.

You need .* at the end to match zero or more characters after the 'first'
gsub('^.*This\\s*|\\s*first.*$', '', x)
#[1] "is my"

Another approach using rm_between from the qdapRegex package.
library(qdapRegex)
rm_between(x, 'This', 'first', extract=TRUE)[[1]]
# [1] "is my"

Since this question is used as a reference, I'll add some possible solutions to build a complete overview. Both are based on a look-ahead/look-behind regex pattern.
base R
regmatches( x, gregexpr("(?<=This ).*(?= first)", x, perl = TRUE ) )
stringr
stringr::str_extract_all( x, "(?<=This ).+(?= first)" )

Related

R: How to split string into pieces

I'm trying to split tons of strings as below:
x = "�\001�\001�\001�\001�\001\002CN�\001\bShandong�\001\004Zibo�\002$ABCDEFGHIJK�\002\aIMG_HAS�\002�\002�\002�\002�\002�\002�\002�\002\02413165537405763268743�\002\001�\002�\002�\002�\003�\003�\003����\005�\003�\003�\003�\003"
into four pieces
'CN', 'Shandong', 'Zibo', 'ABCDEFGHIJK'
I've tried
stringr::str_split(x, '\\00.')
which output the origin x.
Also,
trimws(gsub("�\\00?", "", x, perl = T))
which only removes the unknown character �.
Could someone help me with this? Thanks for doing so.
You can try with str_extract_all :
stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"
With base R :
regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]
Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.
We can use strsplit from base R
setdiff(strsplit(x, "[^A-Za-z]+")[[1]], "")
#[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG" "HAS"

Use Regular expressions extract specific characters

text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')
I have tried but failed:
str_extract(text, pattern = 'f.*\\|')
How can I get
f__Closteroviridae
f__Alphaflexiviridae
f__Retroviridae
Any help will be high appreciated!
Make the regex non-greedy and since you don't want "|" in final output use positive lookahead.
stringr::str_extract(text, 'f.*?(?=\\|)')
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
In base R, we can use sub :
sub('.*(f_.*?)\\|.*', '\\1', text)
#[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
For a base R solution, I would use regmatches along with gregexpr:
m <- gregexpr("\\bf__[^|]+", text)
as.character(regmatches(text, m))
[1] "f__Closteroviridae" "f__Alphaflexiviridae" "f__Retroviridae"
The advantage of using gregexpr as above is that should an input contain more than one f__ matching term, we could also capture it. For example:
x <- 'd__Viruses|f__Closteroviridae|g__Closterovirus|f__some_virus'
m <- gregexpr("\\bf__[^|]+", x)
regmatches(x, m)[[1]]
[1] "f__Closteroviridae" "f__some_virus"
Data:
text <- c('d__Viruses|f__Closteroviridae|g__Closterovirus|s__Citrus_tristeza_virus',
'd__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Mandarivirus|s__Citrus_yellow_vein_clearing_virus',
'd__Viruses|o__Ortervirales|f__Retroviridae|s__Columba_palumbus_retrovirus')

Extract string using `rm_between` function

I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.

str_replace (package stringr) cannot replace brackets in r?

I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?
Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"
The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"
Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Resources