Extracting substring using R - r

I want to extract substring (description details) from the following strings:
string1 <- #{self=https://somesite.atlassian.net/rest/api/2/status/1; description=The issue is open and ready for the assignee to start work on it.; iconUrl=https://somesite.atlassian.net/images/icons/statuses/open.png; name=Open; id=1; statusCategory=}
string2 <- #{self=https://somesite.atlassian.net/rest/api/2/status/10203; description=; iconUrl=https://somesite.atlassian.net/images/icons/statuses/generic.png; name=Full Curation; id=10203; statusCategory=}
I am trying to get the following
ExtractedSubString1 = "The issue is open and ready for the assignee to start work on it."
ExtractedSubString2 = ""
I tried this:
library(stringr)
ExtractedSubString1 <- substr(string1, str_locate(string1, "description=")+12, str_locate(string1, "; iconUrl")-1)
ExtractedSubString2 <- substr(string2, str_locate(string2, "description=")+12, str_locate(string2, "; iconUrl")-1)
Looking for a better way to accomplish this.

Using only base R's sub and back referencing, you could do
sub(".*description=(.*?);.*", "\\1", c(string1, string2))
[1] "The issue is open and ready for the assignee to start work on it." ""
The ".*" match any set of characters, "description=" is a literal match, ".*?" matches any set of characters, but the ? forces a lazy match rather than a greedy match. ";" is a literal, and the "()" capture the sub-expression that is lazily matched. The back reference "\\1" returns the sub-expression captured in the parentheses.
Using the base R functions regexec and regmatchesgets a bit closer to the method in the OP. sapply with "[" is then used to extract the desired result.
sapply(regmatches(c(string1, string2),
regexec(".*description=(.*?);.*", c(string1, string2))),
"[", 2)
[1] "The issue is open and ready for the assignee to start work on it." ""

You could try:
test.1 <- gsub("description=", "", strsplit(string1, "; ")[[1]][2])
test.2 <- gsub("description=", "", strsplit(string2, "; ")[[1]][2])
This simply splits the string on ; which divides each string in to 6 elements the square brackets select the 2nd element and the gsub replaces the description= to nothing to remove it.

Related

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.
To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

subsetting data with only entries with in the parentheses

How can i subset data that contains only entries with in the parentheses from description column
data= ID description control
1814668 glycoprotein 2 (Gp2) (Fy2) LMN_2904435
1791634 claudin 10 (Cldn10), transcript variant 1 ILMN_1214954 NM
1790993 claudin 10 (Cldn10), transcript variant 2 ILMN_2515816
output
ID description control
1814668 Gp2, Fy2 LMN_2904435
1791634 Cldn10 ILMN_1214954 NM
1790993 Cldn10 ILMN_2515816
You could try
df2$description <- gsub('.*\\(([^)]+)\\).*', '\\1', df2$description)
Or use bracketXtract from qdap
library(qdap)
unlist(bracketXtract(df2$description, 'round'))
Or
library(qdapRegex)
unlist(rm_round(df2$description, extract=TRUE))
Update
Based on the new dataset "df2N",
df2N$description <- sapply(rm_round(df2N$description,
extract=TRUE),toString)
Or using str_extract
library(stringr)
sapply(str_extract_all(df2N$description,
perl('(?<=\\()[^)]+(?=\\))')), toString)
Probably not as great as #akrun 's solutions but here is another option, using function gsub (twice...) from base R:
df2$description <- gsub("^,\\s|,\\s$",
"",
gsub("^[^(]*\\(|\\)[^()]*\\(|\\)[^(]*$",
", ",
df2$description, perl=T))
#[1] "Gp2, Fy2" "Cldn10" "Cldn10"
First, it's telling R to search for either:
^[^(]*\\(: anything that is not a opening bracket, at the beginning of the
string, and ending with an opening bracket
\\)[^()]*\\(: a closing bracket followed by anything that is not a bracket, ending with an opening bracket
\\)[^(]*$: a closing bracket, followed by anything that is not an opening bracket and goes till the end of string
and replace it by a comma followed by a space.
Second, it replaces the "comma followed by a space" at the beginning and at the end of the string by an empty string.

How to get words that end with certain characters within each string r

I have a vector of strings that looks like:
str <- c("bills slashed for poor families today", "your calls are charged", "complaints dept awaiting refund")
I want to get all the words that end with the letter s and remove the s. I have tried:
gsub("s$","",str)
but it doesn't work because it tries to match with the strings that end with s instead of words. I'm trying to get an output that looks like:
[1] bill slashed for poor familie today
[2] your call are charged
[3] complaint dept awaiting refund
Any pointers as to how I can do this? Thanks
$ checks for the end of the string, not the end of a word.
To check for the word boundaries you should use \b
So:
gsub("s\\b", "", str)
Here's a non base R solution:
library(rebus)
library(stringr)
plurals <- "s" %R% BOUNDARY
str_replace_all(str, pattern = plurals, replacement = "")
You could also use a positive lookahead assertion:
gsub(pattern = "s{1}(?>\\s)", " ", x = str, perl = T)
I am no expert on regex, but I believe this expression looks for an "s" if it is followed by a space. Finding a match, it replaces that "s" with a space. So, final "s's" are removed.

Newb regex help: string with ampersand, using R

I known this should be simple but I cannot return a subset of characters from a string using regex in R.
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
Test <- grep(pattern=Reg, x=Foo, value=TRUE)
This captures the entire string for me and I want to capture just the R206411. The string I want to capture might vary in length and content, so the key is to have the capture begin after the '=' in propertyid=, and then end the capture once it sees the '&' in '&state_id'.
Thanks for your time.
You have to use positive lookbehind and lookahead assertions like this:
Foo <- 'propertyid=R206411&state_id='
Reg <- gregexpr('(?<=propertyid=).*(?=&state_id=)', Foo, perl=TRUE)
regmatches(Foo, Reg)
Well, grep doesn't play well with captured groups which is what you are trying to do. What you probably want is gsub
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
gsub(Reg, "\\1", Foo)
# [1] "R206411"
Here we take your pattern, and we replace the match with "\1" (and since R requires us to escape backslashes we double the slash) which stands for the first capture group (which is what the parenthesis indicate). So since you match the whole string, it will replace the whole string with just the matching portion.
The strapplyc function in the gsubfn package can do exactly that. Using Foo and Reg from the question:
> library(gsubfn)
>
> strapplyc(Foo, Reg, simplify = TRUE)
[1] "R206411"

Resources