subsetting data with only entries with in the parentheses - r

How can i subset data that contains only entries with in the parentheses from description column
data= ID description control
1814668 glycoprotein 2 (Gp2) (Fy2) LMN_2904435
1791634 claudin 10 (Cldn10), transcript variant 1 ILMN_1214954 NM
1790993 claudin 10 (Cldn10), transcript variant 2 ILMN_2515816
output
ID description control
1814668 Gp2, Fy2 LMN_2904435
1791634 Cldn10 ILMN_1214954 NM
1790993 Cldn10 ILMN_2515816

You could try
df2$description <- gsub('.*\\(([^)]+)\\).*', '\\1', df2$description)
Or use bracketXtract from qdap
library(qdap)
unlist(bracketXtract(df2$description, 'round'))
Or
library(qdapRegex)
unlist(rm_round(df2$description, extract=TRUE))
Update
Based on the new dataset "df2N",
df2N$description <- sapply(rm_round(df2N$description,
extract=TRUE),toString)
Or using str_extract
library(stringr)
sapply(str_extract_all(df2N$description,
perl('(?<=\\()[^)]+(?=\\))')), toString)

Probably not as great as #akrun 's solutions but here is another option, using function gsub (twice...) from base R:
df2$description <- gsub("^,\\s|,\\s$",
"",
gsub("^[^(]*\\(|\\)[^()]*\\(|\\)[^(]*$",
", ",
df2$description, perl=T))
#[1] "Gp2, Fy2" "Cldn10" "Cldn10"
First, it's telling R to search for either:
^[^(]*\\(: anything that is not a opening bracket, at the beginning of the
string, and ending with an opening bracket
\\)[^()]*\\(: a closing bracket followed by anything that is not a bracket, ending with an opening bracket
\\)[^(]*$: a closing bracket, followed by anything that is not an opening bracket and goes till the end of string
and replace it by a comma followed by a space.
Second, it replaces the "comma followed by a space" at the beginning and at the end of the string by an empty string.

Related

Trying to extract a substring from a major strings containing special characters

I am trying to extract a substring from the following strings:
x <- "U1+ ^Eucalyptus baxteri s.s.,Eucalyptus viminalis subsp. cygnetensis\\^tree\\7\\i;U2 Acacia melanoxylon,Banksia marginata\\tree\\7\\i;M1 ^Leucopogon parviflorus,Spyridium parvifolium,Leucopogon lanceolatus var. lanceolatus\\^shrub\\4\\r;M2 Xanthorrhoea minor subsp. lutea,Pteridium esculentum,Billardiera scandens\\fern,grass-tree,vine\\3\\i;G1 ^Veronica calycina,Brunonia australis,Deyeuxia quadriseta,Dianella revoluta var. revoluta s.l.,Dichelachne crinita\\rush,^forb,tussock grass,other grass,sedge\\2\\c;G2 Lagenophora stipitata,Luzula meridionalis,Lomandra nana,Pimelea humilis,Acrotriche serrulata\\rush,heath shrub,forb,vine,sedge\\1\\i"
I am wishing to extract 6\c as a subset that comes immediately after \^tree however, I am facing issues usisng sub() function and thats probably related to the special characters existing in the main string. Any help is appreciated.
sub(".*?\\^tree([^;]+).*", "\\1", x)
[1] "\\7\\i"
Note that the double backslash is just a single literal backslash.
cat(sub(".*?\\^tree([^;]+).*", "\\1", x))
\7\i

Removing leading question marks from first two words of data frame entries in R

I have a large data frame in R with column "NameFull" holding a text string made up of two words (binomial scientific name), followed by author name(s) and initials. Both have been corrupted (presumably UTF translation issues). This means that in the binomials any leading "x" (indicating hybrids) has been replaced with "?". Unfortunately any non-standard characters in the author names have also been replaced with "?" so I cannot just replace all "?" with x.
I simply want to replace and leading "?" in the first two words with "x" (I will then have to manually compose a list of corrected author names to replace the corrupted ones, unless anyone has a bright idea on that!).
Example chunk of df:
df.corrupt <- data.frame(Bing = 1:6, FullName = c("?Anthematricaria dominii Rohlena", "?Anthemimatricaria inolens P.Fourn.", "?Anthemimatricaria maleolens P.Fourn.", "Achillea ?albinea Bjel?i? & K.Mal?", "Achillea carpatica B?ocki ex Dubovik", "Floscaldasia azorelloides SklenĀ ? & H.Rob."), Bang = 1:6)
I've tried to shoehorn it into regex but can't get close. Any help appreciated!
On my understanding, you want to replace ?only if it occurs in word-initial position in either the first or the second word; if that's correct this should work:
Data: (I've changed a few chars)
df.corrupt <- data.frame(Bing = 1:6,
FullName = c("?Anthematricaria dominii ?Rohlena",
"?Anthemimatricaria inolens P.Fourn.",
"?Anthemimatricaria maleolens ?P.Fourn.",
"Achillea ?albinea Bjel?i? & K.Mal?",
"Achillea carpatica B?ocki ex Dubovik",
"Floscaldasia azorelloides Sklen ? & H.Rob."), Bang = 1:6)
Solution:
library(stringr)
str_replace_all(df.corrupt$FullName, "^\\?|(?<=^(\\?)?\\b\\w{1,100}\\b\\s)\\?", "x")
[1] "xAnthematricaria dominii ?Rohlena" "xAnthemimatricaria inolens P.Fourn."
[3] "xAnthemimatricaria maleolens ?P.Fourn." "Achillea xalbinea Bjel?i? & K.Mal?"
[5] "Achillea carpatica B?ocki ex Dubovik" "Floscaldasia azorelloides Sklen ? & H.Rob."
This stringr solution puts x where ?occurs right at the start of the string (^) or (|) using positive lookbehind (i.e., a non-consuming capturing group) where it follows a whitespace char (\\s), which in turn follows a word boundary (\\b) following up to 100 \\w chars following a word boundary, following finally an optional ?
We can check for the ? that succeeds a space or at the start of the string, replace with 'x'
trimws(gsub("(^|\\s)\\?", " x", df.corrupt$FullName))

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

Extracting substring using R

I want to extract substring (description details) from the following strings:
string1 <- #{self=https://somesite.atlassian.net/rest/api/2/status/1; description=The issue is open and ready for the assignee to start work on it.; iconUrl=https://somesite.atlassian.net/images/icons/statuses/open.png; name=Open; id=1; statusCategory=}
string2 <- #{self=https://somesite.atlassian.net/rest/api/2/status/10203; description=; iconUrl=https://somesite.atlassian.net/images/icons/statuses/generic.png; name=Full Curation; id=10203; statusCategory=}
I am trying to get the following
ExtractedSubString1 = "The issue is open and ready for the assignee to start work on it."
ExtractedSubString2 = ""
I tried this:
library(stringr)
ExtractedSubString1 <- substr(string1, str_locate(string1, "description=")+12, str_locate(string1, "; iconUrl")-1)
ExtractedSubString2 <- substr(string2, str_locate(string2, "description=")+12, str_locate(string2, "; iconUrl")-1)
Looking for a better way to accomplish this.
Using only base R's sub and back referencing, you could do
sub(".*description=(.*?);.*", "\\1", c(string1, string2))
[1] "The issue is open and ready for the assignee to start work on it." ""
The ".*" match any set of characters, "description=" is a literal match, ".*?" matches any set of characters, but the ? forces a lazy match rather than a greedy match. ";" is a literal, and the "()" capture the sub-expression that is lazily matched. The back reference "\\1" returns the sub-expression captured in the parentheses.
Using the base R functions regexec and regmatchesgets a bit closer to the method in the OP. sapply with "[" is then used to extract the desired result.
sapply(regmatches(c(string1, string2),
regexec(".*description=(.*?);.*", c(string1, string2))),
"[", 2)
[1] "The issue is open and ready for the assignee to start work on it." ""
You could try:
test.1 <- gsub("description=", "", strsplit(string1, "; ")[[1]][2])
test.2 <- gsub("description=", "", strsplit(string2, "; ")[[1]][2])
This simply splits the string on ; which divides each string in to 6 elements the square brackets select the 2nd element and the gsub replaces the description= to nothing to remove it.

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Resources