Selecting a specific letters from a character after a specific symbol - r

Need to take out all the characters before "(" and combine them with ";"
stringr::word(pilist, 2, sep = '(\\s*|\\')
pilist = "pi1(tag1,tag2);pi2(tag3,tag4,tag5);"
I expect the output as
"pi1;pi2"

You can try this pattern
\([^)]+\)|;+$
Regex Demo
Note:- Use escape character as \\ or \ depending on your regex engine

If you are string has the exact same structure as shown, we can remove everything which comes between round brackets and the trailing ; using gsub
gsub("\\(.*?\\)|;$", "", pilist)
#[1] "pi1;pi2"
However, following your description it can also be done by extracting the words which we want instead of removing. Using str_extract_all
paste0(stringr::str_extract_all(pilist, "(\\w+)(?=\\(.*\\))")[[1]], collapse = ";")
#[1] "pi1;pi2"

Related

Delete string parts within delimiter

I have a string as"dfgdf" sa"2323":
a <- "as\"dfgdf\" sa\"2323\""
The delimiter (same for the start and the end) here is ". So what I want is to get a string were everything is deleted within delimiter but not delimiter itself. So the end result string should look like as"" sa""
You could match " and forget what is matched using \K
Then use a negated character class matching any char except " or a whitespace character and use lookarounds to assert " to the right.
Use perl=TRUE to enable Perl-like regular expressions.
a <- "as\"dfgdf\" sa\"2323\""
gsub('"\\K[^"\\s]+(?=")', "", a, perl=TRUE)
Output
[1] "as\"\" sa\"\""
R demo
Here is another base R option using paste0 + strsplit
s <- paste0(paste0(unlist(strsplit(a, '"\\w+"')), '""'), collapse = "")
which gives
> s
[1] "as\"\" sa\"\""
> cat(s)
as"" sa""
Here is one option with a regex lookaround to match a word (\\w+) that succeeds a double quote and precedes one as pattern and is replaced by blank ("")
cat(gsub('(?<=")\\w+(?=")', "", a, perl = TRUE), "\n")
#as"" sa""
Or without regex lookaround
cat(gsub('"\\w+"', '""', a), "\n")
#as"" sa""
I also found a way with stringr library:
library(stringr)
a <- "as\"dfgdf\" sa\"2323\""
result <- str_replace_all(a, "\".*?\"", "\"\"")
cat(result)

Erase comma and apostrophe in character R

I want to remove the comma and the apostrophe but the point of the following character. After that pass to numeric
I have this:
characterExample <- "234'564,900.99"
I want 234564900.99
I try the following but I can't:
result <- gsub("[:punct:].","", characterExample)
Another option is to explicitly remove the characters you want to remove:
gsub("[',]", "", characterExample)
#[1] "234564900.99"
``
An option is to not match the digits or the . by using ^ within the square bracket
gsub("[^0-9.]+","", characterExample)
#[1] "234564900.99"
Or another option is to make use of SKIP/FAIL for the ., while matching the rest of the punct
gsub("(\\.)(*SKIP)(*F)|[[:punct:]]+", "", characterExample, perl = TRUE)
#[1] "234564900.99"
NOTE: Both solutions make sure that it matches any punct characters other than the . and replace with blank ("")
It can also use the pipe symbol like this:
#Code
gsub(",|'","", characterExample)
Output:
gsub(",|'","", characterExample)
[1] "234564900.99"

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

Removing the second "|" on the last position

Here are some examples from my data:
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
For a: The individual strings can contain even more entries of "sp|" and "orf"
The results have to be like this:
[1] "sp|Q9Y6W5" "sp|Q9HB90,sp|Q9NQL2" "orf|NCBIAAYI_c_1_1023"
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
So the aim is to remove the last "|" for each "sp|" and "orf|" entry. It seems that "|" is a special challenge because it is a metacharacter in regular expressions. Furthermore, the length and composition of the "orf|" entries varying a lot. The only things they have in common is "orf|" or "sp|" at the beginning and that "|" is on the last position. I tried different things with gsub() but also with the stringr package or regexpr() or [:punct:], but nothing really worked. Maybe it was just the wrong combination.
We can use gsub to match the | that is followed by a , or is at the end ($) of the string and replace with blank ("")
gsub("[|](?=(,|$))", "", a, perl = TRUE)
#[1] "sp|Q9Y6W5"
#[2] "sp|Q9HB90,sp|Q9NQL2"
#[3] "orf|NCBIAAYI_c_1_1023"
#[4] "orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142"
#[5] "orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405"
Or we split by ,', remove the last character withsubstr, andpastethelist` elements together
sapply(strsplit(a, ","), function(x) paste(substr(x, 1, nchar(x)-1), collapse=","))
An easy alternative that might work. You need to escape the "|" using "\\|".
# Input
a <-c("sp|Q9Y6W5|","sp|Q9HB90|,sp|Q9NQL2|","orf|NCBIAAYI_c_1_1023|",
"orf|NCBIACEN_c_10_906|,orf|NCBIACEO_c_5_1142|",
"orf|NCBIAAYI_c_258|,orf|aot172_c_6_302|,orf|aot180_c_2_405|")
# Expected output
b <- c("sp|Q9Y6W5", "sp|Q9HB90,sp|Q9NQL2", "orf|NCBIAAYI_c_1_1023" ,
"orf|NCBIACEN_c_10_906,orf|NCBIACEO_c_5_1142" ,
"orf|NCBIAAYI_c_258,orf|aot172_c_6_302,orf|aot180_c_2_405")
res <- gsub("\\|,", ",", gsub("\\|$", "", a))
all(res == b)
#[1] TRUE
You could construct a single regex call to gsub, but this is simple and easy to understand. The inner gsub looks for | and the end of the string and removes it. The outer gsub looks for ,| and replaces with ,.
You do not have to use a PCRE regex here as all you need can be done with the default TRE regex (if you specify perl=TRUE, the pattern is compiled with a PCRE regex engine and is sometimes slower than TRE default regex engine).
Here is the single simple gsub call:
gsub("\\|(,|$)", "\\1", a)
See the online R demo. No lookarounds are really necessary, as you see.
Pattern details
\\| - a literal | symbol (because if you do not escape it or put into a bracket expression it will denote an alternation operator, see the line below)
(,|$) - a capturing group (referenced to with \1 from the replacement pattern) matching either of the two alternatives:
, - a comma
| - or (the alternation operator)
$ - end of string anchor.
The \1 in the replacement string tells the regex engine to insert the contents stored in the capturing group #1 back into the resulting string (so, the commas are restored that way where necessary).

Resources