Match all elements with punctuation mark except asterisk in r [duplicate] - r

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?

Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"

You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

Related

Extract all text after last occurrence of a special character

I have the string in R
BLCU142-09|Apodemia_mejicanus
and I would like to get the result
Apodemia_mejicanus
Using the stringr R package, I have tried
str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "")
# [1] "podemia_mejicanus"
which is almost what I need, except that the A is missing.
You can use
sub(".*\\|", "", x)
This will remove all text up to and including the last pipe char. See the regex demo. Details:
.* - any zero or more chars as many as possible
\| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash).
See the R demo online:
x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus")
sub(".*\\|", "", x)
## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring
library(stringr)
str_remove(str1, "^[^|]+\\|")
#[1] "Apodemia_mejicanus"
If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all
data
str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove:
s <- "BLCU142-09|Apodemia_mejicanus"
stringr::str_extract(s,"[[:alpha:]_]+$")
## [1] "Apodemia_mejicanus"
Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple:
substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)

Delete everything after second comma from string [duplicate]

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Closed 3 years ago.
I would like to remove anything after the second comma in a string -including the second comma-. Here is an example:
x <- 'Day,Bobby,Jean,Gav'
gsub("(.*),.*", "\\1", x)
and it gives:
[1] "Day, Bobby, Jean"
while I want:
[1] "Day, Bobby
regardless of the number of names that may exist in x
Use
> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"
The ^([^,]*,[^,]*),.* pattern matches
^ - start of string
([^,]*,[^,]*) - Group 1: 0+ non-commas, a comma, and 0+ non-commas
,.* - a comma and the rest of the string.
The \1 in the replacement pattern will keep Group 1 value in the result.
We can also use strsplit and then paste
toString(head(strsplit(x, ",")[[1]], 2))
#[1] "Day, Bobby"

How do I remove suffix from a list of Ensembl IDs in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 3 years ago.
I have a large list which contains expressed genes from many cell lines. Ensembl genes often come with version suffixes, but I need to remove them. I've found several references that describe this here or here, but they will not work for me, likely because of my data structure (I think its a nested array within a list?). Can someone help me with the particulars of the code and with my understanding of my own data structures?
Here's some example data
>listOfGenes_version <- list("cellLine1" = c("ENSG001.1", "ENSG002.1", "ENSG003.1"), "cellLine2" = c("ENSG003.1", "ENSG004.1"))
>listOfGenes_version
$cellLine1
[1] "ENSG001.1" "ENSG002.1" "ENSG003.1"
$cellLine2
[1] "ENSG003.1" "ENSG004.1"
And what I would like to see is
>listOfGenes_trimmed
$cellLine1
[1] "ENSG001" "ENSG002" "ENSG003"
$cellLine2
[1] "ENSG003" "ENSG004"
Here are some things I tried, but did not work
>listOfGenes_trimmed <- str_replace(listOfGenes_version, pattern = ".[0-9]+$", replacement = "")
Warning message:
In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
argument is not an atomic vector; coercing
>listOfGenes_trimmed <- lapply(listOfGenes_version, gsub('\\..*', '', listOfGenes_version))
Error in match.fun(FUN) :
'gsub("\\..*", "", listOfGenes_version)' is not a function, character or symbol
Thanks so much!
An option would be to specify the pattern as . (metacharacter - so escape) followeed by one or more digits (\\d+) at the end ($) of the string and replace with blank ('")
lapply(listOfGenes_version, sub, pattern = "\\.\\d+$", replacement = "")
#$cellLine1
#[1] "ENSG001" "ENSG002" "ENSG003"
#$cellLine2
#[1] "ENSG003" "ENSG004"
The . is a metacharacter that matches any character, so we need to escape it to get the literal value as the mode is by default regex

Extracting string between punctuation, when present

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0
Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0
Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0
Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Resources