Unexpected regex behavior when executed in base R [duplicate] - r

This question already has answers here:
Extract a regular expression match
(12 answers)
Closed 1 year ago.
As per this link, I wrote a regex that does not give the expected result when executed for a specific string in R:
string <- "0,9% BB"
regex <- "^ ?\\d+[\\d ,\\.]*[B-DF-HJ-NP-TV-Z\\/]*%?"
grep(regex, string, value = T, perl = T)
The result output is
[1] "0,9% BB"
instead of the desired (and outputed by the link)
[1] "0,9%"
What am I missing to get the desired output? Preferably base R, please.

This returns "0,9%" using only base R
string <- "0,9% BB"
regex <- "^ ?\\d+[\\d ,\\.]*[B-DF-HJ-NP-TV-Z\\/]*%?"
regmatches(x = string, m = regexpr(regex,string,perl = TRUE))

Related

find names/text that end with certain pattern (using BASE R) [duplicate]

This question already has answers here:
Does R have function startswith or endswith like python? [closed]
(6 answers)
How to determine if a string "ends with" another string in R?
(5 answers)
Closed 2 days ago.
I'm trying to find all variables names that end with "DE".
I'm already using this "grep" to find variables that start with "TR"
names(df )[ grep( "^(TR)" , names(df) ) ]
Is there a grep way to find the end pattern.
Just an fyi, I know I can do it with "ends_with" but I'm trying to find a base r method.
names( df %>% select(ends_with( "DE" ) ) )
Thanks.
There is already endsWith in base R
names(df)[endsWith(names(df), "DE")]
-a reproducible example
> names(iris)[endsWith(names(iris), "Length")]
[1] "Sepal.Length" "Petal.Length"
Or with grep use the $ to specify the end of the string (also, with grep, we can use value = TRUE which returns the names instead of the numeric index
grep("DE$", names(df), value = TRUE)
# similar to
names(df)[grep("DE$", names(df)]
Or in a base R pipe
grep("DE$", names(df)) |>
`[`(names(df), i = _)
grep("Length", names(iris)) |>
`[`(names(iris), i = _)
[1] "Sepal.Length" "Petal.Length"

How to get the most frequent character within a character string? [duplicate]

This question already has answers here:
Finding the most repeated character in a string in R
(2 answers)
Closed 1 year ago.
Suppose the next character string:
test_string <- "A A B B C C C H I"
Is there any way to extract the most frequent value within test_string?
Something like:
extract_most_frequent_character(test_string)
Output:
#C
We can use scan to read the string as a vector of individual elements by splitting at the space, get the frequency count with table, return the named index that have the max count (which.count), get its name
extract_most_frequent_character <- function(x) {
names(which.max(table(scan(text = x, what = '', quiet = TRUE))))
}
-testing
extract_most_frequent_character(test_string)
[1] "C"
Or with strsplit
extract_most_frequent_character <- function(x) {
names(which.max(table(unlist(strsplit(x, "\\s+")))))
}
Here is another base R option (not as elegant as #akrun's answer)
> intToUtf8(names(which.max(table(utf8ToInt(gsub("\\s", "", test_string))))))
[1] "C"
One possibility involving stringr could be:
names(which.max(table(str_extract_all(test_string, "[A-Z]", simplify = TRUE))))
[1] "C"
Or marginally shorter:
names(which.max(table(str_extract_all(test_string, "[A-Z]")[[1]])))
Here is solution using stringr package, table and which:
library(stringr)
test_string <- str_split(test_string, " ")
test_string <- table(test_string)
names(test_string)[which.max(test_string)]
[1] "C"

How to keep only specific punctuation mark in a column [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
In the column text how it is possible to remove all punctuation remarks but keep only the ?
data.frame(id = c(1), text = c("keep<>-??it--!##"))
expected output
data.frame(id = c(1), text = c("keep??it"))
A more general solution would be to used nested gsub commands that converts ? to a particular unusual string (like "foobar"), gets rid of all punctuation, then writes "foobar" back to ?:
gsub("foobar", "?", gsub("[[:punct:]]", "", gsub("\\?", "foobar", df$text)))
#> [1] "keep??it"
Using gsub you could do:
gsub("(\\?+)|[[:punct:]]","\\1",df$text)
[1] "keep??it"
gsub('[[:punct:] ]+',' ',data) removes all punctuation which is not what you want.
But this is:
library(stringr)
sapply(df, function(x) str_replace_all(x, "<|>|-|!|#|#",""))
id text
[1,] "1" "a"
[2,] "2" "keep??it"
Better IMO than other answers because no need for nesting, and lets you define whichever characters to sub.
Here's another solution using negative lookahead:
gsub("(?!\\?)[[:punct:]]", "", df$text, perl = T)
[1] "keep??it"
The negative lookahead asserts that the next character is not a ? and then matches any punctuation.
Data:
df <- data.frame(id = c(1), text = c("keep<>-??it--!##"))

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

How do I remove suffix from a list of Ensembl IDs in R [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
Closed 3 years ago.
I have a large list which contains expressed genes from many cell lines. Ensembl genes often come with version suffixes, but I need to remove them. I've found several references that describe this here or here, but they will not work for me, likely because of my data structure (I think its a nested array within a list?). Can someone help me with the particulars of the code and with my understanding of my own data structures?
Here's some example data
>listOfGenes_version <- list("cellLine1" = c("ENSG001.1", "ENSG002.1", "ENSG003.1"), "cellLine2" = c("ENSG003.1", "ENSG004.1"))
>listOfGenes_version
$cellLine1
[1] "ENSG001.1" "ENSG002.1" "ENSG003.1"
$cellLine2
[1] "ENSG003.1" "ENSG004.1"
And what I would like to see is
>listOfGenes_trimmed
$cellLine1
[1] "ENSG001" "ENSG002" "ENSG003"
$cellLine2
[1] "ENSG003" "ENSG004"
Here are some things I tried, but did not work
>listOfGenes_trimmed <- str_replace(listOfGenes_version, pattern = ".[0-9]+$", replacement = "")
Warning message:
In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
argument is not an atomic vector; coercing
>listOfGenes_trimmed <- lapply(listOfGenes_version, gsub('\\..*', '', listOfGenes_version))
Error in match.fun(FUN) :
'gsub("\\..*", "", listOfGenes_version)' is not a function, character or symbol
Thanks so much!
An option would be to specify the pattern as . (metacharacter - so escape) followeed by one or more digits (\\d+) at the end ($) of the string and replace with blank ('")
lapply(listOfGenes_version, sub, pattern = "\\.\\d+$", replacement = "")
#$cellLine1
#[1] "ENSG001" "ENSG002" "ENSG003"
#$cellLine2
#[1] "ENSG003" "ENSG004"
The . is a metacharacter that matches any character, so we need to escape it to get the literal value as the mode is by default regex

Resources