extract regular expressions from collapsed characters (including "|") [duplicate] - r

This question already has answers here:
Find the words in list of strings
(1 answer)
Regular expression pipe confusion
(5 answers)
Closed 3 years ago.
I want to detect (and then extract) month names from text using str_detect and str_extract.
For this, I create an object containing all month names and abbreviations.
m <- paste(c(month.name, month.abb), collapse = "|")
> m
[1] "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
Then, I want to detect any of the entries occurring as a single word (surrounded by word boundaries):
stringr::str_detect(c("inJan", "Jan"), str_glue("\\b{m}\\b"))
This, however, returns TRUE TRUE (I expect FALSE TRUE, as the first is not a single word.
I suspect this is due to the collapsing of the list, as stringr::str_detect(c("inJan", "Jan"), str_glue("\\bJan\\b")) returns the expected FALSE TRUE.
I need to detect occurrences of m, however. What's the best way to go about this?

Related

Exact match on word in middle of string in R [duplicate]

This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 1 year ago.
I referred this question (How to filter Exact match string using dplyr) but mine is slightly different as the word is not the start but can occur anywhere in the string. I want TRUE to be returned only for first one not the second & third
library(stringr)
vec <- c("this should be selected", "thisus should not be selected","not selected thisis too")
str_detect(vec,"this")
Current output
TRUE TRUE TRUE
Expected output
TRUE FALSE FALSE
Use a word boundary (\\b)
stringr::str_detect(vec,"\\bthis\\b")
#[1] TRUE FALSE FALSE
In base R :
grepl('\\bthis\\b', vec)

Remove characters which repeat more than twice in a string [duplicate]

This question already has answers here:
remove repeated character between words
(4 answers)
Closed 3 years ago.
I have this text:
F <- "hhhappy birthhhhhhdayyy"
and I want to remove the repeat characters, I tried this code
https://stackoverflow.com/a/11165145/10718214
and it works, but I need to remove repeat characters if it repeats more than 2, and if it repeated 2 times keep it.
so the output that I expect is
"happy birthday"
any help?
Try using sub, with the pattern (.)\\1{2,}:
F <- ("hhhappy birthhhhhhdayyy")
gsub("(.)\\1{2,}", "\\1", F)
[1] "happy birthday"
Explanation of regex:
(.) match and capture any single character
\\1{2,} then match the same character two or more times
We replace with just the single matching character. The quantity \\1 represents the first capture group in sub.

Remove part of column name post the second "_" [duplicate]

This question already has answers here:
Exclude everything after the second occurrence of a certain string
(2 answers)
Closed 3 years ago.
I have a vector which has names of the columns
group <- c("amount_bin_group", "fico_bin_group", "cltv_bin_group", "p_region_bin")
I want to replace the part after the second "_" from each element i.e. I want it to be
group <- c("amount_bin", "fico_bin", "cltv_bin", "p_region")
I can split this into two vectors and try gsub or substr. However, it would be nice to do that in vector. Any thoughts?
I checked other posts regarding the same question, but none of them has this framework
> sub("(.*)_.*$", "\\1", group)
[1] "amount_bin" "fico_bin" "cltv_bin" "p_region"

Difference between [A-Z] and LETTERS in grep [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".

Edit character length of row names in R [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Closed 5 years ago.
I am working on Bioinformatics recently. I have to edit row.names for my variable. Here is the situation for me:
I have clinical data and gene expression values downloaded from Cancer Genome Atlas. I have to match row names but in clinical data I have row names like this "TCGA-6D-AA2E". But in gene expressions row names like "TCGA-6D-AA2E-01A-11R-A38B-07".
Normally I used "match" command to match row names but the character lengths are not same. So my question is "Is there easy way to edit character length for row names?"
You could use grep function instead:
gene.names <- c("TCGA-6D-AA2E-01A-11R-A38B-07", "TCGC-6D-AA2E-01A-11R-A38B-07", "TAGA-6D-AA2E-01A-11R-07", "TCGA-6D-AA2E-A38B-07")
pick <- "TCGA-6D-AA2E"
grep(pick, gene.names)
# [1] 1 4
Edit based on the comment: Use substr to pick 12 first characters:
substr(gene.names, 0,12)
#[1] "TCGA-6D-AA2E" "TCGC-6D-AA2E" "TAGA-6D-AA2E" "TCGA-6D-AA2E"

Resources