Subset a character vector in R based on a specific pattern

Subset a character vector in R based on a specific pattern - r

I have a vector of character ids, as rownames of a data frame in R. The rownames have the following pattern:
head(foo)
[1] "ENSG00000197372 (ZNF675)" "ENSG00000112624 (GLTSCR1L)"
[3] "ENSG00000151320 (AKAP6)" "ENSG00000139910 (NOVA1)"
[5] "ENSG00000137449 (CPEB2)" "ENSG00000004779 (NDUFAB1)"
I would like to somehow subset the above rownames (~700 entries) in order to keep only the gene symbols in the parenthesis part-i.e. ZNF675-and drop the rest part: is this possible through a function like gsub ?

We can use sub to match characters that are not (, then capture the characters inside the ( which is not a ) and replace it with the backreference (\\1) of the captured group
row.names(foo) <- sub("^[^(]+\\(([^)]+).*", "\\1", row.names(foo))
row.names(foo)
#[1] "ZNF675" "GLTSCR1L" "AKAP6" "NOVA1" "CPEB2" "NDUFAB1"
Or using str_extract from stringr
library(stringr)
str_extract(row.names(foo), "(?<=\\()[^)]+")
data
foo <- data.frame(col1 = rnorm(6))
row.names(foo) <- c("ENSG00000197372 (ZNF675)",
"ENSG00000112624 (GLTSCR1L)", "ENSG00000151320 (AKAP6)",
"ENSG00000139910 (NOVA1)",
"ENSG00000137449 (CPEB2)", "ENSG00000004779 (NDUFAB1)")

Related

R how to match and extract character letters of different length in a string

So I have a column of contract names df$name like below
FB210618C00280000
ADM210618C00280000
M210618P00280000
I would like to extract the FB, ADM and M. That is I want to extract characters in the string and they are of different length and stop once the first number occurs, and I don't want to extract the C or P.
The below code will give me the C or P
stri_extract_all_regex(df$name, "[a-z]+")

We can use stri_extract_first from stringi
library(stringi)
stri_extract_first(df$name, regex = "[A-Z]+")
#[1] "FB" "ADM" "M"
Or we can use base R with sub
sub("\\d+.*", "", df$name)
#[1] "FB" "ADM" "M"
Or use trimws from base R
trimws(df$name, whitespace = "\\d+.*")
data
df <- data.frame(name = c("FB210618C00280000", "ADM210618C00280000",
"M210618P00280000"))

You can use
library(stringr)
str_extract(df$name, "^[A-Za-z]+")
# Or
str_extract(df$name, "^\\p{L}+")
The stringr::str_extract function will extract the first occurrence of a pattern and ^[A-Za-z]+ / ^\p{L}+ regex matches one or more letters at the start of the string. Note \p{L} matches any Unicode letters.
See the regex demo.
Same pattern can be used with stringi::stri_extract_first():
library(stringi)
stri_extract_first(df$name, regex="^[A-Za-z]+")

Regular expression to exclude a string pattern in R

Please, I want to rename the columns of my table by removing the year label. Here are my columns names :
"PROV_201601" "MNT_201602" "PROV_201612" .... and so on !
My objective is to remove the "2016" from the name of the column. I am only familiar with R but not with regular expressions.
Any help is appreciated !
Thank you.

We can try with sub to match a _ capture as a group followed by four digits (\\d{4}) and replace with the backreference of the captured group (\\1) or use _
sub("(_)\\d{4}", "\\1", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
If it is specific to 2016 then
sub("2016", "", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
data
v1 <- c("PROV_201601", "MNT_201602", "PROV_201612")

First, use sub() to replace all instances of "2016" with "". This will eliminate 2016 from the character strings.
col1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
col2 <- sub("2016", "", col1)
Now rename your columns of data frame dat using names():
names(dat) <- col2

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.

You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.

We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

split paired samples based on substring

I have two groups of paired samples that could be separated by the first two letters. I would like to make two groups based on the pairing using something like [tn][abc].
Example of paired samples:
nb-008 ta-008
na015 ta-015
data:
> colnames(data)
"nb-008" "nb-014" "na015" "na-018" "ta-008" "tc-014" "ta-015" "ta-018"
patient <- factor(sapply(str_split(colnames(data), '[tn][abc]'), function(x) x[[1]]))

We can create a grouping variable with sub. We match the pattern of 2 characters (..) from the beginning of the string (^) followed by - (if present), followed by one or more characters (.*) that we capture as a group (inside the brackets), and replace by the backreference (\\1). This can be used to split the column names.
split(colnames(data), sub('^..-?(.*)', '\\1', colnames(data))))
#$`008`
#[1] "nb-008" "ta-008"
#$`014`
#[1] "nb-014" "tc-014"
#$`015`
#[1] "na015" "ta-015"
#$`018`
#[1] "na-018" "ta-018"
data
v1 <- c("nb-008", "nb-014", "na015", "na-018",
"ta-008", "tc-014", "ta-015", "ta-018" )
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5,
replace=TRUE), ncol=8)), v1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset a character vector in R based on a specific pattern - r

Related

R how to match and extract character letters of different length in a string

Regular expression to exclude a string pattern in R

Changing column names in dataframe using gsub

Replace element in vector based on first letter of character string

split paired samples based on substring

Categories

Resources