Regular expression to exclude a string pattern in R - r

Please, I want to rename the columns of my table by removing the year label. Here are my columns names :
"PROV_201601" "MNT_201602" "PROV_201612" .... and so on !
My objective is to remove the "2016" from the name of the column. I am only familiar with R but not with regular expressions.
Any help is appreciated !
Thank you.

We can try with sub to match a _ capture as a group followed by four digits (\\d{4}) and replace with the backreference of the captured group (\\1) or use _
sub("(_)\\d{4}", "\\1", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
If it is specific to 2016 then
sub("2016", "", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
data
v1 <- c("PROV_201601", "MNT_201602", "PROV_201612")

First, use sub() to replace all instances of "2016" with "". This will eliminate 2016 from the character strings.
col1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
col2 <- sub("2016", "", col1)
Now rename your columns of data frame dat using names():
names(dat) <- col2

Related

Replacing multiple punctuation marks in a column of data

Column in a df:
chr10:123453:A:C
chr10:2345543:TTTG:CG
chr10:3454757:G:C
chr10:4567875765:C:G
Desired output:
chr10:123453_A/C
chr10:2345543_TTTG/CG
chr10:3454757_G/C
chr10:4567875765_C/G
I think I could use stingsplit but I wanted to try and do it all in a R oneliner. Any ideas would be greatly welcome!
Try this:
gsub(":([A-Z]+):([A-Z]+)$", "_\\1/\\2", x, perl = TRUE)
[1] "chr10:123453_A/C" "chr10:2345543_TTTG/CG"
Here we use backreference twice: \\1 recollects what's between the pre-ultimate and the ultimate :, whereas \\2 recollects what's after the ultimate :.
Data:
x <- c("chr10:123453:A:C","chr10:2345543:TTTG:CG")

How to delete the 0 before a number?

I'm having the following difficulty in R: a dataframe has a column with some IDs, which are in the character format. What's the most concise way I can delete the 0s in front?
Example:
This is what I have:
ID <- as.character(c("001001","0001002","01003","001004","1005"))
order <- c("a","b","c","d","e")
df <- as.data.frame(cbind(ID, order))
This is what I want:
ID2 <- as.character(c("1001","1002","1003","1004","1005"))
order2 <- c("a","b","c","d","e")
df2 <- as.data.frame(cbind(ID, order))
I've tried replacing strings but it deletes the 0s I don't want (ex: the ID2[1] = 11).
Thanks in advance!
use trimws from base R
trimws(c("001001","0001002","01003","001004","1005"),which = "left",whitespace = "0")
#> [1] "1001" "1002" "1003" "1004" "1005"
Created on 2020-06-30 by the reprex package (v0.3.0)
It is easier to do this if we convert to integer or numeric class as numeric values cannot have 0 prefix. After the conversion, just wrap with as.character if we need the class to remain as character
df$ID <- as.character(as.integer(df$ID))
df$ID
#[1] "1001" "1002" "1003" "1004" "1005"
It could also be done in a regex way (unnecessary though)
df$ID <- sub("^0+", "", df$ID)
In the above code, we match one or more 0s (0+) at the start (^) of the string and replace with blank ("")
if the IDs have characters other than digits, an option is also to capture the digits after the prefix 0's and replace with the backreference (\\1) of the captured groups. This would make sure that strings "0xyz" remains as such
df$ID <- sub("^0+(\\d+)$", "\\1", df$ID)

Subset a character vector in R based on a specific pattern

I have a vector of character ids, as rownames of a data frame in R. The rownames have the following pattern:
head(foo)
[1] "ENSG00000197372 (ZNF675)" "ENSG00000112624 (GLTSCR1L)"
[3] "ENSG00000151320 (AKAP6)" "ENSG00000139910 (NOVA1)"
[5] "ENSG00000137449 (CPEB2)" "ENSG00000004779 (NDUFAB1)"
I would like to somehow subset the above rownames (~700 entries) in order to keep only the gene symbols in the parenthesis part-i.e. ZNF675-and drop the rest part: is this possible through a function like gsub ?
We can use sub to match characters that are not (, then capture the characters inside the ( which is not a ) and replace it with the backreference (\\1) of the captured group
row.names(foo) <- sub("^[^(]+\\(([^)]+).*", "\\1", row.names(foo))
row.names(foo)
#[1] "ZNF675" "GLTSCR1L" "AKAP6" "NOVA1" "CPEB2" "NDUFAB1"
Or using str_extract from stringr
library(stringr)
str_extract(row.names(foo), "(?<=\\()[^)]+")
data
foo <- data.frame(col1 = rnorm(6))
row.names(foo) <- c("ENSG00000197372 (ZNF675)",
"ENSG00000112624 (GLTSCR1L)", "ENSG00000151320 (AKAP6)",
"ENSG00000139910 (NOVA1)",
"ENSG00000137449 (CPEB2)", "ENSG00000004779 (NDUFAB1)")

How to remove '.' from column content in a dataframe?

I have a dataframe containing a number of ensembl gene annotations, the DF looks like this:
geneID
1 ENSG00000000005.5
2 ENSG00000001561.6
3 ENSG00000002726.18
4 ENSG00000005302.16
5 ENSG00000005379.14
6 ENSG00000006116.3
so I would like to delete that "." and the numbers at the end of every ID. In total I have 11224 rows.
I tried using the gsub command gsub(".","",colnames(dataframe)) but this is not helping.
Any suggestions?
Thank you in advance.
If we need the . at the end, capture the characters until the . (as . is a metacharacter meaning any character, escape it (\\) ) followed by one or more numbers (\\d+) until the end of the string and replace with the backreference (\\1) of the captured group
df1$geneID <- sub("^(.*\\.)\\d+$", "\\1", df1$geneID)
If the intention is to remove the . with the numbers after that, match the dot followed by one or more numbers and replace with blank ("")
df1$geneID <- sub("\\.\\d+", "", df1$geneID)
df1$geneID
#[1] "ENSG00000000005" "ENSG00000001561" "ENSG00000002726" "ENSG00000005302"
#[5] "ENSG00000005379" "ENSG00000006116"
You can use following code to remove alphanumeric after '.'
gsub("\\..*", "", df$geneID)

r Remove parts of column name after certain characters

I have a large data set with thousands of columns. The column names include various unwanted characters as follows:
col1_3x_xxx
col2_3y_xyz
col3_3z_zyx
I would like to remove all character strings starting with "_3" from all column names to be left with clean:
col1
col2
col3
What is the most efficient way to do this for 5000+ columns?
certainly late for this answer, but just in case someone is looking for a solution
colnames(df1)[col] <- sub("_3.*", "", colnames(df1)[col])
And if you have multiple columns :
for ( col in 1:ncol(df1)){
colnames(df1)[col] <- sub("_3.*", "", colnames(df1)[col])
}
We can use sub
sub("_3.*", "", df1[,1])
#[1] "col1" "col2" "col3"
We can try the str_extract with regular expression pattern "^[^_]+(?=_)":
stringr::str_extract(c("col1_3x_xxx", "col2_3y_xyz", "col3_3z_zyx"), "^[^_]+(?=_)")
[1] "col1" "col2" "col3"
where in the pattern:
The first ^ matches the beginning of the string; [^_]+ matches one
or more non _ character, ^_ means any character but _. (?=...)
stands for lookahead, so we are looking for pattern ahead of _.
You can use
names(df) = gsub(pattern = "_3.*", replacement = "", x = names(df))

Resources