how can i remove certain characters from my rownames? - r

I want to remove a part of the rownames in my data frame. i want to remove "."and characters after "." Does anyone know?
head(rownames(data))
[1] "ENSG00000000003.15" "ENSG00000000005.6" "ENSG00000000419.13" "ENSG00000000457.14"
[5] "ENSG00000000460.17" "ENSG00000000938.13"
i wanna change it to
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
how

Try this,
rownames(data) <- sub("\\..*", "", rownames(data))

You can try this:
rownames(data) <- unlist(strsplit(rownames(data), split='.',fixed = T))[1]

It looks like they have a fixed length. If it is true, then
rownames(data) = substr(rownames(data),1,15)

A slightly more sophisticated method is this:
sub("([^.]+).*", "\\1", rownames(data))
Here, we define a capture group using a negative character class that includes any character but the period . and, using backreference \\1, 'recollect' just the matching series of digits in sub's replacement argument.

Another option using str_split like this:
library(stringr)
rows <- c("ENSG00000000003.15", "ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", "ENSG00000000938.13")
str_split(rows, "\\.", simplify=T)[,1]
#> [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
#> [5] "ENSG00000000460" "ENSG00000000938"
Created on 2022-07-29 by the reprex package (v2.0.1)

Related

append letter to a string in r

I have a vector:
c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL")
And I would like to remove "BAA" from the words that contain it. And to those words I would like to append ".PR".
Desired outcome:
c("AVAST.PR", "CEZ.PR", "GECBA.PR", "LOL")
Any ideas? Ideally using stringr. Thank you a lot.
You could use the following solution:
gsub("BAA(.*)", "\\1\\.PR", vec)
[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
You could use
library(stringr)
# optimized thanks to Anoushiravan
str_replace(c("BAAAVAST", "BAACEZ", "BAAGECBA", "LOL"), "BAA(\\w*)", "\\1.PR")
#> [1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"
use \\w* if you want to match word characters only or .* if there are no limitations to the characters.
This is verbose than the other answers. It finds strings with 'BAA' and appends 'PR.' to it.
inds <- grepl('BAA', vec, fixed = TRUE)
vec[inds] <- paste(sub('BAA', '', vec[inds]), 'PR', sep = '.')
vec
#[1] "AVAST.PR" "CEZ.PR" "GECBA.PR" "LOL"

Replace colnames to substring of colname

I wonder how I I can replace the colnames of my data frame to be the unique string in the original colname?
> colnames(df.iso)
[1] "../trimmed/100G.tally.fasta" "../trimmed/100R.tally.fasta" "../trimmed/106G.tally.fasta"
[4] "../trimmed/106R.tally.fasta" "../trimmed/122G.tally.fasta" "../trimmed/122R.tally.fasta"
[7] "../trimmed/124G.tally.fasta" "../trimmed/124R.tally.fasta" "../trimmed/126G.tally.fasta"
[10] "../trimmed/126R.tally.fasta" "../trimmed/134G.tally.fasta" "../trimmed/134R.tally.fasta"
We can use sub with ?basename to extract the substring from the column names. Assign the output back to the column names to reflect the change.
colnames(df.iso) <- sub("\\..*", '', basename(colnames(df.iso)))
If we don't want to use basename, sub can also be used alone.
colnames(df.iso) <- sub("([^/]+/){2}([^.]+).*",
"\\2", colnames(df.iso))
Similarly to #Akrun's second answer,
colnames(df.iso) <- sub("[^0-9]+([0-9]+[A-Z])\\.tal.*", "\\1", colnames(df.iso))
Should also do the trick. His first method is likely faster, which probably won't matter here.

Get rid of repetitive characters from a column name in R

Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)

Extract characters between specified characters in R

I have this variable
x= "379_exp_mirror1.csv"
I need to extract the number ("379") at the beggining (which doesn't always have 3 characters), i.e. everything before the first "". And then I need to extract everything between the second "" and the ".", in this case "mirror1".
I have tried several combinations with sub and gsub with no success, can anyone give me some indications please?
Thank you
You can use regular expression. For your problem ^(?<Number>[0-9]*)_.* do the job
1/ Test your regular expression with this website : http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Or you can split string with underscore and then try parse (int.TryParse). I think the second is better but if you want to be a regular expression master try the first method
You can use sub to extract the substrings:
x <- "379_exp_mirror1.csv"
sub("_.*", "", x)
# [1] "379"
sub("^(?:.*_){2}(.*?)\\..*", "\\1", x)
# [1] "mirror1"
Another approach with gregexpr:
regmatches(x, gregexpr("^.*?(?=_)|(?<=_)[^_]*?(?=\\.)", x, perl = TRUE))[[1]]
# [1] "379" "mirror1"
May be you can try:
library(stringr)
x <- "379_exp_mirror1.csv"
str_extract_all(x, perl('^[0-9]+(?=_)|[[:alnum:]]+(?=\\.)'))[[1]]
#[1] "379" "mirror1"
Or
strsplit(x, "[._]")[[1]][c(T,F)]
#[1] "379" "mirror1"
Or
scan(text=gsub("[.]","_", x),what="",sep="_")[c(T,F)]
#Read 4 items
#[1] "379" "mirror1"

numeric sort a list of strings in R

I have a list:
a <- ["12file.txt", "8file.txt", "66file.txt"]
I would like to sort by number:
a would be: ["8file.txt", "12file.txt", "66file.txt"]
Now I could get only this:
a = ["12file.txt", "66file.txt", "8file.txt"]
Thanks
I'm assuming you have a character vector:
a <- c("12file.txt", "8file.txt", "66file.txt")
I would approach this by pulling out the number at the start of each string and sorting on that:
num <- as.numeric(sub("([0-9]+).*", "\\1", a))
a[order(num)]
#[1] "8file.txt" "12file.txt" "66file.txt"
You could also pad your strings with spaces by setting a field length to sprintf to achieve the sorting you want:
a[order(sprintf("%10s",a))]
[1] "8file.txt" "12file.txt" "66file.txt"
You can use str_sort(..., numeric = TRUE) function from stringr package:
library(stringr)
a <- c("12file.txt", "8file.txt", "66file.txt")
str_sort(a, numeric = TRUE)
#> [1] "8file.txt" "12file.txt" "66file.txt"

Resources