Replace colnames to substring of colname - r

I wonder how I I can replace the colnames of my data frame to be the unique string in the original colname?
> colnames(df.iso)
[1] "../trimmed/100G.tally.fasta" "../trimmed/100R.tally.fasta" "../trimmed/106G.tally.fasta"
[4] "../trimmed/106R.tally.fasta" "../trimmed/122G.tally.fasta" "../trimmed/122R.tally.fasta"
[7] "../trimmed/124G.tally.fasta" "../trimmed/124R.tally.fasta" "../trimmed/126G.tally.fasta"
[10] "../trimmed/126R.tally.fasta" "../trimmed/134G.tally.fasta" "../trimmed/134R.tally.fasta"

We can use sub with ?basename to extract the substring from the column names. Assign the output back to the column names to reflect the change.
colnames(df.iso) <- sub("\\..*", '', basename(colnames(df.iso)))
If we don't want to use basename, sub can also be used alone.
colnames(df.iso) <- sub("([^/]+/){2}([^.]+).*",
"\\2", colnames(df.iso))

Similarly to #Akrun's second answer,
colnames(df.iso) <- sub("[^0-9]+([0-9]+[A-Z])\\.tal.*", "\\1", colnames(df.iso))
Should also do the trick. His first method is likely faster, which probably won't matter here.

Related

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

Conditional shift of substring position in R

> Df1
[1] "HM_004_T" "HM_004_T2" "HM_005_T" "HMFN_005_T2" "HM_007_T" "HM_007_T2" "HM_088_TR"
[8] "HM_088_T3"
Reference is made to change position of word within a string in r. I have a slightly different question. I first wish to delete _T if it presents on its own, and wish to delete _T2, _T3 or _TR and move them before all other text.
My ideal output will be:
Df1 <- c("HM_004", "T2_HM_004", "HM_005", "T2_HM_005", "HM_007", "T2_HM_007", "TR_HM_088", "T3_HM_088")
Input data
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2", "HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can do this with nested sub and backreference:
DF1 <- sub("(.*)_(T\\w)$", "\\2_\\1", sub("_T$", "", DF1))
Here you delete string-final _T in the first sub operation, the result of which you pass to the second sub operation, which switches the order of (i) whatever comes before the underscore _ and (ii) T followed by a digit or a letter (\\w), by referring to these two substrings with the backreferences \\1and \\2.
Result:
DF1
[1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007" "TR_HM_088" "T3_HM_088"
Data:
DF1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can achieve this relatively easy with the package stringr and the functions str_remove() and str_replace().
I am assuming that the patterns of interest always occur at the end of the text and that they are always preceded by _.
Please, have a look at the updated code below. This treats the pattern _T*, where * can now be a letter, as target thus good pattern.
library(stringr)
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
# Here I remove the roots I don't want like "_T" and "_T*"
# where "*" can be a digit or a character
df2 <- str_remove(Df1, "_T$")
# Here I replace the patterns through the group reference
final <- str_replace( df2, "(^.*)_(T\\d+$|T\\w+$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
# A more coincise way would be the following where \\w is the workhorse.
final <- str_replace( df2, "(^.*)_(T\\w$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?

replace and remove part of string in rownames

I want to remove a part of the rownames in my data frame. I want to remove everything that do not match the string that is defined in the grepl below and replace it with the string defined behind. Does anyone know?
df[grepl(".*lncRNA.*|.*snRNA.*|.*snoRNA.*|.*precursor_RNA.*", rownames(df))] <- c("lncRNA","snRNA","snoRNA","precursor_RNA")
head(rownames(df))
[3208] "URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT"
[3209] "URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA"
[3210] "URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT"
[3211] "URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG"
[3212] "URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC"
[3213] "URS000075B2ED-lncRNA_CACTCAGGACCCACC"
out
[3208] "snoRNA"
[3209] "snRNA"
[3210] "snRNA"
[3211] "lncRNA"
[3212] "precursor_RNA"
[3213] "lncRNA"
We can use gsub to match one of more characters that are not a - ([^-]+) from the start (^) of the string followed by a - or (|) one or more characters that are not an underscore ([^_]+) until the end of the string ($) and replace it with blanks ("").
gsub("^[^-]+-|_[^_]+$", "", v1)
#[1] "snoRNA" "snRNA" "snRNA" "lncRNA"
#[5] "precursor_RNA" "lncRNA"
If we are doing this on the rownames
gsub("^[^-]+-|_[^_]+$", "", rownames(df))
data
v1 <- c("URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT",
"URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA",
"URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT",
"URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG",
"URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC",
"URS000075B2ED-lncRNA_CACTCAGGACCCACC")
Welcome to StackOverflow! You've done well with giving us some example input and output, but please consider providing a reproducible example to make it easier for us to help you.
In your case, I think you may be able to use sub, capture the middle, and \1 in the replacement.
x <- c("URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT",
"URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA",
"URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT",
"URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG",
"URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC",
"URS000075B2ED-lncRNA_CACTCAGGACCCACC")
# replace the string with the captured group (ie regex in brackets)
gsub("^.*(lncRNA|snRNA|snoRNA|precursor_RNA).*$", "\\1", x)
# [1] "snoRNA" "snRNA" "snRNA" "lncRNA"
# [5] "precursor_RNA" "lncRNA"
Rownames have to be unique though, so you may need to store the result in a column of your dataframe instead (or you could use make.unique() to make them unique, but I think saving the result as a column in your dataframe would make more sense).

Get rid of repetitive characters from a column name in R

Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources