Get rid of repetitive characters from a column name in R - r

Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?

The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))

You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"

You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)

Related

making for loop for character vector in R

char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport") # character vector
Suppose I have the above character vector
I would like to create a for loop to print on the screen only the elements in a vector that have more than 5 characters and starts with a vowel
and also delete from the vector those elements that do not start with a vowel
I created this for loop but it also gives null characters
for (i in char_vector){
if (str_length(i) > 5){
i <- str_subset(i, "^[AEIOUaeiou]")
print(i)
}
}
The result for the above is
[1] "Africa"
[1] "identical"
[1] "ending"
character(0)
character(0)
My desired result would only be the first 3 characters
I'm really new to R and facing huge difficulty with creating a for loop for this problem. Any help would be greatly appreciated!
Use grepl with the pattern ^[AEIOUaeiuo]\w{5,}$:
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
char_vector <- char_vector[grepl("^[AEIOUaeiuo]\\w{5,}$", char_vector)]
char_vector
[1] "Africa" "identical" "ending"
The regex pattern used here says to match words which:
^ from the start of the word
[AEIOUaeiuo] starts with a vowel
\w{5,} followed by 5 or more characters (total length > 5)
$ end of the word
You don't need for loop, because we use vectorized functions in R.
A simple solution using grep and substr (refer to Tim Blegeleisen answer for details):
substr(grep('^[aeiu].{4}', char_vector, T, , T), 1, 3)
# [1] "Afr" "ide" "end"
With stringr functions, you'd rather use str_detect instead of str_subset, and you can take advantage of the fact that those functions are vectorized:
library(stringr)
char_vector[str_length(char_vector) > 5 & str_detect(char_vector, "^[AEIOUaeiou]")]
#[1] "Africa" "identical" "ending"
or if you want your for loop as a single vector:
vec <- c()
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")){
vec <- c(vec, i)
}
}
vec
# [1] "Africa" "identical" "ending"
The first 3 characters?
library(stringr)
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")) {
word <- str_sub(i, 1, 3)
print(word)
}
}
output is:
[1] "Afr"
[1] "ide"
[1] "end"
Using only base R functions. No need for a loop. I wrapped the steps in a function so you can use the function with other character vectors. You could make this code shorter (see #utubun's answer) but I feel it is easier to understand the process with a "one line one step" approach.
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
yourfun <- function(char_vector){
char_vector <- char_vector[nchar(char_vector)>= 5] # grab only the strings that are at least 5 characters long
char_vector <- char_vector[grep(pattern = "^[AEIOUaeiou]", char_vector)] # grab strings that starts with vowel
return(char_vector) # print the first three strings
# remove comments to get the first three characters of each string
# out <- substring(char_vector, 1, 3) # select only the first 3 characters of each string
# return(out)
}
yourfun(char_vector = char_vector)
#> [1] "Africa" "identical" "ending"
Created on 2022-05-09 by the reprex package (v2.0.1)

Replace colnames to substring of colname

I wonder how I I can replace the colnames of my data frame to be the unique string in the original colname?
> colnames(df.iso)
[1] "../trimmed/100G.tally.fasta" "../trimmed/100R.tally.fasta" "../trimmed/106G.tally.fasta"
[4] "../trimmed/106R.tally.fasta" "../trimmed/122G.tally.fasta" "../trimmed/122R.tally.fasta"
[7] "../trimmed/124G.tally.fasta" "../trimmed/124R.tally.fasta" "../trimmed/126G.tally.fasta"
[10] "../trimmed/126R.tally.fasta" "../trimmed/134G.tally.fasta" "../trimmed/134R.tally.fasta"
We can use sub with ?basename to extract the substring from the column names. Assign the output back to the column names to reflect the change.
colnames(df.iso) <- sub("\\..*", '', basename(colnames(df.iso)))
If we don't want to use basename, sub can also be used alone.
colnames(df.iso) <- sub("([^/]+/){2}([^.]+).*",
"\\2", colnames(df.iso))
Similarly to #Akrun's second answer,
colnames(df.iso) <- sub("[^0-9]+([0-9]+[A-Z])\\.tal.*", "\\1", colnames(df.iso))
Should also do the trick. His first method is likely faster, which probably won't matter here.

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Replace specific characters within strings

I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
group <- data.frame(c("12357e", "12575e", "197e18", "e18947")
I start with just the first column; I want to produce the second column by removing the e's:
group group.no.e
12357e 12357
12575e 12575
197e18 19718
e18947 18947
With a regular expression and the function gsub():
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub does here is to replace each occurrence of "e" with an empty string "".
See ?regexp or gsub for more help.
Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub() with the simplest possible replacement pattern: empty string:
R> group$groupNoE <- gsub("e", "", group$group)
R> group
group groupNoE
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
R>
Summarizing 2 ways to replace strings:
group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947"))
1) Use gsub
group$group.no.e <- gsub("e", "", group$group)
2) Use the stringr package
group$group.no.e <- str_replace_all(group$group, "e", "")
Both will produce the desire output:
group group.no.e
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
You do not need to create data frame from vector of strings, if you want to replace some characters in it. Regular expressions is good choice for it as it has been already mentioned by #Andrie and #Dirk Eddelbuettel.
Pay attention, if you want to replace special characters, like dots, you should employ full regular expression syntax, as shown in example below:
ctr_names <- c("Czech.Republic","New.Zealand","Great.Britain")
gsub("[.]", " ", ctr_names)
this will produce
[1] "Czech Republic" "New Zealand" "Great Britain"
Use the stringi package:
require(stringi)
group<-data.frame(c("12357e", "12575e", "197e18", "e18947"))
stri_replace_all(group[,1], "", fixed="e")
[1] "12357" "12575" "19718" "18947"
> library(stringi)
> group <- c('12357e', '12575e', '12575e', ' 197e18', 'e18947')
> pattern <- "e"
> replacement <- ""
> group <- str_replace(group, pattern, replacement)
> group
[1] "12357" "12575" "12575" " 19718" "18947"
You can use chartr as well:
group$group.no.e <- chartr("e", "", group$group)

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources