R: Replacing rownames of data frame by a substring[2] - r

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!

If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...

I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Extracting and matching regular expressions in R

I have a list of strings, an example is shown below (the actual list has a much bigger variety in format)
[1] "AB-123"
[2] "AB-312"
[3] "AB-546"
[4] "ZXC/123456"
Assuming [1] is the correct format, I want to extract the regular expression from [1] and match it against the rest to detect that [4] is inconsistent. Is there a method to do this or is there a better way to achieve the same outcome?
*EDIT - I found something close to what I require, anyone know of any packages that does this?
Given a string, generate a regex that can parse *similar* strings
We may need grep
grepl(sub("-.*", "", v1[1]), v1[-1])
data
v1 <- c( "AB-123" , "AB-312" , "AB-546" , "ZXC/123456")
Here's an attempt at making a function which checks if each value is a Character Digit or Other. It is a bit rough but I'm sure this can be expanded upon to match exactly what you want:
test <- c("AB-123", "AB-312", "AB-546", "ZXC/123456")
compare_1st <- function(x) {
x <- toupper(x)
chars <- list("A",1,"-")
repl <- c("[A-Z]", "[0-9]", "[^0-9A-Z]")
for(i in seq_along(repl)) x <- gsub(repl[i], chars[i], x)
out <- x[1] == x
attr(out, "values") <- chartr("A1-", "CDO", x)
out
}
compare_1st(test)
#[1] TRUE TRUE TRUE FALSE
#attr(,"values")
#[1] "CCODDD" "CCODDD" "CCODDD" "CCCODDDDDD"

Replace colnames to substring of colname

I wonder how I I can replace the colnames of my data frame to be the unique string in the original colname?
> colnames(df.iso)
[1] "../trimmed/100G.tally.fasta" "../trimmed/100R.tally.fasta" "../trimmed/106G.tally.fasta"
[4] "../trimmed/106R.tally.fasta" "../trimmed/122G.tally.fasta" "../trimmed/122R.tally.fasta"
[7] "../trimmed/124G.tally.fasta" "../trimmed/124R.tally.fasta" "../trimmed/126G.tally.fasta"
[10] "../trimmed/126R.tally.fasta" "../trimmed/134G.tally.fasta" "../trimmed/134R.tally.fasta"
We can use sub with ?basename to extract the substring from the column names. Assign the output back to the column names to reflect the change.
colnames(df.iso) <- sub("\\..*", '', basename(colnames(df.iso)))
If we don't want to use basename, sub can also be used alone.
colnames(df.iso) <- sub("([^/]+/){2}([^.]+).*",
"\\2", colnames(df.iso))
Similarly to #Akrun's second answer,
colnames(df.iso) <- sub("[^0-9]+([0-9]+[A-Z])\\.tal.*", "\\1", colnames(df.iso))
Should also do the trick. His first method is likely faster, which probably won't matter here.

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Splitting merged words (with mini-dictionary)

I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.
Here's an example:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
My general procedure would be something like this:
search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB
Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:
searching for words inside words. I can match words from lists no problem but I'm not sure how I use grep or equivalent to go further than this
declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked
Can anyone please send me in the right direction?
I think the first step would be to build all the combined pairs from ListB:
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
# [1] "dodo" "minedo" "anddo" "thedo" "lowerdo" "owedo" "swimdo"
# [8] "domine" "minemine" "andmine" "themine" "lowermine" "owemine" "swimmine"
# [15] "doand" "mineand" "andand" "theand" "lowerand" "oweand" "swimand"
# [22] "dothe" "minethe" "andthe" "thethe" "lowerthe" "owethe" "swimthe"
# [29] "dolower" "minelower" "andlower" "thelower" "lowerlower" "owelower" "swimlower"
# [36] "doowe" "mineowe" "andowe" "theowe" "lowerowe" "oweowe" "swimowe"
# [43] "doswim" "mineswim" "andswim" "theswim" "lowerswim" "oweswim" "swimswim"
You can use str_extract from the stringr package to extract the element of combos that is contained within each element of ListA, if such an element exists:
library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA "andthe" "lowerswim" NA NA
Finally, you want to split the words in ListA that matched a pair of elements from ListB, unless this word is already in ListB. I suppose there are lots of ways to do this, but I'll use lapply and unlist:
newA <- unlist(lapply(seq_along(ListA), function(idx) {
if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
return(ListA[idx])
} else {
return(as.vector(as.matrix(pairings[combos == matches[idx],])))
}
}))
newA
# [1] "dopamine" "and" "the" "lower" "swim" "other" "different"

Resources