Conditional shift of substring position in R

Conditional shift of substring position in R - r

> Df1
[1] "HM_004_T" "HM_004_T2" "HM_005_T" "HMFN_005_T2" "HM_007_T" "HM_007_T2" "HM_088_TR"
[8] "HM_088_T3"
Reference is made to change position of word within a string in r. I have a slightly different question. I first wish to delete _T if it presents on its own, and wish to delete _T2, _T3 or _TR and move them before all other text.
My ideal output will be:
Df1 <- c("HM_004", "T2_HM_004", "HM_005", "T2_HM_005", "HM_007", "T2_HM_007", "TR_HM_088", "T3_HM_088")
Input data
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2", "HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")

You can do this with nested sub and backreference:
DF1 <- sub("(.*)_(T\\w)$", "\\2_\\1", sub("_T$", "", DF1))
Here you delete string-final _T in the first sub operation, the result of which you pass to the second sub operation, which switches the order of (i) whatever comes before the underscore _ and (ii) T followed by a digit or a letter (\\w), by referring to these two substrings with the backreferences \\1and \\2.
Result:
DF1
[1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007" "TR_HM_088" "T3_HM_088"
Data:
DF1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")

You can achieve this relatively easy with the package stringr and the functions str_remove() and str_replace().
I am assuming that the patterns of interest always occur at the end of the text and that they are always preceded by _.
Please, have a look at the updated code below. This treats the pattern _T*, where * can now be a letter, as target thus good pattern.
library(stringr)
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
# Here I remove the roots I don't want like "_T" and "_T*"
# where "*" can be a digit or a character
df2 <- str_remove(Df1, "_T$")
# Here I replace the patterns through the group reference
final <- str_replace( df2, "(^.*)_(T\\d+$|T\\w+$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
# A more coincise way would be the following where \\w is the workhorse.
final <- str_replace( df2, "(^.*)_(T\\w$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?

Related

Gtools mixedsort not working as expected on numeric string

I have a string
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
And I tried to use gtools::mixedsort to order these comma separated strings.
sapply(strsplit(str1 , ','), function(x) toString(gtools::mixedsort(x)))
I get
" -D-781..1338_C, -D-27..758_E, T-759..780"
I am expecting
"-D-27..758_E, T-759..780 -D-781..1338_C"
Not sure what I need to do to get the expected output.

I think you have a misconception on how mixedsort() works. It doesn't sort by the numbers in the string, it splits a string in separate string and number parts and sorts all of them in order. I hope these small example illustrate how mixedsort() works. It starts by sorting the elements of the vector c("B_1", "A_2", "A_10") by their first string-part c("B", "A", "A"), so A is always before B and then for the two A-elements it sorts them by their numbers 10 and 2:
# example showing how mixedsort works
example <- c("B_1", "A_2", "A_10")
gtools::mixedsort(example)
#> [1] "A_2" "A_10" "B_1"
sort(example) # in comparison to normal sort, which doesn't recognize parts of the string as numbers
#> [1] "A_10" "A_2" "B_1"
Created on 2022-09-02 by the reprex package (v2.0.1)
But according to your example, you want to sort a vector by the first number, which appears in each element, and ignore a possible - infront of the number. In that case, you can just use a regular expression to extract the first number in a string with gsub(".*?([0-9]+).*", "\\1", x) and use that to sort the vector. I wrote a small function for it:
# function to sort by first number, ignoring minus before the number
sort.first.number <- function(x) {
v <- gsub(".*?([0-9]+).*", "\\1", x)
x[order(v)]
}
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
sapply(strsplit(str1 , ','), function(x) toString(sort.first.number(x)))
#> [1] " -D-27..758_E, T-759..780, -D-781..1338_C"
Created on 2022-09-02 by the reprex package (v2.0.1)

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"

Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")

You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

How to extract everything after a specific string?

I'd like to extract everything after "-" in vector of strings in R.
For example in :
test = c("Pierre-Pomme","Jean-Poire","Michel-Fraise")
I'd like to get
c("Pomme","Poire","Fraise")
Thanks !

With str_extract. \\b is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub. \\1 refers to string matched by the first capture group (.+), which is any character one or more times following a - at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit and extract the second word from each element of the list (similar to word from #akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr also has str_split variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"

We can use sub to match characters (.*) until the - and in the replacement specify ""
sub(".*-", "", test)
Or another option is word
library(stringr)
word(test, 2, sep="-")

I think the other answers might be what you're looking for, but if you don't want to lose the original context you can try something like this:
library(tidyverse)
tibble(test) %>%
separate(test, c("first", "last"), remove = F)
This will return a dataframe containing the original strings plus components, which might be more useful down the road:
# A tibble: 3 x 3
test first last
<chr> <chr> <chr>
1 Pierre-Pomme Pierre Pomme
2 Jean-Poire Jean Poire
3 Michel-Fraise Michel Fraise

For some reason the responses here didn't work for my particular string. I found this response more helpful (i.e., using Stringr's lookbehind function): stringr str_extract capture group capturing everything.

Extract text with gsub

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)

We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
data
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!

If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...

I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Conditional shift of substring position in R - r

Related

Gtools mixedsort not working as expected on numeric string

How to I use regular expressions to match a substring?

How to extract everything after a specific string?

Extract text with gsub

R: Replacing rownames of data frame by a substring[2]

Categories

Resources