How to I use regular expressions to match a substring? - r

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"

Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")

You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

R splitting string on predefined location

I have string, which should be split into parts from "random" locations. Split occurs always from next comma after colon.
My idea was to find colons with
stringr::str_locate_all(test, ":") %>%
unlist()
then find commas
stringr::str_locate_all(test, ",") %>%
unlist()
and from there to figure out position where it should be split up, but could not find suitable way to do it. Feels like there is always 6 characters after colon before the comma, but I can't be sure about that for whole data.
Here is example string:
dput(test)
"AA,KK,QQ,JJ,TT,99,88:0.5083,66,55:0.8303,AK,AQ,AJs,AJo:0.9037,ATs:0.0024,ATo:0.5678"
Here is what result should be
dput(result)
c("AA,KK,QQ,JJ,TT,99,88:0.5083", "66,55:0.8303", "AK,AQ,AJs,AJo:0.9037",
"ATs:0.0024", "ATo:0.5678")
Perehaps we can use regmatches like below
> regmatches(test, gregexpr("(\\w+,?)+:[0-9.]+", test))[[1]]
[1] "AA,KK,QQ,JJ,TT,99,88:0.5083" "66,55:0.8303"
[3] "AK,AQ,AJs,AJo:0.9037" "ATs:0.0024"
[5] "ATo:0.5678"
here is one option with strsplit - replace the , after the digit followed by the . and one or more digits (\\d+) with a new delimiter using gsub and then split with strsplit in base R
result1 <- strsplit(gsub("([0-9]\\.[0-9]+),", "\\1;", test), ";")[[1]]
-checking
> identical(result, result1)
[1] TRUE
If the number of characters are fixed, use a regex lookaround
result1 <- strsplit(test, "(?<=:.{6}),", perl = TRUE)[[1]]

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

removes part of string in r

I'm trying to extract ES at the end of a string
> data <- c("phrases", "phases", "princesses","class","pass")
> data1 <- gsub("(\\w+)(s)+?es\\b", "\\1\\2", data, perl=TRUE)
> gsub("(\\w+)s\\b", "\\1", data1, perl=TRUE)
[1] "phra" "pha" "princes" "clas" "pas"
I get this result
[1] "phra" "pha" "princes" "clas" "pas"
but in reality what I need to obtain is:
[1] "phras" "phas" "princess" "clas" "pas"
You can use a word boundary (\\b) if it is guaranteed that each word is followed by a punctuation or is at the end of the string:
data <- c("phrases, phases, princesses, bases")
gsub('es\\b', '', data)
# [1] "phras, phas, princess, bas"
With your method, just wrap everything till the second + with one set of parentheses:
gsub("(\\w+s+)es\\b", "\\1", data)
# [1] "phras, phas, princess, bas"
There is also no need to make + lazy with ?, since you are trying to match as many consecutive s's as possible.
Edit:
OP changed the data and the desired output. Below is a simple solution that removes either es or s at the end of each string:
data <- c("phrases", "phases", "princesses","class","pass")
gsub('(es|s)\\b', '', data)
# [1] "phras" "phas" "princess" "clas" "pas"
maybe you are looking for a lookbehind assertion (which is a 0 length match)
"(?<=s)es\\b"
or because lookbehind can't have a variable length perl \K construct to keep out of match left of \K
"\\ws\\Kes\\b"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources