Extract a substring in R with no pattern - r

If one of my strings in a column looks like,
string = "P/project/dhi_intro_genomics/genomics/gene/pag-files-per-patient/000tg82e-99c4-4h20-9ude-d95e15005a 3c_KXgES5FtCpLhQce7mGkuMX/XML/JH_DN_S9_2000-12-27_MTW-29FEB1997UW"
Is there a str_extract code to get
sub_string = "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
from the original 'string'?

We can use the pattern to get the substring that are not a _ character after the patient/ substring
library(stringr)
str_extract(string, "(?<=patient\\/)0+[^_]+")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
If there are no pattern and wants to extract the 7th element based on delimiter /
trimws(strsplit(string, "/")[[1]][7], whitespace = "_.*")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
Or with str_replace
str_replace(string, "([^/]+/){6}([^_]+)_.*", "\\2")
[1] "000tg82e-99c4-4h20-9ude-d95e15005a 3c"
For the new string
str_new <- c("/P/project/dlf_intro_aion/Y0793/Y0793_8665030498_T1_K1IJ2_ps20200918125614.htj.gz.pd5",
"/P/project/dlf_intro_aion/H051/H0518946_032983_T1_K1ID2_ps20289239171246.par.gz"
)
str_replace(str_new, "^/?([^/]+/){4}[^_]+_([^_]+)_.*", "\\2")
[1] "8665030498" "032983"

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"
Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")
You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

R how to match and extract character letters of different length in a string

So I have a column of contract names df$name like below
FB210618C00280000
ADM210618C00280000
M210618P00280000
I would like to extract the FB, ADM and M. That is I want to extract characters in the string and they are of different length and stop once the first number occurs, and I don't want to extract the C or P.
The below code will give me the C or P
stri_extract_all_regex(df$name, "[a-z]+")
We can use stri_extract_first from stringi
library(stringi)
stri_extract_first(df$name, regex = "[A-Z]+")
#[1] "FB" "ADM" "M"
Or we can use base R with sub
sub("\\d+.*", "", df$name)
#[1] "FB" "ADM" "M"
Or use trimws from base R
trimws(df$name, whitespace = "\\d+.*")
data
df <- data.frame(name = c("FB210618C00280000", "ADM210618C00280000",
"M210618P00280000"))
You can use
library(stringr)
str_extract(df$name, "^[A-Za-z]+")
# Or
str_extract(df$name, "^\\p{L}+")
The stringr::str_extract function will extract the first occurrence of a pattern and ^[A-Za-z]+ / ^\p{L}+ regex matches one or more letters at the start of the string. Note \p{L} matches any Unicode letters.
See the regex demo.
Same pattern can be used with stringi::stri_extract_first():
library(stringi)
stri_extract_first(df$name, regex="^[A-Za-z]+")

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

stringr str_extract capture group capturing everything

I'm looking to extract the year from a string. This always comes after an 'X' and before "." then a string of other characters.
Using stringr's str_extract I'm trying the following:
year = str_extract(string = 'X2015.XML.Outgoing.pounds..millions.'
, pattern = 'X(\\d{4})\\.')
I thought the brackets would define the capture group, returning 2015, but I actually get the complete match X2015.
Am I doing this correctly? Why am i not trimming "X" and "."?
The capture group is irrelevant in this case. The function str_extract will return the whole match including characters before and after the capture group.
You have to work with lookbehind and lookahead instead. Their length is zero.
library(stringr)
str_extract(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = '(?<=X)\\d{4}(?=\\.)')
# [1] "2015"
This regex matches four consecutive digits that are preceded by an X and followed by a ..
I believe the most idiomatic way is to use str_match:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')
Which returns the complete match followed by capture groups:
[,1] [,2]
[1,] "X2015." "2015"
As such the following will do the trick:
str_match(string = 'X2015.XML.Outgoing.pounds..millions.',
pattern = 'X(\\d{4})\\.')[2]
Alternatively, you can use gsub:
string = 'X2015.XML.Outgoing.pounds..millions.'
gsub("X(\\d{4})\\..*", "\\1", string)
# [1] "2015"
or str_replace from stringr:
library(stringr)
str_replace(string, "X(\\d{4})\\..*", "\\1")
# [1] "2015"

Resources