Why group capturing extract eveything in str_replace in r - r

Context
a = 'g_pm10year1126.81 - 139.90'
I have a character vector a. I want to extract the content after year1 in the string a ("126.81 - 139.90").
By using str_extract(a, "(? <=year1).*") I successfully extracted the content I wanted.
After that, I tried to use group capturing in the str_replace function, but it returned the whole string a.
Question
My question is why str_replace(a, "(? <=year1)(. *)", '\\1') returns "g_pm10year1126.81 - 139.90".
As I understand it it should return 126.81 - 139.90.
Reproducible code:
library(stringr)
a = 'g_pm10year1126.81 - 139.90'
> str_extract(a, "(?<=year1).*")
[1] "126.81 - 139.90"
> str_replace(a, "(?<=year1)(.*)", '\\1')
[1] "g_pm10year1126.81 - 139.90"

The issue is that you are replacing the captured group with itself. Hence you are not changing anything and end up with your input string.
To achieve your desired result using str_replace you have to replace the part before the captured group, i.e. you could do:
library(stringr)
a = 'g_pm10year1126.81 - 139.90'
str_replace(a, "^.*?(?<=year1)(.*)", '\\1')
#> [1] "126.81 - 139.90"

Related

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"
Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")
You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

Find pattern and wrap between parenthesis the next letter

I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.
You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).

Extracting a certain substring (email address)

I'm attempting to pull some a certain from a variable that looks like this:
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
(this variable has hundreds of observations)
I want to eventually make a second variable that pulls their email to give this output:
v2 <- c("personsemail#email.com", "person2#email.com")
How would I do this? Is there a certain package I can use? Or do I need to make a function incorporating grep and substr?
Those look like what R might call a "person". There is an as.person() function that can split out the email address. For example
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
unlist(as.person(v1)$email)
# [1] "personsemail#email.com" "person2#email.com"
For more information, see the ?person help page.
One option with str_extract from stringr
library(stringr)
str_extract(v1, "(?<=\\<)[^>]+")
#[1] "personsemail#email.com" "person2#email.com"
You can look for the pattern "anything**, then <, then (anything), then >, then anything" and replace that pattern with the part between the parentheses, indicated by \1 (and an extra \ to escape).
sub('.*<(.*)>.*', '\\1', v1)
# [1] "personsemail#email.com" "person2#email.com"
** "anything" actually means anything but line breaks
You can look for a pattern that looks like email using regexpr. If a match is found, extract the relevant part using substring. The starting position and match length is provided by the regexpr
inds = regexpr(pattern = "<(.*#.*\\..*)>", v1)
ifelse(inds > 1,
substring(v1, inds + 1, inds + attr(inds, "match.length") - 2),
NA)
#[1] "personsemail#email.com" "person2#email.com"

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources