R gsub fixed pattern and non-fixed pattern at the same time - r

I have a column in an R data.table of data with text like this:
> my_table
descr
1: DESCRIPTIONA - JONES:4:2
2: DESCRIPTIONB - WILDER:6:7
---
253: DESCRIPTIONA - MANN:5:8
254: DESCRIPTIONB - ROBERTS:3:4
Notice there are two kinds of Descriptions: DESCRIPTIONA and DESCRIPTIONB. I want to replace the Whole description part including the names up to the first semi colon with A if it's DESCRIPTIONA and B if it's DESCRIPTIONB. That means I totally don't care about the name. The output should look something like this:
> my_table
descr
1: A:4:2
2: B:6:7
---
253: A:5:8
254: B:3:4
I'm trying to use gsub to accomplish this, but I can't get the regex to replace just the part (DESCRIPTIONA - JONES):4:2. It's difficult because each name is different, and is of a different length. Any ideas?

x = c(
"DESCRIPTIONA - JONES:4:2",
"DESCRIPTIONB - WILDER:6:7",
"DESCRIPTIONA - MANN:5:8",
"DESCRIPTIONB - ROBERTS:3:4"
)
gsub(pattern = "DESCRIPTION(.)[^:]*", replacement = "\\1", x)
# [1] "A:4:2" "B:6:7" "A:5:8" "B:3:4"
Explanation: "DESCRIPTION(.)[^:]*" matches the word DESCRIPTION, then a single character (.) which is "saved" as a capturing group by the parens (), then it continues to match non-colon characters [^:], as many as possible (*). It replaces the full match with the first (\\1) capturing group.
You can play with it here to understand better: https://regex101.com/r/Sc7oC1/1

Related

Find pattern and wrap between parenthesis the next letter

I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.
You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).

Keep only the first letter of each word after a comma

I have strings like Sacher, Franz Xaver or Nishikawa, Kiyoko.
Using R, I want to change them to Sacher, F. X. or Nishikawa, K..
In other words, the first letter of each word after the comma should be retained with a dot (and a whitespace if another word follows).
Here is a related response, but it cannot be applied to my case 1:1 as it does not have a comma in its strings; it seems that the simple addition of (<?=, ) does not work.
E.g. in the following attempts, gsub() replaces everything, while my str_replace_all()-attempt leads to an error:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
# first attempt
# (resembles the response from the other thread)
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1', TEST, perl = TRUE)
# second attempt
# error: "Incorrect unicode property"
stringr::str_replace_all(TEST, '(?<=, )\\b(\\pL)\\pL{2,}|.','\\U\\1')
I would be grateful for your help!
You can use
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
See the regex demo. Details:
(*UCP) - the PCRE verb that will make \b Unicode aware
^[^,]+(*SKIP)(*F) - start of string and then any zero or more chars other than a comma, and then the match is failed and skipped, the next match starts at the location where the failure occurred
| - or
\b - word boundary
(\p{L}) - Group 1: any Unicode letter
\p{L}* - zero or more Unicode letters
See the R demo:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
## => [1] "Sacher, F. X." "Nishikawa, K." "Al-Assam, M."
A crude approach splitting the string :
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
sapply(strsplit(TEST, '\\s+'), function(x)
paste0(x[1], paste0(substr(x[-1], 1, 1), collapse = '.'), '.'))
#[1] "Sacher,F.X." "Nishikawa,K." "Al-Assam,M."
An approach using multiple backreference:
gsub("(\\b\\w+,\\s)(\\b\\w).*(\\b\\w)*", "\\1\\2.\\3", TEST)
[1] "Sacher, F." "Nishikawa, K." "Al-Assam, M."
Here, we use three capturing groups to refer back to in gsub's replacment argument via backreference:
(\\b\\w+,\\s): this, first, group captures the last name plus the comma followed by whitespace
(\\b\\w): this, second, group captures the initial of the first name
(\\b\\w): this, third, group captures the initial of the middle name

How to replace only specific groups in a match in R using stringr package?

In my R project I am using package stringr to perform regex operations.
text <- "My code #snippet wanna get this# is simple"
pattern <- "#([^ \t]+) (.+)#"
pattern looks for stuff inside #...#. The following code:
stringr::str_match_all(text, pattern)
Will give me the content of the groups I am targeting:
[[1]]
[,1] [,2] [,3]
[1,] "#snippet wanna get this#" "snippet" "wanna get this"
How do I replace the content of group 3 (and only that) with a different text? The final desired result would be:
My code #snippet REPLACED WITH THIS# is simple
I am playing with stringr::str_replace_all but I don't seem to get how to solve this issue. I keep replacing the whole match and not just a single group content.
You may capture what you need to keep and just match what you need to replace, use
> gsub("(#[^ \t#]+ )[^#]*(#)", "\\1REPLACED WITH THIS\\2", text)
[1] "My code #snippet REPLACED WITH THIS# is simple"
Details
(#[^ \t#]+ ) - Group 1: #, then any 1+ chars other than #, space and tab, and a space
[^#]* - 0+ chars other than #
(#) - Group 2: a # char
Another way: use gsubfn with a pattern where all your pattern parts are captured into separate groups and then rebuild the replacement after performing the required manipulations:
> gsubfn::gsubfn("(#[^ \t#]+ )([^#]*)(#)", function(x, y, z) paste0(x, "REPLACED WITH THIS", z), text)
[1] "My code #snippet REPLACED WITH THIS# is simple"
Here, the x, y and z refer to the groups defined in the pattern:
(#[^ \t#]+ )([^#]*)(#)
| --- x ---||- y -||z|
With stringr, you may - but you should be very careful with this - use a pattern with lookbehind/lookahead:
> stringr::str_replace_all(text, "(?<=#[^ \t#]{1,1000} )[^#]*(?=#)", "REPLACED WITH THIS")
[1] "My code #snippet REPLACED WITH THIS# is simple"
The (?<=#[^ \t#]{1,1000} ) lookbehind works because it matches a known length pattern (the {1,1000} says there can be from 1 to 1000 occurrences of any chars but space, tab and #), and this "constrained-width lookbehind" is supported since stringr uses ICU regex library.

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Resources