Extract characters from string based on rule (repeated hyphen) - r

I have a large dataframe with a column that looks something like this:
var <- c("150507-001-0000001", "KMD070515-2-0000001",
"15144KMD01AA-0000001", "Z75Z151222-0000001")
What I want to do is extract part of the string. I want all characters undtil second hyphen. So this is what I need:
150507-001
KMD070515-2
15144KMD01AA-0000001
Z75Z151222-0000001
So I know if I only wanted the data before the hyphen I'd do this:
> var <- sub("-.*", "", var)
> var
150507
KMD070515
15144KMD01AA
Z75Z151222
I've also tried a package qdap which kinda gave me what I wanted:
library("qdap")
var <- beg2char(var, "-", 2)
I do get the column I need with the last code, however something seems to be wrong. Because when I do a left_join based on the column it doesn't work. I can find a match by copy-paste in data view, but left_join doesn't find anything. Doing a leftjoin with the var made with sub (see above) do however work. But for some of my rows I need the characters after the first hyphen (and before the second) to find a match.

Here is a non regex solution, for those who might be interested:
x <- "150507-001-0000001"
paste(strsplit(x, "-")[[1]][1:2], collapse="-")
[1] "150507-001"
If you wanted to apply this logic to your entire vector, then use:
sapply(var, function(x) paste(strsplit(x, "-")[[1]][1:2], collapse="-"))

We can use sub to match the pattern of characters that are not a - followed by - and another set of characters that are not a -, capture as a group ((...)) and replace with the backreference (\\1) of the captured group
sub("^([^-]+-[^-]+).*", "\\1", var)
#[1] "150507-001" "KMD070515-2"
#[3] "15144KMD01AA-0000001" "Z75Z151222-0000001"

Related

Extract text in two columns from a string

I have a table where one column has data like this:
table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.
table$project_name <- "projectname"
using the regex:
project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)
If I test the regex on 1 value (1 row individually) of the table, the above regex works with using
str_extract_all(table$test_string, project_name[[1]][2]).
However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.
2.) Second part of the string, which is a URL in another column,
table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"
I am using the following regex expression for URL:
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
table$url_link <- str_extract(table$test_string, url_pattern)
and this works on the whole table, however, I still get the ')' last paranthesis in the url link.
What am I missing here? and why does the first regex work individually and not on the whole table?
and for the url, how do I not get the last paranthesis?
It feels like you could simplify things considerably by using parentheses to group capture. For example:
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
regex <- "\\[(.*)\\]\\((.*)\\)"
gsub(regex, "\\1", test_string)
#> [1] "projectname"
gsub(regex, "\\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"
We can make use of convenient functions from qdapRegex
library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"
rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"
data
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

Add space before a character with gsub (R) [duplicate]

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

str_extract expressions in R

I would like to convert this:
AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1
to this:
ELA-3
I tried this function:
str_extract(.,pattern = ":?(ELA).*(\\d\\-)"))
it printed this:
"ELA-NH-COMBINED-3-"
I need to get rid of the text or anything between the two extracts. The number will be a number between 3 and 9. How should I modify my expression in pattern =?
Thanks!
1) Match everything up to -ELA followed by anything (.*) up to - followed by captured digits (\\d+)followed by - followed by anything. Then replace that with ELA- followed by the captured digits. No packages are used.
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub(".*-ELA.*-(\\d+)-.*", "ELA-\\1", x)
## [1] "ELA-3"
2) Another approach if there is only one numeric field is that we can read in the fields, grep out the numeric one and preface it with ELA- . No packages are used.
s <- scan(text = x, what = "", quiet = TRUE, sep = "-")
paste("ELA", grep("^\\d+$", s, value = TRUE), sep = "-")
## [1] "ELA-3"
TL;DR;
You can't do that with a single call to str_extract because you cannot match discontinuous portions of texts within a single match operation.
Again, it is impossible to match texts that are separated with other text into one group.
Work-arounds/Solutions
There are two solutions:
Capture parts of text you need and then join them (2 operations: match + join)
Capture parts of text you need and then replace with backreferences to the groups needed (1 replace operation)
Capturing groups only keep parts of text you match in separate memory buffers, but you also need a method or function that is capable of accessing these chunks.
Here, in R, str_extract drops them, but str_match keeps them in the result.
s <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
m <- str_match(s, ":?(ELA).*-(\\d+)")
paste0(m[,2], "-", m[,3])
This prints ELA-3. See R demo online.
Another way is to replace while capturing the parts you need to keep and then using backreferences to those parts in the replacement pattern:
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub("^.*-ELA.*?-([^-]+)-[^-]+$", "ELA-\\1", x)
See this R demo

Match & Replace String, utilising the original string in the replacement, in R

I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}

how to write a regular expression to extract a specific element from string in r

I have a string list as below:
df = read.table(text="AC1=60;AD=393,115;AF1=0.318816;BQB=0.508823;DP=1016;DP4=393
AC1=190;AD=2,747;AF1=1;BQB=0.0722892;DP=749;DP4=2,0,747,0;FQ=-43.6844
AC1=150;AD=1,5;AF1=0.787353;DP=6;DP4=1,0,5,0;VDB=0.00215942
AC1=47;AD=660,182;AF1=0.24862;BQB=0.680047;DP=1684;DP4=660,0,182,0
AC1=47;AD=659,183;AF1=0.248425;DP=842;DP4=0,659,0,183;FQ=999
AC1=78;AD=23,17;AF1=0.408247;BQB=1;DP=40;DP4=23,0,17,0", header=FALSE, stringsAsFactors=F)
each element is separated by ";". I would like to extract out only "DP=[0-9]" part. The result is expected as:
DP=1016
DP=749
DP=6
DP=1684
DP=842
DP=40
I appreciate any helps.
In base:
gsub(".*((?<=;)DP=[^;]+(?=;)).*", "\\1", df$V1, perl=TRUE)
#[1] "DP=1016" "DP=749" "DP=6" "DP=842" "DP=1684" "DP=40"
I was surprised when the resident genius on regex suggested the use packages for text extraction. sub and gsub can get unruly when pulling out a specific string:
library(stringr)
str_extract_all(df$V1, "(?<=;)DP=[^;]+(?=;)")
Here is one regular expression that will work
gsub(".*;(DP=[0-9.]+);.*$", "\\1", df$V1)
If it's the case that the "DP=" substring contains multiple entries separated by commas, as do substrings like "DP4= " in some cases in the example data, then as #pierre-lafortune notes in the comments below, and in his answer, you might be better off with the [^;] character class:
gsub(".*;(DP=[^;]+);.*$", "\\1", df$V1)
Of course, you could just add the comma to the character class,
gsub(".*;(DP=[0-9.,]+);.*$", "\\1", df$V1)
but there may be other characters you want to keep as well. So [^;] would be the most inclusive approach.

Resources