R extract text until, and not including x - r

I have a bunch of strings of mixed length, but all with a year embedded. I am trying to extract just the text part, that is everything until the number start and am having problem with lookeahead assertions assuming that is the proper way of such extractions.
Here is what I have (returns no match):
>grep("\\b.(?=\\d{4})","foo_1234_bar",perl=T,value=T)
In the example I am looking to extract just foo but there may be several, and of mixed lengths, separated by _ before the year portion.

Look-aheads may be overkill here. Use the underscore and the 4 digits as the structure, combined with a non-greedy quantifier to prevent the 'dot' from gobbling up everything:
/(.+?)_\d{4}/
-first matching group ($1) holds 'foo'

This will grab everything up until the first digit
x <- c("asdfas_1987asdf", "asd_das_12")
regmatches(x, regexpr("^[^[:digit:]]*", x))
#[1] "asdfas_" "asd_das_"

Another approach (often I find that strsplit is faster than regex searching but not always (though this does use a slight bit of regexing):
x <- c("asdfas_1987asdf", "asd_das_12") #shamelessly stealing Dason's example
sapply(strsplit(x, "[0-9]+"), "[[", 1)

Related

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?

R: parse nested parentheses

I would like to parse nested parentheses using R. No, this is not JASON. I have seen examples using perl, php, and python, but I am having trouble getting anything to work in R. Here is an example of some data:
(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)
I would like to split this string based on the three parent parentheses into three separate strings:
(a(a(a)(aa(a)a)a)a)
((b(b)b)b)
(((cc)c)c)
One of the challenges I am facing is the lack of a consistent structure in terms of total pairs of child parentheses within the parent parentheses, and the number of consecutive open or closed parentheses. Notice the consecutive open parentheses in the data with Bs and with Cs. This has made attempts to use regex very difficult. Also, the data within a given parent parentheses will have many common characters to other parent parentheses, so looking for all "a"s or "b"s is not possible - I fabricated this data to help people see the three parent parentheses better.
Basically I am looking for a function that identifies parent parentheses. In other words, a function that can find parentheses that are not contained with parentheses, and return all instances of this for a given string.
Any ideas? I appreciate the help.
Here is one directly adapted from Regex Recursion with \\((?>[^()]|(?R))*\\):
s = "(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)"
matched <- gregexpr("\\((?>[^()]|(?R))*\\)", s, perl = T)
substring(s, matched[[1]], matched[[1]] + attr(matched[[1]], "match.length") - 1)
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)" "(((cc)c)c)"
Assuming that there are matching paranthesis, you can try the following (this is like a PDA, pushdown automata, if you are familiar with theory of computation):
str <- '(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)'
indices <- c(0, which(cumsum(sapply(unlist(strsplit(str, split='')),
function(x) ifelse(x == '(', 1, ifelse(x==')', -1, 0))))==0))
sapply(1:(length(indices)-1), function(i) substring(str, indices[i]+1, indices[i+1]))
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)" "(((cc)c)c)"

How do I remove a specific sign like a comma partially from a data set

I have a data set like this:
Quest_main=c("quest2,","quest5,","quest4,","quest12,","quest4,","quest5,quest7")
And I would like to remove the comma from for example "quest2," so that it is "quest2", but not from the "quest5,quest7". I think I have to use substr or ifelse, but I am not sure. The final result is this when I call up Quest_main:
"quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
Thanks!
All you need is
gsub(",$","",Quest_main)
The $ signifies the end of a string: for full explanation, see the (long and complicated) ?regexp, or a more general introduction to regular expressions, or search for the tags [r] [regex] on Stack Overflow.
If you insist on doing it with substr() and ifelse(), you can:
nc <- nchar(Quest_main)
lastchar <- substr(Quest_main,nc,nc)
ifelse(lastchar==",",substr(Quest_main,1,nc-1),
Quest_main)
With substring and ifelse:
ifelse(substring(Quest_main,nchar(Quest_main))==',',substring(Quest_main,1,nchar(Quest_main)-1),Quest_main)
Here's an alternative approach (just for general knowledge) using negative lookahead
gsub("(,)(?!\\w)", "", Quest_main, perl = TRUE)
## [1] "quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
This approach is more general in case you want to delete commas not only from end of the word, but specify other conditions too
A more general solution would be using stringis stri_trim_right which will work in cases Bens or Jealie solutions will fail, for example when you have many commas at the end of the sentence which you want to get rid of, for example:
Quest_main <- c("quest2,,,," ,"quest5,quest7,,,,")
Quest_main
#[1] "quest2,,,," "quest5,quest7,,,,"
library(stringi)
stri_trim_right(Quest_main, pattern = "[^,]")
#[1] "quest2" "quest5,quest7"

R - Using grep and gsub to return more than one match in the same (character) vector element

Imagine we want to find all of the FOOs and subsequent numbers in the string below and return them as a vector (apologies for unreadability, I wanted to make the point there is no regular pattern before and after the FOOs):
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
We can use this to find one of them (taken from 1)
gsub(".*(FOO[0-9]+).*", "\\1", xx)
[1] "FOO1643"
However, I want to return all of them, as a vector.
I've thought of a complicated way to do it using strplit() and gregexpr() - but I feel there is a better (and easier) way.
You may be interested in regmatches:
> regmatches(xx, gregexpr("FOO[0-9]+", xx))[[1]]
[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
xx <- "xasdrFOO1921ddjadFOO1234dakaFOO12345ndlslsFOO1643xasdf"
library(stringr)
str_extract_all(xx, "(FOO[0-9]+)")[[1]]
#[1] "FOO1921" "FOO1234" "FOO12345" "FOO1643"
this can take vectors of strings as well, and mathces will be in list elements.
Slightly shorter version.
library(gsubfn)
strapplyc(xx,"FOO[0-9]*")[[1]]

Resources