R: parse nested parentheses - r

I would like to parse nested parentheses using R. No, this is not JASON. I have seen examples using perl, php, and python, but I am having trouble getting anything to work in R. Here is an example of some data:
(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)
I would like to split this string based on the three parent parentheses into three separate strings:
(a(a(a)(aa(a)a)a)a)
((b(b)b)b)
(((cc)c)c)
One of the challenges I am facing is the lack of a consistent structure in terms of total pairs of child parentheses within the parent parentheses, and the number of consecutive open or closed parentheses. Notice the consecutive open parentheses in the data with Bs and with Cs. This has made attempts to use regex very difficult. Also, the data within a given parent parentheses will have many common characters to other parent parentheses, so looking for all "a"s or "b"s is not possible - I fabricated this data to help people see the three parent parentheses better.
Basically I am looking for a function that identifies parent parentheses. In other words, a function that can find parentheses that are not contained with parentheses, and return all instances of this for a given string.
Any ideas? I appreciate the help.

Here is one directly adapted from Regex Recursion with \\((?>[^()]|(?R))*\\):
s = "(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)"
matched <- gregexpr("\\((?>[^()]|(?R))*\\)", s, perl = T)
substring(s, matched[[1]], matched[[1]] + attr(matched[[1]], "match.length") - 1)
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)" "(((cc)c)c)"

Assuming that there are matching paranthesis, you can try the following (this is like a PDA, pushdown automata, if you are familiar with theory of computation):
str <- '(a(a(a)(aa(a)a)a)a)((b(b)b)b)(((cc)c)c)'
indices <- c(0, which(cumsum(sapply(unlist(strsplit(str, split='')),
function(x) ifelse(x == '(', 1, ifelse(x==')', -1, 0))))==0))
sapply(1:(length(indices)-1), function(i) substring(str, indices[i]+1, indices[i+1]))
# [1] "(a(a(a)(aa(a)a)a)a)" "((b(b)b)b)" "(((cc)c)c)"

Related

difference between <- and = in R with an example [duplicate]

This question already has answers here:
What are the differences between "=" and "<-" assignment operators?
(9 answers)
Closed 3 years ago.
I was wondering if there is a technical difference between the assignment operators "=" and "<-" in R. So, does it make any difference if I use:
Example 1: a = 1 or a <- 1
Example 2: a = c(1:20) or a <- c(1:20)
Thanks for your help
Sven
Yes there is. This is what the help page of '=' says:
The operators <- and = assign into the
environment in which they are
evaluated. The operator <- can be used
anywhere, whereas the operator = is
only allowed at the top level (e.g.,
in the complete expression typed at
the command prompt) or as one of the
subexpressions in a braced list of
expressions.
With "can be used" the help file means assigning an object here. In a function call you can't assign an object with = because = means assigning arguments there.
Basically, if you use <- then you assign a variable that you will be able to use in your current environment. For example, consider:
matrix(1,nrow=2)
This just makes a 2 row matrix. Now consider:
matrix(1,nrow<-2)
This also gives you a two row matrix, but now we also have an object called nrow which evaluates to 2! What happened is that in the second use we didn't assign the argument nrow 2, we assigned an object nrow 2 and send that to the second argument of matrix, which happens to be nrow.
Edit:
As for the edited questions. Both are the same. The use of = or <- can cause a lot of discussion as to which one is best. Many style guides advocate <- and I agree with that, but do keep spaces around <- assignments or they can become quite hard to interpret. If you don't use spaces (you should, except on twitter), I prefer =, and never use ->!
But really it doesn't matter what you use as long as you are consistent in your choice. Using = on one line and <- on the next results in very ugly code.

In R: Searching a column for different string patterns and replace all of them

I have a column with different game titles. In order to collect them, I have to change all of them to a singluar spelling.
For example, I have:
str_replace_all(FavouriteGames_DF$FavGame1, pattern = c("SKYRIM|
THE ELDER SCROLLS V: SKYRIM|
ELDER SCROLLS SKYRIM|
ELDER SCROLLS V SKYRIM|
SKYRIM (BETHESDA 2011)|
SKYRIM (MODDED)|
THE ELDERSCROLLS V: SKYRIM"),
replacement = "THE ELDER SCROLLS 5: SKYRIM")
The problem is, that str_replace_all is kinda bad for this, as it can't just search for any matching pattern and replace it with the replacement, but apparently has to go through it in order and I can't predict where in the DataSet which term will arrive.
I do not want the function to replace incomplete matches (ie., turning "The ELDERSCROLLS V: SKYRIM" to THE ELDERSCOLLS V: THE ELDER SCROLL 5: Skyrim")
Putting the patterns into pattern = c("1", "2") it will not work at all, because it can only check for the patterns in order.
I also tried the FindReplace function from the DataCombine package, but that one doesn't seem to work either for reasons I do not quite understand (claiming I am missing dimensions and the vector not being a character vector). Anyway, I want to use as few packages as possible and would prefer to stay in the tidyverse.
Does anybody have a good solution? I do not want to search for each term on it's own as I have to do this a lot and I already have to do it for 6 columns as mutate_at doesn_t seem to work with str_replace.
Thanks!
My comment as an answer:
FavouriteGames_DF[FavouriteGames_Df$FavGame1 %in% pattern, ]$FavGame1 <- replacement
A handy solution would be to just use "SKYRIM" as a pattern, as it is the common word on all the patterns you specified. You could define a very simple function to check for that pattern and then use lapply on the specific column you want to check for:
check <- function(x){
y <- unlist(strsplit(x, " "))
if("SKYRIM" %in% y)
return("THE ELDER SCROLLS 5: SKYRIM")
else
return(x)
}
FavouriteGames_DF["FavGame1"] <- lapply(FavouriteGames_DF["FavGame1"], check)

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

R extract text until, and not including x

I have a bunch of strings of mixed length, but all with a year embedded. I am trying to extract just the text part, that is everything until the number start and am having problem with lookeahead assertions assuming that is the proper way of such extractions.
Here is what I have (returns no match):
>grep("\\b.(?=\\d{4})","foo_1234_bar",perl=T,value=T)
In the example I am looking to extract just foo but there may be several, and of mixed lengths, separated by _ before the year portion.
Look-aheads may be overkill here. Use the underscore and the 4 digits as the structure, combined with a non-greedy quantifier to prevent the 'dot' from gobbling up everything:
/(.+?)_\d{4}/
-first matching group ($1) holds 'foo'
This will grab everything up until the first digit
x <- c("asdfas_1987asdf", "asd_das_12")
regmatches(x, regexpr("^[^[:digit:]]*", x))
#[1] "asdfas_" "asd_das_"
Another approach (often I find that strsplit is faster than regex searching but not always (though this does use a slight bit of regexing):
x <- c("asdfas_1987asdf", "asd_das_12") #shamelessly stealing Dason's example
sapply(strsplit(x, "[0-9]+"), "[[", 1)

Resources