pyparsing recursive grammar space separated list inside a comma separated list - recursion

Have the following string that I'd like to parse:
((K00134,K00150) K00927,K11389) (K00234,K00235)
each step is separated by a space and alternation is represented by a comma. I'm stuck in the first part of the string where there is a space inside the brackets. The desired output I'm looking for is:
[[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
What I've got so far is a basic setup to do recursive parsing, but I'm stumped on how to code in a space separated list into the bracket expression
from pyparsing import Word, Literal, Combine, nums, \
Suppress, delimitedList, Group, Forward, ZeroOrMore
ortholog = Combine(Literal('K') + Word(nums, exact=5))
exp = Forward()
ortholog_group = Suppress('(') + Group(delimitedList(ortholog)) + Suppress(')')
atom = ortholog | ortholog_group | Group(Suppress('(') + exp + Suppress(')'))
exp <<= atom + ZeroOrMore(exp)

You are on the right track, but I think you only need one place where you include grouping with ()'s, not two.
import pyparsing as pp
LPAR,RPAR = map(pp.Suppress, "()")
ortholog = pp.Combine('K' + pp.Word(pp.nums, exact=5))
ortholog_group = pp.Forward()
ortholog_group <<= pp.Group(LPAR + pp.OneOrMore(ortholog_group | pp.delimitedList(ortholog)) + RPAR)
expr = pp.OneOrMore(ortholog_group)
tests = """\
((K00134,K00150) K00927,K11389) (K00234,K00235)
"""
expr.runTests(tests)
gives:
((K00134,K00150) K00927,K11389) (K00234,K00235)
[[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
[0]:
[['K00134', 'K00150'], 'K00927', 'K11389']
[0]:
['K00134', 'K00150']
[1]:
K00927
[2]:
K11389
[1]:
['K00234', 'K00235']
This is not exactly what you said you were looking for:
you wanted: [[['K00134', 'K00150'], 'K00927'], 'K11389'], ['K00234', 'K00235']
I output : [[['K00134', 'K00150'], 'K00927', 'K11389'], ['K00234', 'K00235']]
I'm not sure why there is grouping in your desired output around the space-separated part (K00134,K00150) K00927. Is this your intention or a typo? If intentional, you'll need to rework the definition of ortholog_group, something that will do a delimited list of space-delimited groups in addition to the grouping at parens. The closest I could get was this:
[[[[['K00134', 'K00150']], 'K00927'], ['K11389']], [['K00234', 'K00235']]]
which required some shenanigans to group on spaces, but not group bare orthologs when grouped with other groups. Here is what it looked like:
ortholog_group <<= pp.Group(LPAR + pp.delimitedList(pp.Group(ortholog_group*(1,) & ortholog*(0,))) + RPAR) | pp.delimitedList(ortholog)
The & operator in combination with the repetition operators gives the space-delimited grouping (*(1,) is equivalent to OneOrMore, *(0,) with ZeroOrMore, but also supports *(10,) for "10 or more", or *(3,5) for "at least 3 and no more than 5"). This too is not quite exactly what you asked for, but may get you closer if indeed you need to group the space-delimited bits.
But I must say that grouping on spaces is ambiguous - or at least confusing. Should "(A,B) C D" be [[A,B],C,D] or [[A,B],C],[D] or [[A,B],[C,D]]? I think, if possible, you should permit comma-delimited lists, and perhaps space-delimited also, but require the ()'s when items should be grouped.

Related

jinja2 variable inside a variable

I am passing a dictionary to a template.
dict road_len = {"US_NEWYORK_MAIN":24,
"US_BOSTON_WALL":18,
"FRANCE_PARIS_RUE":9,
"MEXICO_CABOS_STILL":8}
file_handle = output.txt
env.globals.update(country = "MEXICO")
env.globals.update(city = "CABOS")
env.globals.update(street = "STILL")
file_handle.write(env.get_template(template.txt).render(road_len=road_len)))
template.txt
This is a road length is: {{road_len["{{country}}_{{city}}_{{street}}"]}}
Expected output.txt
This is a road length is: 8
But this does not work as nested variable substitution are not allowed.
You never nest Jinja {{..}} markers. You're already inside a template context, so if you want to use the value of a variable you just use the variable. It helps if you're familiar with Python, because you can use most of Python's string formatting constructs.
So you could write:
This is a road length is: {{road_len["%s_%s_%s" % (country, city, street)]}}
Or:
This is a road length is: {{road_len[country + "_" + city + "_" + street]}}
Or:
This is a road length is: {{road_len["{}_{}_{}".format(country, city, street)]}}

regex to replace text *outside* of {}

I want to use regex to replace commands or tags around strings. My use case is converting LaTeX commands to bookdown commands, which means doing things like replacing \citep{*} with [#*], \ref{*} with \#ref(*), etc. However, lets stick to the generalized question:
Given a string <begin>somestring<end> where <begin> and <end> are known and somestring is an arbitrary sequence of characters, can we use regex to susbstitute <newbegin> and <newend> to get the string <newbegin>somestring<newend>?
For example, consider the LaTeX command \citep{bonobo2017}, which I want to convert to [#bonobo2017]. For this example:
<begin> = \citep{
somestring = bonobo2017
<end> = }
<newbegin> = [#
<newend> = ]
This question is basically the inverse of this question.
I'm hoping for an R or notepad++ solution.
Additional Examples
Convert \citet{bonobo2017} to #bonobo2017
Convert \ref{myfigure} to \#ref(myfigure)
Convert \section{Some title} to # Some title
Convert \emph{something important} to *something important*
I'm looking for a template regex that I can fill in my <begin>, <end>, <newbegin> and <newend> on a case-by-case basis.
You can try something like this with dplyr + stringr:
string = "\\citep{bonobo2017}"
begin = "\\citep{"
somestring = "bonobo2017"
end = "}"
newbegin = "[#"
newend = "]"
library(stringr)
library(dplyr)
string %>%
str_extract(paste0("(?<=\\Q", begin, "\\E)\\w+(?=\\Q", end, "\\E)")) %>%
paste0(newbegin, ., newend)
or:
string %>%
str_replace_all(paste0("\\Q", begin, "\\E|\\Q", end, "\\E"), "") %>%
paste0(newbegin, ., newend)
You can also make it a function for convenience:
convertLatex = function(string, BEGIN, END, NEWBEGIN, NEWEND){
string %>%
str_replace_all(paste0("\\Q", BEGIN, "\\E|\\Q", END, "\\E"), "") %>%
paste0(NEWBEGIN, ., NEWEND)
}
convertLatex(string, begin, end, newbegin, newend)
# [1] "[#bonobo2017]"
Notes:
Notice that I manually added an additional \ to "\\citep{bonobo2017}", this is because raw strings don't exist in R(I hope they do exist), so a single \ would be treated as an escape character. I need another \ to escape the first \.
The regex in str_extract uses positive lookbehind and positve lookahead to extract the somestring in between begin and end.
str_replace takes another approach of removing begin and end from string.
The "\\Q", "\\E" pair in the regex means "Backslash all nonalphanumeric characters" and "\\E" ends the expression. This is especially useful in your case since you likely have special characters in your Latex command. This expression automatically escapes them for you.

r check if string contains special characters

I am checking if a string contains any special characters. This is what I have, and its not working,
if(grepl('^\\[:punct:]', val))
So if anybody can tell me what I am missing, that will be helpful.
Special characters
~ ` ! ## $ % ^ & * | : ; , ." |
As #thelatemail pointed out in the comments you can use:
grepl('[^[:punct:]]', val)
which will result in TRUE or FALSE for each value in your vector. You can add sum() to the beginning of the statement to get the total number of these cases.
You can also use:
grepl('[^[:alnum:]]', val)
which will check for any value that is not a letter or a number.

String recognition in idl

I have the following strings:
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat
and from each I want to extract the three variables, 1. SWIR32 2. the date and 3. the text following the date. I want to automate this process for about 200 files, so individually selecting the locations won't exactly work for me.
so I want:
variable1=SWIR32
variable2=2005210
variable3=East_A
variable4=SWIR32
variable5=2005210
variable6=Froemke-Hoy
I am going to be using these to add titles to graphs later on, but since the position of the text in each string varies I am unsure how to do this using strmid
I think you want to use a combination of STRPOS and STRSPLIT. Something like the following:
s = ['F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat', $
'F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat']
name = STRARR(s.length)
date = name
txt = name
foreach sub, s, i do begin
sub = STRMID(sub, 1+STRPOS(sub, '\', /REVERSE_SEARCH))
parts = STRSPLIT(sub, '_', /EXTRACT)
name[i] = parts[0]
date[i] = parts[1]
txt[i] = STRJOIN(parts[2:*], '_')
endforeach
You could also do this with a regular expression (using just STRSPLIT) but regular expressions tend to be complicated and error prone.
Hope this helps!

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources