julia to regex match lines in a file like grep - julia

I would like to see a code snippet of julia that will read a file and return lines (string type) that match a regular expression.
I welcome multiple techniques, but output should be equivalent to the following:
$> grep -E ^AB[AJ].*TO' 'webster-unabridged-dictionary-1913.txt'
ABACTOR
ABATOR
ABATTOIR
ABJURATORY
I'm using GNU grep 3.1 here, and the first line of each entry in the file is the all caps word on its own.

You could also use the filter function to do this in one line.
filter(line -> ismatch(r"^AB[AJ].*TO",line),readlines(open("webster-unabridged-dictionary-1913.txt")))
filter applies a function returning a Boolean to an array, and only returns those elements of the array which are true. The function in this case is an anonymous function line -> ismatch(r"^AB[AJ].*TO",line)", which basically says to call each element of the array being filtered (each line, in this case) line.
I think this might not be the best solution for very large files as the entire file needs to be loaded into memory before filtering, but for this example it seems to be just as fast as the for loop using eachline. Another difference is that this solution returns the results as an array rather than printing each of them, which depending on what you want to do with the matches might be a good or bad thing.

My favored solution uses a simple loop and is very easy to understand.
julia> open("webster-unabridged-dictionary-1913.txt") do f
for i in eachline(f)
if ismatch(r"^AB[AJ].*TO", i) println(i) end
end
end
ABACTOR
ABATOR
ABATTOIR
ABJURATORY
notes
Lines with tab separations have the tabs preserved (no literal output of '\t')
my source file in this example has the dictionary words in all caps alone on one line above the definition; the complete line is returned.
the file I/O operation is wrapped in a do block syntax structure, which expresses an anonymous function more conveniently than lamba x -> f(x) syntax for multi-line functions. This is particularly expressive with the file open() command, defined with a try-finally-close operation when called with a function as an argument.
Julia docs: Strings/Regular Expressions
regex objects take the form r"<regex_literal_here>"
the regex itself is a string
based on perl PCRE library
matches become regex match objects
example
julia> reg = r"^AB[AJ].*TO";
julia> typeof(reg)
Regex
julia> test = match(reg, "ABJURATORY")
RegexMatch("ABJURATO")
julia> typeof(test)
RegexMatch

Just putting ; in front is Julia's way to using commandline commands so this works in Julia's REPL
;grep -E ^AB[AJ].*TO' 'webster-unabridged-dictionary-1913.txt'

Related

Julia: Piping standing input to process from string

In my Julia code, I want to call various external commands that get data from standard input and produce output on standard output. I'd like to store data in strings and have them read and written to these processes. For definiteness, let's say the process is tr [a-z] [A-Z]. My wrapper would be
function toupper(string)
fn, fh = mktemp()
print(fh, string)
close(fh)
result = pipeline(fn, `tr [a-z] [A-Z]`) |> readstring
rm(fn)
result
end
(this is Julia 0.6 syntax; replace readstring by io->read(io,String))
I would like a cleaner way of doing this; ideally, a command printer(string) that creates a stream producing the contents of the string, such that the command above would be coded as
toupper(string) = pipeline(printer(string), `tr [a-z] [A-Z]`) |> readstring
(Indeed, there will be lots of commands like the above, and I'd like, for efficiency reasons, to avoid creating and deleting all these temporary files)
OK, I found it: printer(string) could just be coded as IOBuffer(string).

How to process latex commands in R?

I work with knitr() and I wish to transform inline Latex commands like "\label" and "\ref", depending on the output target (Latex or HTML).
In order to do that, I need to (programmatically) generate valid R strings that correctly represent the backslash: for example "\label" should become "\\label". The goal would be to replace all backslashes in a text fragment with double-backslashes.
but it seems that I cannot even read these strings, let alone process them: if I define:
okstr <- function(str) "do something"
then when I call
okstr("\label")
I directly get an error "unrecognized escape sequence"
(of course, as \l is faultly)
So my question is : does anybody know a way to read strings (in R), without using the escaping mechanism ?
Yes, I know I could do it manually, but that's the point: I need to do it programmatically.
There are many questions that are close to this one, and I have spent some time browsing, but I have found none that yields a workable solution for this.
Best regards.
Inside R code, you need to adhere to R’s syntactic conventions. And since \ in strings is used as an escape character, it needs to form a valid escape sequence (and \l isn’t a valid escape sequence in R).
There is simply no way around this.
But if you are reading the string from elsewhere, e.g. using readLines, scan or any of the other file reading functions, you are already getting the correct string, and no handling is necessary.
Alternatively, if you absolutely want to write LaTeX-like commands in literal strings inside R, just use a different character for \; for instance, +. Just make sure that your function correctly handles it everywhere, and that you keep a way of getting a literal + back. Here’s a suggestion:
okstr("+label{1 ++ 2}")
The implementation of okstr then needs to replace single + by \, and double ++ by + (making the above result in \label{1 + 2}). But consider in which order this needs to happen, and how you’d like to treat more complex cases; for instance, what should the following yield: okstr("1 +++label")?

Julia: docstrings and LaTeX

Julia has docstrings capabilities, which are documented here https://docs.julialang.org/en/stable/manual/documentation/. I'm under the impression that it has support for LaTeX code, but I'm not sure if the intention is that the LaTeX code should look like code or like an interpretation. In the following, the LaTeX code is garbled somewhat (see rho, for instance) and not interpreted (rho does not look like ρ). Am I doing something wrong?
Is there a way to get LaTeX code look interpreted?
What I mean by interpreted is something like what they do at https://math.stackexchange.com/.
The documentation says that LaTeX code should be wrapped around double back-quotes and that Greek letters should be typed as ρ rather than \rho. But that rather defeats the point of being able to include LaTeX code, doesn't it?
Note: Version 0.5.2 run in Juno/Atom console.
"""
Module blabla
The objective function is:
``\max \mathbb{E}_0 \int_0^{\infty} e^{-\rho t} F(x_t) dt``
"""
module blabla
end
If I then execute the module and query I get this:
With triple quotes, the dollar signs disappear, but the formula is printed on a dark background:
EDIT Follow-up to David P. Sanders' suggestion to use the Documenter.jl package.
using Documenter
doc"""
Module blabla
The objective function is:
$\max \mathbb{E}_0 \int_0^{\infty} e^{-\rho t} F(x_t) dt$
"""
module blabla
end
Gives the following: the LaTeX code appears to print correctly, but it's not interpreted (ρ is displayed as \rho. I followed suggestions in: https://juliadocs.github.io/Documenter.jl/stable/man/latex.html to
Rendering LaTeX code as actual equations has to be supported by whichever software renders your docstrings. So, the short answer to why you're not seeing any rendered equations in Juno is that LaTeX rendering is currently not supported in Juno (as Matt B. pointed out, there's an open issue for that).
The doc"" string literal / string macro is there to get around another issue. Backslashes and dollar signs normally have a special meaning in string literals -- escape sequences and variable interpolation, respectively (e.g. \n gets replaced by a newline character, $(variable) inserts the value of variable into the string). This, of course, clashes with the ordinary LaTeX commands and delimiters (e.g. \frac, $...$). So, to actually have backslashes and dollar signs in a string you need to escape them all with backslashes, e.g.:
julia> "\$\\frac{x}{y}\$" |> print
$\frac{x}{y}$
Not doing that will either give an error:
julia> "$\frac{x}{y}$" |> print
ERROR: syntax: invalid interpolation syntax: "$\"
or invalid characters in the resulting strings:
julia> "``\e^x``" |> print
``^x``
Having to escape everything all the time would, of course, be annoying when writing docstrings. So, to get around this, as David pointed out out, you can use the doc string macro / non-standard string literal. In a doc literal all standard escape sequences are ignored, so an unescaped LaTeX string doesn't cause any issues:
julia> doc"$\frac{x}{y}$" |> print
$$
\frac{x}{y}
$$
Note: doc also parses the Markdown in the string and actually returns a Base.Markdown.MD object, not a string, which is why the printed string is a bit different from the input.
Finally, you can then use these doc-literals as normal docstrings, but you can then freely use LaTeX syntax without having to worry about escaping everything:
doc"""
$\frac{x}{y}$
"""
function foo end
This is also documented in Documenter's manual, although it is not actually specific to Documenter.
Double backticks vs dollar signs. The preferred way to mark LaTeX in Julia docstrings or documentation is by using double backticks or ```math blocks, as documented in the manual. Dollar signs are supported for backwards compatibility.
Note: Documenter's manual and the show methods for Markdown objects in Julia should be updated to reflect this.
You can use
doc"""
...
"""
This is a "non-standard string literal" used by the Documenter.jl package; see https://docs.julialang.org/en/stable/manual/strings/#non-standard-string-literals.

Shell command to parse/print stream

Specific question
What is a shell command to turn strings like this
class A(B, C):
into sets of strings like this
B -> A;
C -> A;
Where A, B, and C are all of the form \w+ and, where I've written "B, C" I really mean any number of terms separated by commas and whitespace. I.e. "B, C" could equally be "B" or "B, C, D, E".
Big picture
I'm visualizing the class hierarchy of a Python project. I'm looking into a directory for all .py files, grepping for class declarations and then converting them to DOT format. So far I've used find and grep to get a list of lines. I've done what is above in a small python script. If possible I'd like to use just the standard unix toolchain instead. Ideally I'd like to find another composable tool to pipe into and out of and complete the chain.
You want primitive? This sed script should work on every UNIX since V7 (but I haven't tested it on anything really old so be careful). Run it as sed -n -f scriptfile infile > outfile
: loop
/^class [A-Za-z0-9_][A-Za-z0-9_]*(\([A-Za-z0-9_][A-Za-z0-9_]*, *\)*[A-Za-z0-9_][A-Za-z0-9_]*):$/{
h
s/^class \([A-Za-z0-9_][A-Za-z0-9_]*\)(\([A-Za-z0-9_][A-Za-z0-9_]*\)[,)].*/\2 -> \1;/
p
g
s/\(class [A-Za-z0-9_][A-Za-z0-9_]*(\)[A-Za-z0-9_][A-Za-z0-9_]*,* */\1/
b loop
}
Those are BREs (Basic Regular Expressions). They don't have a + operator (that's only found in Extended Regular Expressions) and they definitely don't have \w (which was invented by perl). So your simple \w+ becomes [A-Za-z0-9_][A-Za-z0-9_]* and I had to use it several times, resulting in major ugliness.
In pseudocode form, what the thing does is:
while the line matches /^class \w+(comma-separated-list-of \w+):$/ {
save the line in the hold space
capture the outer \w and the first \w in the parentheses
replace the entire line with the new string "\2 -> \1;" using the captures
print the line
retrieve the line from the hold space
delete the first member of the comma-separated list
}
Using Python's ast module to parse Python is easy as, well, Python.
import ast
class ClassDumper(ast.NodeVisitor):
def visit_ClassDef(self, clazz):
def expand_name(expr):
if isinstance(expr, ast.Name):
return expr.id
if isinstance(expr, ast.Attribute):
return '%s.%s' % (expand_name(expr.value), expr.attr)
return ast.dump(expr)
for base in clazz.bases:
print '%s -> %s;' % (clazz.name, expand_name(base))
ClassDumper.generic_visit(self, clazz)
ClassDumper().visit(ast.parse(open(__file__).read()))
(This isn't quite right w.r.t. nesting, as it'll output Inner -> Base; instead of Outer.Inner -> Base;, but you could fix that by keeping track of context in a manual walk.)

using R to copy files

As part of a larger task performed in R run under windows, I would like to copy selected files between directories. Is it possible to give within R a command like cp patha/filea*.csv pathb (notice the wildcard, for extra spice)?
I don't think there is a direct way (shy of shelling-out), but something like the following usually works for me.
flist <- list.files("patha", "^filea.+[.]csv$", full.names = TRUE)
file.copy(flist, "pathb")
Notes:
I purposely decomposed in two steps, they can be combined.
See the regular expression: R uses true regex, and also separates the file pattern from the path, in two separate arguments.
note the ^ and $ (beg/end of string) in the regex -- this is a common gotcha, as these are implicit to wildcard-type patterns, but required with regexes (lest some file names which match the wildcard pattern but also start and/or end with additional text be selected as well).
In the Windows world, people will typically add the ignore.case = TRUE argument to list.files, in order to emulate the fact that directory searches are case insensitive with this OS.
R's glob2rx() function provides a convenient way to convert wildcard patterns to regular expressions. For example fpattern = glob2rx('filea*.csv') returns a different but equivalent regex.
You can
use system() to fire off a command as if it was on shell, incl globbing
use list.files() aka dir() to do the globbing / reg.exp matching yourself and the copy the files individually
use file.copy on individual files as shown in mjv's answer

Resources