I would like to see a code snippet of julia that will read a file and return lines (string type) that match a regular expression.
I welcome multiple techniques, but output should be equivalent to the following:
$> grep -E ^AB[AJ].*TO' 'webster-unabridged-dictionary-1913.txt'
ABACTOR
ABATOR
ABATTOIR
ABJURATORY
I'm using GNU grep 3.1 here, and the first line of each entry in the file is the all caps word on its own.
You could also use the filter function to do this in one line.
filter(line -> ismatch(r"^AB[AJ].*TO",line),readlines(open("webster-unabridged-dictionary-1913.txt")))
filter applies a function returning a Boolean to an array, and only returns those elements of the array which are true. The function in this case is an anonymous function line -> ismatch(r"^AB[AJ].*TO",line)", which basically says to call each element of the array being filtered (each line, in this case) line.
I think this might not be the best solution for very large files as the entire file needs to be loaded into memory before filtering, but for this example it seems to be just as fast as the for loop using eachline. Another difference is that this solution returns the results as an array rather than printing each of them, which depending on what you want to do with the matches might be a good or bad thing.
My favored solution uses a simple loop and is very easy to understand.
julia> open("webster-unabridged-dictionary-1913.txt") do f
for i in eachline(f)
if ismatch(r"^AB[AJ].*TO", i) println(i) end
end
end
ABACTOR
ABATOR
ABATTOIR
ABJURATORY
notes
Lines with tab separations have the tabs preserved (no literal output of '\t')
my source file in this example has the dictionary words in all caps alone on one line above the definition; the complete line is returned.
the file I/O operation is wrapped in a do block syntax structure, which expresses an anonymous function more conveniently than lamba x -> f(x) syntax for multi-line functions. This is particularly expressive with the file open() command, defined with a try-finally-close operation when called with a function as an argument.
Julia docs: Strings/Regular Expressions
regex objects take the form r"<regex_literal_here>"
the regex itself is a string
based on perl PCRE library
matches become regex match objects
example
julia> reg = r"^AB[AJ].*TO";
julia> typeof(reg)
Regex
julia> test = match(reg, "ABJURATORY")
RegexMatch("ABJURATO")
julia> typeof(test)
RegexMatch
Just putting ; in front is Julia's way to using commandline commands so this works in Julia's REPL
;grep -E ^AB[AJ].*TO' 'webster-unabridged-dictionary-1913.txt'
I am particularly looking at R, Perl, and shell. But any other programming language would be fine too.
QUESTION
Is there a way to visually or programmatically inspect and index a matched string based on the regex? This is intended for referencing back to the first regex and its results inside of a second regex, so as to be able to modify a part of the matched string and write new rules for that particular part.
https://regex101.com does visualize how a certain string matches the regular expression. But it is far from perfect and is not efficient for my huge dataset.
PROBLEM
I have around 12000 matched strings (DNA sequences) for my first regex, and I want to process these strings and based on some strict rules find some other strings in a second file that go well together with those 12000 matches based on those strict rules.
SIMPLIFIED EXAMPLE
This is my first regex (a simplified, shorter version of my original regex) that runs through my first text file.
[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)
Let's suppose that it finds the following three sub-strings in my large text file:
1. AAACCCGTGTAATAACAGACGTACTGTGTA
2. TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
3. TAACAAGGACCCTGTGTA
Now I have a second file which includes a very large string. From this second file, I am only interested in extracting those sub-strings that match a new (second) regex which itself is dependent on my first regex in few sections. Therefore, this second regex has to take into account the substrings matched in the first file and look at how they have matched to the first regex!
Allow me, for the sake of simplicity, index my first regex for better illustration in this way:
first.regex.p1 = [ACGT]{1,12000}
first.regex.p2 = (AAC)
first.regex.p3 = [AG]{2,5}
first.regex.p4 = [ACGT]{2,5}
first.regex.p5 = (CTGTGTA)
Now my second (new) regex which will search the second text file and will be dependent on the results of the first regex (and how the substrings returned from the first file have matched the first regex) will be defined in the following way:
second.regex = (CTAAA)[AC]{5,100}(TTTGGG){**rule1**} (CTT)[AG]{10,5000}{**rule2**}
In here rule1 and rule2 are dependent on the matches coming from the first regex on the first file. Hence;
rule1 = look at the matched strings from file1 and complement the pattern of first.regex.p3 that is found in the matched substring from file1 (the complement should of course have the same length)
rule2 = look at the matched strings from file1 and complement the pattern of first.regex.p4 that is found in the matched substring from file1 (the complement should of course have the same length)
You can see that second regex has sections that belong to itself (i.e. they are independent of any other file/regex), but it also has sections that are dependent on the results of the first file and the rules of the first regex and how each sub-string in the first file has matched that first regex!
Now again for the sake of simplicity, I use the third matched substring from file1 (because it is shorter than the other two) to show you how a possible match from the second file looks like and how it satisfies the second regex:
This is what we had from our first regex run through the first file:
3. TAACAAGGACCCTGTGTA
So in this match, we see that:
T has matched first.regex.p1
AAC has matched first.regex.p2
AAGGA has matched first.regex.p3
CC first.regex.p4
CTGTGTA has matched first.regex.p5
Now in our second regex for the second file we see that when looking for a substring that matches the second regex, we are dependent on the results coming from the first file (which match the first regex). Particularly we need to look at the matched substrings and complement the parts that matched first.regex.p3 and first.regex.p4 (rule1 and rule2 from second.regex).
complement means:
A will be substituted by T
T -> A
G -> C
C -> G
So if you have TAAA, the complement will be ATTT.
Therefore, going back to this example:
TAACAAGGACCCTGTGTA
We need to complement the following to satisfy the requirements of the second regex:
AAGGA has matched first.regex.p3
CC first.regex.p4
And complements are:
TTCCT (based on rule1)
GG (based on rule2)
So an example of a substring that matches second.regex is this:
CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG
This is only one example! But in my case I have 12000 matched substrings!! I cannot figure out how to even approach this problem. I have tried writing pure regex but I have completely failed to implement anything that properly follows this logic.. Perhaps I shouldn't be even using regex?
Is it possible to do this entirely with regex? Or should I look at another approach? Is it possible to index a regex and in the second regex reference back to the first regex and force the regex to consider the matched substrings as returned by first regex?
This can be done programmatically in Perl, or any other language.
Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.
You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.
Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.
The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.
use strict;
use warnings;
sub complement {
my $string = shift;
$string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///
return $string;
}
# first regex, split into sub-patterns
my #first = (
qr([ACGT]{1,12000}),
qr(AAC),
qr([AG]{2,5}),
qr([ACGT]{2,5}),
qr(CTGTGTA),
);
# second regex, split into sub-patterns as callbacks
my #second = (
sub { return qr(CTAAA) },
sub { return qr([AC]{5,100}) },
sub { return qr(TTTGGG) },
sub {
my (#matches) = #_;
# complement the pattern of first.regex.p3
return complement( $matches[3] );
},
sub { return qr(CTT) },
sub { return qr([AG]{10,5000}) },
sub {
my (#matches) = #_;
# complement the pattern of first.regex.p4
return complement( $matches[4] );
},
);
my $file2 = "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG";
while ( my $file1 = <DATA> ) {
# this pattern will match the full thing in $1, and each sub-section in $2, $3, ...
# #matches will contain (full, $2, $3, $4, $5, $6)
my #matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );
# iterate the list of anonymous functions and call each of them,
# passing in the match results of the first match
my $pattern2 = join q{}, map { '(' . $_->(#matches) . ')' } #second;
my #matches2 = ( $file2 =~ m/($pattern2)/ );
}
__DATA__
AAACCCGTGTAATAACAGACGTACTGTGTA
TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
TAACAAGGACCCTGTGTA
These are the generated second patterns for your three input substrings.
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TCT)((?^:CTT))((?^:[AG]{10,5000}))(GCAT)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(CC)((?^:CTT))((?^:[AG]{10,5000}))(AA)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TTCCT)((?^:CTT))((?^:[AG]{10,5000}))(GG)
If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.
The pattern matches your example output for the third case. The resulting #matches2 looks like this when dumped out using Data::Printer.
[
[0] "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG",
[1] "CTAAA",
[2] "ACACC",
[3] "TTTGGG",
[4] "TTCCT",
[5] "CTT",
[6] "AAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAG",
[7] "GG"
]
I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.
If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.
my #matches = ( $file1 =~ join q{}, map { "($_)" } #first);
If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.
Using stringr in R
Extract matches to regex_1: "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)"
reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
reg_1_matches = unlist(reg_1_matches)
lets assume the matches were:
reg_1_matches = c("TTTTTTTGCGACCGAGAAACGGTTCTGTGTA", "TAACAAGGACCCTGTGTA")
Use stringr::str_match with capturing groups (...)
df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")
p3 = df_ps[,2]
p4 = df_ps[,3]
Complement
rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)
Construct regex_2
paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="")
all in one go:
reg_1_matches = stringr::str_extract_all(sequences, "[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)")
df_ps = stringr::str_match(reg_1_matches, "[ACGT]{1,12000}AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA")
p3 = df_ps[,2]
p4 = df_ps[,3]
rule_1 = chartr(old= "ACGT", "TGCA", p3)
rule_2 = chartr(old= "ACGT", "TGCA", p4)
paste("(CTAAA)[AC]{5,100}(TTTGGG)", rule_1, "(CTT)[AG]{10,5000}", rule_2, sep="")
This question really brings to mind the old saying about regular expressions, though in this case the languages you're matching against are regular, so RE is a good fit for this.
Unfortunately, my Perl is somewhat lacking, but fundamentally this sounds like a Regex problem rather than an R or Perl problem, so I'll do my best to answer it on that basis.
Perl's regex engine supports capture groups. The substrings matching bracketed subexpressions in your regex can be made available after matching:
use feature qw(say);
$foo = 'foo';
'aaa' =~ /(a)(a+)/;
say($1); # => 'a'
say($2); # => 'aa'
say("Matched!") if 'aaaa' =~ /${2}/;
What I'd suggest doing is bracketing your regex up properly, picking apart the capture groups after matching, and then sticking them together into a new regex, say...
use feature qw(say);
'ACGTAACAGAGATCTGTGTA' =~ /([ACGT]{1,12000})(AAC)([AG]{2,5})([ACGT]{2,5})(CTGTGTA)/ ; # Note that I've added a lot of (s and )s here so that the results get sorted into nice groups
say($1); # => 'ACGT'
say($2); # => 'AAC'
say($3); # => 'AGAG'
say($4); # => 'AT'
say($5); # => 'CTGTGTA'
$complemented_3 = complement($3); # You can probably implement these yourself...
$complemented_4 = complement($4);
$new_regex = /${complemented_3}[ACGT]+${complemented_4}/;
If the sections have actual meaning, then I'd also advise looking up named capture groups, and giving the results decent names rather than $1, $2, $3....
awk solution.
The requirements are not that complicated: a simple script can do the trick. There's just one complication: every regex that is a result from your first match has to be matched against all lines of the second file. Here's where we use xargs to solve that.
Now, whatever language you pick, it looks like the number of matches being made is going to be extensive, so some remarks about the regexes need to be made first.
The regex for the first file is going to be slow, because in
[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)
the number of possibilities for the first part [AGCT]{1,12000} are huge. Actuallly it only says pick any element from A, C, G, T and to that between 1 and 12000 times. Then match the rest. Couldn't we do a
AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA$
instead? The speed gain is considerable.
A similar remark can be made to the regex for the second file. If you replace
(CTAAA)[AC]{5,100}(TTTGGG){**rule1**}(CTT)[AG]{10,5000}{**rule2**}
with
(CTAAA)[AC]{5,100}(TTTGGG){**rule1**}(CTT)[AG]*{**rule2**}$
you will experience some improvement.
Because I started this answer with the low complication-factor of the requirements, let's see some code:
$ cat tst.awk
match($0, /AAC([AG]{2,5})([ACGT]{2,5})CTGTGTA$/, a) {
r = sprintf("(CTAAA)[AC]{5,100}(TTTGGG)(%s)(CTT)[AG]*(%s)$",
translate(a[1]),
translate(a[2]));
print r
}
function translate(word) {
cmd = "echo '" word "' | tr 'ACGT' 'TGCA'";
res = ((cmd | getline line) > 0 ? line : "");
close(cmd);
return res
}
What this will do is produce the regex for your second file. (I've added extra grouping for demo purposes). Now, let's take a look at the second script:
$ cat tst2.awk
match($0, regex, a){ printf("Found matches %s and %s\n", a[3], a[5]) }
What this will do is get a regex and matches it with every line read from the second input file. We need to provide this script with a value for regex, like this:
$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt
The -v option of awk let's us define a regex, which is fed into this call by the first script.
$ cat input1.txt
TAACAAGGACCCTGTGTA
$ cat input2.txt
CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG
and the result is:
$ awk -f tst.awk input1.txt | xargs -I {} -n 1 awk -v regex={} -f tst2.awk input2.txt
Found matches TTCCT and GG
In conclusion: should you use regexes to solve your problem? Yes, but you need to be not too ambitious to match the whole string in one time. Quantifiers like {1,12000} are going to slow you down, whatever language you pick.
(Note: This is a successor question to my posting zsh: Command substitution and proper quoting , but now with an additional complication).
I have a function _iwpath_helper, which outputs to stdout a path, which possibly contains spaces. For the sake of this discussion, let's assume that _iwpath_helper always returns a constant text, for instance
function _iwpath_helper
{
echo "home/rovf/my directory with spaces"
}
I also have a function quote_stripped expects one parameter and if this parameter is surrounded by quotes, it removes them and returns the remaining text. If the parameter is not surrounded by quotes, it returns it unchanged. Here is its definition:
function quote_stripped
{
echo ${1//[\"\']/}
}
Now I combine both functions in the following way:
target=$(quote_stripped "${(q)$(_iwpath_helper)}")
(Of course, 'quote_stripped' would be unnecessary in this toy example, because _iwpath_helper doesn't return a quote-delimited path here, but in the real application, it sometimes does).
The problem now is that the variable target contains a real backslash character, i.e. if I do a
echo +++$target+++
I see
+++home/rovf/my\ directory\ with\ spaces
and if I try to
cd $target
I get on my system the error message, that the directory
home/rovf/my/ directory/ with/ spaces
would not exist.
(In case you are wondering where the forward slashes come from: I'm running on Cygwin, and I guess that the cd command just interprets backslashes as forward slashes in this case, to accomodate better for the Windows environment).
I guess the backslashes, which physically appear in the variable target are caused by the (q) expansion flag which I apply to $(_iwpath_helper). My problem is now that I can not simply drop the (q), because without it, the function quote_stripped would get on parameter $1 only the first part of the string, up to the first space (/home/rovf/my).
How can I write this correctly?
I think you just want to avoid trying to strip quotes manually, and use the (Q) expansion flag. Compare:
% v="a b c d"
% echo "$v"
a b c d
% echo "${(q)v}"
a\ b\ c\ d
% echo "${(Q)${(q)v}}"
a b c d
chepner was right: The way I tried to unquote the string was silly (I was thinking too much in a "Bourne Shell way"), and I should have used the (Q) flag.
Here is my solution:
target="${(Q)$(_iwpath_helper)}"
No need for the quote_stripped function anymore....
Lets say, I highlighted (matched) text present in brackets using
/(.*)
Now, how to copy the highlighted text only (i.e matching pattern, not entire line) into a buffer, so that I paste it some where.
Multiple approaches are presented in this Vim Tips Wiki page. The simplest approach is the following custom command:
function! CopyMatches(reg)
let hits = []
%s//\=len(add(hits, submatch(0))) ? submatch(0) : ''/ge
let reg = empty(a:reg) ? '+' : a:reg
execute 'let #'.reg.' = join(hits, "\n") . "\n"'
endfunction
command! -register CopyMatches call CopyMatches(<q-reg>)
When you search, you can use the e flag to motion to the end of the match. So if I understand your question correctly, if you searched using eg.:
/bar
And you wish to copy it, use:
y//e
This will yank using the previous search pattern until the end of the match.
Do you want to combine every (foo) in the buffer in one register (which would look like (foo)(bar)(baz)…) or do you want to yank a single (foo) that you matched?
The last is done with ya( if you want the parenthesis or yi( if you only want what's between.
Ingo's answer takes care of the former.