I have line such as these:
[something]
[[something else]]
[[[another text here]]]
...
And I want to capture the inner text which is something, something else and another text here.
For this I wrote this regex:
m/^([^\[\]]+|\[(?1)\])$/gm
Unfortunately it does not capture the inner text even if I put a capture group around [^\[\]]+. I suppose the capture group latch its content at the first match, not during the last recursion.
How is it possible to capture the inner text using a capture group?
It isn't possible to get the capture group content from a recursion at the ground level with PCRE.
A workaround for your particular examples is to not use the recursion feature and to check if there's always a closing bracket for each opening bracket:
/\A (?:\[(?=[^]]*(]\1?+)))+ ([^][]*) \1 \z/x
(in group 2, demo)
But you can't use this way with more complicated strings (with several groups at the same level).
Related
I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.
I have a double challenge.
First, I want to match lines that contain two (or eventually more) specified words within certain distance in whatever order.
Using lookaround I manage to select lines matching two or more words, regardless of the order within they occur. I can also easily add more words to be found in the same line, so it this can also be applied without much effort when more word must occur in order to be selected. The disadvantage is that can't detail the maximal distance between them.
^(?=.*\john)(?=.*\jack).*$
By using the pipe operator I can detail both orders in which the terms may occur as well as the accepted distance between them, but when more words should be matched the code becomes lengthy and errorsensitive.
jack.{0,100}john|john.{0,100}jack
Is there a way to combine the respective advantages of both approaches in one regular expression?
Second, ideally I would like that only 'jack' and 'john' (and are selected in the line but not the whole line.
Is there a possibility to do this all at once?
For this case, you have to use the second approach. But it can't be possible with regex alone.. You have to ask for language tools help like paste in-order to build a regex (given in the second format).
In python, I would do like below to create a long regex.
>>> def create_reg(lis):
out = []
for i in lis:
out.append(''.join(i) + '|' + ''.join([i[2],i[1], i[0]]))
return '(?:' + '|'.join(out) + ')'
>>> lst = [('john', '{0,100}', 'jack'), ('foo', '{0,100}', 'bar')]
>>> create_reg(lst)
'(?:john{0,100}jack|jack{0,100}john|foo{0,100}bar|bar{0,100}foo)'
>>>
Alright, I've been given a program that requires me to take a .txt file of varying symbols in rows and columns that would look like this.
..........00
...0....0000
...000000000
0000.....000
............
..#########.
..#...#####.
......#####.
...00000....
and using command arguments to specify row and column, requires me to select a symbol and replace that symbol with an asterisk. The problem i have with this is that it then requires me to recur up, down, left, and right any of the same symbol and change those into an asterisk.
As i understand it, if i were to enter "1 2" into my argument list it would change the above text into.
**********00
***0....0000
***000000000
0000.....000
............
..#########.
..#...#####.
......#####.
...00000....
While selecting the specified character itself isn't a problem, how do i have any similar, adjacent symbols change and then the ones next to those. I have looked around but can't find any information and as my teacher has had a different subs for the last 3 weeks, i havent had a chance to clarify my questions with them. I've been told that recursion can be used, but my actual experience using recursion is limited. Any suggestions or links i can follow to get a better idea on what to do? Would it make sense to add a recursive method that takes the coordinates given adds and subtracts from the row and column respectively to check if the symbol is the same and repeats?
Load in char by char, row by row, into a 2D array of characters. That'll make it a lot easier to move up and down and left and right, all you need to do is move one of the array indexes.
You can also take advantage of recursion. Make a function that changes all adjacent matching characters, and then call that same function on all adjacent matching characters.
I want to know if there is a way to check for the last element in a fusion for_each loop (in order to apply special code for this case)
Edit : Maybe a better question should be :
I have played with fusion::for_each, now I want to apply code on each element of a fusion sequence with special code (special code does not mean "extra code" but different code) for the last element. May be I should use iterators (an example please)?
Some ideas:
1) use boost::fusion::fold, count your way though, and on the last one, perform your edit
2) if all types in the tuple are heterogenous, match on type to determine last one
3) include some sort of marker for the last one on which you can match
4) use the 'prior(end(v))' operators to manipulate the last element when for_each processing is complete
I'm trying to set up a validation expression for an ASP.Net Regular Expression Validator control. It is for validating the creation of a user name, so I want to limit the number of characters, and I also want to prevent them from using spaces. Here's what I've got so far:
^.*(?=.{5,20})(?=.*\w{5,255}).*$
The \w{5,255} part prevents spaces and special characters (except for underscores, apparently). I have no idea how "5,255" makes it work, but it does; I just copied it from somewhere else.
The main problem I'm having is that if the first or last character is a space (or special character), it passes validation, which is not acceptable. Can anyone help me? I'm sure it is something simple, but I know next to nothing about regular expressions.
You can use something simpler like this:
^[a-zA-Z0-9_]{5,255}$
This will allow alphanumeric usernames between 5-255 characters in length.
(let's expand overall understanding of how to at least use regex!)
The main reason why the posted regex wasn't working is because you were attempting to use lookahead. Lookahead is a 0-length pattern that just guarantees that the next part of the string will match a certain pattern (and is usually used to take advantage of it being 0-length, so it doesn't expand your capturing group).
Effectively, what your regex (going off of the original /^.(?=.{5,20})(?=.\w{5,255}).*$/) meant was:
^. "The beginning of our line should match any single character (provided it's not a newline, although this depends on the regex implementation as well as flags that may or may not have been passed in)"
(?= "and guarantee that after here"
.{5,20}) are any 5-20 characters."
(?= "Also, after that same first character (since, remember, lookahead is 0-length), guarantee"
. "one arbitrary character"
\w{5,255}) "and 5-255 word characters."
.*$ And of course, since all of that exhaustive matching was 0-length, we want the rest of the line to be an arbitrary number of characters."
What you technically could have done to use lookaround was ^(?=\w{5,255}).{5,255}$, but that's just overly convoluted. I'd suggest just using \w{5,255} or something along those lines.