how back reference works in unix - unix

I have a file "file7.txt"
The contents of file7.txt,
I want to know how the back reference works with grep command.
when i type the commands, i get the following results,
So now i want to know how this works.

I would explain you the first grep that you have tried.
grep '\([a-z]\)\1'
This matches the sample string with 'first character' same as the next character.
f i l l i n g
Grep matches the first character within 'a' to 'z'.
At the beginnign it checks for every character, one by one.
$1 holds the character, The pattern is to have the next character as 1.
You need a way of remembering what you found, and seeing if the same pattern occurred again.
You can mark part of a pattern using "(" and ")".
You can recall the remembered pattern with "\" followed by a single digit.
You can have 9 different remembered patterns.
This reduces the pattern search efficiently and saves your time.
This is how all the back references work.

Related

Using grep to filter rows with two or more patterns in the string in R

I need to index all the rows that have a string beginning with either "B-" or "B^" in one of the columns. I tried a bunch of combinations, but I am suspecting it might not be working due to "-" and "^" signs being part of grep command as well.
dataset[grep('^(B-|B^)[^B-|B^]*$', dataset$Col1),]
With the above script, rows beginning with "B^" are not being extracted. Please suggest a smart way to handle this.
You can use the escape \\ command in grep:
dataset[grep('^(B\\-|B\\^)[^B\\-|B\\^]*$', dataset$Col1),]
For further explanation, the ^ matches the beginning of a string as an anchor therefore you have to escape it in the middle of string. The [] are a character class so [^B-|B^]* matches any character that's not a B,-,B, or ^. They are unnecessary here.
The simplified regex is:
dataset[grep('^(B-|B\\^)', dataset$Col1),]

can someone explain this regular expression inside gsub()? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.

zsh match files not containing dash

I have file listing as the following one:
001file.jpg
003file.jpg
001-800x600-sq.jpg
001-800x600.jpg
002-800x600-sq.jpg
002-800x600.jpg
003-800x600-sq.jpg
003-800x600.jpg
004-800x531-sq.jpg
004-800x531.jpg
005-800x531-sq.jpg
005-800x531.jpg
006-800x531-sq.jpg
006-800x531.jpg
007-800x531-sq.jpg
007-800x531.jpg
008-800x1067-sq.jpg
008-800x1067.jpg
009-800x1067-sq.jpg
009-800x1067.jpg
010-800x533-sq.jpg
010-800x533.jpg
011-800x1200-sq.jpg
011-800x1200.jpg
012-800x533-sq.jpg
012-800x533.jpg
013-800x600-sq.jpg
013-800x600.jpg
014-800x1067-sq.jpg
014-800x1067.jpg
015-800x533-sq.jpg
015-800x533.jpg
016-800x533-sq.jpg
016-800x533.jpg
In ZSH, I want to list all files beginning with any number, not containing dash in filename, so I tried:
print -l <->[^-]*.jpg
with no success. What is wrong with this pattern!?
This is, I think, similar to the case that the documentation for <-> warns about:
Be careful when using other wildcards adjacent to patterns of this form; for example, <0-9>* will actually match any number whatsoever at the start of the string, since the `<0-9>' will match the first
digit, and the `*' will match any others. This is a trap for the unwary, but is in fact an inevitable
consequence of the rule that the longest possible match always succeeds. Expressions such as
`<0-9>[^[:digit:]]*' can be used instead.
In print -l <->[^-]*.jpg, the <-> matches the first digit, then [^-] matches the 2nd digit, and * matches everything thing else.
Use instead
print -l <->[^[:digit:]-]*.jpg

Creating RegEx That Reads Entire String

My current regex is only picking up part of my string. It creates a match as soon as one if found, even though I need the longer version of that match to hit. For example, I am creating matches for both:
SSS111
and
SSS111-L
The first SSS111 matches fine with my current regex, but the SSS111-L is only getting matched to the SSS111, leaving the -L out.
How can I create a greedy regex to read the whole line before matching? I am currently using
[-A-Z0-9]{3,12}
to capture the numbers and letters, but have not had any luck outside of this.
Regex are allways greedy. This ist mostly the Problem.
Here i think you have only to escape the '-'
#"[-A-Z]{3-12}"

Please explain the below unix code

echo "1,a,20,000,aa,s" | sed 's/,\([^0]\)/|\1/g'
**output
1|a|20,000|aa|s**
Please explain the above command.
I am unable to understand this execution.
The given command uses sed to substitute certain characters for other characters.
The basic form for this is
s/FIND/REPLACE/
where FIND and REPLACE are regular expressions.
The g at the end stands for global. It means that not only the first occurrence of a pattern matching FIND is replaced but all occurrences in the input string.
To the regular expressions used:
FIND ,\([^0]\) This pattern matches all two character strings who start with a , which is not followed by a 0.
REPLACE |\1 This is equal to a two character string who starts with a | which is followed by the second character in FIND. (The \1 remembers the previously found match)
For a detailed overview of the sed commands I suggest you also read here: http://www.grymoire.com/Unix/Sed.html#uh-1
And to look up on how to read regular expressions: http://www.grymoire.com/Unix/Regular.html
Of curse there are many more sites concerning this to be found if the above web-pages are not enlightening to you.

Resources