unix grep command

unix grep command - unix

I have a text file named "file1" containing the following data :
apple
appLe
app^e
app\^e
Now the commands given are :
1.)grep app[\^lL]e file1
2.)grep "app[\^lL]e" file1
3.)grep "app[l\^L]e" file1
4.)grep app[l\^L]e file1
output in 1st case : app^e
output in 2nd case :
apple
appLe
app^e
output in 3rd case :
apple
appLe
app^e
output in 4th case :
apple
appLe
app^e
why so..?
Please help..!

1.)grep app[\^lL]e file1
The escape (\) is removed by the shell before grep sees it so this is equivalent to app[^lL]e. The bit in brackets matches anything not (from the ^, since it's the first character) L or l
2.)grep "app[\^lL]e" file1
This time, the \ escapes the ^ so it matches ^ or L or l
3.)grep "app[l\^L]e" file1
^ works to negate the set only if it is the first character, so this matches ^ or L or l
4.)grep app[l\^L]e file1
The ^ is escaped, but since it's not the first it doesn't make any difference, so it matches ^ or L or l

In the first case grep app[\^lL]e file1, you do not quote the pattern on the command line, the shell takes care of its expansion. So the search pattern, effectively, becomes
app[^lL]e
and means: "app", then any symbol but "l" or "L", then "e". The only line that fits is
app^e
In other cases, ^ is either escaped and matched literally, or, in addition, it is in the middle of of the pattern.

Related

How to split a file into multiple files based on condition match with line numbers starting from beginning in all split files in UNIX?

For this I used a command like below using awk
awk '/H.*/{x="F"++i;next}{print NR-1 "," $0 > x;}' words.txt
which splits into multiple files when any Header pattern matches.
words.txt
Header
LLLL
AAAA
Header
SSSS
DDDD
Now after splitting am getting output with above command
File1.txt
1. LLLL
2. AAAA
File2.txt
3. SSSS
4. DDDD
What am expecting is line numbers starting from 1 in each file like below
File1.txt
1. LLLL
2. AAAA
File2.txt
1. SSSS
2. DDDD

In case you need to print the count vice(matching your conditions one) then use following.
awk '/H.*/{count=1;close(x);x="F"++i;next}{print count++ "," $0 > x;}' words.txt
Add close also to avoid error which gives us sometimes "too many files opened"
Explanation: Adding explanation of above code too now.
awk ' ##Starting awk program here.
/H.*/{ ##Checking condition from H.* to till everything it covers in line.
count=1 ##Setting variable named count value to 1 here.
close(x) ##Closing the file(in case it is opened) whose value is variable x value. To avoid too many opened files error.
x="F"++i ##Creating variable x whose value is character F with increasing value of variable F each time with 1.
next ##next will skip all further statements.
}
{ ##Following statements will be executed when above condition is NOT TRUE.
print count++ "," $0 > x ##Printing variable count value with comma and current line value into file named x here.
}
' words.txt ##Mentioning Input_file name here.

Here
awk '/^Header$/{close(f); f="File"++n".txt"; l=0; next}{print ++l". "$0 > f}' words.txt
Result
$ cat File1.txt
1. LLLL
2. AAAA
$
$ cat File2.txt
1. SSSS
2. DDDD

Getting the last x digits from output with grep command

I need help getting the last 16 digits from the output I get with this command ;
cat q5data.txt | grep -o '[0-9]*[0-9]\{16\}'
The output I get is :
6420029454020029
26787889786973463
92272417810036027222591368318424
1147142436072964
And id want the last 16 digits only of the numbers above, so it would look something like this :
6420029454020029
6787889786973463
7222591368318424
1147142436072964
So yeah, the question is, how would I get the last 16 digits ?
q5data contains this:
0111102.82575525572371251FriThuSat32169716436971243.1415 foo100001$$$3.14153
foo`3.1415Green100010blah2.8
2.85720948213811501Purple`WedTueBLACK1869228491762178BLACK$$3.14100001Feb010000
taoblahfoopiGreen010111
VOIDchiOrangeSatNILLVOIDBLACK$$$Sat3.14155378825854705118Mar$WHITEAug`Tue
4421929582063064
2.8$$$$BLACKSun$"blah$ThublahJun2057411253659033Orange$$Sun$$fubar'
BLACKSun8061215743158569Jul'010101`2.8MayFri$$'blah
100001$3.141533.14153taoBLACKWHITE3.141532.8'foo"chi`BLACK$$$3300209361826966
5976364681345632YellowFri"JanWHITEWedWHITE3652470302503667WHITE
1237496282374608WHITEpiNILLVOID110111WHITEApr'$$$2.83536505910579946111010
54891762211716313.14$$RedWedtaoMonFri110010$$3068508931421361$PurpleNILLWHITE9242959892278294Sep
000110BlueOct2582940799974379
phifoo$
Purple3.1415Green '
3.14BLACKTuepiYellowWHITEchi35798399298233973.14153.1415WHITEpitao$SunBlue010110
NULLBLACKTue1650665049652872`2.8$'$$$NULL3.14SatGreen$$3.141533.14153GreenVOIDJul"
chichifubarWedpiBLACK3.14153BLACKpiWHITEThu$ BLACK
blah2.8fubar4411479881441554$$`BLACKWHITE1101113.14SepWHITEJanThuGreen
$$WHITE'"3675572769992033fooBlueNULL100000'
BLACK 3.14WHITEDecfubarOrangeMay NILLWHITE2570850288634750 101011$$$Mon
Tue" 3.143.14phiSat7665425103246257MayphiTue'0010110101112.8BLACK$fubar"
0358649831711525100010'FriJunThu"3.14SunGreenfubarMonWHITEVOID$$$VOID1877369637528056Jan$010010
GreenTue000111ThuBLACKApr011010
Jun6244216458497289`PurpleAug$$$2685357800265115''2.8taopi101100$$chiFeb
9471418620899225VOID8617331495319240NULLWHITEblah5461478451014026
6352741666667105
WHITEfooOct011010pi$$$110100BLACKBLACKTuePurpleWHITE9093492271343727SepNovchi
Orange3.144596443153024361`"'$$78253311502390510101103.14153Friphi $Mon
1385825179552755YellowBLACK001011Sep$$RedFebfubarMon010010000010fubar"Jul0110117544560082562350
3.141653642540032022chi'Orange
1253542283769081tao4876457038962098MonSunMayWHITEYellow3.14153$Orange000101blah
RedSatNILLphiVOIDWedfubarGreen chi$$piphiJul$$$111001`9540185369262601NILLVOID
7006440921851679Wed3.14152.8chiGreenThu$$Tuefoofooblahpi$$$taopi$ May 'Feb
MayNILLblah8007182476768737JantaophiThutao$'Jul AprNILLBLACK'3.14153Feb3.1415
57067714600406493.141537231229468300261Mon$`SunNILL `NULL3.14153foochi1000109494160741986074
6577869219715310JulJanBLACKfubarBLACK2.8phiGreen0091496849086433
SunBlue2355648762601053 3.1415NULL$$$BLACK100011 ThuDecJun2.83.1415phiFeb"
9173525733960126BLACK 3.14153`110001PurpleRedFebfubarVOIDfoo$$$blah9330024102534139
Jun$$VOIDVOID4099554992034342Julpi9976331355660412taoWHITEGreen$$100010NILLVOID
3.14153phiSatphi43658305924319679197159994746838phipiApr
3.1415RedblahMayfooJul100011NovtaoMon3.141533.14JanGreen$$ OctNILLfooWHITE3.1415
96027197435535111011013.14VOID3583462878046156NULL3.1415blahOrangefoo 100101taofoo3.14153"3.1415
$$Red3.14Marblah'
3797758515388131tao $$$101010NULL2268984774582096BlueBlue3.14153Oct`
74321533961822933.14153994759453326425$$Jul001111PurpleGreenTueNovJan2742714540787707Blue$$$
0010003.14blah3.14ThuWHITE$$$$blah
3997313793176662 3.141463510697622121Yellow 3.1415'Jul`3.14153NILL2.8Thuphi
3134920264311067fooNov`NULL1111119335359393623483Tue$$$GreenVOIDtaoRedTueAug$$3.141532.8Sat'
3.14153Oct100010FebJan$$3.1415pi$$'chiRed$$$NILL8614261680268364
fubarBLACKpi110001110101pichi0126011887834143GreenNILLYellow NILLfoo101000 $$$
RedTueNULLThu2.814091424413091162.8 WHITE$WHITE60620358244865230211111773156587'pi
Yellow3.1415$$$$$
"Aug3.1415VOIDBLACK0810996065354809$$$NULLfoo$$Orange6850772642048628WedBLACK
BLACKBluepi 70173555329860651869981769139132phi$$$$$$3.14Feb2.86083883638401362
6420029454020029WHITE26787889786973463.14 3.14 Mon`92272417810036027222591368318424$$$tao
fooTue"1147142436072964AprPurpleSep
Okay so, at the begining of q5data we see 01111102. and right after this we see : 82575525572371251 (17 digits)
Id like it to output the last 16 digits ( 2575525572371251 )
Thank you :)

To match the end of the pattern use \b
grep -o '[0-9]\{16\}\b' q5data.txt
so this will match 16 digits up to a word boundary.

If you want to capture digits in strings terminated with non-numerical chars you need negative lookahead (with -P option, not available in standard grep)
$ grep -Po '[0-9]{16}(?![0-9])'
e.g.
$ echo "12345678901234567890aaa" | grep -Po '[0-9]{16}(?![0-9])'
5678901234567890

If you want the last 16 digits from every run of 16 or more digits, then you could filter through grep twice:
grep -Eo '[0-9]{16,}' <q5data.txt | grep -Eo '.{16}$'
The first selects all runs of 16 or more digits, and the second selects the last 16 characters from each run.
Testing this on the first line of your input file gives:
$ grep -Eo '[0-9]{16,}' <<<'0111102.82575525572371251FriThuSat32169716436971243.1415 foo100001$$$3.14153' | grep -Eo '.{16}$'
2575525572371251
2169716436971243

grep -Eo '([0-9]{16})$' q5data.txt

Regular Expression To exclude sub-string name(job corps) Includes at least 1 upper case letter, 1 lower case letter, 1 number and 1 symbol except "#"

Regular Expression To exclude sub-string name(job corps)
Includes at least 1 upper case letter, 1 lower case letter, 1 number and 1 symbol except "#"
I have written something like below :
^((?!job corps).)(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[!#$%^&*]).*$
I tested with the above regular expression, not working for special character.
can anyone guide on this..

If I understand well your requirements, you can use this pattern:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[^#]*$
If you only want to allow characters from [a-zA-Z0-9^#$%&*] changes the pattern to:
^(?![^a-z]*$|[^A-Z]*$|[^0-9]*$|[^!#$%^&*]*$|.*?job corps)[a-zA-Z0-9^#$%&*]*$
details:
^ # start of the string
(?! # not followed by any of these cases
[^a-z]*$ # non lowercase letters until the end
|
[^A-Z]*$ # non uppercase letters until the end
|
[^0-9]*$
|
[^!#$%^&*]*$
|
.*?job corps # any characters and "job corps"
)
[^#]* # characters that are not a #
$ # end of the string
demo
Note: you can write the range #$%& like #-& to win a character.

stribizhev, your answer is correct
^(?!.job corps)(?=.[0-9])(?=.[a-z])(?=.[A-Z])(?=.[!#$%^&])(?!.#).$
can verify the expression in following url:
http://www.freeformatter.com/regex-tester.html

How to separate unique characters from several words in a "indic" text file?

I've a plain text file.
> Input: इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री
All words are separated by one or more spaces. I want to collect all unique chars from the text file. I'm looking for a unix command; the order of the result chars is not important.
> Expected result: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री
With the command Klaus has provided
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
Result comes as:
ं अ इ क ग ज ट ड न र ल श सिीॅे्
I don't want to separate horizontal or vertical conjuncts or dependent vowels from its base character.
I just want to separate complete characters in a word from each other.
Can we achieve this with UNIX commands?
"base character" + "dependent vowel" = "complete character"
- क ा का
- क ि कि
Klaus's command works for English text only. But, It doesn't work with indic languages such as Hindi.
Input: hi1 hello-2 how!3 "are4 ?you5
result: h i e l o w a r y u 1 2 3 4 5 - ! "
Note:- You have to install Indic support in your OS.
Also, download Mangal font from http://hindi-fonts.com/fonts/Mangal

Try this:
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
or simplified ( stolen from fedorqui comment, thanks! Never seen & before in the replacement part. Good to learn something new! )
sed 's/./&\n/g' <file> | sort -u | tr -d '\n'

Grep examples - can't understand

Given the following commands:
ls | grep ^b[^b]*b[^b]
ls | grep ^b[^b]*b[^b]*
I know that ^ marks the start of the line, but can anyone give me a brief explanation about
these commands? what do they do? (Step by step)
thanks!

^ can mean two things:
mark the beginning of a line
or it negates the character set (whithin [])
So, it means:
lines starting with 'b'
matching any (0+) characters Other than 'b'
matching another 'b'
followed by something not-'b' (or nothing at all)
It will match
bb
bzzzzzb
bzzzzzbzzzzzzz
but not
zzzzbb
bzzzzzxzzzzzz

1)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with a character which is not b
2)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with 0 or more characters which are not b