I use grep very often and am familiar with it's ability to return matching lines (by default) and non-matching lines (using the -v parameter). However, I want to be able to grep a file once to separate matching and non-matching lines.
If this is not possible, please let me know. I realize I could do this easily in perl or awk, but am curious if it is possible with grep.
Thanks!
If it does NOT have to be grep - this is a single pass split based on a pattern -- pattern found > file1 pattern not found > file2
awk '/pattern/ {print $0 > "file1"; next}{print $0 > "file2"}' inputfile
I had the exact same problem and I wrote a small Perl script for that [1]. It only accepts one argument: the regex to grep input on.
[1] https://gist.github.com/tonejito/c9c0bffd75d8c81483f9107c609439e1
It reads STDIN by line and checks against the given regex, matched lines go to STDOUT and not matched go to STDERR.
I made it this way because this tool sits in the middle of a pipeline and I use shell redirection to save the files on their final location.
Step 1 : Read the file
Step 2 : Replace spaces with a new line and save the result in a temporary file
Step 3 : Get only lines contains '_' from the temporary file and save it into multiwords.txt
Step 4 : Exclude the lines that contains '-' from the temporary file then save the result into singlewords.txt
Step 5 : Delete the temporary file
cat file | tr ' ' '\n' > tmp.txt | grep '_' tmp.txt > multiwords.txt | grep -v '_' tmp.txt > singlewords.txt | find . -type f -name 'tmp.txt' -delete
Related
Assume that there are two files
File1 - lookup.txt
CAN
USD
INR
EUR
Another file Input.txt
1~Canada~CAN
2~United States of America~USD
3~Brazil~BRL
Both files may be very huge, hypothetically several thousand of records . Now I'm trying to identify the records in Input.txt and identify them based on values in lookup file.
The expected output should be
1~Canada~CAN
2~United States of America~USD
I tried to do something like below
#!/bin/sh
lookupFile=$1 #lookup.txt
inputFile=$2 #input.txt
outputFile=$3 #output.txt
while IFS= read -r line
do
awk -F'~' '{if ($3==$line) print >> $outputFile}' $inputFile
done < "$lookupFile"
But I'm getting error like
awk: cmd. line:1: (FILENAME=input.txt FNR=2) fatal: can't redirect to
How can I fix this issue ? Also if the files really huge, with several thousand of records to search, is this an efficient way ?
With your shown samples please try following awk code. We could do this in single awk we need to take care of setting field separator as ~ before input.txt.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' lookup.txt FS="~" input.txt
Explanation:
awk ' ##starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when lookup.txt is being read.
arr[$0] ##Creating array arr with $0 as index.
next ##next to skip all further statements from here.
}
($3 in arr) ##If $3 is present in arr then print that line.
' lookup.txt FS="~" input.txt ##Mentioning Input_files and setting FS to ~ before input.txt
A non-awk solution that you could compare with on the performance point of view:
$ grep -wFf lookup.txt input.txt
1~Canada~CAN
2~United States of America~USD
Warning: this does not match only on the last word. So if some values in lookup.txt can also be found elsewhere in input.txt, prefer another solution. Or, if it contains nothing that could be interpreted as a regular expression operator, preprocess lookup.txt before grep. Example with bash, sed and grep:
$ grep -f <( sed 's/.*/~&$/' lookup.txt ) input.txt
1~Canada~CAN
2~United States of America~USD
FILE1.TXT
0020220101
or
01 20220101
Need to extra date part from file where text starts from 2
Options tried:
t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'
echo "$t_FILE_DT1"
echo "$t_FILE_DT2"
1st output : 0101
2nd output : 0220101
Expected Output: 20220101
Im new to linux scripting. Could some one help guide where Im going wrong?
Use grep like so:
echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Using any awk:
$ awk '{print substr($0,length()-7)}' file
20220101
20220101
The above was run on this input file:
$ cat file
0020220101
01 20220101
Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.
The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.
The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.
I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )
Think instead about extracting the last 8 characters, which you know for fact is your date ..
input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
# read in the last 8 characters of $line .. You KNOW this is the date ..
# No need to worry about exact matching at that point, or spaces ..
myDate=${line: -8}
echo "$myDate"
done < "$input"
About the cut and awk commands that you tried:
Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101
Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101
If you want to match the 2 followed by 7 digits until the end of the string:
awk '
match ($0, /2[0-9]{7}$/) {
print substr($0, RSTART, RLENGTH)
}
' file
Output
20220101
The accepted answer shows how to extract the first eight digits, but that's not what you asked.
grep -o '2.*' file
will extract from the first occurrence of 2, and
grep -o '2[0-9]*' file
will extract all the digits after every occurrence of 2. If you specifically want eight digits, try
grep -Eo '2[0-9]{7}'
maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try
sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file
I want to get grep content of file matching particular text and then want to save all those records which matches particular text to new file and also want to make sure that matched content is removed from original file.
296949657|QL|163744584|163744581|20441||
292465754|RE|W757|3012|301316469|00|
296950717|RC|7264|00001|013|27082856203|
292465754|QL|191427266|191427266|16405||
296950717|RC|7264|AETNAACTIVE|HHRPPO|27082856203|
299850356|RC|7700|153447|0891185100102-A|W19007007201|
292465754|RE|W757|3029|301316469|00|
299850356|RC|7700|153447|0891185100104-A|W19007007201|
293695591|QL|743559415|743559410|18452||
297348183|RC|6602|E924|0048|CD101699303|
297348183|RC|6602|E924|0051|CD101699303|
108327882|QL|613440276|613440275|17435||
I have written awk and it works as expected for small files but for larger files is not working as expected....am sure that i have missed something...
awk '{print $0 > ($0~/RC/?"RC_RECORDS":"TEST.DAT")}' TEST.DAT
any thoughts on how to fix this.
Update 1
Now in above file, i always want to check values of column two to |RC| and if it matches then move that record to RC_RECORDS file and if values matches to |RE| then move it to RE_RECORDS, how can this be done.
Case 1:
So for example if i have records as
108327882|RE|613440276|613440275|RC||
then it should go to RE_RECORDS file.
Case 2:
108327882|RC|613440276|613440275|RE||
then it should go to RE_RECORDS
Case 3:
108327882|QL|613440276|613440275|RC||
then it should not go to either RE_RECORDS or RC_RECORDS
Case 4:
108327882|QL|613440276|613440275|RE||
then it should not go to either RE_RECORDS or RC_RECORDS
My hunch is that
awk '/\|RC\|/ {print > "RC_RECORDS.DAT";next} {print > "NEWTEST.DAT"}' TEST.DAT | awk '$2 == "RC"'
awk '/\|RE\|/ {print > "RE_RECORDS.DAT";next} {print > "FINAL_NEWTEST.DAT"}' NEWTEST.DAT | awk '$2 == "RE"'
but wanted to check if there's an better and quicker solution out there that can be used.
Update 2
Update 3
I think this is what you want:
Option 1
awk -F'|' '
$2=="RC" {print >"RC_RECORDS.TXT";next}
$2=="RE" {print >"RE_RECORDS.TXT";next}
{print >"OTHER_RECORDS.TXT"}' file
You can put it all on one line if you prefer, like this:
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} $2=="RE"{print >"RE_RECORDS.TXT";next}{print >"OTHER_RECORDS.TXT"}' file
Option 2
Or you can see how grep compares for speed/readability:
grep -E "^[[:alnum:]]+\|RC\|" file > RC_RECORDS.TXT &
grep -E "^[[:alnum:]]+\|RE\|" file > RE_RECORDS.TXT &
grep -vE "^[[:alnum:]]+\|R[CE]" file > OTHER_RECORDS.TXT &
wait
Option 3
This solution uses 2 awk processes and maybe achieves better parallelism in the I/O. The first awk extracts the RC records to a file and passes the rest onwards. The second awk extracts the RE records to a file and passes the rest on to be written to the OTHER_RECORDS.TXT file.
awk -F'|' '$2=="RC"{print >"RC_RECORDS.TXT";next} 1' file | awk -F'|' '$2=="RE"{print >"RE_RECORDS.TXT";next} 1' > OTHER_RECORDS.TXT
I created an 88M record file (3 GB), and ran some test on a desktop iMac as follows:
Option 1: 65 seconds
Option 2: 92 seconds
Option 3: 53 seconds
Your mileage may vary.
My file looks like this, i.e. 33% RE records, 33% RC records and rest junk:
00000000|RE|abcdef|ghijkl|mnopq|rstu
00000001|RC|abcdef|ghijkl|mnopq|rstu
00000002|XX|abcdef|ghijkl|mnopq|rstu
00000003|RE|abcdef|ghijkl|mnopq|rstu
00000004|RC|abcdef|ghijkl|mnopq|rstu
00000005|XX|abcdef|ghijkl|mnopq|rstu
00000006|RE|abcdef|ghijkl|mnopq|rstu
00000007|RC|abcdef|ghijkl|mnopq|rstu
00000008|XX|abcdef|ghijkl|mnopq|rstu
00000009|RE|abcdef|ghijkl|mnopq|rstu
Sanity Check
wc -l *TXT
29333333 OTHER_RECORDS.TXT
29333333 RC_RECORDS.TXT
29333334 RE_RECORDS.TXT
88000000 total
I'm trying to read multiple files in an AWK-script but when I change between file, the field seperator (FS) needs to change as well. At this point I got:
FILENAME=="A.txt"{
FS=";"
//DoSomething
}
FILENAME=="B.txt"{
FS=" - "
//DoSomething
}
But as you might know, the FS will not get set correctly for the first line of the file. How can I solve this?
You can specify the field separators at the command line:
awk -f a.awk FS=";" A.txt FS=" - " B.txt
In this way, the field separator will change for each file.
From http://www.delorie.com/gnu/docs/gawk/gawk_82.html :
Any awk variable can be set by including a variable assignment among
the arguments on the command line when awk is invoked
and
With it, a variable is set either at the beginning of the awk run or
in between input files.
You can do it as #HakonHaegland suggests by setting FS between file names in the arg list if you are listing the files individually. That is the typical way to do this.
Alternatively, if you can't do that (e.g. because you need to use * or similar for the file list), then you can use BEGINFILE if you are using GNU awk, but otherwise you can do it the way you are already by adding an assignment of $0 to itself after changing FS to force awk to re-split the record. e.g.:
$ cat file
a-b-c
d e f
$ awk '{print NF, $1}' file
1 a-b-c
3 d
$ awk '{FS="-"; $0=$0; print NF, $1}' file
3 a
1 d e f
If you are going to do it that way it's best done just once at the start of each file (when FNR==1).
I need to get the records from a text file in Unix. The delimiter is multiple blanks. For example:
2U2133 1239
1290fsdsf 3234
From this, I need to extract
1239
3234
The delimiter for all records will be always 3 blanks.
I need to do this in an unix script(.scr) and write the output to another file or use it as an input to a do-while loop. I tried the below:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then
int_1=0
else
int_2=0
fi
done < awk -F' ' '{ print $2 }' ${Directoty path}/test_file.txt
test_file.txt is the input file and file1.txt is a lookup file. But the above way is not working and giving me syntax errors near awk -F
I tried writing the output to a file. The following worked in command line:
more test_file.txt | awk -F' ' '{ print $2 }' > output.txt
This is working and writing the records to output.txt in command line. But the same command does not work in the unix script (It is a .scr file)
Please let me know where I am going wrong and how I can resolve this.
Thanks,
Visakh
The job of replacing multiple delimiters with just one is left to tr:
cat <file_name> | tr -s ' ' | cut -d ' ' -f 2
tr translates or deletes characters, and is perfectly suited to prepare your data for cut to work properly.
The manual states:
-s, --squeeze-repeats
replace each sequence of a repeated character that is
listed in the last specified SET, with a single occurrence
of that character
It depends on the version or implementation of cut on your machine. Some versions support an option, usually -i, that means 'ignore blank fields' or, equivalently, allow multiple separators between fields. If that's supported, use:
cut -i -d' ' -f 2 data.file
If not (and it is not universal — and maybe not even widespread, since neither GNU nor MacOS X have the option), then using awk is better and more portable.
You need to pipe the output of awk into your loop, though:
awk -F' ' '{print $2}' ${Directory_path}/test_file.txt |
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done
The only residual issue is whether the while loop is in a sub-shell and and therefore not modifying your main shell scripts variables, just its own copy of those variables.
With bash, you can use process substitution:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done < <(awk -F' ' '{print $2}' ${Directory_path}/test_file.txt)
This leaves the while loop in the current shell, but arranges for the output of the command to appear as if from a file.
The blank in ${Directory path} is not normally legal — unless it is another Bash feature I've missed out on; you also had a typo (Directoty) in one place.
Other ways of doing the same thing aside, the error in your program is this: You cannot redirect from (<) the output of another program. Turn your script around and use a pipe like this:
awk -F' ' '{ print $2 }' ${Directory path}/test_file.txt | while read readline
etc.
Besides, the use of "readline" as a variable name may or may not get you into problems.
In this particular case, you can use the following line
sed 's/ /\t/g' <file_name> | cut -f 2
to get your second columns.
In bash you can start from something like this:
for n in `${Directoty path}/test_file.txt | cut -d " " -f 4`
{
grep -c $n ${Directory path}/file*.txt
}
This should have been a comment, but since I cannot comment yet, I am adding this here.
This is from an excellent answer here: https://stackoverflow.com/a/4483833/3138875
tr -s ' ' <text.txt | cut -d ' ' -f4
tr -s '<character>' squeezes multiple repeated instances of <character> into one.
It's not working in the script because of the typo in "Directo*t*y path" (last line of your script).
Cut isn't flexible enough. I usually use Perl for that:
cat file.txt | perl -F' ' -e 'print $F[1]."\n"'
Instead of a triple space after -F you can put any Perl regular expression. You access fields as $F[n], where n is the field number (counting starts at zero). This way there is no need to sed or tr.