How to interleave lines from two text files - unix

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?

paste -d '\n' file1 file2

Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1

#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-

With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3

Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.

cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Related

Append data from 1 file to another using AWK

I have an already existing script to check the exclusive data between 2 files and load it in 3rd file. The command is below.
var='FNR == NR {keys[$1 $2]; next} !($1 $2 in keys)'
awk -F\| $var file1.dat file2.dat > file3.dat
The requirement is to reuse the same but just append the data from file2 to file3 ignoring file1. I tried to do the below but it is spooling the data from both file1 and file2. All I need is, though there are 2 file names provided in the awk command, only the 2nd file data to be appended.
var='{print $0}'
awk -F\| $var file1.dat file2.dat > file3.dat
Can anyone help with the exact command.
Below is the data in each file and expected output.
File1 (Can have 0 or more) - We should not look at this file at all
123
456
789
File2:
123
ABC
XYZ
456
Expected output in File3 (All from file2 and just ignore file1 input, but I have to have the file1 name in awk command)
123
ABC
XYZ
456
All from file2 and just ignore file1 input, but I have to have the file1 name in awk command.
If you must use file1 and file2 in arguments to awk command and want to output content from file2 only then you can just use:
awk 'BEGIN {delete ARGV[1]} 1' file1 file2 > file3
123
ABC
XYZ
456
delete ARGV[1] will delete first argument from argument list.
With your shown samples and attempts please try following awk code. Written and tested in GNU awk. Simply use nextfile to skip first Input_file named file1 itself and read 2nd file onwards.
awk 'NR==1{nextfile} 1' file1 file2
also remember not to waste time splitting unneeded fields
{m,g}awk 'BEGIN { delete ARGV[_^=FS="^$"] }_' file1 file2
and it's MUUUCH faster not reading it a row at a time :
mawk2 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:01 [1.11GiB/s] [1.11GiB/s] [ <=>]
f9d2e18d22eb58e5fc2173863cff238e stdin
mawk2 'BEGIN { delete ARGV[_^=RS=FS="^$"] }_^(ORS=__)' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:00 [1.92GiB/s] [1.92GiB/s] [<=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin
and try to avoid the slow default mode of gawk :
gawk 'BEGIN { delete ARGV[_^=FS="^$"] }_' "${m2p}" "${m3t}"
out9: 1.85GiB 0:00:03 [ 620MiB/s] [ 620MiB/s] [ <=> ]
f9d2e18d22eb58e5fc2173863cff238e stdin

Using Awk how to merge fields between files, F2 of file1 plus last 8char of F2 in file 2

I have two files file1 and file2, I need to replace F1 value of file1 by merging F2 of file1 plus last 8char of F2 in file2
File 1 :
123456|AAAAAAA|BBBBBB|CCCCCCC
444444|kkkkkkk|rrrrrr|NNNNNNN
File 2:
AAAAAAA|DDDDDD12345678
kkkkkkk|987654321aaaaa
Expected Output
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
I have tried with Bellow awk function not sure how to fetch last 8 char of F2 from file2
# awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2a[$2];print}' OFS='|' File2 File1
123456|AAAAAAADDDDDD12345678|BBBBBB|CCCCCCC
444444|kkkkkkk987654321aaaaa|rrrrrr|NNNNNNN
In order to get the last 8 characters of a[$2], you need to use substr:
substr(a[$2],length(a[$2])-7)
The above takes the substring of a[$2] starting at position length(a[$2])-7.
With that one change, your code produces your desired output:
$ awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
As Ghoti points out in the comments, the more usual awk style is to use next so as to avoid the need for the second condition, NR>FNR, as follows:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
When awk encounters next, it skips the rest of the commands and starts over on the next line.
As awk programmers often value conciseness over clarity, it is common to see the print statement replaced with a 1:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7)} 1' OFS='|' File2 File1
In this case, 1 is a condition and it always evaluates to true. Since no command is associated with that condition, the default command is executed which is print.

what is the alternate way to count the occurrence of each word without using 'uniq -c' command?

Is it possible to count the occurrence of each word like using uniq -c but with the count after the word rather than before?
Example scenario
Input file named as text1.txt which contain the following data
Renault:cilo:84563
Renault:cilo:84565
M&M:Thar:84566
Tata:nano:84567
M&M:quanto:84568
M&M:quanto:84569
The fields used in the above data are car_company:car_model:customerID
Desired result
cilo 2
Thar 1
nano 1
quanto 2
(car_model and number of cars sold grouped by car_model)
My code
cat test1.txt | cut -d: -f2 | uniq -c
Actual Result
2 cilo
1 Thar
1 nano
2 quanto
Is it possible to do the above process without using uniq -c ,so that I can swap the order of the fields (columns)?
You can use uniq, and simply post-process its output to swap the columns:
cut -d: -f2 test1.txt | uniq -c | awk '{print $2 "\t" $1 "\n" }'
EDIT: Added \n, as noted in a comment.
Save your commands output into a file "badresult";
cat test1.txt | cut -d: -f2 | uniq -c > badresult
Then cut the seventh field and save it into a file named "counts"(you should use space(" ") as a seperator);
cut -d" " -f7 badresult > counts
Then cut the eighth field and save it into a file named "models"(you should use space(" ") as a seperator);
cut -d" " -f8 badresult > models
Now you have your counts and models in seperate files. All you have to do is to show these two files seperately with "pr" command(-m: one file per column, -T:no pre-information)
pr -m -T models counts
Using awk:
cat test1.txt | cut -d: -f2 | uniq -c | awk '{ t = $1; $1 = $2; $2 = t; print }'
The little awk code exchanges fields 1 and 2 using a temporary.
You just need awk for this:
$ awk -F: '{a[$2]++} END {for (i in a) print i, a[i]}' file
cilo 2
quanto 2
nano 1
Thar 1
This goes through every line keeping track of how many times the second field has appeared. Since everything is stored in the array a, then it is just a matter of looping through it and printing its content.

Duplicates in an unix text file based on multiple fields

I have a requirement to find duplicates based on three columns in a .txt file in unix which is delimited by ,.
Input:
a,b,c,d,e,f,gf,h
a,bd,cg,dd,ey,f,g,h
a,b,df,d,e,fd,g,h
a,b,ck,d,eg,f,g,h
Let's take we are finding dupliactes based on 1,2,5 fields.
Expected output:
a,b,c,d,e,f,gf,h
a,b,df,d,e,fd,g,h
Can anyone help to write a script for this or is there a command already available?
I tried like this:
awk -F, '!x[$1,$2,$3]++' file.txt but did not work
One way using awk:
awk -F, 'FNR==NR { x[$1,$2,$5]++; next } x[$1,$2,$5] > 1' a.txt a.txt
This is simple, but reads the file two times. On the first pass (FNR==NR), it maintains counts based on key fields. During the second pass, if prints the line if its key was found more than once.
Another way using awk:
awk -F, '{if (x[$1$2$5]) { y[$1$2$5]++; print $0; if (y[$1$2$5] == 1) { print x[$1$2$5] } } x[$1$2$5] = $0}' a.txt
Explanation:
1 awk -F,
2 '{if (x[$1$2$5])
3 { y[$1$2$5]++; print $0;
4 if (y[$1$2$5] == 1)
5 { print x[$1$2$5] }
6 } x[$1$2$5] = $0
7 }'
Line 2: If x has $1$2$5, this key was seen before, do steps 3-5
Line 3: Increment the count and print the line because it is a dup
Line 4: This means, We are seeing this key for the 2nd time, so we need to print the first line with this key. Last time we saw this key we did not know whether it was a dup or not. So we print the first line in step 5.
Line 6: Store the current line against the key so we can use it in step 2
Another way using sort, uniq and awk
Note: uniq command has an option '-f' to skip the specified number of fields before it starts comparison.
sort -t, -k1,1 -k2,2 -k5,5 a.txt | awk -F, 'BEGIN { OFS = " "} {print $0, $1, $2, $5}' | sed 's/,/ /g' | uniq -f7 -D | sed 's/ /,/g' | cut -d',' -f 1-7
This sorts based on fields 1,2,5. awk prints the original line and appends fields 1,2,5 . sed changes the delimiter because uniq does not have an option to specify delimiter. uniq skips first 7 fields and works on rest of the line and prints duplicate lines.
I had a similar issue
I needed to eliminate duplicate detail records while preserving flat file record formatting and seqence of the records.
The duplication caused by a time expansion of the date field in column 2 of the detail only.
Receiving system was reporting duplication on columns 4 and 5.
I cobbled together this quick hack to resolve it.
First read the file data into an array
Then we can read and manipulate the individual records (crudely with a counter) as demonstrated in this snippet integrating a case statement to logically treat the various record types.
Cheers!
readarray inrecs < [input file name]
filebase=echo "[input file name] | cut -d '.' -f1
i=1
for inrec in "${inrecs[#]}";do
field1=echo ${inrecs[$i-1]} | cut -d',' -f1
field2=echo ${inrecs[$i-1]} | cut -d',' -f2
field3=echo ${inrecs[$i-1]} | cut -d',' -f3
field4=echo ${inrecs[$i-1]} | cut -d',' -f4
field5=echo ${inrecs[$i-1]} | cut -d',' -f5
field6=echo ${inrecs[$i-1]} | cut -d',' -f6
field7=echo ${inrecs[$i-1]} | cut -d',' -f7
field8=echo ${inrecs[$i-1]} | cut -d',' -f8
case $field1 in
'H')
echo "$field1,$field2,$field3">${filebase}.new
;;
'D')
dupecount=0
dupecount=`zegrep -c -e "${field4},${field5}" ${infile}`
if [[ "$dupecount" -gt 1 ]];then
writtencount=0
writtencount=`zegrep -c -e "${field4},${field5}" ${filebase}.new`
if [[ "${writtencount}" -eq 0 ]];then
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
else
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
;;
'T')
dcount=`zegrep -c '^D' ${filebase}.new`
echo "$field1,$field2,$dcount,$field4">>${filebase}.new
;;
esac
((i++))
done

Advanced grep unix

Usually grep command is used to display the line contaning the specified pattern. Is there any way to display n lines before and after the line which contains the specified pattern?
Can this will be achieved using awk?
Yes, use
grep -B num1 -A num2
to include num1 lines of context before the match, and num2 lines of context after the match.
EDIT:
Seems the OP is using AIX. This has a different set of options which doesn't include -B and -A
this link describes grep on AIX 4.3 (it doesn't look promising)
Matt's perl script might be a better solution.
Here is what I usually do on AIX:
before=2 << The number of lines to be shown Before >>
after=2 << The number of lines to be shown After >>
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk "NR<=%+$after && NR>=%-$before" <filename>
If you do not want the extra 2 varialbles you can always use it an a one line:
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+<<after>> && NR>=%-<<before>>' <filename>
Suppose I have a pattern 'stack' and the filename is flow.txt
I want 2 lines before and 3 lines after. The the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%-2' flow.txt
I want 2 lines before and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=% && NR>=%-2' flow.txt
I want 3 lines after and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%' flow.txt
Multiple Files - change it for Awk & grep. From above for the pattern 'stack' with the filename is flow.* - 2 lines before and 3 lines after. The the command will be like:
awk 'BEGIN {
before=1; after=3; pattern="stack";
i=0; hold[before]=""; afterprints=0}
{
#Print the lines from the previous Match
if (afterprints > 0)
{
print FILENAME ":" FNR ":" $0
afterprints-- #keep a track of the lines to print after - this can be reset if a match is found
if (afterprints == 0) print "---"
}
#Look for the pattern in current line
if ( match($0, pattern) > 0 )
{
# print the lines in the hold round robin buffer from the current line to line-1
# if (before >0) => user wants lines before avoid divide by 0 in %
# and afterprints => 0 - we have not printed the line already
for(j=i; j < i+before && before > 0 && afterprints == 0 ; j++)
print hold[j%before]
if (afterprints == 0) # print the line if we have not printed the line already
print FILENAME ":" FNR ":" $0
afterprints=after
}
if (before > 0) # Store the lines in the round robin hold buffer
{ hold[i]=FILENAME ":" FNR ":" $0
i=(i+1)%before }
}' flow.*
From the tags, it's likely that the system has a grep that may not support providing context (Solaris is one system that doesn't and I can't remember about AIX). If that is the case, there's a perl script that may help at http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/cgrep__context_grep.
If you have sed you could use this shell script
BEFORE=2
AFTER=3
FILE=file.txt
PATTERN=pattern
for i in $(grep -n $PATTERN $FILE | sed -e 's/\:.*//')
do head -n $(($AFTER+$i)) $FILE | tail -n $(($AFTER+$BEFORE+1))
done
What it does is, grep -n prefixes each match with the line it was found at, the sed strips all but the line it was found at. Then you use head to get the lines up to the line it was found on plus an additional $AFTER lines. That's then piped to tail to just get $BEFORE + $AFTER + 1 lines (that is, your matching line plus the number of lines before and after)
Sure there is (from the grep man page):
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
and if you want the same amount of lines before AND after the match, use:
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.
you can use awk
awk 'BEGIN{t=4}
c--&&c>=0
/pattern/{ c=t; for(i=NR;i<NR+t;i++)print a[i%t] }
{ a[NR%t]=$0}
' file
output
$ more file
1
2
3
4
5
pattern
6
7
8
9
10
11
$ ./shell.sh
2
3
4
5
6
7
8
9

Resources