Transposing multiple columns in multiple rows keeping one column fixed in Unix - unix

I have one file that looks like below
1234|A|B|C|10|11|12
2345|F|G|H|13|14|15
3456|K|L|M|16|17|18
I want the output as
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
I have tried with the below script.
awk -F"|" '{print $1","$2","$3","$4"}' file.dat | awk -F"," '{OFS=RS;$1=$1}1'
But the output is generated as below.
1234
A
B
C
2345
F
G
H
3456
K
L
M
Any help is appreciated.

What about a single simple awk process such as this:
$ awk -F\| '{print $1 "|" $2 "\n" $1 "|" $3 "\n" $1 "|" $4}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M
No messing with RS and OFS.

If you want to do this dynamically, then you could pass in the number of fields that you want, and then use a loop starting from the second field.
In the script, you might first check if the number of fields is equal or greater than the number you pass into the script (in this case n=4)
awk -F\| -v n=4 '
NF >= n {
for(i=2; i<=n; i++) print $1 "|" $i
}
' file
Output
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M

# perl -lne'($a,#b)=((split/\|/)[0..3]);foreach (#b){print join"|",$a,$_}' file.dat
1234|A
1234|B
1234|C
2345|F
2345|G
2345|H
3456|K
3456|L
3456|M

Related

awk $4 column if column = value with characters thereafter

I have a file with the following data within for example:
20 V 70000003d120f88 1 2
20 V 70000003d120f88 2 2
20x00 V 70000003d120f88 2 2
10020 V 70000003d120f88 1 5
I want to get the sum of the 4th column data.
Using the the below command, I can acheive this, however the row 20x00 is excluded. I want to everything to start with 20 must be sumed and nothing before that, so 20* for example:
cat testdata.out | awk '{if ($1 == '20') print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The output value must be:
5
How can I achieve this using awk. The below I attempted also does not work:
cat testdata.out | awk '$1 ~ /'20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
There is no need to use 3 processes, anything can be done by one AWK process. Check it out:
awk '$1 ~ /^20/ { a+=$4 } END { print a }' testdata.out
explanation:
$1 ~ /^20/ checks to see if $1 starts with 20
if yes, we add $4 in the variable a
finally, we print the variable a
result 5
EDIT:
Ed Morton rightly points out that the result should always be of the same type, which can be solved by adding 0 to the result.
You can set the exit status if it is necessary to distinguish whether the result 0 is due to no matches
(output status 0) or matching only zero values ​​(output status 1).
The exit code for different input data can be checked e.g. echo $?
The code would look like this:
awk '$1 ~ /^20/ { a+=$4 } END { print a+0; exit(a!="") }' testdata.out
Figured it out:
cat testdata.out | awk '$1 ~ /'^20'/ {print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'
The above might not work for all cases, but below will suffice:
i=20
cat testdata.out | awk '{if ($1 == "'"$i"'" || $1 == ""'"${i}"'"x00") print $4;}' | awk '{s+=$1}END{printf("%.0f\n", s)}'

OFS when using if-else statement in awk

I have a simple text file, delimited by multiple spaces, and with a different number of columns (6 or 5).
What I am trying to do is, for the rows with more than 5 columns, combine the 2 last columns in one, doing:
cat data.txt | awk '{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else print $0} OFS="," ' > data.csv
The problem is that the OFS is not working for the else statement.
Example - input:
a d e t er ap
b q j n mm
Output that I am getting:
a,d,e,t,er_ap
b q j n mm
Desirable output:
a,d,e,t,er_ap
b,q,j,n,mm
Any suggestions?
Set your OFS in the BEGIN block so that it's a comma before any processing happens. Also when you do print $0 without manipulating the line in any way, awk will just spit out the line as-is with whatever delimiters are in place in the source file. Personally I think that's dumb, but that's awk. As a workaround, just set one column equal to itself, then print:
awk 'BEGIN{OFS=","}{if(NF>5) print $1,$2,$3,$4,$5"_"$6; else {$1=$1;print $0}}' data.txt
If you anticipate more than 6 columns you can just have it toss underscores for all of them after column 5 with some printf trickery too
awk '{for (i=1;i<=NF;i++){printf (i==NF)?"%s\n":(i>=5)?"%s_":"%s,", $i}}' data.txt

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

Is Awk and multiple file processing possible?

I need to process two file contents. I was wondering if we can pull it off using a single nawk statement.
File A contents:
AAAAAAAAAAAA 1
BBBBBBBBBBBB 2
CCCCCCCCCCCC 3
File B contents:
XXXXXXXXXXX 3
YYYYYYYYYYY 2
ZZZZZZZZZZZ 1
I would like to compare if $2 (2nd field ) in file A is the reverse of $2 in file B.
I was wondering how to write rules in nawk for multi-file processing ?
How would we distinguish A's $2 from B's $2
EDIT: I need to compare $2 of A's first line (which is 1) with the $2 of B's last line (which is 1 again) .Then compare $2 of line 2 in A with $2 in NR-1 th line of B. And so on.....
You can do something like this -
[jaypal:~/Temp] cat f1
AAAAAAAAAAAA 1
BBBBBBBBBBBB 2
CCCCCCCCCCCC 3
DDDDDDDDDDDD 4
[jaypal:~/Temp] cat f2
AAAAAAAAAAA 5
XXXXXXXXXXX 3
YYYYYYYYYYY 2
ZZZZZZZZZZZ 1
Solution:
awk '
NR==FNR {a[i++]=$2; next}
{print (a[--i] == $2 ? "Match " $2 FS a[i] : "Do not match " $2 FS a[i])}' FileB FileA
Match 1 1
Match 2 2
Match 3 3
Do not match 4 5
You can make awk process files serially, but you can't easily make it process two files in parallel. You probably can achieve the effect with careful use of getline but 'careful' is the operative term.
I think in this case, with simple two-column files, I'd be inclined to use:
paste "File A" "File B" |
awk '{ process fields $1, $2 from File A and fields $3, $4 from file B }'
You would need to make sure the two files are in the appropriate order, etc.
If your input is more complex, then this may not work so well, though you can choose the character that separates the data from the two files with paste -d'|' ... to use a pipe to separate the two records, and awk -F'|' '{ ... }' to read $1 as the info from File A and $2 as the info from File B.
Have you thought about doing something like the following?
diff --brief <(awk '{print $2}' A) <(tac B | awk '{print $2}')
tac reverses the lines of file B and then you can compare the two columns.

Resources