Removing empty lines and duplicate lines from text file - unix

I recently used the awk command to remove duplicate lines, and spaces between lines but I am not getting the desired output file.
Input file:
a b
a b
c d
c d
e f
e f
Desired output:(I wanted to remove duplicate lines and all spaces in between lines)
a b
c d
e f
I used the following code:
awk '!x[$0]++' input file > output file
And got this output:
a b
c d
e f
The space between the first line and all the rest is still in the output file.
Help please and thank you.

awk 'NF && !seen[$0]++' inputfile.txt > outputfile.txt
NF removes white lines or lines containing only tabs or whitespaces.
!seen[$0]++ removes duplicates.

If the original line order of the input is important, then the following will not work for you. If you don't care about the order, then read on.
For me, awk is not the best tool for this problem.
Since you are trying to use awk, I assume you are in a unix-like environment, so:
When I hear "eliminate blank lines" I think "grep".
When I hear "eliminate duplicate lines" I think "uniq" (which requires sort, though not in your example since it is already sorted.)
So, given a file 'in.txt' that duplicates your example, the following produces the desired output.
grep -v "^[[:space:]]*$" in.txt | uniq
Now, if your real data is not sorted, that won't work. Instead use:
grep -v "^[[:space:]]*$" in.txt | sort -u
Your output may be in a different order than the input in this case.

cat test
a b
a b
c d
c d
e f
e f
awk '$0 !~ /^[[:space:]]*$/' test
a b
a b
c d
c d
e f
e f

Related

Count the number of times a word in a specific order (Unix)

count how many times A and B shows up in this example file:
Ex:
1,2,3,A
2,3,1,A
3,1,2,A
1,2,3,B
1,3,2,B
Expected Output should be:
A 3
B 2
So far I have:
grep -cw "*A" <file>
with output:
3
Which only displays the number of occurrences.
Try this:
mayankp#mayank:~/$ cat t1.txt
1,2,3,A
2,3,1,A
3,1,2,A
1,2,3,B
1,3,2,B
mayankp#mayank:~/$ awk -F, 'NR{arr[$4]++}END{for (a in arr) print a, arr[a]}' t1.txt
A 3
B 2
Let me know if this helps.

Drop or remove column using awk

I wanted to drop first 3 column;
This is my data;
DETAIL 02032017
Name Gender State School Class
A M Melaka SS D
B M Johor BB E
C F Pahang AA F
EOF 3
I want my data like this:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
This is my current command that I get mycommandoutput:
awk -v date="$(date +"%d%m%Y")" -F\| 'NR==1 {h=$0; next}
{file="TEST_"$1"_"$2"_"date".csv";
print (a[file]++?"": "DETAIL"date"" ORS h ORS) $0 > file} END{for(file in a) print "EOF " a[file] > file}' testing.csv
Can anyone help me?
Thank you :)
I want to remove first three column
If you just want to remove the first three columns, you can just set them to empty strings, leaving alone those that don't have three columns, something like:
awk 'NF>=3 {$1=""; $2=""; $3=""; print; next}{print}'
That has the potentially annoying habit of still having the field separators between those empty fields but, since modifying columns will reformat the line anyway, I assume that's okay:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3
If awk is the only tool being used to process them, the spacing won't matter. If you do want to preserve formatting (meaning that the columns are at very specific locations on the line), you can just get a substring of the entire line:
awk '{if (NF>=3) {$0 = substr($0,25)}; print}'
Since that doesn't modify individual fields, it won't trigger a recalculation of the line that would change its format:
DETAIL 02032017
School Class
SS D
BB E
AA F
EOF 3

Extract data before and after matching (BIG FILE )

I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'

How to produce the same string for x number of lines and paste 2 files? unix

How to produce the same string for x number of lines and then use paste to combine the files:
I have a file as such with unknown number of lines, e.g.:
$ echo -e "a\tb\tc\nd\te\tf" > in.txt
$ cat in.txt
a b c
d e f
I want to concat the files with a new column that has the same string for every row. I have tried using echo to create a file and then using paste to concat the columns but i have to know the number of rows in in.txt first and then create a in2.txt using echo.
$ echo -e "a\tb\tc\nd\te\tf" > in.txt
$ cat in.txt
a b c
d e f
$ echo -e "x\nx\n" > in2.txt
$ paste in.txt in2.txt
a b c x
d e f x
How else can I achieve the same output for an unknown number of lines in in.txt?, e.g.
[in:]
a b c
d e f
[out:]
a b c x
d e f x
My data consist of a million lines with 3 columns in in.txt of 50-200 chars for each line, so solution needs to keep the "big" data size in mind.
One way with join:
echo | join input - -a 1 -o "1.1 1.2 1.3 2.1" -e x
Though just doing a sed replace should be much better.

Grep examples - can't understand

Given the following commands:
ls | grep ^b[^b]*b[^b]
ls | grep ^b[^b]*b[^b]*
I know that ^ marks the start of the line, but can anyone give me a brief explanation about
these commands? what do they do? (Step by step)
thanks!
^ can mean two things:
mark the beginning of a line
or it negates the character set (whithin [])
So, it means:
lines starting with 'b'
matching any (0+) characters Other than 'b'
matching another 'b'
followed by something not-'b' (or nothing at all)
It will match
bb
bzzzzzb
bzzzzzbzzzzzzz
but not
zzzzbb
bzzzzzxzzzzzz
1)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with a character which is not b
2)starts with b and name continues with a 0 or more characters which are not b and then b and then continues with 0 or more characters which are not b

Resources