Remove duplicate lines based on starting pattern using bash - unix

I'm trying to remove duplicates in a list of Jira tickets that follow the following syntax:
XXXX-12345: a description
where 12345 is a pattern like [0-9]+ and the XXXX is constant. For example, the following list:
XXXX-1111: a description
XXXX-2222: another description
XXXX-1111: yet another description
should get cleaned up like this:
XXXX-1111: a description
XXXX-2222: another description
I've been trying using sed but while what I had worked on Mac it didn't on linux. I think it'd be easier with awk but I'm not an expert on any of them.
I tried:
sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' file

This simple awk should get the output:
awk '!seen[$1]++' file
XXXX-1111: a description
XXXX-2222: another description

If the digits are the only thing defining a dup, you could do:
awk -F: '{split($1,arr,/-/); if (seen[arr[2]]++) next} 1' file
If the XXXX is always the same, you can simplify to:
awk -F: '!seen[$1]++' file
Either prints:
XXXX-1111: a description
XXXX-2222: another description

This might work for you (GNU sed):
sed -nE 'G;/^([^:]*:).*\n\1/d;P;h' file
-nE turn on explicit printing and extended regexps.
G append unique lines from the hold space to the current line.
/^([^:]*:).*\n\1/d If the current line key already exists, delete it.
P otherwise, print the current line and
h store unique lines in the hold space
N.B. Your sed solution would work (not as is but with some tweaking) but only if the file(s) were sorted by the key.
sed -E 'N;/^([^:]*:).*\n\1/!P;D' file

Related

Scraping 5 characters off a webpage using sed or something better?

I have a downloaded webpage I would like to scrape using sed or awk. I have been told by a colleague that what I'm trying to achieve isn't possible with sed and maybe this is probably correct seeing as he is a bit of a linux guru.
What I am trying to achieve:
I am trying to scrape a locally stored html document for every value within this html label which apppears hundreds of times on a webpage..
For example:
<label class="css-l019s7">54321</label>
<label class="css-l019s7">55555</label>
This label class never changes, so it seems the perfect point to do scraping and get the values:
54321
55555
There are hundreds of occurences of this data and I need to get a list of them all.
As sed probably isn't capable of this, I would be forever greatful if someone could demonstrate AWK or something else?
Thank you.
Things I've tried:
sed -En 's#(^.*<label class=\"css-l019s7\">)(.*)(</label>.*$)#\2#gp' TS.html
This code above managed to extract about 40 of the numbers out of 320. There must be a little bug in this sed command for it to work partially.
Use a parser like xmllint:
xmllint --html --recover --xpath '//label[#class="css-l019s7"]/text()' TS.html
As an interest in sed was expressed (note that html can use newlines instead of spaces, for example, so this is not very robust):
sed 't a
s/<label class="css-l019s7">\([^<]*\)/\
\1\
/;D
:a
P;D' TS.html
Using awk:
awk '$1~/^label class="css-l019s7"$/ { print $2 }' RS='<' FS='>' TS.html
or:
awk '$1~/^[\t\n\r ]*label[\t\n\r ]/ &&
$1~/[\t\n\r ]class[\t\n\r ]*=[\t\n\r ]*"css-l019s7"[\t\n\r ]*([\t\n\r ]|$)/ {
print $2
}' RS='<' FS='>' TS.html
someone could demonstrate AWK or something else?
This task seems for me as best fit for using CSS selector. If you are allowed to install tools you might use Element Finder for this following way:
elfinder -s "label.css-l019s7"
which will search for labels with class css-l019s7 in files in current directory.
with grep you can get the values
grep -Eo '([[:digit:]]{5})' file
54321
55555
with awk you can concrete where the values are, here in the lines with label at the beginning and at the end:
awk '/^<label|\/label>$/ {if (match($0,/[[:digit:]]{5}/)) { pattern = substr($0,RSTART,RLENGTH); print pattern}}' file
54321
55555
using GNU awk and gensub:
awk '/label class/ && /css-l019s7/ { str=gensub("(<label class=\"css-l019s7\">)(.*)(</label>)","\\2",$0);print str}' file
Search for lines with "label class" and "css=1019s7". Split the line into three sections and substitute the line for the second section, reading the result into a variable str. Print str.

Removing comments from a datafile. What are the differences?

Let's say that you would like to remove comments from a datafile using one of two methods:
cat file.dat | sed -e "s/\#.*//"
cat file.dat | grep -v "#"
How do these individual methods work, and what is the difference between them? Would it also be possible for a person to write the clean data to a new file, while avoiding any possible warnings or error messages to end up in that datafile? If so, how would you go about doing this?
How do these individual methods work, and what is the difference
between them?
Yes, they work same though sed and grep are 2 different commands. Your sed command simply substitutes all those lines which having # with NULL. On other hand grep will simply skip or ignore those lines which will skip lines which have # in it.
You could get more information on these by man page as follows:
man grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
man sed:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The
replacement may
contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1
through \9 to
refer to the corresponding matching sub-expressions in the regexp.
Would it also be possible for a person to write the clean data to a
new file, while avoiding any possible warnings or error messages to
end up in that datafile?
yes, we could re-direct the errors by using 2>/dev/null in both the commands.
If so, how would you go about doing this?
You could try like 2>/dev/null 1>output_file
Explanation of sed command: Adding explanation of sed command too now. This is only for understanding purposes and no need to use cat and then use sed you could use sed -e "s/\#.*//" Input_file instead.
sed -e " ##Initiating sed command here with adding the script to the commands to be executed
s/ ##using s for substitution of regexp following it.
\#.* ##telling sed to match a line if it has # till everything here.
//" ##If match found for above regexp then substitute it with NULL.
That grep -v will lose all the lines that have # on them, for example:
$ cat file
first
# second
thi # rd
so
$ grep -v "#" file
first
will drop off all lines with # on it which is not favorable. Rather you should:
$ grep -o "^[^#]*" file
first
thi
like that sed command does but this way you won't get empty lines. man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

How to delete one specific line in a file and modify the next line in unix?

I have a text file and there was a mistake when it was created. To fix this I need to delete a line with a specific unique string and delete the characters in the following line that precede the # symbol. I was able to do this with sed and cut but it only output that one line, not the many other 1000s of lines in my file. Here is an example of the part of the file that needs fixing. I know the line #s (delete 45603341 and modify 45603342) where this mistake occurs.
#HWI-1KL135:70:C305EACXX:5:2105:6727:102841 1:N:0:CAGATC
CCAAGTGTCACCTCTTTTATTTATTGATTT#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
I need the output to look like this and for it to leave the rest of the file intact.
#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
Thanks!
How about:
sed -i -e '45603341d;45603342s/^.*\(#.*\)$/\1/' <filename>
where you replace <filename> with the name of your file.
If you want to change a particular line and delete the above line then run,
sed -ri '45603342s/^([^#]*)(#.*)$/\2/g; 45603341d' aa
Example:
$ cat aa
#HWI-1KL135:70:C305EACXX:5:2105:6727:102841 1:N:0:CAGATC
CCAAGTGTCACCTCTTTTATTTATTGATTT#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
$ sed -r '2s/^([^#]*)(#.*)$/\2/g; 1d' aa
#HWI-1KL135:70:C305EACXX:5:1101:1178:2203 1:N:0:CAGATC
This might work for you (GNU sed):
sed '45603341!b;N;s/^.*\n[^#]*//' file
Leave as is any other line ecsept 45603341. On this line , append the following line and then remove everything from the start to the first non-# in the the appended line.
An alternative approach to 'sed' can be to use vim macros (This also works on Windows). The main disadvantage is that you will not be able to integrate inside scripts like 'sed' does. The main advantage is that it allows for complex replacements like "search for this pattern, then clear the line, go down 3 lines, move to column 40, switch lines,...). If you are already familiar with VIM it's also much more intuitive.
In this particular case you will have to do something like
qq (start macro recording)
/^#HWI.*CAGATC$ (search pattern)
dd (delete line)
vw (select word)
d (delete selected word)
q (end macro)
To run the macro 100 times:
100#q

modifiy grep output with find and replace cmd

I use grep to sort log big file into small one but still there is long dir path in output log file which is common every time.I have to do find and replace every time.
Isnt there any way i can grep -r "format" log.log | execute findnreplce thing?
Sed will do what you want. Basic syntax to replace all the matches of foo with bar in-place in $file is:
sed -i 's/foo/bar/g' $file
If you're just wanting to delete rather than replace, simply leave out the 'bar' (so s/foo//g).
See this tutorial for a lot more detail, such as regex support.
sed -n '/match/s/pattern/repl/p'
Will print all the lines that match the regex match, with all instances of pattern replaced by repl. Since your lines may contain paths, you will probably want to use a different delimeter. / is customary, but you can also do:
sed -n '\#match#s##repl#p`
In the second case, omitting pattern will cause match to be used for the pattern to be replaced.

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

Resources