Replacing multiple occurrences of a word using sed in unix - unix

I have a requirement in unix to replace an occurrence of word with a space.
My File looks something like below. I need to replace |NA| with a space
File Format
1234|NA|NA|abcd|xyz
2345|NA|NA|NA|lmn
456|NA|abcd|xya|ggh
Expected Output
1234| | |abcd|xyz
2345| | | |lmn
456| |abcd|xya|ggh
I am using the following command but it only replaces the very first occurrence
sed 's/|NA|| |/g'

While the g modifier does make "global" replacements, the replacements must be non-overlapping. When overlapping replacements are required, one must loop:
$ sed ':a; s/|NA|/| |/g; ta' file.txt
1234| | |abcd|xyz
2345| | | |lmn
456| |abcd|xya|ggh
The above was tested on GNU sed. For BSD (OSX) sed (Hat tip: Jonathan Leffler), the label a must occur only at the end of a command string:
sed -e ':a' -e ' s/|NA|/| |/g; ta' file.txt
How it works
:a creates a label a.
s/|NA|/| |/g performs the substitution that you want but only for non-overlapping instances of |NA|.
ta tells sed to jump to label a if the preceding substitution command resulted in any changes to the line. In this way, the substitution command is repeated as many times as necessary to replace every occurrence of |NA|.

Just use awk for clarity, simplicity, portability, extensibility, etc., etc.:
$ awk '{while(gsub(/\|NA\|/,"| |"));}1' file
1234| | |abcd|xyz
2345| | | |lmn
456| |abcd|xya|ggh
First time through the loop the gsub() replaces all odd-numbered occurrences of the regexp and the 2nd time through it replaces any that are left. It will work as-is with any awk on any UNIX system.

Failed utterly trying to escape the vertical bar. Then made an attempt without
having the vertical bar involved and it worked! Missed also that the
replacement is only one space, now corrected. This way the field is easily
expandable by adding space.
awk '{gsub(/NA/," ")}1' file
1234| | |abcd|xyz
2345| | | |lmn
456| |abcd|xya|ggh

Related

how to list a specific string or number in a file in Unix

for Example if your file has following lines
1=10200|2=2343i|3=otit|5=89898|54=9546i96i|10=2459
1=10200|54=9546i96i|10=2459|2=2343i|3=otit|5=8
1=10200|5=IGY|14=897|459=122|132=1|54=9546i96i|10=2459
1=10200|2=2343i|5=0|54=9546i96i
The output should be
5=89898
5=8
5=IGY
5=0
You could use grep with the -o flag to return only the regexp matches.
Assuming you have a file.txt that you want to parse:
cat file.txt | grep -o -E "(\||^)5=[^|]*" | grep -o "5=[^|]*"
This will match anything that starts with 5= up until the first |.
By running this command on the input you provided I get:
5=89898
5=8
5=IGY
5=0
Cheers
Edit: as Walter A suggested, my previous solution did not cover all cases.
I have added an extra parsing step: first, you get all strings that match 5=... at the start of a line, or |5=..., and then you remove the |.
Use (^|[|]) for matching start of field (start of line or |) and remember/match string until next | or end-of-line.
sed -nr 's/.*(^|[|])(5=[^|]*).*/\2/p' file

Removing comments from a datafile. What are the differences?

Let's say that you would like to remove comments from a datafile using one of two methods:
cat file.dat | sed -e "s/\#.*//"
cat file.dat | grep -v "#"
How do these individual methods work, and what is the difference between them? Would it also be possible for a person to write the clean data to a new file, while avoiding any possible warnings or error messages to end up in that datafile? If so, how would you go about doing this?
How do these individual methods work, and what is the difference
between them?
Yes, they work same though sed and grep are 2 different commands. Your sed command simply substitutes all those lines which having # with NULL. On other hand grep will simply skip or ignore those lines which will skip lines which have # in it.
You could get more information on these by man page as follows:
man grep:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)
man sed:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful, replace that portion matched with replacement. The
replacement may
contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1
through \9 to
refer to the corresponding matching sub-expressions in the regexp.
Would it also be possible for a person to write the clean data to a
new file, while avoiding any possible warnings or error messages to
end up in that datafile?
yes, we could re-direct the errors by using 2>/dev/null in both the commands.
If so, how would you go about doing this?
You could try like 2>/dev/null 1>output_file
Explanation of sed command: Adding explanation of sed command too now. This is only for understanding purposes and no need to use cat and then use sed you could use sed -e "s/\#.*//" Input_file instead.
sed -e " ##Initiating sed command here with adding the script to the commands to be executed
s/ ##using s for substitution of regexp following it.
\#.* ##telling sed to match a line if it has # till everything here.
//" ##If match found for above regexp then substitute it with NULL.
That grep -v will lose all the lines that have # on them, for example:
$ cat file
first
# second
thi # rd
so
$ grep -v "#" file
first
will drop off all lines with # on it which is not favorable. Rather you should:
$ grep -o "^[^#]*" file
first
thi
like that sed command does but this way you won't get empty lines. man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Using grep to search DNA sequence files

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

Returning Data Until N'th Occurence of Character X

Withing a directory I have multiple files that have multiple version numbers within the files. I am grepping each file within the directory for these version numbers, sorting them in order to get the most recent version number, and then piping that into 'tail -1' to only the most recent version number and not every grep result.
The data looks something like this:
file1: asdf garbage 1.2.4.1 garbagetext asdf
file2: fsdaf garbage asdfsda 4.3.2.10 fdsaf
and so on. I have already accomplished extracting the most recent version number. I did this with the following:
grep -o '[0-9]\{1,\}\.[0-9]\{1,\}\.[0-9]\{1,\}\.[0-9]\{1,\}' * | sort | tail -1
The next part is what I am having trouble on. I am trying to extract the number (whether it be one number character or two number characters) before the first period and return that result. Then, I am assuming with a slightly different command do the same thing but for the number after the first period. And again for the number after the second period and finally after the third period.
I have little to no experience with sed or awk, but after a little research I believe either one of these tools are the way to accomplish this.
Thank you!
Edit: Alright I got it, but I am certain this can be done in a much easier way. What I have is the following:
grep -o '[0-9]\{1,\}\.[0-9]\{1,\}\.[0-9]\{1,\}\.[0-9]\{1,\}' * | sort | tail -1 | grep -o '[0-9]\{1,\}' | sed -n 2p
or sed -n 1p, 3p, 4p depending on which value I want.
to get the lastest version number:
grep -P -o "\d+\.\d+\.\d+\.\d+" * |sed 's/.*://g'|awk -F'.' '{v[$0]=($1"."$2$3$4)+0;}END{m=0;for(x in v)if(v[x]>m){m=v[x];n=x;}print n}'
to extract numbers:
kent$ echo "10.2.30.4"|awk -F'.' -v OFS="\n" '$1=$1'
10
2
30
4
you can put the two line together.
To extract a version number, without having to know how many dots are in it, I would use
grep -o '[0-9.]\+' filename | sort --version-sort | tail -1
(assuming you have GNU sort, with the --version-sort option)
To get just the major version number, pipe the above into one of
sed 's/\..*//'
while read line; do echo ${line%%.*}; done

Delete specific line number(s) from a text file using sed?

I want to delete one or more specific line numbers from a file. How would I do this using sed?
If you want to delete lines from 5 through 10 and line 12th:
sed -e '5,10d;12d' file
This will print the results to the screen. If you want to save the results to the same file:
sed -i.bak -e '5,10d;12d' file
This will store the unmodified file as file.bak, and delete the given lines.
Note: Line numbers start at 1. The first line of the file is 1, not 0.
You can delete a particular single line with its line number by
sed -i '33d' file
This will delete the line on 33 line number and save the updated file.
and awk as well
awk 'NR!~/^(5|10|25)$/' file
$ cat foo
1
2
3
4
5
$ sed -e '2d;4d' foo
1
3
5
$
This is very often a symptom of an antipattern. The tool which produced the line numbers may well be replaced with one which deletes the lines right away. For example;
grep -nh error logfile | cut -d: -f1 | deletelines logfile
(where deletelines is the utility you are imagining you need) is the same as
grep -v error logfile
Having said that, if you are in a situation where you genuinely need to perform this task, you can generate a simple sed script from the file of line numbers. Humorously (but perhaps slightly confusingly) you can do this with sed.
sed 's%$%d%' linenumbers
This accepts a file of line numbers, one per line, and produces, on standard output, the same line numbers with d appended after each. This is a valid sed script, which we can save to a file, or (on some platforms) pipe to another sed instance:
sed 's%$%d%' linenumbers | sed -f - logfile
On some platforms, sed -f does not understand the option argument - to mean standard input, so you have to redirect the script to a temporary file, and clean it up when you are done, or maybe replace the lone dash with /dev/stdin or /proc/$pid/fd/1 if your OS (or shell) has that.
As always, you can add -i before the -f option to have sed edit the target file in place, instead of producing the result on standard output. On *BSDish platforms (including OSX) you need to supply an explicit argument to -i as well; a common idiom is to supply an empty argument; -i ''.
The shortest, deleting the first line in sed
sed -i '1d' file
As Brian states here, <address><command> is used, <address> is <1> and <command> <d>.
I would like to propose a generalization with awk.
When the file is made by blocks of a fixed size
and the lines to delete are repeated for each block,
awk can work fine in such a way
awk '{nl=((NR-1)%2000)+1; if ( (nl<714) || ((nl>1025)&&(nl<1029)) ) print $0}'
OriginFile.dat > MyOutputCuttedFile.dat
In this example the size for the block is 2000 and I want to print the lines [1..713] and [1026..1029].
NR is the variable used by awk to store the current line number.
% gives the remainder (or modulus) of the division of two integers;
nl=((NR-1)%BLOCKSIZE)+1 Here we write in the variable nl the line number inside the current block. (see below)
|| and && are the logical operator OR and AND.
print $0 writes the full line
Why ((NR-1)%BLOCKSIZE)+1:
(NR-1) We need a shift of one because 1%3=1, 2%3=2, but 3%3=0.
+1 We add again 1 because we want to restore the desired order.
+-----+------+----------+------------+
| NR | NR%3 | (NR-1)%3 | (NR-1)%3+1 |
+-----+------+----------+------------+
| 1 | 1 | 0 | 1 |
| 2 | 2 | 1 | 2 |
| 3 | 0 | 2 | 3 |
| 4 | 1 | 0 | 1 |
+-----+------+----------+------------+
cat -b /etc/passwd | sed -E 's/^( )+(<line_number>)(\t)(.*)/--removed---/g;s/^( )+([0-9]+)(\t)//g'
cat -b -> print lines with numbers
s/^( )+(<line_number>)(\t)(.*)//g -> replace line number to null (remove line)
s/^( )+([0-9]+)(\t)//g #remove numbers the cat printed

Resources