arrange a file with sed - unix

Okay I have a file which look like this :
>S000632122
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
>S000632121
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
>S000541758
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
But I want something like this
>S000632122\tBacteria; domain; actinobacteria\n
>S000548245\tBacteria; domain; actinobacteria\n
I tried with sed but I'm a bit lost...
Here is what I've done :
sed ':a;N;$!ba;s/\n/\t/g' my file
But it returns me that:
>S000632122 \tBacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;\t>S000632121\tBacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;\t>S000541758
Thank you in advance for your help

This sort of thing is almost always simpler with awk:
$ cat input
>S000632122
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
>S000632121
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
>S000541758
Bacteria;domain;"Actinobacteria";Actinobacteria;Acidimicrobidae;Acidimicrobiales;order;"Acidimicrobineae";Acidimicrobiaceae;Acidimicrobium;
$ awk '/^>S[0-9]*$/{ printf "%s\t", $0; next} {printf "%s; %s; %s\n", $1, $2, tolower($4)}' FS=\; input
>S000632122 Bacteria; domain; actinobacteria
>S000632121 Bacteria; domain; actinobacteria
>S000541758 Bacteria; domain; actinobacteria
It's not clear to me from the question if you actually want literal text \t and \n in the output. If you do, you could do:
$ awk '/^>S[0-9]*$/{ printf "%s\\t", $0; next} {printf "%s; %s; %s\\n\n", $1, $2, tolower($4)}' FS=\; input
>S000632122\tBacteria; domain; actinobacteria\n
>S000632121\tBacteria; domain; actinobacteria\n
>S000541758\tBacteria; domain; actinobacteria\n
In each of these, the first clause matches the regex ^>S[0-9]*$ and prints those lines with a trailing tab. (Removing the newline effectively joins the next line in the output.) Every other line is printed according to the format string.

This might work for you (GNU sed):
sed -E 'N;s/\n(([^;]*;){3}).*/\t\L\1/;s/;/\n/3;s//& /g;s/"//g' file
Or if the tab and newline are literal:
sed -E 'N;s/\n(([^;]*;){3}).*/\\t\L\1/;s/;/\\n/3;s//& /g;s/"//g' file
Append the following line.
Replace the newline by a tab and remove all but the first 3 fields of the second line (also lowercase the second line at the same time).
Replace the 3rd ; by a newline.
Put a space after all remaining ;'s.
Remove any "'s.

Related

Surround every line with single quote except empty lines

My goal is to add a single apostrophe to every line in the file and skip empty lines.
file.txt:
Quote1
Quote2
Quote3
So far I have used sed:
sed -e "s/\(.*\)/'\1'/"
Which does the job but creates apostrophes also in empty lines:
'Quote1'
'Quote2'
''
'Quote3'
My goal:
'Quote1'
'Quote2'
'Quote3'
How could I achieve this by using sed, or maybe it should it be awk.
.* means zero or more characters, you want 1 or more characters which in any sed would be ..*:
$ sed "s/..*/'&'/" file
'Quote1'
'Quote2'
'Quote3'
You can also write that regexp as .\+ in GNU sed, .\{1,\} in POSIX seds, and .+ in GNU or OSX/BSD sed when invoked with -E.
The above assumes lines of all blanks should be quoted. If that's wrong then:
$ sed "s/.*[^[:blank:]].*/'&'/" file
'Quote1'
'Quote2'
'Quote3'
In any awk assuming lines of all blanks should be quoted:
$ awk '/./{$0="\047" $0 "\047"}1' file
'Quote1'
'Quote2'
'Quote3'
otherwise:
$ awk 'NF{$0="\047" $0 "\047"}1' file
'Quote1'
'Quote2'
'Quote3'
You can see the difference between the above with this:
$ printf ' \n' | sed "s/..*/'&'/"
' '
$ printf ' \n' | sed "s/.*[^[:blank:]].*/'&'/"
$ printf ' \n' | awk '/./{$0="\047" $0 "\047"}1'
' '
$ printf ' \n' | awk 'NF{$0="\047" $0 "\047"}1'
$
One way:
awk '$1{$0 = q $0 q}1' q="'" file
Add quotes only if 1st column($1) has some value. 1 to print every line.
Assuming you want to add the single quotes to lines that contain nothing but whitespace:
sed -E "/./s/(.*)/'\1'/"
Another sed
sed '/^$/!{s/^/\x27/;s/$/\x27/}' file
The above script says
Look for an empty line - /^$/ - pattern.
For those lines that doesn't follow the above pattern(!), substitute start(^) and end($) with single quotes(\x27).
You can use perl to check for a negative lookbehind, asserting that you won't match an "empty" line:
perl -pe 's/(?<!$)(.*)/"\1"/' file
Another alternative is being more specific in your regex, as #edmorton suggested in his answer.
tried on gnu sed
sed -E "s/\S+/'&'/" file.txt

Split line with multiple delimiters in Unix

I have the below lines in a file
id=1234,name=abcd,age=76
id=4323,name=asdasd,age=43
except that the real file has many more tag=value fields on each line.
I want the final output to be like
id,name,age
1234,abcd,76
4323,asdasd,43
I want all values before (left of) the = to come out as separated with a , as the first row and all values after the (right side) of the = to come below for in each line
Is there a way to do it with awk or sed? Please let me know if for loop is required for the same?
I am working on Solaris 10; the local sed is not GNU sed (so there is no -r option, nor -E).
$ cat tst.awk
BEGIN { FS="[,=]"; OFS="," }
NR==1 {
for (i=1;i<NF;i+=2) {
printf "%s%s", $i, (i<(NF-1) ? OFS : ORS)
}
}
{
for (i=2;i<=NF;i+=2) {
printf "%s%s", $i, (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
id,name,age
1234,abcd,76
4323,asdasd,43
Assuming they don't really exist in your input, I removed the ...s etc. that were cluttering up your example before running the above. If that stuff really does exist in your input, clarify how you want the text "(n number of fields)" to be identified and removed (string match? position on line? something else?).
EDIT: since you like the brevity of the cat|head|sed; cat|sed approach posted in another answer, here's the equivalent in awk:
$ awk 'NR==1{h=$0;gsub(/=[^,]+/,"",h);print h} {gsub(/[^,]+=/,"")} 1' file
id,name,age
1234,abcd,76
4323,asdasd,43
FILE=yourfile.txt
# first line (header)
cat "$FILE" | head -n 1 | sed -r "s/=[^,]+//g"
# other lines (data)
cat "$FILE" | sed -r "s/[^,]+=//g"
sed -r '1 s/^/id,name,age\n/;s/id=|name=|age=//g' my_file
edit: or use
sed '1 s/^/id,name,age\n/;s/id=\|name=\|age=//g'
output
id,name,age
1234,abcd,76 ...(n number of fields)
4323,asdasd,43...
The following simply combines the best of the sed-based answers so far, showing you can have your cake and eat it too. If your sed does not support the -r option, chances are that -E will do the trick; all else failing, one can replace R+ by RR* where R is [^,]
sed -r '1s/=[^,]+//g; s/[^,]+=//g'
(That is, the portable incantation would be:
sed "1s/=[^,][^,]*//g; s/[^,][^,]*=//g"
)

AWK to print field $2 first, then field $1

Here is the input(sample):
name1#gmail.com|com.emailclient.account
name2#msn.com|com.socialsite.auth.account
I'm trying to achieve this:
Emailclient name1#gmail.com
Socialsite name2#msn.com
If I use AWK like this:
cat foo | awk 'BEGIN{FS="|"} {print $2 " " $1}'
it messes up the output by overlaying field 1 on the top of field 2.
Any tips/suggestions? Thank you.
A couple of general tips (besides the DOS line ending issue):
cat is for concatenating files, it's not the only tool that can read files! If a command doesn't read files then use redirection like command < file.
You can set the field separator with the -F option so instead of:
cat foo | awk 'BEGIN{FS="|"} {print $2 " " $1}'
Try:
awk -F'|' '{print $2" "$1}' foo
This will output:
com.emailclient.account name1#gmail.com
com.socialsite.auth.accoun name2#msn.com
To get the desired output you could do a variety of things. I'd probably split() the second field:
awk -F'|' '{split($2,a,".");print a[2]" "$1}' file
emailclient name1#gmail.com
socialsite name2#msn.com
Finally to get the first character converted to uppercase is a bit of a pain in awk as you don't have a nice built in ucfirst() function:
awk -F'|' '{split($2,a,".");print toupper(substr(a[2],1,1)) substr(a[2],2),$1}' file
Emailclient name1#gmail.com
Socialsite name2#msn.com
If you want something more concise (although you give up a sub-process) you could do:
awk -F'|' '{split($2,a,".");print a[2]" "$1}' file | sed 's/^./\U&/'
Emailclient name1#gmail.com
Socialsite name2#msn.com
Use a dot or a pipe as the field separator:
awk -v FS='[.|]' '{
printf "%s%s %s.%s\n", toupper(substr($4,1,1)), substr($4,2), $1, $2
}' << END
name1#gmail.com|com.emailclient.account
name2#msn.com|com.socialsite.auth.account
END
gives:
Emailclient name1#gmail.com
Socialsite name2#msn.com
Maybe your file contains CRLF terminator. Every lines followed by \r\n.
awk recognizes the $2 actually $2\r. The \r means goto the start of the line.
{print $2\r$1} will print $2 first, then return to the head, then print $1. So the field 2 is overlaid by the field 1.
The awk is ok. I'm guessing the file is from a windows system and has a CR (^m ascii 0x0d) on the end of the line.
This will cause the cursor to go to the start of the line after $2.
Use dos2unix or vi with :se ff=unix to get rid of the CRs.

How to append string before every third line in file

File is
a#gmail.com,b#yahoo.com
xyz#gmail.com
abc#gmail.com
ff#yahoo.co.in
jf#rediff.com
oop#hotmail.com
Output should be:
U|a#gmail.com,b#yahoo.com
D|xyz#gmail.com
R|abc#gmail.com
U|ff#yahoo.co.in
D|jf#rediff.com
R|oop#hotmail.com
I want to append specific string after every 3rd character.
#!/usr/bin/sed -f
s/^/U|/
n
s/^/D|/
n
s/^/R|/
Useful one-line scripts for sed
$ awk 'BEGIN {split("UDR",p,"")} {print p[((NR-1)%3)+1] "|" $0}' a.txt
U|a#gmail.com,b#yahoo.com
D|xyz#gmail.com
R|abc#gmail.com
U|ff#yahoo.co.in
D|jf#rediff.com
R|oop#hotmail.com

Removing trailing / starting newlines with sed, awk, tr, and friends

I would like to remove all of the empty lines from a file, but only when they are at the end/start of a file (that is, if there are no non-empty lines before them, at the start; and if there are no non-empty lines after them, at the end.)
Is this possible outside of a fully-featured scripting language like Perl or Ruby? I’d prefer to do this with sed or awk if possible. Basically, any light-weight and widely available UNIX-y tool would be fine, especially one I can learn more about quickly (Perl, thus, not included.)
From Useful one-line scripts for sed:
# Delete all leading blank lines at top of file (only).
sed '/./,$!d' file
# Delete all trailing blank lines at end of file (only).
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' file
Therefore, to remove both leading and trailing blank lines from a file, you can combine the above commands into:
sed -e :a -e '/./,$!d;/^\n*$/{$d;N;};/\n$/ba' file
So I'm going to borrow part of #dogbane's answer for this, since that sed line for removing the leading blank lines is so short...
tac is part of coreutils, and reverses a file. So do it twice:
tac file | sed -e '/./,$!d' | tac | sed -e '/./,$!d'
It's certainly not the most efficient, but unless you need efficiency, I find it more readable than everything else so far.
here's a one-pass solution in awk: it does not start printing until it sees a non-empty line and when it sees an empty line, it remembers it until the next non-empty line
awk '
/[[:graph:]]/ {
# a non-empty line
# set the flag to begin printing lines
p=1
# print the accumulated "interior" empty lines
for (i=1; i<=n; i++) print ""
n=0
# then print this line
print
}
p && /^[[:space:]]*$/ {
# a potentially "interior" empty line. remember it.
n++
}
' filename
Note, due to the mechanism I'm using to consider empty/non-empty lines (with [[:graph:]] and /^[[:space:]]*$/), interior lines with only whitespace will be truncated to become truly empty.
As mentioned in another answer, tac is part of coreutils, and reverses a file. Combining the idea of doing it twice with the fact that command substitution will strip trailing new lines, we get
echo "$(echo "$(tac "$filename")" | tac)"
which doesn't depend on sed. You can use echo -n to strip the remaining trailing newline off.
Here's an adapted sed version, which also considers "empty" those lines with just spaces and tabs on it.
sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'
It's basically the accepted answer version (considering BryanH comment), but the dot . in the first command was changed to [^[:blank:]] (anything not blank) and the \n inside the second command address was changed to [[:space:]] to allow newlines, spaces an tabs.
An alternative version, without using the POSIX classes, but your sed must support inserting \t and \n inside […]. GNU sed does, BSD sed doesn't.
sed -e :a -e '/[^\t ]/,$!d; /^[\n\t ]*$/{ $d; N; ba' -e '}'
Testing:
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n'
foo
foo
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -n l
$
\t $
$
foo$
$
foo$
$
\t $
$
prompt$ printf '\n \t \n\nfoo\n\nfoo\n\n \t \n\n' | sed -e :a -e '/[^[:blank:]]/,$!d; /^[[:space:]]*$/{ $d; N; ba' -e '}'
foo
foo
prompt$
using awk:
awk '{a[NR]=$0;if($0 && !s)s=NR;}
END{e=NR;
for(i=NR;i>1;i--)
if(a[i]){ e=i; break; }
for(i=s;i<=e;i++)
print a[i];}' yourFile
this can be solved easily with sed -z option
sed -rz 's/^\n+//; s/\n+$/\n/g' file
Hello
Welcome to
Unix and Linux
For an efficient non-recursive version of the trailing newlines strip (including "white" characters) I've developed this sed script.
sed -n '/^[[:space:]]*$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^[[:space:]]*$/H'
It uses the hold buffer to store all blank lines and prints them only after it finds a non-blank line. Should someone want only the newlines, it's enough to get rid of the two [[:space:]]* parts:
sed -n '/^$/ !{x;/\n/{s/^\n//;p;s/.*//;};x;p;}; /^$/H'
I've tried a simple performance comparison with the well-known recursive script
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba'
on a 3MB file with 1MB of random blank lines around a random base64 text.
shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M > bigfile
base64 </dev/urandom | dd bs=1 count=1M >> bigfile
shuf -re 1 2 3 | tr -d "\n" | tr 123 " \t\n" | dd bs=1 count=1M >> bigfile
The streaming script took roughly 0.5 second to complete, the recursive didn't end after 15 minutes. Win :)
For completeness sake of the answer, the leading lines stripping sed script is already streaming fine. Use the most suitable for you.
sed '/[^[:blank:]]/,$!d'
sed '/./,$!d'
Using bash
$ filecontent=$(<file)
$ echo "${filecontent/$'\n'}"
In bash, using cat, wc, grep, sed, tail and head:
# number of first line that contains non-empty character
i=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | head -1`
# number of hte last one
j=`grep -n "^[^\B*]" <your_file> | sed -e 's/:.*//' | tail -1`
# overall number of lines:
k=`cat <your_file> | wc -l`
# how much empty lines at the end of file we have?
m=$(($k-$j))
# let strip last m lines!
cat <your_file> | head -n-$m
# now we have to strip first i lines and we are done 8-)
cat <your_file> | tail -n+$i
Man, it's definitely worth to learn "real" programming language to avoid that ugliness!
#dogbane has a nice simple answer for removing leading empty lines. Here's a simple awk command which removes just the trailing lines. Use this with #dogbane's sed command to remove both leading and trailing blanks.
awk '{ LINES=LINES $0 "\n"; } /./ { printf "%s", LINES; LINES=""; }'
This is pretty simple in operation.
Add every line to a buffer as we read it.
For every line which contains a character, print the contents of the buffer and then clear it.
So the only things that get buffered and never displayed are any trailing blanks.
I used printf instead of print to avoid the automatic addition of a newline, since I'm using newlines to separate the lines in the buffer already.
This AWK script will do the trick:
BEGIN {
ne=0;
}
/^[[:space:]]*$/ {
ne++;
}
/[^[:space:]]+/ {
for(i=0; i < ne; i++)
print "";
ne=0;
print
}
The idea is simple: empty lines do not get echoed immediately. Instead, we wait till we get a non-empty line, and only then we first echo out as much empty lines as seen before it, and only then echo out the new non-empty line.
perl -0pe 's/^\n+|\n+(\n)$/\1/gs'
Here's an awk version that removes trailing blank lines (both empty lines and lines consisting of nothing but white space).
It is memory efficient; it does not read the entire file into memory.
awk '/^[[:space:]]*$/ {b=b $0 "\n"; next;} {printf "%s",b; b=""; print;}'
The b variable buffers up the blank lines; they get printed when a non-blank line is encountered. When EOF is encountered, they don't get printed. That's how it works.
If using gnu awk, [[:space:]] can be replaced with \s. (See full list of gawk-specific Regexp Operators.)
If you want to remove only those trailing lines that are empty, see #AndyMortimer's answer.
A bash solution.
Note: Only useful if the file is small enough to be read into memory at once.
[[ $(<file) =~ ^$'\n'*(.*)$ ]] && echo "${BASH_REMATCH[1]}"
$(<file) reads the entire file and trims trailing newlines, because command substitution ($(....)) implicitly does that.
=~ is bash's regular-expression matching operator, and =~ ^$'\n'*(.*)$ optionally matches any leading newlines (greedily), and captures whatever comes after. Note the potentially confusing $'\n', which inserts a literal newline using ANSI C quoting, because escape sequence \n is not supported.
Note that this particular regex always matches, so the command after && is always executed.
Special array variable BASH_REMATCH rematch contains the results of the most recent regex match, and array element [1] contains what the (first and only) parenthesized subexpression (capture group) captured, which is the input string with any leading newlines stripped. The net effect is that ${BASH_REMATCH[1]} contains the input file content with both leading and trailing newlines stripped.
Note that printing with echo adds a single trailing newline. If you want to avoid that, use echo -n instead (or use the more portable printf '%s').
I'd like to introduce another variant for gawk v4.1+
result=($(gawk '
BEGIN {
lines_count = 0;
empty_lines_in_head = 0;
empty_lines_in_tail = 0;
}
/[^[:space:]]/ {
found_not_empty_line = 1;
empty_lines_in_tail = 0;
}
/^[[:space:]]*?$/ {
if ( found_not_empty_line ) {
empty_lines_in_tail ++;
} else {
empty_lines_in_head ++;
}
}
{
lines_count ++;
}
END {
print (empty_lines_in_head " " empty_lines_in_tail " " lines_count);
}
' "$file"))
empty_lines_in_head=${result[0]}
empty_lines_in_tail=${result[1]}
lines_count=${result[2]}
if [ $empty_lines_in_head -gt 0 ] || [ $empty_lines_in_tail -gt 0 ]; then
echo "Removing whitespace from \"$file\""
eval "gawk -i inplace '
{
if ( NR > $empty_lines_in_head && NR <= $(($lines_count - $empty_lines_in_tail)) ) {
print
}
}
' \"$file\""
fi
Because I was writing a bash script anyway containing some functions, I found it convenient to write those:
function strip_leading_empty_lines()
{
while read line; do
if [ -n "$line" ]; then
echo "$line"
break
fi
done
cat
}
function strip_trailing_empty_lines()
{
acc=""
while read line; do
acc+="$line"$'\n'
if [ -n "$line" ]; then
echo -n "$acc"
acc=""
fi
done
}

Resources