unix merging character and next line to one line - unix

$ cat TPSCIS1705291200.err
0301705293504895 000003330011868452100001742N
#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
#ERROR - Can not find Account:12404797-010
Here I am looking to replace the last character N and next line character # to come in one line
Expected output should be something like
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010
please assist I am not able to find the best way for this as character N and # are in different rows

sed won't match newlines. One possible trick is to first "translate" them to other character, then do sed-subtitution.
In this code I use tr command to replace newlines to a different 'carriage feed' character (`\f'), then replace it by sed, and finally replace those new lines back
cat myfile | tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
Another dirty trick above is doing command-substitution for echo '\f\ since it cannot be part of the regular expression either!
Working code (in bash for MacOS):
-- /tmp » cat in
0301705293504895 000003330011868452100001742N
ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N
ERROR - Can not find Account:12404797-010
--- /tmp » cat in| tr '\n' '\f' | sed -e "s/N$(echo '\f')/N,#/g" | tr '\f' '\n'
0301705293504895 000003330011868452100001742N,#ERROR - Can not find Account:3504895
04117052912404797-010000005947011868455100001410N,#ERROR - Can not find Account:12404797-010

It appears you are just looking to merge every other line:
awk 'NR%2 { printf "%s,", $0; next} 1' input

This might work for you (GNU sed):
sed -i ':a;/N$/N;s/\n#/,#/;ta;P;D' file
If the current line ends in N and the following line begins with a # replace the newline with a , and repeat. Otherwise print the first line and repeat.

Related

Unix concatenate two line if shorter

I have 500k line of fix length data, but in some line there is enter character in between data.
Eg. Each line length is 26 character.
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLM<BR>
NOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
Line 2 is having enter character. I Want to remove enter character from line 2 and combine it with line below it.
E.g.
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLMNOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
I tried to use awk and sed but result is not correct
If you have Perl in your system, you can simply do this.
$ perl -pe 's/<BR>\n//' your_file_name
It is a one-liner. You simply run it at your command line.
Or with awk:
awk '{ORS = sub(/<BR>/,"") ? "" : "\n"; print $0}' your_file_name
This might work for you (GNU sed):
sed 'N;s/<BR>\n//;P;D' file
or:
sed -z 's/<BR>\n//g' file
One, slightly off-the-wall, way of doing this is to:
remove all existing linefeeds
insert new linefeeds every 27 characters
That looks like this:
tr -d '\n' < YOURFILE | fold -w 27
ABCDEFGHIJKLMNOUPQRSTUVWXTZ
ABCDEFGHIJKLMNOUPQRSTUVWXYZ
ABCDEFGHIJKLMNOUPQRSTUVWXTZ

Sed command printing out full line

Im having trouble with sed in UNIX. I would like print out every instance of the letter I, but instead this command prints out the full line everytime it finds an I. Any suggestions?
For Example: if the line is "There is something that I want" I would like it to just print out I instead of the full line.
find | sed -nE '/\bI\b/p' file.txt
Because it is what you are telling it to do:
-n means don't print lines by default
/\bI\b/ means match a separate letter I
p means print the line that matched the previous pattern
Maybe this is what you want, matching everything but I and replacing it with nothing:
]$ echo "xIxIxIx" | sed 's/[^I]//g'
III
You can use the complement option in tr.
echo "There is something that I want" | tr -cd 'I'
When you want to find I in different files, you can combine with find:
find . -type f -name file.txt -exec cat {} + | tr -cd 'I'

Join lines depending on the line beginning

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with a space, empty line or a nonnumeric character. E.g.
40403813|7|Failed|No such file or directory|1
40403816|7|Hi,
The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
I'd like join the split line back with the previous line (as mentioned below):
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
...
using a Unix command like sed/awk. I'm not clear how to join a line with the preceeding one.
Any suggestion?
awk to the rescue!
awk -v ORS='' 'NR>1 && /^[0-9]/{print "\n"} NF' file
only print newline when the current line starts with a digit, otherwise append rows (perhaps you may want to add a space to ORS if the line break didn't preserve the space).
Don't do anything based on the values of the strings in your fields as that could go wrong. You COULD get a wrapping line that starts with a digit, for example. Instead just print after every complete record of 5 fields:
$ awk -F'|' '{rec=rec $0; nf+=NF} nf>=5{print rec; nf=0; rec=""}' file
40403813|7|Failed|No such file or directory|1
40403816|7|Hi, The Conversion System could not be reached.|No such file or directory||1
40403818|7|Failed|No such file or directory|1
Try:
awk 'NF{printf("%s",$0 ~ /^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
OR
awk 'NF{printf("%s",/^[0-9]/ && NR>1?RS $0:$0)} END{print ""}' Input_file
It will check if each line starts from a digit or not if yes and greater than line number 1 than it will insert a new line with-it else it will simply print it, also it will print a new line after reading the whole file, if we not mention it, it is not going to insert that at end of the file reading.
If you only ever have the line split into two, you can use this sed command:
sed 'N;s/\n\([^[:digit:]]\)/\1/;P;D' infile
This appends the next line to the pattern space, checks if the linebreak is followed by something other than a digit, and if so, removes the linebreak, prints the pattern space up to the first linebreak, then deletes the printed part.
If a single line can be broken across more than two lines, we have to loop over the substitution:
sed ':a;N;s/\n\([^[:digit:]]\)/\1/;ta;P;D' infile
This branches from ta to :a if a substitution took place.
To use with Mac OS sed, the label and branching command must be separate from the rest of the command:
sed -e ':a' -e 'N;s/\n\([^[:digit:]]\)/\1/;ta' -e 'P;D' infile
If the continuation lines always begin with a single space:
perl -0000 -lape 's/\n / /g' input
If the continuation lines can begin with an arbitrary amount of whitespace:
perl -0000 -lape 's/\n(\s+)/$1/g' input
It is probably more idiomatic to write:
perl -0777 -ape 's/\n / /g' input
You can use sed when you have a file without \r :
tr "\n" "\r" < inputfile | sed 's/\r\([^0-9]\)/\1/g' | tr '\r' '\n'

Using grep to search DNA sequence files

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF

I have a text file ("INPUT.txt") of the format:
A<LF>
B<LF>
C<LF>
D<LF>
X<LF>
Y<LF>
Z<LF>
<EOF>
which I need to reformat to:
A:B:C:D:X:Y:Z<LF>
<EOF>
I know you can do this with 'sed'. There's a billion google hits for doing this with 'sed'. But I'm trying to emphasis readability, simplicity, and using the correct tool for the correct job. 'sed' is a line editor that consumes and hides newlines. Probably not the right tool for this job!
I think the correct tool for this job would be 'tr'. I can replace all the newlines with colons with the command:
cat INPUT.txt | tr '\n' ':'
There's 99% of my work done. I have a problem, now, though. By replacing all the newlines with colons, I not only get an extraneous colon at the end of the sequence, but I also lose the carriage return at the end of the input. It looks like this:
A:B:C:D:X:Y:Z:<EOF>
Now, I need to remove the colon from the end of the input. However, if I attempt to pass this processed input through 'sed' to remove the final colon (which would now, I think, be a proper use of 'sed'), I find myself with a second problem. The input is no longer terminated by a newline at all! 'sed' fails outright, for all commands, because it never finds the end of the first line of input!
It seems like appending a newline to the end of some input is a very, very common task, and considering I myself was just sorely tempted to write a program to do it in C (which would take about eight lines of code), I can't imagine there's not already a very simple way to do this with the tools already available to you in the Linux kernel.
This should do the job (cat and echo are unnecessary):
tr '\n' ':' < INPUT.TXT | sed 's/:$/\n/'
Using only sed:
sed -n ':a; $ ! {N;ba}; s/\n/:/g;p' INPUT.TXT
Bash without any externals:
string=($(<INPUT.TXT))
string=${string[#]/%/:}
string=${string//: /:}
string=${string%*:}
Using a loop in sh:
colon=''
while read -r line
do
string=$string$colon$line
colon=':'
done < INPUT.TXT
Using AWK:
awk '{a=a colon $0; colon=":"} END {print a}' INPUT.TXT
Or:
awk '{printf colon $0; colon=":"} END {printf "\n" }' INPUT.TXT
Edit:
Here's another way in pure Bash:
string=($(<INPUT.TXT))
saveIFS=$IFS
IFS=':'
newstring="${string[*]}"
IFS=$saveIFS
Edit 2:
Here's yet another way which does use echo:
echo "$(tr '\n' ':' < INPUT.TXT | head -c -1)"
Old question, but
paste -sd: INPUT.txt
Here's yet another solution: (assumes a character set where ':' is
octal 72, eg ascii)
perl -l72 -pe '$\="\n" if eof' INPUT.TXT

Resources