split one line into multiple line using number of bytes - unix

I want to split single line into multiple line of 8 bytes each. And I am using the fold command and since this file contains the special characters the fold command does not work and it breaks in the middle of multibyte character.
File Content
あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc
Command Used
fold -b8 dummy_file.dat
Appreciate any help on this.

The problem here is that your text contains multi-bytes characters that will be broken by the fold command if we split them on 2 lines.
echo "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | fold -b8
あいbb
えお��
�cc髙��
�こさ�
��㈱㈱
ちつ��
�髙aabb
c
If you want to have 8 characters per line you can use the following sed command:
echo "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | sed 's/.\{8\}/&\n/g'
あいbbえおかc
c髙①こさし㈱㈱
ちつて髙aabb
c
that add a breakline after each occurrence of 8 characters.
If you do not want to display 8 characters but want to constraint each line to be at most 8 bytes without breaking the text content then you can use the python script:
import sys
def utf8len(s):
return len(s.encode('utf-8'))
entry = unicode(sys.stdin.read(),'utf-8')
tmp = ''
for c in entry:
if utf8len(tmp)+utf8len(c) > 8:
print tmp
tmp = c
elif utf8len(tmp)+utf8len(c) == 8:
print tmp,c
tmp = ''
else:
tmp += c
if tmp:
print tmp
output:
echo -n "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | python max8bytes.py
あいb b
えお
かcc 髙
①こ
さし
㈱㈱
ちつ
て髙a a
bbc
Explanations:
You define a function that will count how many bytes you have per char.
You read char by char stdin and you avoid to have more than 8 bytes on the same line. If you do not want to have less than you can add some spaces char at the end of each line.

Related

How to replace a CRLF char from a variable length file in the middle of a string in Unix?

My sample file is variable length, without any field delimiters. Lines have a minimum of 18 chars length and the 'CRLF' is potentially (not always) between columns 11-15. How do I replace this with a space only when it has a new line char ('CRLF') in the middle (columns 11-15). I still want to keep true end of record.
Sample data:
Input:
1123xxsdfdsfsfdsfdssa
1234ddfxxyff
frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm
knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
Expected output:
1123xxsdfdsfsfdsfdssa
1234ddfxxyfff rrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
What I tried:
sed '/^.\{15\}$/!N;s/./ /11' filename
But above code just adding space, not removing 'CRLF'
Given your sample data, this seems to produce the desired output:
$ awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }' data
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
However, if the input contained CRLF line endings, things would not be so happy; it would be best to filter out the CR characters altogether (Unix files don't normally contain CR and certainly do not normally have CRLF line endings).
$ tr -d '\r' < data | awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }'
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
If you really need DOS-style CRLF input and output, you probably need to use a program such as utod or unix2dos (or some other similar tool) to convert from Unix line endings to DOS.

Removing blank spaces in specific column in pipe delimited file in AIX

Good morning. Long time reader, first time emailer so please be gentle.
I'm working on AIX 5.3 and have a 42 column pipe delimited file. There are telephone numbers in columns 15 & 16 (land|mobile) which may or may not contain spaces depending on who has keyed in the data.
I need to remove these space from columns 15 & 16 only ie
Column 15 | Column 16 **Currently**
01942 665432|07865346122
01942756423 |07855 333567
Column 15 | Column 16 **Needs to be**
01942665432|07865346122
01942756423|07855333567
I have a quick & dirty script which unfortunately is proving to be anything but quick because it's a while loop reading every single line, cutting the field on the pipe delimiter, assigning it to a variable, using sed on column 15 & 16 only to strip blank spaces then writing it out to a new file ie
cat $file | while read
output
do
.....
fourteen=$( echo $output | cut -d'|' -f14 )
fifteen=$( echo $output | cut -d'|' -f15 | sed 's/ //g' )
echo ".....$fourteen|$fifteen..." > $new_file
done
I know there must be a better way to do this, probably using AWK, but am open to any kind of suggestion anyone can offer as the script as it stands is taking half an hour plus to process 176,000 records.
Thanks in advance.
Yes, awk is better suited here
$ cat ip.txt
a|foo bar|01942 665432|07865346122|123
b|i j k |01942756423 |07855 333567|90870
$ awk 'BEGIN{FS=OFS="|"} {gsub(" ","",$3); gsub(" ","",$4)} 1' ip.txt
a|foo bar|01942665432|07865346122|123
b|i j k |01942756423|07855333567|90870
BEGIN{FS=OFS="|"} set | as input and output field separator
gsub(" ","",$3) replace all spaces with nothing only for column 3
gsub(" ","",$4) replace all spaces with nothing only for column 4
1 idiomatic way to print the input record (including any modification done )
Change 3 and 4 to whatever field you need
In case first line should not be affected, add a condition
awk 'BEGIN{FS=OFS="|"} NR>1{gsub(" ","",$3); gsub(" ","",$4)} 1' ip.txt

Grep to count occurrences of file A in file B

I have two files, file A may be in file B and I would like to count for each line in file A, how many times it occurs in file B. For example:
File A:
GAGGACAGACTACTAAAGCC
CTTGCCGCAGATTATCAGAG
CCAGCTTGATGTGTCCTGTG
TGATAGGCAGTGGAACACTG
File B:
NTCTTGAGGAAAGGACGAATCTGCGGAGGACAGACTACTAAAGCCGTTTGAGAGCTAGAACGAGCAAGTTAAGAGA
TCTTGAGGAAAGGACGAAACTCCGGAGGACAGACTACTAAAGCCGTTTTAGAGCTAGAAAGCGCAAGTTAAACGAC
NTCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTATGAGAGCTAGAACGAGCAAGTTAAGAGC
TCTTGAGGAAAGGACGAAAGTGCGCTTGCCGCAGATTATCAGAGGTTTTAGAGCTAGAAAGAGCAAGTTAAAATAA
GATCTAGTGGAAAGGACGATTCTCCGCTTGCCGCAGATTATCAGAGGTTGTAGAGCTAGAACTAGCAAGTGACAAG
ATCTTGAGGAAAGGACGAATCTGCGCTTGCCGCAGATTATCAGAGGTTTGAGAGCTAGAACTAGCAAGTTAATAGA
CGATCAAGTGGAAGGACGATTCTCCGTGATAGGCAGTGGAACACTGGATGTAGAGCTAGAAATAGCAAGTGAGCAG
ATCTAGAGGAAAGGACGAATCTCCGTGATAGGCAGTGGAACACTGGTATGAGAGCTAGAACTAGCAAGTTAATAGA
TCTTGAGGAAAGGACGAAACTCCGTGATAGGCAGTGGAACACTGGTTTTAGAGCTAGAAAGCGCAAGTTAAAAGAC
And the output should be File C:
2 GAGGACAGACTACTAAAGCC
4 CTTGCCGCAGATTATCAGAG
0 CCAGCTTGATGTGTCCTGTG
3 TGATAGGCAGTGGAACACTG
I would like to do this using grep and I've tried a few variations of -c,o,f but I can't seem to get the right output.
How can I achieve this?
Try this
for i in `cat a`; do echo "$i `grep $i -c b`"; done
In this case if line from file A occurred several times in one line of file B then this will be count as one occurrence. If you want to count such occurrences but without its overlapping use this
for i in `cat a`; do printf $i; grep $i -o b | wc -l; done
And maybe this variant would be quicker
cat b | grep "`cat a`" -o | sort | uniq -c
#!/usr/bin/perl
open A, "A"; # open file "A" to handle A
open B, "B"; # open file "B" to handle B
chomp(#keys = <A>); # read keys to array, strip line-feeds
#counts{#keys} = (0) x #keys; # initialize hash counts for keys
while(<B>){ # iterate file handle B line by line
foreach $k (#keys){ # iterate keys array
if (/$k/) { # if key matches line
$counts{$k}++; # increase count for key by one
}
}
}
print "$counts{$_} $_\n" for (keys %counts);
Linux command to compare files:
comm FileA FileB
comm produces three-column output. Column one contains lines unique to FileA, column two contains lines unique to FileB, and column three contains lines common to both files.

unix split FASTA using a loop, awk and split

I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

Extract data before and after matching (BIG FILE )

I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'

Resources