Separate into different files after splitting the pattern - unix

I have following pattern
k0
lj33
lp90
ko00
j9
mn12
sh30
lp33
ji90
e3
nd32
jk90
hi43
df45
cv89
er43
I need different files containing
File1 File2 File3
k0 j9 e3
lj33 mn12 nd32
lp90 sh30 jk90
ko00 lp33 hi43
ji90 df45
cv89
er43
Any suggestions ?

Do you mean: each file starts with a two-character string?
Try this command:
csplit input /^..$/ {*}
Please ignore the first empty file xx00.

assuming that you need the data to be split when you reach a two character string:
awk '{if(length($0)==2){filename=$0}; print >filename}' your_file

Related

Cut specific columns and collapse with delimiter in Unix

Say I have 6 different columns in a text file (as shown below)
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
I would like to extract columns first, second and fourth columns as A1_B1_D1 collapsed together and the third column separated by tab.
So the result would be:
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
I tried
cut -f 1,2,4 -d$'\t' 3, but is just not what I want.
If you need to maintain your column alignment, you can check the length of the combination of fields 1, 2 and 4 and add one or two tab characters as necessary,
awk '{
printf (length($1"_"$2"_"$4) >= 8) ? "%s_%s_%s\t%s\n" : "%s_%s_%s\t\t%s\n",
$1,$2,$4,$3
}' file
Example Output
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
Could you please try following.
awk '
BEGIN{
OFS="\t"
}
{
print $1"_"$2"_"$4,$3
}
' Input_file
I've tried RavinderSingh13 code and it has the same output as mine but I don't quite know the difference, anyways, here it is:
awk -F ' ' '{print $1"_"$2"_"$4"\t"$3}' /path/to/file
This might work for you (GNU sed):
sed 's/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+.*/\1_\2_\4\t\3/' -E file
Use pattern matching and back references.
\S+ means one or more non-white space characters.
\s+ means one or more white space characters.
\t represents a tab.
Another awk and using column -t for formatting.
$ cat cols_345.txt
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
$ awk -v OFS="_" '{ $3="\t"$3; print $1,$2,$4 $3 } ' cols_345.txt | column -t
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
$

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

Using Awk how to merge fields between files, F2 of file1 plus last 8char of F2 in file 2

I have two files file1 and file2, I need to replace F1 value of file1 by merging F2 of file1 plus last 8char of F2 in file2
File 1 :
123456|AAAAAAA|BBBBBB|CCCCCCC
444444|kkkkkkk|rrrrrr|NNNNNNN
File 2:
AAAAAAA|DDDDDD12345678
kkkkkkk|987654321aaaaa
Expected Output
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
I have tried with Bellow awk function not sure how to fetch last 8 char of F2 from file2
# awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2a[$2];print}' OFS='|' File2 File1
123456|AAAAAAADDDDDD12345678|BBBBBB|CCCCCCC
444444|kkkkkkk987654321aaaaa|rrrrrr|NNNNNNN
In order to get the last 8 characters of a[$2], you need to use substr:
substr(a[$2],length(a[$2])-7)
The above takes the substring of a[$2] starting at position length(a[$2])-7.
With that one change, your code produces your desired output:
$ awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
123456|AAAAAAA12345678|BBBBBB|CCCCCCC
444444|kkkkkkk321aaaaa|rrrrrr|NNNNNNN
As Ghoti points out in the comments, the more usual awk style is to use next so as to avoid the need for the second condition, NR>FNR, as follows:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7);print}' OFS='|' File2 File1
When awk encounters next, it skips the rest of the commands and starts over on the next line.
As awk programmers often value conciseness over clarity, it is common to see the print statement replaced with a 1:
awk -F"|" 'NR==FNR{a[$1]=$2;next} {$2=$2 substr(a[$2],length(a[$2])-7)} 1' OFS='|' File2 File1
In this case, 1 is a condition and it always evaluates to true. Since no command is associated with that condition, the default command is executed which is print.

all matches per id on one line to one match per id per line

I have a tab delimited text file like so:
Gene1 ID:454,ID:575,ID:44449
Gene2 ID:4344,ID:5626,ID:4
Gene3 ID:244
And Id like to get htis into long form, e.g.
Gene1 ID:454
Gene1 ID:575
Gene1 ID:44449
Gene2 ID:4344
Gene2 ID:5626
Gene2 ID:4
Gene3 ID:244
I thought I could do this with sed, going line by line, replacing each comma with the first string up to space (GeneX) plus the element before the comma andthen adding a new line, but wasn't making much progress. And in some cases there is only one match (no comma) to complicate the parsing.
Is sed even the right way to go with this?
Perl to the rescue:
perl -ane '
#ids = split /,/, $F[1];
print "$F[0]\t$_\n" for #ids;
' < input.txt > output.txt
-n reads the file line by line
-a splits each line on whitespace to the #F array
split creates an array from a string - here, it splits the second ($F[1]) field on commas
Using awk.
awk -F , '{
# Pull off the Gene## string.
g=substr($1, 1, index($1, " "))
# Set the output field separator to a newline followed by the gene string.
OFS="\n"g
# Force awk to recombine the current line with the new value of OFS.
# This *should*, canonically, work as $0=$0 I believe but it doesn't
# work when I do that here and I don't know why.
$1=$1
print
}' input.txt > output.txt
This might work for you (GNU sed):
sed -r 's/^((\S+\s)[^,]*),/\1\n\2/;P;D' file
This replaces the first , by the preceeding tokens followed by a newline and then the first token and its following whitespace. The first line is then printed and discarded and the procedure repeated until no further ,'s are substituted.

Conditionally Merging two lines into one line

How can I merge two lines if the have met specific criteria in Unix terminal?
I have data like:
A1
B1
A2
B2
A3
A4
A5
B5
And I want to merge to like that:
A1, B1
A2, B2
A3,
A4,
A5, B5
Real data looks like this:
"224222"
<Frequency freq="0.136" allele="T" sampleSize="5008"/>
"224223"
<Frequency freq="0.3864" allele="T" sampleSize="5008"/>
"224224"
"224225"
<Frequency freq="0.3894" allele="G" sampleSize="5008"/>
"1801179"
"1861759"
I actually tried to add dummy deliminator texts to before the "A" data to separate them. But I couldn't achive it.
Using sed
sed 's/$/, /;N;/\n<Freq/{s/\n//};P;D' <file>
Explanation:
s/$/, / - Append a comma to the current line
N - Get the next line
/\n<Freq/{s/\n//} - If the second line contains <Freq, delete the newline
P - Print first portion of pattern space
D - Delete first portion of pattern space
It can be done using awk getline:
awk '{ if(condition){ if((getline var)>0) print $0","$var; else print $0; } else print $0;}' <file>

Resources