Syncsort Join two files - syncsort-dmx-h

I am trying to join two files based on a key and add a new column in the result based on a condition , but I am not able to figure out why my condition in command is not working. Please help me understand correct way of doing this.
a.txt
3~Y
4~0
1~Y
2~N
b.txt
4~44~444
3~33~333
2~22~222
1~11~111
Syncsort Command
/INFILE a.txt ALIAS DOC '~' 3000
/JOINKEYS BA1
/FIELD DOC_Rest1 2:1 - 2:
/INFILE b.txt ALIAS FINACT '~' 3000
/JOINKEYS BA2
/FIELD FINACT_DID 2:1 - 2:
/FIELD FINACT_PILOT 3:1 - 3:
/FIELDS BA1 1:1 - 1:, BA2 1:1 - 1:
/COND CHKINCCALS (DOC_Rest1 = 'Y')
/DERIVEDFIELD endofrecord '\n'
/DERIVEDFIELD TYPECAL
If CHKINCCALS then FINACT_DID
Else FINACT_PILOT
/OUTFILE c.txt OVERWRITE /REFORMAT LEFTSIDE:BA1,LEFTSIDE:DOC_Rest1, TYPECAL, endofrecord
/END
Expected result
1~Y~11~
2~N~222~
3~Y~33~
4~0~222~
Actual result
1~Y~Y~
2~N~~
3~Y~Y~
4~0~~

After some guess work, I was able to figure out the correct way. I am still not sure how it works though.
/INFILE a.txt ALIAS DOC '~' 3000
/JOINKEYS BA1
/FIELD DOC_Rest1 2:1 - 2:
/INFILE b.txt ALIAS FINACT '~' 3000
/JOINKEYS BA2
/FIELD FINACT_DID 2:1 - 2:
/FIELD FINACT_PILOT 3:1 - 3:
/FIELDS BA1 1:1 - 1:, BA2 1:1 - 1:
/COND CHKINCCALS (DOC_Rest1 = 'Y')
/DERIVEDFIELD endofrecord '\n'
/DERIVEDFIELD TYPECAL
If CHKINCCALS then RIGHTSIDE:FINACT_DID
Else RIGHTSIDE:FINACT_PILOT
/OUTFILE c.txt OVERWRITE /REFORMAT LEFTSIDE:BA1,LEFTSIDE:DOC_Rest1, TYPECAL, endofrecord
/END

Related

Cut specific columns and collapse with delimiter in Unix

Say I have 6 different columns in a text file (as shown below)
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
I would like to extract columns first, second and fourth columns as A1_B1_D1 collapsed together and the third column separated by tab.
So the result would be:
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
I tried
cut -f 1,2,4 -d$'\t' 3, but is just not what I want.
If you need to maintain your column alignment, you can check the length of the combination of fields 1, 2 and 4 and add one or two tab characters as necessary,
awk '{
printf (length($1"_"$2"_"$4) >= 8) ? "%s_%s_%s\t%s\n" : "%s_%s_%s\t\t%s\n",
$1,$2,$4,$3
}' file
Example Output
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
Could you please try following.
awk '
BEGIN{
OFS="\t"
}
{
print $1"_"$2"_"$4,$3
}
' Input_file
I've tried RavinderSingh13 code and it has the same output as mine but I don't quite know the difference, anyways, here it is:
awk -F ' ' '{print $1"_"$2"_"$4"\t"$3}' /path/to/file
This might work for you (GNU sed):
sed 's/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+.*/\1_\2_\4\t\3/' -E file
Use pattern matching and back references.
\S+ means one or more non-white space characters.
\s+ means one or more white space characters.
\t represents a tab.
Another awk and using column -t for formatting.
$ cat cols_345.txt
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
$ awk -v OFS="_" '{ $3="\t"$3; print $1,$2,$4 $3 } ' cols_345.txt | column -t
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
$

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

deleting repetitive columns in unix

I would like to delete multiple repetitive columns from a huge file (about 1 million).
The columns that I want to delete has the same column names: A and others has different unique name. Say:
A B2 A B3
1.1 AA 1.2 AA
2.1 AB 4.3 CT
2.2 AC 6.4 GT
so column headers are A, B2, A, B3,... .
How could I delete the columns named as A's from the data.
Another in awk:
$ awk '
NR==1 {
split($0,a)
for(i in a)
if(a[i]=="A")
delete a[i]
}
{
for(i=1;i<=NF;i++)
printf "%s",(i in a?$i OFS:"")
printf ORS
}' file
B2 B3
AA AA
AB CT
AC GT
I'm not sure I'm understanding your question correctly, but here an (GNU) awk solution to delete all duplicate columns (keeping only the first occurrence):
#!/usr/bin/awk -f
NR==1 {
seen[$1] = 1
cols[0] = 1
for (i=2; i<=NF; i++) {
if (!($i in seen)) {
seen[$i] = 1
cols[length(cols)] = i
}
}
}
{
for (i=0; i<length(cols); i++)
printf $(cols[i]) " "
printf "\n"
}
For the first line (NR==1), we find all non-duplicate columns (preserving the order), and for all the other lines, we just print out the columns (fields) we selected before (cols array holds column/field indexes we wish to keep).
$ ./filter.awk file
A B2 B3
1.1 AA AA
2.1 AB CT
2.2 AC GT
cut -d' ' -f $(head -1 filename|tr ' ' '\n'|awk '{if(!seen[$0]++) print NR}'|paste -s -d ',') filename
this will work like a charm.
The question is solved by the James Brown code.
I added
!/usr/bin/awk -f
to the first line of his code and correct tiny typo at the end of the code (simply additional -'- deleted).
I am sorry, I did not have time to try all other suggestions
with my best wishes

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

Using grep -f and -w together

I have two files like this:
abc.txt
a
b
z
1
10
and abcd.txt
a
b
c
d
1
10
100
1000
I would like:
a
b
1
10
I would like to use grep -fw abc.txt abcd.txt to search through every line of abc.txt and print lines which match the entire word. If I just use grep -f, I get lines 100 since the pattern '10' matches '100'. But grep -f -w abc.txt abcd.txt produces:
a
b
1
and doesn't print out the 10. So, I guess, what is the best way to match every line in abc.txt with the entire line of abcd.txt ?

Resources