Doing a complex join on files

Doing a complex join on files - unix

I have files (~1k) that look (basically) like this:
NAME1.txt
NAME ATTR VALUE
NAME1 x 1
NAME1 y 2
...
NAME2.txt
NAME ATTR VALUE
NAME2 x 19
NAME2 y 23
...
Where the ATTR collumn is same in everyfile and the name column is just some version of the filename. I would like to combine them together into 1 file that looks like:
All_data.txt
ATTR NAME1_VALUE NAME2_VALUE NAME3_VALUE ...
X 1 19 ...
y 2 23 ...
...
Is there simple way to do this with just command line utilities or will I have to resort to writing some script?
Thanks

You need to write a script.
gawk is the obvious candidate
You could create an associative array in a BEGIN block, using FILENAME as the KEY and
ATTR " " VALUE
values as the value.
Then create your output in an END block.
gawk can process all txt files together by using *txt as the filename
It's a bit optimistic to expect there to be a ready made command to do exactly what you want.
Very few command join data horizontally.

Related

csplit in zsh: splitting file based on pattern

I would like to split the following file based on the pattern ABC:
ABC
4
5
6
ABC
1
2
3
ABC
1
2
3
4
ABC
8
2
3
to get file1:
ABC
4
5
6
file2:
ABC
1
2
3
etc.
Looking at the docs of man csplit: csplit my_file /regex/ {num}.
I can split this file using: csplit my_file '/^ABC$/' {2} but this requires me to put in a number for {num}. When I try to match with {*} which suppose to repeat the pattern as much as possible, i get the error:
csplit: *}: bad repetition count
I am using a zshell.

To split a file on a pattern like this, I would turn to awk:
awk 'BEGIN { i=0; }
/^ABC/ { ++i; }
{ print >> "file" i }' < input
This reads lines from the file named input; before reading any lines, the BEGIN section explicitly initializes an "i" variable to zero; variables in awk default to zero, but it never hurts to be explicit. The "i" variable is our index to the serial filenames.
Subsequently, each line that starts with "ABC" will increment this "i" variable.
Any and every line in the file will then be printed (in append mode) to the file name that's generated from the text "file" and the current value of the "i" variable.

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!

Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.

Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

Attach foldername to first column of file

I have a list of files that do have the identical filename but are in different subfolders. The values in the files are seperated with a tab.
I would like to attach to all of the files "test.txt" an additional first column with the foldername and if merge to one file in the end (they all have the same header for the columns).
The most important command though would be the merging.
I have tried to many commands now that did not work so I guess I am missing an essential step with awk...
Current structure is:
mainfolder
|_>Folder1
|_>test.txt
|->Folder2
|_>test.txt
.
.
.
This is where I would like to get to per file before merging all of the,
#Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
#Samplename #Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
Sample1 RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
Thanks so much!!
D

I believe this might do the trick:
$ cd mainfolder
$ awk '(NR==1){sub("#","#Samplename\t"); print} # print header
(FNR==1){next} # skip header
{print substr(FILENAME,1,match(FILENAME,"/")-1)"\t"$0 } # add directory
' */test.txt > /path/to/newfile.txt

How to read the comma separated values

We have a fixed width file
Col1 length 10
Col2 length 10
Col3 length 30
Col4 length 40
Sample record
ABC 123 xyz. 5171-5261,51617
ABC. 1234. Xxy. 81651-61761
Col4 can have any number of comma separated values
1 or more within length of 40 characters: If it is has 1 value for that record there is no change in output file.
If more than one value is there i.e. comma separated (5171-5261,51617)
the output file should have multiple records.
1 record
ABC. 123. Xyz. 5171-5261
ABC 123. Xyz. 51617
What is the most efficient way to do this.
As of now trying using while and for loop but it is taking so long for execution since we are doing this splitting by reading each record.
The output file can be comma separated or fixed width.

awk is your friend here.
A single line of awk will achieve what you need:
awk -v FIELDWIDTHS="10 10 30 40" '{ if (match($4,",")) { split($4,array,","); for (i in array) { print $1,$2,$3,array[i]; }; } else { print $1,$2,$3,$4 }; }' samp.dat
For ease of reading the code is:
{
if (match($4,",")) {
split($4,array,",");
for (i in array) {
print $1,$2,$3,array[i];
};
} else {
print $1,$2,$3,$4
};
}
Testing with the sample data you supplied gives:
ABC 123 xyz. 5171-5261
ABC 123 xyz. 51617
ABC. 1234. Xxy. 81651-61761
How it works:
awk reads your file one line at a time.
The FIELDWIDTHS directive allows us to reference each column as $1,$2...
Now that we have our columns we can look for a comma in the fourth field with match($4,",").
If we find one we make an array of the values in the fourth field that are separated by commas with split($4,array,",").
Then we loop through this array and print multiple lines of output, one for each element of the array.
If the fourth field has no comma the else clause prints a single line.
This process repeats for each line in your fixed width file.
NOTE:
awk associative arrays do not guarantee to preserve the order of your data.
This means that your output might come out as
ABC 123 xyz. 51617
ABC 123 xyz. 5171-5261
ABC. 1234. Xxy. 81651-61761
i.e. 5171-5261,51617 in the input data produced a line from the second value before the first.
If the ordering is important to you then you can use the code below that makes a csv from your input data first, then produces the output preserving the order.
awk -v FIELDWIDTHS="10 10 30 40" '{print $1,$2,$3,$4}' OFS=',' samp.data > samp.csv
awk -F',' '{ for (i=4; i<=NF; i++) { print $1,$2,$3,$i } }' samp.csv

sh - Split File to Multiple Files

I need a unix (aix) script to split a file to multiple files, basically one file per line, where the content of the file like:
COL_1 ROW 1 1 1
COL_2 ROW 2 2 2
COL_3 ROW 3 3 3
... and the name of each file is the 1st column, and the content of the file the rest of the line, something like:
Name: COL_1.log
content:
ROW 1 1 1
Thanks in advance,
Tiago

Using a while loop and read each line:
cat file | while read COL REST; do
echo $REST > $COL.log
done
COL will contain the first word of each line
REST will contain the rest of the line

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Doing a complex join on files - unix

Related

csplit in zsh: splitting file based on pattern

Selecting specific rows of a tab-delimited file using bash (linux)

Attach foldername to first column of file

How to read the comma separated values

sh - Split File to Multiple Files

Categories

Resources