Attach foldername to first column of file - unix

I have a list of files that do have the identical filename but are in different subfolders. The values in the files are seperated with a tab.
I would like to attach to all of the files "test.txt" an additional first column with the foldername and if merge to one file in the end (they all have the same header for the columns).
The most important command though would be the merging.
I have tried to many commands now that did not work so I guess I am missing an essential step with awk...
Current structure is:
mainfolder
|_>Folder1
|_>test.txt
|->Folder2
|_>test.txt
.
.
.
This is where I would like to get to per file before merging all of the,
#Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
#Samplename #Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
Sample1 RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
Thanks so much!!
D

I believe this might do the trick:
$ cd mainfolder
$ awk '(NR==1){sub("#","#Samplename\t"); print} # print header
(FNR==1){next} # skip header
{print substr(FILENAME,1,match(FILENAME,"/")-1)"\t"$0 } # add directory
' */test.txt > /path/to/newfile.txt

Related

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

Combining many files with matching fields in particular column to a single file

So I have 128 files with two columns.
I want to match them by the values in the first column and add the values in the second column from each file to a single file.
I was able to kinda of find a solution here:
From: https://unix.stackexchange.com/questions/159961/merging-2-files-with-based-on-field-match
awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1
It does what I want, however I need for this to go through every file in the folder.
Is there away to make this command loop through all the files in the folder or is there a better method all together?
Example:
Input
File 1:
gene_id normalized_count
A1BG|1 42.3332
A1CF|29974 165.6696
A2BP1|54715 0.0000
A2LD1|87769 138.1270
A2ML1|144568 2.7612
A2M|2 7310.6121
A4GALT|53947 348.3663
A4GNT|51146 0.0000
File 2:
gene_id normalized_count
A1BG|1 18.2019
A1CF|29974 129.6194
A2BP1|54715 2.2063
A2LD1|87769 65.3116
A2ML1|144568 0.0000
A2M|2 3415.8632
A4GALT|53947 83.2874
A4GNT|51146 0.0000
File 3:
gene_id normalized_count
A1BG|1 8.6285
A1CF|29974 97.6385
A2BP1|54715 0.0000
A2LD1|87769 200.5540
A2ML1|144568 0.0000
A2M|2 984.0736
A4GALT|53947 24.0690
A4GNT|51146 0.4541
Desired output
gene_id normalized_count
A1BG|1 42.3332 18.2019 8.6285
A1CF|29974 165.6696 129.6194 97.6385
A2BP1|54715 0 2.2063 0
A2LD1|87769 138.127 65.3116 200.554
A2ML1|144568 2.7612 0 0
A2M|2 7310.6121 3415.8632 984.0736
A4GALT|53947 348.3663 83.2874 24.069
A4GNT|51146 0 0 0.4541
For the desired output I don't care how the column labels end up looking.
Again my problem is that I have to do this for hundreds of files at once to produce one file.
Here are some other similar problems with solutions
https://unix.stackexchange.com/questions/122919/merge-2-files-based-on-all-values-of-the-first-column-of-the-first-file
https://unix.stackexchange.com/questions/113879/how-to-merge-two-files-with-different-number-of-rows-in-shell
But they only had to do this for a few files.
Edit: both Nathan's and joepd worked and produced similar output
Thank you!
Nathan's solution will produce output space delimited
joepd's will produce output that had the header (with original tab separated), and the first column separated by two spaces and the rest space delimited.
You will need gawk for this:
gawk '{a[$1]+=$2}; END{ for (i in a) print i, a[i]}' files*
If this does not work for you, please specify input and output.
EDIT
After your specification it becomes clear that you want to concatenate the strings. How about this?
awk '
NR==1 {title=$0}
FNR!=1 {a[$1] = a[$1]" "$2}
END {
print title
for (i in a)
print i, a[i]
}
' files*
This should produce the output you want with one more column in the output for each file in the input:
awk 'FNR>2{a[$1]=a[$1] " " $2}; END{ for (i in a) print i a[i]}' File*
It's structured like #joepd's answer which numerically sums the inputs instead of string concatenating them.
FNR>2 is used to ignore the header lines in each file.

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

TCSH/CSH | assign variable with commands result

I wrote this code :
1 #!/bin/tcsh
2
3 set myFiles = `ls`
4 set i = 1;
5 echo "argc is $#argv"
6 while ($i <= $#argv)
7 $myFiles = `echo $myFiles | tr "$argv[$i]" " "`
8 echo "argv now is $argv[$i]"
9 echo "my files are : $myFiles"
10 # i++;
11 end
12 echo "my files post proccess are $myFiles"
13 foreach name ($myFiles)
14 set temp = `cat $name`
15 echo "temp is : $temp"
16 unset temp
17 end
This piece should get a list of file names within the current folder, and print the content of the files that are not specified
IE : folder has the files : A B C D E
and the input is : A B C
so the content of D E will be printed.
now the logic is right, but I have some syntactic issues regarding line 7 (the tr)
I've tried with sed as well, but I get "permission denied" to the console for some reason, and I really don't know how to fix it.
So the help I need is actually syntactic regarding assigning a variable with commands output plus including other variables within those commands.
Hope that's alright..
THANKS !
Please note that tr will replace all matching characters, so if your input includes "A", it will replace all "A" with " " in all file names returned by ls.
There is a much cleaner solution. You want to find all files, exclude files matching the input and print what is left. Here you go:
#!/bin/tcsh
set exclude_names = ""
# if any argument is passed in, add it as "! -name $arg"
foreach arg ( $* )
set exclude_names = "$exclude_names ! -name $arg"
end
# Find all files in the current dir, excluding the input
# then print out the filtered set of files
find . -maxdepth 1 -type f $exclude_names -exec cat {} +

Doing a complex join on files

I have files (~1k) that look (basically) like this:
NAME1.txt
NAME ATTR VALUE
NAME1 x 1
NAME1 y 2
...
NAME2.txt
NAME ATTR VALUE
NAME2 x 19
NAME2 y 23
...
Where the ATTR collumn is same in everyfile and the name column is just some version of the filename. I would like to combine them together into 1 file that looks like:
All_data.txt
ATTR NAME1_VALUE NAME2_VALUE NAME3_VALUE ...
X 1 19 ...
y 2 23 ...
...
Is there simple way to do this with just command line utilities or will I have to resort to writing some script?
Thanks
You need to write a script.
gawk is the obvious candidate
You could create an associative array in a BEGIN block, using FILENAME as the KEY and
ATTR " " VALUE
values as the value.
Then create your output in an END block.
gawk can process all txt files together by using *txt as the filename
It's a bit optimistic to expect there to be a ready made command to do exactly what you want.
Very few command join data horizontally.

Resources