I am pretty new to Perl codes, and I am merging some datasets with the following code. The data is set up as such: first row specifying the sample names, followed by the counts on the second, third columns.... The first column specifies the gene names. I've got 2 big datasets that I'm merging together, and I have been using the following Perl script, by specifying the path to the perl script, and running the following code in Terminal:
$ cd /path/to/file
$ perl /path/to/file dataset1.txt dataset2.txt merged.txt
The Perl script is as follows:
use strict;
my $file1=$ARGV[0];
my $file2=$ARGV[1];
my $out=$ARGV[2];
my %hash=();
open(RF,"$file1") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
$hash{$gene}=join("\t",#arr);
}
close(RF);
open(RF,"$file2") or die $!;
open(WF,">$out") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",#arr) . "\n";
}
}
close(WF);
close(RF);
With the above code is I am supposed to get a merged table, with the duplicate rows deleted, and the second text file's (Sample A to Sample Z) columns merged to the first text file's columns (Sample 1 to Sample 100), so it should look like this, separated by tabs.
Gene Name Sample 1 Sample 2 ..... Sample A Sample B...
TP53 2.345 2.234 4.32 4.53
The problem arises when my merged files come back with the two datasets merged, however the second dataset in the next row instead of the same row. It will recognise, sort, and merge the counts, but onto the next row. Is there something wrong with my codes or my input?
Thank you for all of your help!!
The double line issue might be because of foreign line endings in your input file. You can check this with a command such as:
$ perl -MData::Dumper -ne'$Data::Dumper::Useqq=1; print Dumper $_' file1.txt
There are more issues with your code, as follows.
What you seem to be doing is joining lines based on the name in column 1. You should be aware that this match is done case-sensitively, so it will differentiate between for example tp53 and TP53, or Gene name and Gene Name, or something as subtle as TP53 and TP53 (an extra space). That can be both good and bad, but be prepared for edge cases.
You are expecting 3 arguments to your program, input files and output, but this is a quite un-Perlish way to go about it. I would use the diamond operator for input files, and then redirect output with shell commands, such as:
$ perl foo.pl file1 file2 > merged.txt
This will give you the flexibility of adding more files to merge, for example, and gives you the option to test the merge without committing to a file.
You are using 2 argument open command, without specifying an open mode (e.g. "<"). That is very dangerous and leaves you open to code injection. For example, someone could enter "| rm -rf /" as the first argument to your program and delete your whole hard drive (or as much as their permissions allowed). To prevent this, you use 3-argument open and specify a hard coded open mode.
Open commands in Perl should also use the lexical file handle, e.g. my $fh, and not global. It should look like this:
open my $fh, "<", $input1 or die $!;
open my $fh_out, ">", $output or die $!;
But since we are using the diamond operator, Perl handles that for us automagically.
You also do not need to separate the reading of the files into two loops, since you are basically doing the same thing. There is also no need to first split the lines, and then join them back together.
I wrote this as a sample of how it can be done:
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my ($name, $line) = /^([^\t]+)(.+)/; # using a regex match avoiding split
$data{$name} .= $line; # merge lines using concatenation
}
for my $name (sort keys %data) {
print $name . $data{$name} . "\n";
}
__DATA__
Gene Name Sample 1 Sample 2 Sample 3 Sample 4
TP53 2.345 2.234 4.32 4.53
TP54 2.345 2.234 4.32 4.53
TP55 2.345 2.234 4.32 4.53
Gene Name Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 2.53
TP54 2.212 1.234 3.32 6.53
TP55 1.345 2.114 7.32 5.53
On my system it gives the output:
Gene Name Sample 1 Sample 2 Sample 3 Sample 4 Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 4.53 2.345 2.234 4.32 2.53
TP54 2.345 2.234 4.32 4.53 2.212 1.234 3.32 6.53
TP55 2.345 2.234 4.32 4.53 1.345 2.114 7.32 5.53
This will output the lines in alphabetical order. If you want to preserve the order of the files, you can collect the names in an array while reading the file, and use that when printing. Arrays preserve the order, hash keys do not.
Related
I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:
File1.txt
1
2
3
3
4
5
6
File2.txt
6
7
8
8
9
File3.txt
9
10
10
11
The desired output would be:
1
2
3
3
4
5
6
7
8
8
9
10
10
11
I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.
Thanks in advance.
Arturo
Examples of previous solutions removing redundant lines both within and outside files:
https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates
https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines
https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines
https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate
With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.
awk '
FNR==1{
for(key in current){
total[key]
}
delete current
}
!($0 in total)
{
current[$0]
}
' file1.txt file2.txt file3.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if its first line(of each file) then do following.
for(key in current){ ##Traverse through current array here.
total[key] ##placing index of current array into total(for all files) one.
}
delete current ##Deleting current array here.
}
!($0 in total) ##If current line is NOT present in total then do following.
{
current[$0] ##Place current line into current array.
}
' file1.txt file2.txt file3.txt ##Mentioning Input_file names here.
Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.
out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${#} ; do
{ cat "${out}" ;
diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
} > "${tmpout}"
mv "${tmpout}" "${out}"
done
cat "${out}"
Output:
$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11
$ diff <(./question.sh F*) Output.txt
(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)
I have a list of files that do have the identical filename but are in different subfolders. The values in the files are seperated with a tab.
I would like to attach to all of the files "test.txt" an additional first column with the foldername and if merge to one file in the end (they all have the same header for the columns).
The most important command though would be the merging.
I have tried to many commands now that did not work so I guess I am missing an essential step with awk...
Current structure is:
mainfolder
|_>Folder1
|_>test.txt
|->Folder2
|_>test.txt
.
.
.
This is where I would like to get to per file before merging all of the,
#Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
#Samplename #Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
Sample1 RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
Thanks so much!!
D
I believe this might do the trick:
$ cd mainfolder
$ awk '(NR==1){sub("#","#Samplename\t"); print} # print header
(FNR==1){next} # skip header
{print substr(FILENAME,1,match(FILENAME,"/")-1)"\t"$0 } # add directory
' */test.txt > /path/to/newfile.txt
So I have 128 files with two columns.
I want to match them by the values in the first column and add the values in the second column from each file to a single file.
I was able to kinda of find a solution here:
From: https://unix.stackexchange.com/questions/159961/merging-2-files-with-based-on-field-match
awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1
It does what I want, however I need for this to go through every file in the folder.
Is there away to make this command loop through all the files in the folder or is there a better method all together?
Example:
Input
File 1:
gene_id normalized_count
A1BG|1 42.3332
A1CF|29974 165.6696
A2BP1|54715 0.0000
A2LD1|87769 138.1270
A2ML1|144568 2.7612
A2M|2 7310.6121
A4GALT|53947 348.3663
A4GNT|51146 0.0000
File 2:
gene_id normalized_count
A1BG|1 18.2019
A1CF|29974 129.6194
A2BP1|54715 2.2063
A2LD1|87769 65.3116
A2ML1|144568 0.0000
A2M|2 3415.8632
A4GALT|53947 83.2874
A4GNT|51146 0.0000
File 3:
gene_id normalized_count
A1BG|1 8.6285
A1CF|29974 97.6385
A2BP1|54715 0.0000
A2LD1|87769 200.5540
A2ML1|144568 0.0000
A2M|2 984.0736
A4GALT|53947 24.0690
A4GNT|51146 0.4541
Desired output
gene_id normalized_count
A1BG|1 42.3332 18.2019 8.6285
A1CF|29974 165.6696 129.6194 97.6385
A2BP1|54715 0 2.2063 0
A2LD1|87769 138.127 65.3116 200.554
A2ML1|144568 2.7612 0 0
A2M|2 7310.6121 3415.8632 984.0736
A4GALT|53947 348.3663 83.2874 24.069
A4GNT|51146 0 0 0.4541
For the desired output I don't care how the column labels end up looking.
Again my problem is that I have to do this for hundreds of files at once to produce one file.
Here are some other similar problems with solutions
https://unix.stackexchange.com/questions/122919/merge-2-files-based-on-all-values-of-the-first-column-of-the-first-file
https://unix.stackexchange.com/questions/113879/how-to-merge-two-files-with-different-number-of-rows-in-shell
But they only had to do this for a few files.
Edit: both Nathan's and joepd worked and produced similar output
Thank you!
Nathan's solution will produce output space delimited
joepd's will produce output that had the header (with original tab separated), and the first column separated by two spaces and the rest space delimited.
You will need gawk for this:
gawk '{a[$1]+=$2}; END{ for (i in a) print i, a[i]}' files*
If this does not work for you, please specify input and output.
EDIT
After your specification it becomes clear that you want to concatenate the strings. How about this?
awk '
NR==1 {title=$0}
FNR!=1 {a[$1] = a[$1]" "$2}
END {
print title
for (i in a)
print i, a[i]
}
' files*
This should produce the output you want with one more column in the output for each file in the input:
awk 'FNR>2{a[$1]=a[$1] " " $2}; END{ for (i in a) print i a[i]}' File*
It's structured like #joepd's answer which numerically sums the inputs instead of string concatenating them.
FNR>2 is used to ignore the header lines in each file.
I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt
Is it possible to read formatted file in SQLite? I have a file with rows like this (two rows below):
1921.300 . . < 0.030 . . . . . 550 1.6 1 Mrr1922 Jm 5
1973.220 158. 3. 0.240 0.002 . . 1.5 0.5 620 5.1 1 Lab1974 S 4
and description like this:
term columns format description
date 008-017 f10.5 Observation date, in years.
tflag 019-019 a1 Flag for theta (position angle) measure.
.....................
etc.
I need to read this file into my SQLite table.
I think what you probably want to do is convert your file to comma- or tab-separated text (say by loading into Excel then exporting). Then in sqlite3:
.mode tabs
.import myfile.txt