How to read formatted file in sqlite? - sqlite

Is it possible to read formatted file in SQLite? I have a file with rows like this (two rows below):
1921.300 . . < 0.030 . . . . . 550 1.6 1 Mrr1922 Jm 5
1973.220 158. 3. 0.240 0.002 . . 1.5 0.5 620 5.1 1 Lab1974 S 4
and description like this:
term columns format description
date 008-017 f10.5 Observation date, in years.
tflag 019-019 a1 Flag for theta (position angle) measure.
.....................
etc.
I need to read this file into my SQLite table.

I think what you probably want to do is convert your file to comma- or tab-separated text (say by loading into Excel then exporting). Then in sqlite3:
.mode tabs
.import myfile.txt

Related

merging columns but it goes into a new line in perl code

I am pretty new to Perl codes, and I am merging some datasets with the following code. The data is set up as such: first row specifying the sample names, followed by the counts on the second, third columns.... The first column specifies the gene names. I've got 2 big datasets that I'm merging together, and I have been using the following Perl script, by specifying the path to the perl script, and running the following code in Terminal:
$ cd /path/to/file
$ perl /path/to/file dataset1.txt dataset2.txt merged.txt
The Perl script is as follows:
use strict;
my $file1=$ARGV[0];
my $file2=$ARGV[1];
my $out=$ARGV[2];
my %hash=();
open(RF,"$file1") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
$hash{$gene}=join("\t",#arr);
}
close(RF);
open(RF,"$file2") or die $!;
open(WF,">$out") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",#arr) . "\n";
}
}
close(WF);
close(RF);
With the above code is I am supposed to get a merged table, with the duplicate rows deleted, and the second text file's (Sample A to Sample Z) columns merged to the first text file's columns (Sample 1 to Sample 100), so it should look like this, separated by tabs.
Gene Name Sample 1 Sample 2 ..... Sample A Sample B...
TP53 2.345 2.234 4.32 4.53
The problem arises when my merged files come back with the two datasets merged, however the second dataset in the next row instead of the same row. It will recognise, sort, and merge the counts, but onto the next row. Is there something wrong with my codes or my input?
Thank you for all of your help!!
The double line issue might be because of foreign line endings in your input file. You can check this with a command such as:
$ perl -MData::Dumper -ne'$Data::Dumper::Useqq=1; print Dumper $_' file1.txt
There are more issues with your code, as follows.
What you seem to be doing is joining lines based on the name in column 1. You should be aware that this match is done case-sensitively, so it will differentiate between for example tp53 and TP53, or Gene name and Gene Name, or something as subtle as TP53 and TP53 (an extra space). That can be both good and bad, but be prepared for edge cases.
You are expecting 3 arguments to your program, input files and output, but this is a quite un-Perlish way to go about it. I would use the diamond operator for input files, and then redirect output with shell commands, such as:
$ perl foo.pl file1 file2 > merged.txt
This will give you the flexibility of adding more files to merge, for example, and gives you the option to test the merge without committing to a file.
You are using 2 argument open command, without specifying an open mode (e.g. "<"). That is very dangerous and leaves you open to code injection. For example, someone could enter "| rm -rf /" as the first argument to your program and delete your whole hard drive (or as much as their permissions allowed). To prevent this, you use 3-argument open and specify a hard coded open mode.
Open commands in Perl should also use the lexical file handle, e.g. my $fh, and not global. It should look like this:
open my $fh, "<", $input1 or die $!;
open my $fh_out, ">", $output or die $!;
But since we are using the diamond operator, Perl handles that for us automagically.
You also do not need to separate the reading of the files into two loops, since you are basically doing the same thing. There is also no need to first split the lines, and then join them back together.
I wrote this as a sample of how it can be done:
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my ($name, $line) = /^([^\t]+)(.+)/; # using a regex match avoiding split
$data{$name} .= $line; # merge lines using concatenation
}
for my $name (sort keys %data) {
print $name . $data{$name} . "\n";
}
__DATA__
Gene Name Sample 1 Sample 2 Sample 3 Sample 4
TP53 2.345 2.234 4.32 4.53
TP54 2.345 2.234 4.32 4.53
TP55 2.345 2.234 4.32 4.53
Gene Name Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 2.53
TP54 2.212 1.234 3.32 6.53
TP55 1.345 2.114 7.32 5.53
On my system it gives the output:
Gene Name Sample 1 Sample 2 Sample 3 Sample 4 Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 4.53 2.345 2.234 4.32 2.53
TP54 2.345 2.234 4.32 4.53 2.212 1.234 3.32 6.53
TP55 2.345 2.234 4.32 4.53 1.345 2.114 7.32 5.53
This will output the lines in alphabetical order. If you want to preserve the order of the files, you can collect the names in an array while reading the file, and use that when printing. Arrays preserve the order, hash keys do not.

Attach foldername to first column of file

I have a list of files that do have the identical filename but are in different subfolders. The values in the files are seperated with a tab.
I would like to attach to all of the files "test.txt" an additional first column with the foldername and if merge to one file in the end (they all have the same header for the columns).
The most important command though would be the merging.
I have tried to many commands now that did not work so I guess I am missing an essential step with awk...
Current structure is:
mainfolder
|_>Folder1
|_>test.txt
|->Folder2
|_>test.txt
.
.
.
This is where I would like to get to per file before merging all of the,
#Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
#Samplename #Name Count FragCount Type Left LeftB Right RightB Support FRPM LeftBD LeftBE RightBD RightBE annots
Sample1 RFP1A 13 10 REF RFP1A_ins chr3:3124352:+ RFP1A_ins chr3:5234143:+ confirmed 0.86 TA 1.454 AC 1.564 ["INTRACHROM."]
Thanks so much!!
D
I believe this might do the trick:
$ cd mainfolder
$ awk '(NR==1){sub("#","#Samplename\t"); print} # print header
(FNR==1){next} # skip header
{print substr(FILENAME,1,match(FILENAME,"/")-1)"\t"$0 } # add directory
' */test.txt > /path/to/newfile.txt

sh - Split File to Multiple Files

I need a unix (aix) script to split a file to multiple files, basically one file per line, where the content of the file like:
COL_1 ROW 1 1 1
COL_2 ROW 2 2 2
COL_3 ROW 3 3 3
... and the name of each file is the 1st column, and the content of the file the rest of the line, something like:
Name: COL_1.log
content:
ROW 1 1 1
Thanks in advance,
Tiago
Using a while loop and read each line:
cat file | while read COL REST; do
echo $REST > $COL.log
done
COL will contain the first word of each line
REST will contain the rest of the line

Invalid values in an array on executing for loop in R

I am new to R and stuck up in a very naive thing. I am getting 'NA' values in count array after executing following code:
i=1
j=2
l=1
count=0
while(j<length(positions)){
a=positions[i]
b=positions[j]
for(k in a:b){
if(y$feature[k]==x$feature[l]){
count[l]=count[l]+1
}
}
i=i+2
j=j+2
l=l+1
}
For reference, y and x data frames are as follows:
y data frame
positions id feature
1 1 45128
2 1 28901
3 1 48902
. .
. .
. .
. .
2344 1 45579
2345 2 37689
2346 2 45547
. .
. .
5677 2 12339
5678 3 98034
5679
.
.
x dataframe :
id feature
1 28901
2 23498
3 98906
. .
. .
. .
I have inserted the positions in the position array, at the point where new id starts and where it ends
positions is an array consisting of [1,2344,2345,5677,5678,7390,7391,...]. I am incrementing the for loop as elements in position array, i being 1,3,5... j being 2,4,6... If y$feature and x$feature match I increment count[l]
So first feature of x is compared with all features in y with id=1, second feature in x is compared with all features in y with id=2 and so on. When they match, count[l] is incremented. i and j are incremented twice, to make them start with correct positions. *But I just get a valid answer for count[1], rest all values are NA.
Please tell a reason why this happens and a valid way to do this using the loops.
It's because you are trying to add a nonexistent value count[l] to 1. You start out with count<-0, so count is of length one. There is no count[2], so a reference to count[2] returns NA. Then (assuming l = 2 in your loop), NA + l returns NA.
If you initialize count<-rep(0,length(positions)) this particular problem will go away.
Meanwhile, you can vectorize your operations quite a lot. I believe you can replace the k-loop with
count[l] <- sum(y$feature[a:b]==x$feature[l])
for one example.

how to replace a word starting with certain characters on certain lines?

I am trying to use SED command to replace/remove rs numbers from my file.
I have a VCF file:
##reference=file:/hs37d5.fasta
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SC_PCHD5235298
1 10234 rs145599635 C T 153.34 .
I would like to replace rs* entries ONLY on the lines which does not start with #,
For example i would like to replace rs145599635 with a dot. And want it to ignore headers which are the lines start with a #.
I tried
sed "/^[^#]/s/rs.*/./g" test.vcf
but it deletes everything after the rs.
You can try with this,
Ex:
sed -i 's/\(^[^#].*\)rs[0-9]\+\( .*\)/\1rs.\2/' test.vcf
I altered your command.You have to write like this.
sed -i "/^[^#]/s/rs[0-9]\+/rs./g" test.vcf
My test.vcf file looks like this.I think your file looks like this only.
##reference=file:/hs37d5.fasta
#rs145599635 C T 153.34 .
#1 10234 rs145599635 C T 153.34 .
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SC_PCHD5235298
1 10234 rs145599635 C T 153.34 .
I hope this will help you.

Resources