I have a file in the form (Input_fasta.txt)
>tr|A0A089QH62|A0A089QH62_MYCTU Histidine kinase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00865 PE=4 SV=1
MTATASGIAATAPNCGEASINDVPIAESERRYLGARSASEYGQEIPLW
>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
>tr|A0A089SBT4|A0A089SBT4_MYCTU Glycosyl transferase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_19775 PE=4 SV=1
MDTETHYSDVWVVIPAFNEAAVIGKVVTDVRSVFDHVVCVDDGSTDGTGDIARRSGAHLV
RHPINLGQGAAIQTGIEYARKQPGAQVFATFDGDGQHRVKDVAAMVDRLGAGDVDVVIGT
RFGRPVGKASASRPPLMKRIVLQTGARLSRRGRRLGLTDTNNGLRVFNKTVADGLNITMS
GMSHATEFIMLIAENHWRVAEEPVEVLYTEYSKSKGQPLLNGVNIIFDGFLRGRMPR
>tr|A0A089QKT1|A0A089QKT1_MYCTU TetR family transcriptional regulator OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00800 PE=4 SV=1
MSLTAGRGPGRPPAAKADETRKRILHAARQVFSERGYDGATFQEIAVRADLTRPAINHYF
ANKRVLYQEVVEQTHELVIVAGIERARREPTLMGRLAVVVDFAMEADAQYPASTAFLATT
VLESQRHPELSRTENDAVRATREFLVWAVNDAIERGELAADVDVSSLAETLLVVLCGVGF
YIGFVGSYQRMATITDSFQQLLAGTLWRPPT
>tr|I6YAB3|I6YAB3_MYCTU Iron ABC transporter permease OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_07380 PE=4 SV=1
MARGLQGVMLRSFGARDHTATVIETISIAPHFVRVRMVSPTLFQDAEAEPAAWLRFWFPD
PNGSNTEFQRAYTISEADPAAGRFAVDVVLHDPAGPASSWARTVKPGATIAVMSLMGSSR
FDVPEEQPAGYLLIGDSASIPGMNGIIETVPNDVPIEMYLEQHDDNDTLIPLAKHPRLRV
RWVMRRDEKSLAEAIENRDWSDWYAWATPEAAALKCVRVRLRDEFGFPKSEIHAQAYWNA
GRAMGTHRATEPAATEPEVGAAPQPESAVPAPARGSWRAQAASRLLAPLKLPLVLSGVLA
ALVTLAQLAPFVLLVELSRLLVSGAGAHRLFTVGFAAVGLLGTGALLAAALTLWLHVIDA
RFARALRLRLLSKLSRLPLGWFTSRGSGSIKKLVTDDTLALHYLVTHAVPDAVAAVVAPV
GVLVYLFVVDWRVALVLFGPVLVYLTITSSLTIQSGPRIVQAQRWAEKMNGEAGSYLEGQ
PVIRVFGAASSSFRRRLDEYIGFLVAWQRPLAGKKTLMDLATRPATFLWLIAATGTLLVA
THRMDPVNLLPFMFLGTTFGARLLGIAYGLGGLRTGLLAARHLQVTLDETELAVREHPRE
PLDGEAPATVVFDHVTFGYRPGVPVIQDVSLTLRPGTVTALVGPSGSGKSTLATLLARFH
DVERGAIRVGGQDIRSLAADELYTRVGFVLQEAQLVHGTAAENIALAVPDAPAEQVQVAA
REAQIHDRVLRLPDGYDTVLGANSGLSGGERQRLTIARAILGDTPVLILDEATAFADPES
EYLVQQALNRLTRDRTVLVIAHRLHTITRADQIVVLDHGRIVERGTHEELLAAGGRYCRL
WDTGQGSRVAVAAAQDGTR
>tr|L0T545|L0T545_MYCTU PPE family protein PPE7 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=PPE7 PE=4 SV=1
MSVCVIYIPFKGCVKHVSVTIPITTEHLGPYEIDASTINPDQPIDTAFTQTLDFAGSGTV
GAFPFGFGWQQSPGFFNSTTTPSSGFFNSGAGGASGFLNDAAAAVSGLGNVFTETSGFFN
AGGVGIRASKTSATCCRAGRT
and another file containing the pattern like(Pattern.txt)
I6WXB4
I6WXC3
I6WXK8
I need an output like
>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
what I have done till now is
grep -f Pattern.txt Input_fasta.txt
How to extend the output to next lines till I hit next ">" after the match ?
tried awk '/I6WXB4/{copy=1;next} />/{copy=0;next} copy' Input_fasta.txt
which gave an output MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
but header is missing here.
In awk:
$ awk 'NR==FNR{a[$0]; next} $2 in a' pattern.txt FS="|" RS=">" input_fasta.tzt
tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
Here’s a simple Python solution, using BioPython:
import sys
import re
from Bio import SeqIO
with open('pattern.txt', 'r') as f:
patterns = '|'.join([re.escape(pattern.strip()) for pattern in f])
for record in SeqIO.parse('test.fa', 'fasta'):
if re.search(patterns, record.id):
SeqIO.write(record, sys.stdout, 'fasta')
Note that this requires a well-behaved patterns.txt file, i.e. one that doesn’t contain any empty lines.
bash & sed solution:
while read pattern
do
if [ ! -z $pattern ] ; then
sed -n "/\|$pattern\|/{:loop;p;n;/>/q;bloop;}" input.txt
fi
done < patternfile.txt
loops on pattern file (blank lines are skipped) and if finds the pattern, just read & print the file lines through the end or until it finds >
Related
I have launched a query with amino acid sequences on "KAAS - KEGG Automatic Annotation Server".
I have then downloaded the results file called "myfile.keg". A small example file that shows how it looks like can be dowloaded at: https://www.dropbox.com/s/ixf0091z5q3cx9z/myfile.keg?dl=0
+D KO
#<h2><img src="/Fig/bget/kegg3.gif" align="middle" border=0> KEGG Orthology (KO)</h2> 75prot_protdiff_GD_5h
!
A<b>Metabolism</b>
B
B <b>Carbohydrate metabolism</b>
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D MYGENEACCESSION01; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00020 Citrate cycle (TCA cycle) [PATH:ko00020]
C 00030 Pentose phosphate pathway [PATH:ko00030]
D MYGENEACCESSION02; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00040 Pentose and glucuronate interconversions [PATH:ko00040]
C 00051 Fructose and mannose metabolism [PATH:ko00051]
D MYGENEACCESSION03; K17497 PMM; phosphomannomutase [EC:5.4.2.8]
D MYGENEACCESSION04; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00052 Galactose metabolism [PATH:ko00052]
C 00053 Ascorbate and aldarate metabolism [PATH:ko00053]
C 00500 Starch and sucrose metabolism [PATH:ko00500]
C 00520 Amino sugar and nucleotide sugar metabolism [PATH:ko00520]
D MYGENEACCESSION05; K01183 E3.2.1.14; chitinase [EC:3.2.1.14]
C 00620 Pyruvate metabolism [PATH:ko00620]
C 00630 Glyoxylate and dicarboxylate metabolism [PATH:ko00630]
C 00640 Propanoate metabolism [PATH:ko00640]
C 00650 Butanoate metabolism [PATH:ko00650]
C 00660 C5-Branched dibasic acid metabolism [PATH:ko00660]
C 00562 Inositol phosphate metabolism [PATH:ko00562]
B
!
#<hr>
#<b>[ KO | BRITE | KEGG2 | KEGG ]</b><br>
#Last updated: May 18, 2018
#<br><br>» All categories
(I open it with Notepad++)
In this file, you can see the different functional categories from KEGG for each of my genes, the latters being referred to as "MYGENEACCESSION01" (or -"02", -"03", etc).
I want to extract and organize all info from this first file.keg into a new file (e.g., excel) that looks something like this : https://www.dropbox.com/s/xq4714ngesap9dx/annotation.xlsx?dl=0
CSV version here:
accession,kegg.first.level,kegg.second.level,kegg.third.level,kegg.fourth.level,path ,KO
MYGENEACCESSION01,metabolism,carbohydrate metabolism,glycolisis / Gluconeogenesis,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00010,K01623
MYGENEACCESSION02,metabolism,carbohydrate metabolism,Pentose phosphate pathway ,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00030,K01623
MYGENEACCESSION03,metabolism,carbohydrate metabolism,Fructose and mannose metabolism, PMM; phosphomannomutase [EC:5.4.2.8],PATH:ko00051,K17497
MYGENEACCESSION04,metabolism,carbohydrate metabolism,Fructose and mannose metabolism,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00051,K01623
MYGENEACCESSION05,metabolism,carbohydrate metabolism,Amino sugar and nucleotide sugar metabolism,chitinase [EC:3.2.1.14],PATH:ko00520,K01183
I have done it manually but it is very tedious and I have a much larger dataset than the provided example.
Any idea to do it automatically with R or another program? (Do you think that an R script could do the job ?)
A newbie to textmining analysis and R coding.
I have 200 genes with mixed string. I want to split them and paste strings (eg, cadherins, orphan receptors) in one column and numbers (eg, 2/3), number+string (eg, 7D, 7TM) in another column.
I used strssplit to split the words. Please any suggestion on how to parse them would be helpful.
example:
> Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA28S", "45S pre-ribosomal RNAs RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)
Expected result(2nd and 3rd column):
7D cadherins cadherins 7D
7TM orphan receptors orphan receptors 7TM
18S ribosomal RNAs RNA18S ribosomal RNAs RNA18S 18S RNA18S
28S ribosomal RNAs RNA28S ribosomal RNAs RNA28S 28S RNA28S
45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
5.8S ribosomal RNAs ribosomal RNAs 5.8S
Actin related protein 2/3 complex Actin related protein complex 2/3
Using strsplit to split the names, grep to detect words with or without numbers and paste to collapse the words. Put everithing in a function to avoid repetition:
wordS <- function(x, invert = TRUE) {
clean <- gsub( '[[:space:]]+', ' ', x ) # to remove extra spaces
split <- strsplit( clean, ' ' )
detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
words <- sapply( detec, paste, collapse = ' ' )
return( words )
}
data.frame(
Gene = Genes,
column2 = wordS(Genes),
column3 = wordS(Genes, invert = FALSE)
)
Gene column2 column3
1 7D cadherins cadherins 7D
2 7TM orphan receptors orphan receptors 7TM
3 7TM orphan receptors RNA18S orphan receptors 7TM RNA18S
4 28S ribosomal RNAs RNA28S ribosomal RNAs 28S RNA28S
5 45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
6 5.8S ribosomal RNAs ribosomal RNAs 5.8S
7 Actin related protein 2/3 complex Actin related protein complex 2/3
I am using parsey mcparseface and syntaxnet to parse some text. I wish to extract the positional data of words along with the parse tree.
Currently what the output is:
echo 'Alice brought the pizza to Alice.' | syntaxnet/demo.sh
Input: Alice brought the pizza to Alice .
Parse:
brought VBD ROOT
+-- ALice NNP nsubj
+-- pizza NN dobj
| +-- the DT det
+-- to IN prep
| +-- Alice NNP pobj
+-- . . punct
how i need it to be
Input: Alice brought the pizza to Alice .
Parse:
brought VBD ROOT 2
+-- Alice NNP nsubj 1
+-- pizza NN dobj 4
| +-- the DT det 3
+-- to IN prep 5
| +-- Alice NNP pobj 6
+-- . . punct 7
or similar. (this will be particularly useful when there are many occurances of same word.)
Thank you
You can edit conll2tree.py
https://github.com/tensorflow/models/blob/master/syntaxnet/syntaxnet/conll2tree.py
Changing token_str to
token_str = ['%s %d %s %s' % (token.word, tind,
token.tag, token.label)
for tind,token in enumerate(sentence.token,1)]
should do it.
Say I have 3 or more files, is there anyway I can combine these files into a single document? Example below.
File1:
abc123
File2:
2468, def
File3:
zyx987
I want the outcome to be
CombinedFile:
abc123 2468, def zyx987
There are different ways:
I tested with f1, f2, f3. If the name follows the pattern fXX, it can be done like this:
$ paste f*
abc123 2468, def zyx987
$ paste -d' ' f* #set space as delimiter
abc123 2468, def zyx987
$ cat f*
abc123
2468, def
zyx987
If you want the output to be a file, just add > result
$ cat f* > result
$ cat result
abc123
2468, def
zyx987
Here is another way using pr.
pr -mts' ' f{1,2,3}
$ head f*
==> f1 <==
abc123
==> f2 <==
2468, def
==> f3 <==
zyx987
$ pr -mts' ' f{1,2,3}
abc123 2468, def zyx987
I want find all of the punctuation marks used my .txt file and give a count of the number of occurrences of each one. How would I go about doing this?? I am new at this but I am trying to learn! This is not homework! I have been doing research on grep and sed right now.
$ perl -CSD -nE '$seen{$1}++ while /(\pP)/g; END { say "$_ $seen{$_}" for keys %seen }' sometextfile.utf8
As in
$ perl -CSD -nE '$seen{$1}++ while /(\pP)/g; END { say "$_ $seen{$_}" for keys %seen }' programming_perl_4th_edition.pod | sort -k2rn
, 21761
. 19578
; 10986
( 8856
) 8853
- 7606
: 7420
" 7300
_ 5305
’ 4906
/ 4528
{ 2966
} 2947
\ 2258
# 2121
# 2070
* 1991
' 1715
“ 1406
” 1404
[ 1007
] 1003
% 881
! 838
? 824
& 555
— 330
‑ 72
– 41
‹ 16
› 16
‐ 10
⁂ 10
… 8
· 3
「 2
」 2
« 1
» 1
‒ 1
― 1
‘ 1
• 1
‥ 1
⁃ 1
・ 1
If you want not just punctuation but punctuation and symbols, use [\pP\pS] in your pattern. Don’t use old-style POSIX classes whatever you do, though.
Use sed, tr, sort and uniq (and no perl):
sed -E 's/[^[:punct:]]//g;s/(.)/\1x/g' myfile.txt | tr 'x' '\n' | sort | uniq -c
I did it this way (sed + tr) so it will work on both unix and mac. Mac needs an imbedded linefeed in the sed command, but unix can use \n. This way it works everywhere.
This will work on non-mac unix:
sed -E 's/[^[:punct:]]//g;s/(.)/\1\n/g' myfile.txt | sort | uniq -c