pyparsing: Grouping guidelines - pyparsing

pyparsing: The below is the code i put up which can parse a nested function call , a logical function call or a hybrid call which nests both the function and a logical function call. The dump() data adds too many unnecessary levels of braces because of grouping. Removing the Group() results in a wrong output. Is there a guideline to use Group(parsers)?
Also the Pyparsing document does'nt detail on how to walk the tree created and not much of data is available out there. Please point me to a link/guide which helps me write the tree walker for recursively parsed data for my test cases.
I will be translating this parsed data to a valid tcl code.
from pyparsing import *
from pyparsing import OneOrMore, Optional, Word, delimitedList, Suppress
# parse action -maker; # from Paul's example
def makeLRlike(numterms):
if numterms is None:
# None operator can only by binary op
initlen = 2
incr = 1
else:
initlen = {0:1,1:2,2:3,3:5}[numterms]
incr = {0:1,1:1,2:2,3:4}[numterms]
# define parse action for this number of terms,
# to convert flat list of tokens into nested list
def pa(s,l,t):
t = t[0]
if len(t) > initlen:
ret = ParseResults(t[:initlen])
i = initlen
while i < len(t):
ret = ParseResults([ret] + t[i:i+incr])
i += incr
return ParseResults([ret])
return pa
line = Forward()
fcall = Forward().setResultsName("fcall")
flogical = Forward()
lparen = Literal("(").suppress()
rparen = Literal(")").suppress()
arg = Word(alphas,alphanums+"_"+"."+"+"+"-"+"*"+"/")
args = delimitedList(arg).setResultsName("arg")
fargs = delimitedList(OneOrMore(flogical) | OneOrMore(fcall) |
OneOrMore(arg))
fname = Word(alphas,alphanums+"_")
fcall << Group(fname.setResultsName('func') + Group(lparen +
Optional(fargs) + rparen).setResultsName('fargs'))
flogic = Keyword("or") | Keyword("and") | Keyword("not")
logicalArg = delimitedList(Group(fcall.setResultsName("fcall")) |
Group(arg.setResultsName("arg")))
#logicalArg.setDebug()
flogical << Group(logicalArg.setResultsName('larg1') +
flogic.setResultsName('flogic') + logicalArg.setResultsName('larg2'))
#logical = operatorPrecedence(flogical, [(not, 1, opAssoc.RIGHT,
makeLRlike(2)),
# (and, 2, opAssoc.LEFT,
makeLRlike(2)),
# (or , 2, opAssoc.LEFT,
makeLRlike(2))])
line = flogical | fcall #change to logical if operatorPrecedence is used
# Works fine
print line.parseString("f(x, y)").dump()
print line.parseString("f(h())").dump()
print line.parseString("a and b").dump()
print line.parseString("f(a and b)").dump()
print line.parseString("f(g(x))").dump()
print line.parseString("f(a and b) or h(b not c)").dump()
print line.parseString("f(g(x), y)").dump()
print line.parseString("g(f1(x), a, b, f2(x,y, k(x,y)))").dump()
print line.parseString("f(a not c) and g(f1(x), a, b, f2(x,y,
k(x,y)))").dump()
#Does'nt work fine yet;
#try changing flogical assignment to logicalArg | flogic
#print line.parseString("a or b or c").dump()
#print line.parseString("f(a or b(x) or c)").dump()

Related

ellipsis ... as function in substitute?

I'm having trouble understanding how/why parentheses work where they otherwise should not work®.
f = function(...) substitute(...()); f(a, b)
[[1]]
a
[[2]]
b
# but, substitute returns ..1
f2 = function(...) substitute(...); f2(a, b)
a
Normally an error is thrown, could not find function "..." or '...' used in an incorrect context, for example when calling (\(...) ...())(5).
What I've tried
I have looked at the source code of substitute to find out why this doesn't happen here. R Internals 1.1.1 and 1.5.2 says ... is of SEXPTYPE DOTSXP, a pairlist of promises. These promises are what is extracted by substitute.
# \-substitute #R
# \-do_substitute #C
# \-substituteList #C recursive
# \-substitute #C
Going line-by-line, I am stuck at substituteList, in which h is the current element of ... being processed. This happens recursively at line 2832 if (TYPEOF(h) == DOTSXP) h = substituteList(h, R_NilValue);. I haven't found exception handling of a ...() case in the source code, so I suspect something before this has happened.
In ?substitute we find substitute works on a purely lexical basis. Does it mean ...() is a parser trick?
parse(text = "(\\(...) substitute(...()))(a, b)") |> getParseData() |> subset(text == "...", select = c(7, 9))
#> token text
#> 4 SYMBOL_FORMALS ...
#> 10 SYMBOL_FUNCTION_CALL ...
The second ellipsis is recognized during lexical analysis as the name of a function call. It doesn't have its own token like |> does. The output is a pairlist ( typeof(f(a, b)) ), which in this case is the same as a regular list (?). I guess it is not a parser trick. But whatever it is, it has been around for a while!
Question:
How does ...() work?
Note: When referring to documentation and source code, I provide links to an unofficial GitHub mirror of R's official Subversion repository. The links are bound to commit 97b6424 in the GitHub repo, which maps to revision 81461 in the Subversion repo (the latest at the time of this edit).
substitute is a "special" whose arguments are not evaluated (doc).
typeof(substitute)
[1] "special"
That means that the return value of substitute may not agree with parser logic, depending on how the unevaluated arguments are processed internally.
In general, substitute receives the call ...(<exprs>) as a LANGSXP of the form (pseudocode) pairlist(R_DotsSymbol, <exprs>) (doc). The context of the substitute call determines how the SYMSXP R_DotsSymbol is processed. Specifically, if substitute was called inside of a function with ... as a formal argument and rho as its execution environment, then the result of
findVarInFrame3(rho, R_DotsSymbol, TRUE)
in the body of C utility substituteList (source) is either a DOTSXP or R_MissingArg—the latter if and only if f was called without arguments (doc). In other contexts, the result is R_UnboundValue or (exceptionally) some other SEXP—the latter if and only if a value is bound to the name ... in rho. Each of these cases is handled specially by substituteList.
The multiplicity in the processing of R_DotsSymbol is the reason why these R statements give different results:
f0 <- function() substitute(...(n = 1)); f0()
## ...(n = 1)
f1 <- function(...) substitute(...(n = 1)); f1()
## $n
## [1] 1
g0 <- function() {... <- quote(x); substitute(...(n = 1))}; g0()
## Error in g0() : '...' used in an incorrect context
g1 <- function(...) {... <- quote(x); substitute(...(n = 1))}; g1()
## Error in g1() : '...' used in an incorrect context
h0 <- function() {... <- NULL; substitute(...(n = 1))}; h0()
## $n
## [1] 1
h1 <- function(...) {... <- NULL; substitute(...(n = 1))}; h1()
## $n
## [1] 1
Given how ...(n = 1) is parsed, you might have expected f1 to return call("...", n = 1), both g0 and g1 to return call("x", n = 1), and both h0 and h1 to throw an error, but that is not the case for the above, mostly undocumented reasons.
Internals
When called inside of the R function f,
f <- function(...) substitute(...(<exprs>))
substitute evaluates a call to the C utility do_substitute—you can learn this by looking here—in which argList gets a LISTSXP of the form pairlist(x, R_MissingArg), where x is a LANGSXP of the form pairlist(R_DotsSymbol, <exprs>) (source).
If you follow the body of do_substitute, then you will find that the value of t passed to substituteList from do_substitute is a LISTSXP of the form pairlist(copy_of_x) (source).
It follows that the while loop inside of the substituteList call (source) has exactly one iteration and that the statement CAR(el) == R_DotsSymbol in the body of the loop (source) is false in that iteration.
In the false branch of the conditional (source), h gets the value
pairlist(substituteList(copy_of_x, env)). The loop exits and substituteList returns h to do_substitute, which in turn returns CAR(h) to R (source 1, 2, 3).
Hence the return value of substitute is substituteList(copy_of_x, env), and it remains to deduce the identity of this SEXP. Inside of this call to substituteList, the while loop has 1+m iterations, where m is the number of <exprs>. In the first iteration, the statement CAR(el) == R_DotsSymbol in the body of the loop is true.
In the true branch of the conditional (source), h is either a DOTSXP or R_MissingArg, because f has ... as a formal argument (doc). Continuing, you will find that substituteList returns:
R_NilValue if h was R_MissingArg in the first while iteration and m = 0,
or, otherwise,
a LISTSXP listing the expressions in h (if h was a DOTSXP in the first while iteration) followed by <exprs> (if m > 1), all unevaluated and without substitutions, because the execution environment of f is empty at the time of the substitute call.
Indeed:
f <- function(...) substitute(...())
is.null(f())
## [1] TRUE
f <- function(...) substitute(...(n = 1))
identical(f(a = sin(x), b = zzz), pairlist(a = quote(sin(x)), b = quote(zzz), n = 1))
## [1] TRUE
Misc
FWIW, it helped me to recompile R after adding some print statements to coerce.c. For example, I added the following before UNPROTECT(3); in the body of do_substitute (source):
Rprintf("CAR(t) == R_DotsSymbol? %d\n",
CAR(t) == R_DotsSymbol);
if (TYPEOF(CAR(t)) == LISTSXP || TYPEOF(CAR(t)) == LANGSXP) {
Rprintf("TYPEOF(CAR(t)) = %s, length(CAR(t)) = %d\n",
type2char(TYPEOF(CAR(t))), length(CAR(t)));
Rprintf("CAR(CAR(t)) = R_DotsSymbol? %d\n",
CAR(CAR(t)) == R_DotsSymbol);
Rprintf("TYPEOF(CDR(CAR(t))) = %s, length(CDR(CAR(t))) = %d\n",
type2char(TYPEOF(CDR(CAR(t)))), length(CDR(CAR(t))));
}
if (TYPEOF(s) == LISTSXP || TYPEOF(s) == LANGSXP) {
Rprintf("TYPEOF(s) = %s, length(s) = %d\n",
type2char(TYPEOF(s)), length(s));
Rprintf("TYPEOF(CAR(s)) = %s, length(CAR(s)) = %d\n",
type2char(TYPEOF(CAR(s))), length(CAR(s)));
}
which helped me confirm what was going into and coming out of the substituteList call on the previous line:
f <- function(...) substitute(...(n = 1))
invisible(f(hello, world, hello(world)))
CAR(t) == R_DotsSymbol? 0
TYPEOF(CAR(t)) = language, length(CAR(t)) = 2
CAR(CAR(t)) = R_DotsSymbol? 1
TYPEOF(CDR(CAR(t))) = pairlist, length(CDR(CAR(t))) = 1
TYPEOF(s) = pairlist, length(s) = 1
TYPEOF(CAR(s)) = pairlist, length(CAR(s)) = 4
invisible(substitute(...()))
CAR(t) == R_DotsSymbol? 0
TYPEOF(CAR(t)) = language, length(CAR(t)) = 1
CAR(CAR(t)) = R_DotsSymbol? 1
TYPEOF(CDR(CAR(t))) = NULL, length(CDR(CAR(t))) = 0
TYPEOF(s) = pairlist, length(s) = 1
TYPEOF(CAR(s)) = language, length(CAR(s)) = 1
Obviously, compiling R with debugging symbols and running R under a debugger helps, too.
Another puzzle
Just noticed this oddity:
g <- function(...) substitute(...(n = 1), new.env())
gab <- g(a = sin(x), b = zzz)
typeof(gab)
## [1] "language"
gab
## ...(n = 1)
Someone here can do another deep dive to find out why the result is a LANGSXP rather than a LISTSXP when you supply env different from environment() (including env = NULL).

LLVMLITE Hello World Example Produces Wrong Output

The problem I am faced with regards to llvmlite is producing a simple hello world example.
I am unable to display the string global variable value in the function I created.
It appears to always print out the number one.
I have already tried to return the stringtype which produced a error.
from llvmlite import ir
i64 = ir.IntType(64)
i8 = ir.IntType(16)
hellostr = 'hello, world!'
stringtype = ir.ArrayType(i64, len(hellostr))
module = ir.Module( name="m_hello_example" )
hello = ir.GlobalVariable(module, stringtype, '.str4')
fn_int_to_int_type = ir.FunctionType(i64, [stringtype.as_pointer()] )
fn_hel = ir.Function( module, fn_int_to_int_type, name="fn_hel" )
fn_hel_block = fn_hel.append_basic_block( name="fn_hel_entry" )
builder = ir.IRBuilder(fn_hel_block )
# zero = builder.constant(i64, 0)
# const_1 = ir.Constant(stringtype,1);
# builder.ret(const_1)
const_1 = ir.Constant(i64,1);
# print(const_1)
builder.ret(const_1)
print( module )
I was expecting the output to print out the string 'hello, world!'.
Any help would be much appreciated.
Thanks.
It ended up that I was able to solve my problem with the following code:
import llvmlite.ir as ir
import llvmlite.binding as llvm
from ctypes import CFUNCTYPE
def main():
m = ir.Module()
func_ty = ir.FunctionType(ir.VoidType(), []) #defining printer function as type void
func = ir.Function(m, func_ty, name="printer") #define function as printer
builder = ir.IRBuilder(func.append_basic_block('entry')) #defining the entry point of the function printer
fmt = "%s\n\0" #in function printf allows for inserting arg in, next global_fmt statements allow for creating #"fstr" assignment
c_fmt = ir.Constant(ir.ArrayType(ir.IntType(8), len(fmt)),
bytearray(fmt.encode("utf8")))
global_fmt = ir.GlobalVariable(m, c_fmt.type, name="fstr")
global_fmt.linkage = 'internal'
global_fmt.global_constant = True
global_fmt.initializer = c_fmt
arg = "Hello, World!\0" #args will be passed into printf function.
c_str_val = ir.Constant(ir.ArrayType(ir.IntType(8), len(arg)),
bytearray(arg.encode("utf8"))) #creates the c_str_value as a constant
printf_ty = ir.FunctionType(ir.IntType(32), [], var_arg=True) #creation of the printf function begins here and specifies the passing of a argument
printf = ir.Function(m, printf_ty, name="printf")
c_str = builder.alloca(c_str_val.type) #creation of the allocation of the %".2" variable
builder.store(c_str_val, c_str) #store as defined on the next line below %".2"
voidptr_ty = ir.IntType(8).as_pointer()
fmt_arg = builder.bitcast(global_fmt, voidptr_ty) #creates the %".4" variable with the point pointing to the fstr
builder.call(printf, [fmt_arg, c_str]) #We are calling the prinf function with the fmt and arg and returning the value as defiend on the next line
builder.ret_void()
#Next lines are for calling llvm and returning the assembly.
llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter()
print(str(m)) #PRINTING OUT THE ASSEMBLY
llvm_module = llvm.parse_assembly(str(m)) #Parsing teh assembly
tm = llvm.Target.from_default_triple().create_target_machine() #creating the target machine
with llvm.create_mcjit_compiler(llvm_module, tm) as ee:
ee.finalize_object() #Making sure all modules owned by the execution engine are fully processed and usable for execution
fptr = ee.get_function_address("printer") #fptr will reference the printer function
py_func = CFUNCTYPE(None)(fptr)
py_func() #run the function printer
if __name__ == "__main__":
main()
It appears that I didn't correctly assign the variable and hence why I wasn't returning anything.

How to use the "6" char code in overloading insertion function on Scilab?

In the Help Documentation of Scilab 6.0.2, I can read the following instruction on the Overloading entry, regarding the last operation code "iext" showed in this entry's table:
"The 6 char code may be used for some complex insertion algorithm like x.b(2) = 33 where b field is not defined in the structure x. The insertion is automatically decomposed into temp = x.b; temp(2) = 33; x.b = temp. The 6 char code is used for the first step of this algorithm. The 6 overloading function is very similar to the e's one."
But I can't find a complete example on how to use this "char 6 code" to overload a function. I'm trying to use it, without success. Does anyone have an example on how to do this?
The code bellow creates a normal "mlist" as a example. Which needs overloading functions
A = rand(5,3)
names = ["colA" "colB" "colC"]
units = ["ft" "in" "lb"]
M = mlist(["Mlog" "names" "units" names],names,units,A(:,1),A(:,2),A(:,3))
Following are the overload functions:
//define display
function %Mlog_p(M)
n = size(M.names,"*")
formatStr = strcat(repmat("%10s ",1,n)) + "\n"
formatNum = strcat(repmat("%0.10f ",1,n)) + "\n"
mprintf(formatStr,M.names)
mprintf(formatStr,M.units)
disp([M(M.names(1)),M(M.names(2)),M(M.names(3))])
end
//define extraction operation
function [Mat]=%Mlog_e(varargin)
M = varargin($)
cols = [1:size(M.names,"*")] // This will also work
cols = cols(varargin($-1)) // when varargin($-1) = 1:1:$
Mat = []
if length(varargin)==3 then
for i = M.names(cols)
Mat = [Mat M(i)(varargin(1))]
end
else
for i=1:size(M.names(cols),"*")
Mat(i).name = M.names(cols(i))
Mat(i).unit = M.units(cols(i))
Mat(i).data = M(:,cols(i))
end
end
endfunction
//define insertion operations (a regular matrix into a Mlog matrix)
function ML=%s_i_Mlog(i,j,V,M)
names = M.names
units = M.units
A = M(:,:) // uses function above
A(i,j) = V
ML = mlist(["Mlog" "names" "units" names],names,units,A(:,1),A(:,2),A(:,3))
endfunction
//insertion operation with structures (the subject of the question)
function temp = %Mlog_6(j,M)
temp = M(j) // uses function %Mlog_e
endfunction
function M = %st_i_Mlog(j,st,M)
A = M(:,:) // uses function %Mlog_e
M.names(j) = st.name // uses function above
M.units(j) = st.unit // uses function above
A(:,j) = st.data // uses function above
names = M.names
units = M.units
M = mlist(["Mlog" "names" "units" names],names,units,A(:,1),A(:,2),A(:,3))
endfunction
The first overload (displays mlist) will show the matrix in the form of the following table:
--> M
M =
colA colB colC
ft in lb
0.4720517 0.6719395 0.5628382
0.0623731 0.1360619 0.5531093
0.0854401 0.2119744 0.0768984
0.0134564 0.4015942 0.5360758
0.3543002 0.4036219 0.0900212
The next overloads (extraction and insertion) Will allow the table to be access as a simple matrix M(i,j).
The extraction function Will also allow M to be access by column, which returns a structure, for instance:
--> M(2)
ans =
name: "colB"
unit: "in"
data: [5x1 constant]
The last two functions are the overloads mentioned in the question. They allow the column metadata to be changed in a structure form.
--> M(2).name = "length"
M =
colA length colC
ft in lb
0.4720517 0.6719395 0.5628382
0.0623731 0.1360619 0.5531093
0.0854401 0.2119744 0.0768984
0.0134564 0.4015942 0.5360758
0.3543002 0.4036219 0.0900212

Pyparsing: ParseAction not called

On a simple grammar I am in the bad situation that one of my ParseActions is not called.
For me this is strange as parseActions of a base symbol ("logic_oper") and a derived symbol ("cmd_line") are called correctly. Just "pa_logic_cmd" is not called. You can see this on the output which is included at the end of the code.
As there is no exception on parsing the input string, I am assuming that the grammar is (basically) correct.
import io, sys
import pyparsing as pp
def diag(msg, t):
print("%s: %s" % (msg , str(t)) )
def pa_logic_oper(t): diag('logic_oper', t)
def pa_operand(t): diag('operand', t)
def pa_ident(t): diag('ident', t)
def pa_logic_cmd(t): diag('>>>>>> logic_cmd', t)
def pa_cmd_line(t): diag('cmd_line', t)
def make_grammar():
semi = pp.Literal(';')
ident = pp.Word(pp.alphas, pp.alphanums).setParseAction(pa_ident)
operand = (ident).setParseAction(pa_operand)
op_and = pp.Keyword('A')
op_or = pp.Keyword('O')
logic_oper = (( op_and | op_or) + pp.Optional(operand))
logic_oper.setParseAction(pa_logic_oper)
logic_cmd = logic_oper + pp.Suppress(semi)
logic_cmd.setParseAction(pa_logic_cmd)
cmd_line = (logic_cmd)
cmd_line.setParseAction(pa_cmd_line)
grammar = pp.OneOrMore(cmd_line) + pp.StringEnd()
return grammar
if __name__ == "__main__":
inp_str = '''
A param1;
O param2;
A ;
'''
grammar = make_grammar()
print( "pp-version:" + pp.__version__)
parse_res = grammar.parseString( inp_str )
'''USAGE/Output: python test_4.py
pp-version:2.0.3
operand: ['param1']
logic_oper: ['A', 'param1']
cmd_line: ['A', 'param1']
operand: ['param2']
logic_oper: ['O', 'param2']
cmd_line: ['O', 'param2']
logic_oper: ['A']
cmd_line: ['A']
'''
Can anybody give me a hint on this parseAction problem?
Thanks,
The problem is here:
cmd_line = (logic_cmd)
cmd_line.setParseAction(pa_cmd_line)
The first line assigns cmd_line to be the same expression as logic_cmd. You can verify by adding this line:
print("???", cmd_line is logic_cmd)
Then the second line calls setParseAction, which overwrites the parse action of logic_cmd, so the pa_logic_cmd will never get called.
Remove the second line, since you are already testing the calling of the parse action with pa_logic_cmd. You could change to using the addParseAction method instead, but to my mind that is an invalid test (adding 2 parse actions to the same pyparsing expression object).
Or, change the definition of cmd_line to:
cmd_line = pp.Group(logic_cmd)
Now you will have wrapped logic_cmd inside another expression, and you can then independently set and test the running of parse actions on the two different expressions.

alignment of sequences

I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this.
pdb id uniprot id
1dbh Q07889
1e43 P00692
1f1s Q53591
first, I need to read each line in an input file
2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files
3) Do alignment and calculate sequence identity.
Usually, I use the following program for pairwise alignment and seq.identity calculation.
library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")
I need to print the output like this
pdbid uniprotid seq.identity
1dbh Q07889 99
1e43 P00692 80
1f1s Q53591 56
How can I change the above code ? your help would be appreciated!
'
This code is hopefully what your looking for:
class test():
def get_seq(self, pdb,fasta_file): # Get sequences
from Bio.PDB.PDBParser import PDBParser
from Bio import SeqIO
aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
p=PDBParser(PERMISSIVE=1)
structure_id="%s" % pdb[:-4]
structure=p.get_structure(structure_id, pdb)
residues = structure.get_residues()
seq_pdb = ''
for res in residues:
res = res.get_resname()
if res in aa:
seq_pdb = seq_pdb+aa[res]
handle = open(fasta_file, "rU")
for record in SeqIO.parse(handle, "fasta") :
seq_fasta = record.seq
handle.close()
self.seq_aln(seq_pdb,seq_fasta)
def seq_aln(self,seq1,seq2): # Align the sequences
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum62
gap_open = -10
gap_extend = -0.5
alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
top_aln = alns[0]
aln_seq1, aln_seq2, score, begin, end = top_aln
with open('aln.fasta', 'w') as outfile:
outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
print aln_seq1+'\n'+aln_seq2
self.seq_id('aln.fasta')
def seq_id(self,aln_fasta): # Get sequence ID
import string
from Bio import AlignIO
input_handle = open("aln.fasta", "rU")
alignment = AlignIO.read(input_handle, "fasta")
j=0 # counts positions in first sequence
i=0 # counts identity hits
for record in alignment:
#print record
for amino_acid in record.seq:
if amino_acid == '-':
pass
else:
if amino_acid == alignment[0].seq[j]:
i += 1
j += 1
j = 0
seq = str(record.seq)
gap_strip = seq.replace('-', '')
percent = 100*i/len(gap_strip)
print record.id+' '+str(percent)
i=0
a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')
This outputs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence
For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file.
EDIT:
Replace the seq_id function with this one:
def seq_id(self,aln_fasta):
import string
from Bio import AlignIO
from Bio import SeqIO
record_iterator = SeqIO.parse(aln_fasta, "fasta")
first_record = record_iterator.next()
print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-','')))
second_record = record_iterator.next()
print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-','')))
lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
print 'PDB sequence has the shortest length'
else:
print 'Uniport sequence has the shortes length'
idenities = 0
for i,v in enumerate(first_record.seq):
if v == '-':
pass
#print i,v, second_record.seq[i]
if v == second_record.seq[i]:
idenities +=1
#print i,v, second_record.seq[i], idenities
print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))
to pass the arguments to the class use:
with open('input_file.txt', 'r') as infile:
next(infile)
next(infile) # Going by your input file
for line in infile:
line = line.split()
a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')
It might be something like this; a repeatable example (e.g., with short files posted on-line) would help...
library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")
to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix
pids = matrix(numeric(), length(uniprot), length(pdb),
dimnames=list(names(uniprot), names(pdb)))
and then do the calculations
for (i in seq_along(uniprot)) {
globalAlignment = pairwiseAlignment(pdb, uniprot[i])
pids[i,] = pid(globalAlignment)
}

Resources