Dictionary key from pdb file - dictionary

I'm trying to go through a .pdb file, calculate distance between alpha carbon atoms from different residues on chains A and B of a protein complex, then store the distance in a dictionary, together with the chain identifier and residue number.
For example, if the first alpha carbon ("CA") is found on residue 100 on chain A and the one it binds to is on residue 123 on chain B I would want my dictionary to look something like d={(A, 100):[B, 123, distance_between_atoms]}
from Bio.PDB.PDBParser import PDBParser
parser=PDBParser()
struct = parser.get_structure("1trk", "1trk.pdb")
def getAlphaCarbons(chain):
vec = []
for residue in chain:
for atom in residue:
if atom.get_name() == "CA":
vec = vec + [atom.get_vector()]
return vec
def dist(a,b):
return (a-b).norm()
chainA = struct[0]['A']
chainB = struct[0]['B']
vecA = getAlphaCarbons(chainA)
vecB = getAlphaCarbons(chainB)
t={}
model=struct[0]
for model in struct:
for chain in model:
for residue in chain:
for a in vecA:
for b in vecB:
if dist(a,b)<=8:
t={(chain,residue):[(a, b, dist(a, b))]}
break
print t
It's been running the programme for ages and I had to abort the run (have I made an infinite loop somewhere??)
I was also trying to do this:
t = {i:[((a, b, dist(a,b)) for a in vecA) for b in vecB if dist(a, b) <= 8] for i in chainA}
print t
But it's printing info about residues in the following format:
<Residue PHE het= resseq=591 icode= >: []
It's not printing anything related to the distance.
Thanks a lot, I hope everything is clear.

Would strongly suggest using C libraries while calculating distances. I use mdtraj for this sort of thing and it works much quicker than all the for loops in BioPython.
To get all pairs of alpha-Carbons:
import mdtraj as md
def get_CA_pairs(self,pdbfile):
traj = md.load_pdb(pdbfile)
topology = traj.topology
CA_index = ([atom.index for atom in topology.atoms if (atom.name == 'CA')])
pairs=list(itertools.combinations(CA_index,2))
return pairs
Then, for quick computation of distances:
def get_distances(self,pdbfile,pairs):
#returns list of resid1, resid2,distances between CA-CA
traj = md.load_pdb(pdbfile)
pairs=self.get_CA_pairs(pdbfile)
dist=md.compute_distances(traj,pairs)
#make dictionary you desire.
dict=dict(zip(CA, pairs))
return dict
This includes all alpha-Carbons. There should be a chain identifier too in mdtraj to select CA's from each chain.

Related

All path *lengths* from source to target in Directed Acyclic Graph

I have a graph with an adjacency matrix shape (adj_mat.shape = (4000, 4000)). My current problem involves finding the list of path lengths (the sequence of nodes is not so important) that traverses from the source (row = 0 ) to the target (col = trans_mat.shape[0] -1).
I am not interested in finding the path sequences; I am only interested in propagating the path length. As a result, this is different from finding all simple paths - which would be too slow (ie. find all paths from source to target; then score each path). Is there a performant way to do this quickly?
DFS is suggested as one possible strategy (noted here). My current implementation (below) is simply not optimal:
# create graph
G = nx.from_numpy_matrix(adj_mat, create_using=nx.DiGraph())
# initialize nodes
for node in G.nodes:
G.nodes[node]['cprob'] = []
# set starting node value
G.nodes[0]['cprob'] = [0]
def propagate_prob(G, node):
# find incoming edges to node
predecessors = list(G.predecessors(node))
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = G.get_edge_data(prev_node, node)['weight']
# get predecessor node value
if len(G.nodes[prev_node]['cprob']) == 0:
G.nodes[prev_node]['cprob'] = propagate_prob(G, prev_node)
prev_node_arr = G.nodes[prev_node]['cprob']
# add incoming edge weight to prev_node arr
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(prev_node_arr)])
# update current node array
G.nodes[node]['cprob'] = curr_node_arr
return G.nodes[node]['cprob']
# calculate all path lengths from source to sink
part_func = propagate_prob(G, 4000)
I don't have a large example by hand (e.g. >300 nodes), but I found a non recursive solution:
import networkx as nx
g = nx.DiGraph()
nx.add_path(g, range(7))
g.add_edge(0, 3)
g.add_edge(0, 5)
g.add_edge(1, 4)
g.add_edge(3, 6)
# first step retrieve topological sorting
sorted_nodes = nx.algorithms.topological_sort(g)
start = 0
target = 6
path_lengths = {start: [0]}
for node in sorted_nodes:
if node == target:
print(path_lengths[node])
break
if node not in path_lengths or g.out_degree(node) == 0:
continue
new_path_length = path_lengths[node]
new_path_length = [i + 1 for i in new_path_length]
for successor in g.successors(node):
if successor in path_lengths:
path_lengths[successor].extend(new_path_length)
else:
path_lengths[successor] = new_path_length.copy()
if node != target:
del path_lengths[node]
Output: [2, 4, 2, 4, 4, 6]
If you are only interested in the number of paths with different length, e.g. {2:2, 4:3, 6:1} for above example, you could even reduce the lists to dicts.
Background
Some explanation what I'm doing (and I hope works for larger examples as well). First step is to retrieve the topological sorting. Why? Then I know in which "direction" the edges flow and I can simply process the nodes in that order without "missing any edge" or any "backtracking" like in a recursive variant. Afterwards, I initialise the start node with a list containing the current path length ([0]). This list is copied to all successors, while updating the path length (all elements +1). The goal is that in each iteration the path length from the starting node to all processed nodes is calculated and stored in the dict path_lengths. The loop stops after reaching the target-node.
With igraph I can calculate up to 300 nodes in ~ 1 second. I also found that accessing the adjacency matrix itself (rather than calling functions of igraph to retrieve edges/vertices) also saves time. The two key bottlenecks are 1) appending a long list in an efficient manner (while also keeping memory) 2) finding a way to parallelize. This time grows exponentially past ~300 nodes, I would love to see if someone has a faster solution (while also fitting into memory).
import igraph
# create graph from adjacency matrix
G = igraph.Graph.Adjacency((trans_mat_pad > 0).tolist())
# add edge weights
G.es['weight'] = trans_mat_pad[trans_mat_pad.nonzero()]
# initialize nodes
for node in range(trans_mat_pad.shape[0]):
G.vs[node]['cprob'] = []
# set starting node value
G.vs[0]['cprob'] = [0]
def propagate_prob(G, node, trans_mat_pad):
# find incoming edges to node
predecessors = trans_mat_pad[:, node].nonzero()[0] # G.get_adjlist(mode='IN')[node]
curr_node_arr = []
for prev_node in predecessors:
# get incoming edge weight
edge_weight = trans_mat_pad[prev_node, node] # G.es[prev_node]['weight']
# get predecessor node value
if len(G.vs[prev_node]['cprob']) == 0:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + propagate_prob(G, prev_node, trans_mat_pad)])
else:
curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(G.vs[prev_node]['cprob'])])
## NB: If memory constraint, uncomment below
# set max size
# if len(curr_node_arr) > 100:
# curr_node_arr = np.sort(curr_node_arr)[:100]
# update current node array
G.vs[node]['cprob'] = curr_node_arr
return G.vs[node]['cprob']
# calculate path lengths
path_len = propagate_prob(G, trans_mat_pad.shape[0]-1, trans_mat_pad)

Best way to count downstream with edge data

I have a NetworkX problem. I create a digraph with a pandas DataFrame and there is data that I set along the edge. I now need to count the # of unique sources for nodes descendants and access the edge attribute.
This is my code and it works for one node but I need to pass a lot of nodes to this and get unique counts.
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes).copy()
domain_sources = {}
for s, t, v in subgraph.edges(data=True):
if v["domain"] in domain_sources:
domain_sources[v["domain"]].append(s)
else:
domain_sources[v["domain"]] = [s]
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(list(set(v)))
It works but, again, for one node the time is not a big deal but I'm feeding this routine at least 40 to 50 nodes. Is this the best way? Is there something else I can do that can group by an edge attribute and uniquely count the nodes?
Two possible enhancements:
Remove copy from line creating the sub graph. You are not changing anything and the copy is redundant.
Create a defaultdict with keys of set. Read more here.
from collections import defaultdict
import networkx as nx
# missing part of df creation
graph = nx.from_pandas_edgelist(df, source="source", target="target",
edge_attr=["domain", "category"], create_using=nx.DiGraph)
downstream_nodes = list(nx.descendants(graph, node))
downstream_nodes.append(node)
subgraph = graph.subgraph(downstream_nodes)
domain_sources = defaultdict(set)
for s, t, v in subgraph.edges(data=True):
domain_sources[v["domain"]].add(s)
down_count = {}
for k, v in domain_sources.items():
down_count[k] = len(set(v))

Python 3.6 user-defined board size win check with 2 variables

I have somewhat a general question for more experienced programmers. I'm somewhat new to programming, but still enjoy it quite a bit. I've been working with Python, and decided to try to program a tic tac toe game, but with variable board size that can be decided by the user (all the way up to a 26x26 board). Here's what I've got so far:
print("""Welcome to tic tac toe!
You will begin by determining who goes first;
Player 2 will decide on the board size, from 3 to 26.
Depending on the size of the board, you will have to
get a certain amount of your symbol (X/O) in a row to win.
To place symbols on the board, input their coordinates;
letter first, then number (e.g. a2, g10, or f18).
That's it for the rules. Good luck!\n""")
while True:
ready = input("Are you ready? When you are, input 'yes'.")
if ready.lower() == 'yes': break
def printboard(n, board):
print() #print board in ranks of n length; n given later
boardbyrnk = [board[ind:ind+n] for ind in range(0,n**2,n)]
for rank in range(n):
rn = f"{n-rank:02d}" #pads with a 0 if rank number < 10
print(f"{rn}|{'|'.join(boardbyrnk[rank])}|") #with rank#'s
print(" ",end="") #files at bottom of board
for file in range(97,n+97): print(" "+chr(file), end="")
print()
def sqindex(prompt, n, board, syms): #takes input & returns index
#ss is a list/array of all possible square names
ss = [chr(r+97)+str(f+1) for r in range(n) for f in range(n)]
while True: #all bugs will cause input to be taken for same turn
sq = input(prompt)
if not(sq in ss): print("Square doesn't exist!"); continue
#the index is found by multiplying rank and adding file #'s
index = n*(n-int(sq[1:])) + ord(sq[0])-97
if board[index] in syms: #ensure it contains ' '
print("The square must be empty!"); continue
return index
def checkwin(n, w, board, sm): #TODO
#check rows, columns and diagonals in terms of n and w;
#presumably return True if each case is met
return False
ps = ["Player 1", "Player 2"]; syms = ['X', 'O']
#determines number of symbols in a row needed to win later on
c = {3:[3,3],4:[4,6],5:[7,13],6:[14,18],7:[19,24],8:[25,26]}
goagain = True
while goagain:
#decide on board size
while True:
try: n=int(input(f"\n{ps[1]}, how long is the board side? "))
except ValueError: print("Has to be an integer!"); continue
if not(2<n<27): print("Has to be from 3 to 26."); continue
break
board = (n**2)*" " #can be rewritten around a square's index
for num in c:
if c[num][0] <= n <= c[num][1]: w = num; break
print(f"You'll have to get {w} symbols in a row to win.")
for tn in range(n**2): #tn%2 = 0 or 1, determining turn order
printboard(n, board)
pt = ps[tn%2]
sm = syms[tn%2]
prompt = f"{pt}, where do you place your {sm}? "
idx = sqindex(prompt, n, board, syms)
#the index found in the function is used to split the board string
board = board[:idx] + sm + board[idx+1:]
if checkwin(n, w, board, sm):
printboard(n, board); print('\n' + pt + ' wins!!\n\n')
break
if board.count(" ") == 0:
printboard(n, board); print("\nIt's a draw!")
while True: #replay y/n; board size can be redetermined
rstorq = input("Will you play again? Input 'yes' or 'no': ")
if rstorq in ['yes', 'no']:
if rstorq == 'no': goagain = False
break
else: print("Please only input lowercase 'yes' or 'no'.")
print("Thanks for playing!")
So my question to those who know what they're doing is how they would recommend determining whether the current player has won (obviously I have to check in terms of w for all cases, but how to program it well?). It's the only part of the program that doesn't work yet. Thanks!
You can get the size of the board from the board variable (assuming a square board).
def winning_line(line, symbol):
return all(cell == symbol for cell in line)
def diag(board):
return (board[i][i] for i in range(len(board)))
def checkwin(board, symbol):
if any(winning_line(row, symbol) for row in board):
return True
transpose = list(zip(*board))
if any(winning_line(column, symbol) for column in transpose):
return True
return any(winning_line(diag(layout), symbol) for layout in (board, transpose))
zip(*board) is a nice way to get the transpose of your board. If you imagine your original board list as a list of rows, the transpose will be a list of columns.

How to call numerical results to integrate a ODE using Runge-Kutta-4 in Python 3?

I'm trying to solve (for m_0) numerically the following ordinary differential equation:
dm0/dx=(((1-x)*(x*(2-x))**(1.5))/(k+x)**2)*(((x*(2-x))/3.0)*(dw/dx)**2 + ((8*(k+1))/(3*(k+x)))*w**2)
The values of w and dw/dx have been found already numerically using the Runge-Kutta 4th order and k is a factor that is fixed. I wrote a code where I call the values for w and dw/dx from an external file, then I organize them in an array, then I call the array in the function and then I run the integration. My outcome is not what it's expected :(, I don't know what is wrong. If anyone could give me a hand, it would be highly appreciated. Thank you!
from math import sqrt
from numpy import array,zeros,loadtxt
from printSoln import *
from run_kut4 import *
m = 1.15 # Just a constant.
k = 3.0*sqrt(1.0-(1.0/m))-1.0 # k in terms of m.
omegas = loadtxt("omega.txt",float) # Import values of w
domegas = loadtxt("domega.txt",float) # Import values of dw/dx
w = [] # Defines the array w to store the values w^2
s = 0.0
for s in omegas:
w.append(s**2) # Calculates the values w**2
omeg = array(w,float) # Array to store the value of w**2
dw = [] # Defines the array dw to store the values dw**2
t = 0.0
for t in domegas:
dw.append(t**2) # Calculates the values for dw**2
domeg = array(dw,float) # Array to store the values of dw**2
x = 1.0e-12 # Starting point of integration
xStop = (2.0 - k)/3.0 # Final point of integration
def F(x,y): # Define function to be integrated
F = zeros(1)
for i in domeg: # Loop to call w^2, (dw/dx)^2
for j in omeg:
F[0] = (((1.0-x)*(x*(2.0-x))**(1.5))/(k+x)**2)*((1.0/3.0)*x* (2.0-x)*domeg[i] + (8.0*(k+1.0)*omeg[j])/(3.0*(k+x)))
return F
y = array([((32.0*sqrt(2.0)*(k+1.0)*(x**2.5))/(15.0*(k**3)))]) # Initial condition for m_{0}
h = 1.0e-5 # Integration step
freq = 0 # Prints only initial and final values
X,Y = integrate(F,x,y,xStop,h) # Calls Runge-Kutta 4
printSoln(X,Y,freq) # Prints solution
Interpreting your verbal description, there is an ODE for omega, w'=F(x,w), and a coupled ODE for m0, m'=G(x,m,w,w'). The almost always optimal way to solve this is to treat it as system of ODE,
def ODEfunc(x,y)
w,m = y
dw = F(x,w)
dm = G(x,m,w,dw)
return np.array([dw, dm])
which you can then insert in the ODE solver of your choice, e.g., the fictitious
ODEintegrate(ODEfunc, xsamples, y0)

How to generate a n-dimensional "identity matrix"?

I'm building a demonstration any-dimensional Vector class to show some functional programming in Python.
class Vector():
def __init__(self, *coords):
self.coords = coords
def __add__(this, that):
return Point(*[(x+y) for x,y in zip(this.coords, that.coords)])
#...
While trying to come up with an example of a static #classmethod in this example, I decided it'd be nice to have a class method giving me an n-dimensional base of vectors for any n. That is:
>>> Vector.get_base(dimensions = 2)
[Vector(1,0), Vector(0,1)]
>>> Vector.get_base(3)
[Vector(1,0,0), Vector(0,1,0), Vector(0,0,1)]
>>> Vector.get_base(1)
[Vector(1)]
I'm however having a huge brain fart however and am stumbling on the problem of how to "properly" generate those lists.
What I can think up right now is a declarative solution:
def get_base(dimensions):
arrays = []
zeros = [0] * dimensions
for i in range(dimensions):
item = zeros
item[i] = 1
arrays.append(Vector(*array))
return arrays
There has to be a better way! How can I rewrite this function in a hopefully more concise or Pythonic functional style?
Well, you could do this:
def get_base(dimensions):
return [Vector(*coords) for coords in
[[(0,1)[i==j] for i in range(dimensions)] for j in range(dimensions)]]
but I would break it down a little:
def get_base(dimensions):
arrays = [[(0,1)[i==j] for i in range(dimensions)] for j in range(dimensions)]
return [Vector(*coords) for coords in arrays]
Which is a little better. Remember, not everything has to be a one-liner.
How about the next:
>>> def get_base(dimensions):
... for points in set(itertools.permutations([0] * (dimensions - 1) + [1], dimensions)):
... yield Vector(*points)

Resources