Networkx read from pandas creates 2 nodes for same source - graph

I'm trying to read in from a pandas dataframe using from_pandas_edgelist with the following code:
input = df_from_string("""
        source, target, size
            abc, xyz, 0.25
            abc, def, 0.35
            xyz, ghi, 0.40
 """)
G = nx.from_pandas_edgelist(input, source='source', target='target', edge_attr='size', create_using=nx.DiGraph())
nx.draw(G, with_labels=True)
plt.show()
The result I want is:
abc -> xyz -> ghi.
However currently I am only getting:
abc -> xyz
xyz -> ghi

don't use names such as input for variables. This can be confused with python's input function.
check if the dataframe is loaded correctly (df_from_string doesn't seem to be a pandas function), so if you defined it yourself, either post it in your question for debugging, or verify that it does what it should do.
finally, if you post a problem with code, make sure that the code that you post reproduces that problem, and doesn't throw other errors.
Meanwhile, if you generate the dataframe in a more standard approach:
import networkx as nx, pandas as pd, matplotlib.pyplot as plt
df = pd.DataFrame(
{
'source': ['abc', 'abc', 'xyz'],
'target':['xyz', 'def', 'ghi'],
'size':[0.25,0.35,0.4]
}
)
generating the graph works just fine:
G = nx.from_pandas_edgelist(df, source='source', target='target', edge_attr='size', create_using=nx.DiGraph())[![enter image description here][1]][1]
nx.draw(G, with_labels=True)
plt.show()

Related

Call Python from Julia

I am new to Julia and I have a Python function that I want to use in Julia. Basically what the function does is to accept a dataframe (passed as a numpy ndarray), a filter value and a list of column indices (from the array) and run a logistic regression using the statsmodels package in Python. So far I have tried this:
using PyCall
py"""
import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
import itertools
def reg_frac(state, ind_vars):
rows = 2000
total_rows = rows*13
data = pd.DataFrame({
'state': ['a', 'b', 'c','d','e','f','g','h','i','j','k','l','m']*rows, \
'y_var': [random.uniform(0,1) for i in range(total_rows)], \
'school': [random.uniform(0,10) for i in range(total_rows)], \
'church': [random.uniform(11,20) for i in range(total_rows)]}).to_numpy()
try:
X, y = sm.add_constant(np.array(data[(data[:,0] == state)][:,ind_vars], dtype=float)), np.array(data[(data[:,0] == state), 1], dtype=float)
model = sm.Logit(y, X).fit(cov_type='HC0', disp=False)
rmse = np.sqrt(np.square(np.subtract(y, model.predict(X))).mean())
except:
rmse = np.nan
return [state, ind_vars, rmse]
"""
reg_frac(state, ind_vars) = (py"reg_frac"(state::Char, ind_vars::Array{Any}))
However, when I run this, I don't expect the results to be NaN. I think it is working but I am missing something.
reg_frac('b', Any[i for i in 2:3])
0.000244 seconds (249 allocations: 7.953 KiB)
3-element Array{Any,1}:
'b'
[2, 3]
NaN
Any help is appreciated.
In Python code you have strs while in Julia code you have Chars - it is not the same.
Python:
>>> type('a')
<class 'str'>
Julia:
julia> typeof('a')
Char
Hence your comparisons do not work.
Your function could look like this:
reg_frac(state, ind_vars) = (py"reg_frac"(state::String, ind_vars::Array{Any}))
And now:
julia> reg_frac("b", Any[i for i in 2:3])
3-element Array{Any,1}:
"b"
[2, 3]
0.2853707270515166
However, I recommed using Vector{Float64} that in PyCall gets converted in-flight into a numpy vector rather than using Vector{Any} so looks like your code still could be improved (depending on what you are actually planning to do).

Dictionary key from pdb file

I'm trying to go through a .pdb file, calculate distance between alpha carbon atoms from different residues on chains A and B of a protein complex, then store the distance in a dictionary, together with the chain identifier and residue number.
For example, if the first alpha carbon ("CA") is found on residue 100 on chain A and the one it binds to is on residue 123 on chain B I would want my dictionary to look something like d={(A, 100):[B, 123, distance_between_atoms]}
from Bio.PDB.PDBParser import PDBParser
parser=PDBParser()
struct = parser.get_structure("1trk", "1trk.pdb")
def getAlphaCarbons(chain):
vec = []
for residue in chain:
for atom in residue:
if atom.get_name() == "CA":
vec = vec + [atom.get_vector()]
return vec
def dist(a,b):
return (a-b).norm()
chainA = struct[0]['A']
chainB = struct[0]['B']
vecA = getAlphaCarbons(chainA)
vecB = getAlphaCarbons(chainB)
t={}
model=struct[0]
for model in struct:
for chain in model:
for residue in chain:
for a in vecA:
for b in vecB:
if dist(a,b)<=8:
t={(chain,residue):[(a, b, dist(a, b))]}
break
print t
It's been running the programme for ages and I had to abort the run (have I made an infinite loop somewhere??)
I was also trying to do this:
t = {i:[((a, b, dist(a,b)) for a in vecA) for b in vecB if dist(a, b) <= 8] for i in chainA}
print t
But it's printing info about residues in the following format:
<Residue PHE het= resseq=591 icode= >: []
It's not printing anything related to the distance.
Thanks a lot, I hope everything is clear.
Would strongly suggest using C libraries while calculating distances. I use mdtraj for this sort of thing and it works much quicker than all the for loops in BioPython.
To get all pairs of alpha-Carbons:
import mdtraj as md
def get_CA_pairs(self,pdbfile):
traj = md.load_pdb(pdbfile)
topology = traj.topology
CA_index = ([atom.index for atom in topology.atoms if (atom.name == 'CA')])
pairs=list(itertools.combinations(CA_index,2))
return pairs
Then, for quick computation of distances:
def get_distances(self,pdbfile,pairs):
#returns list of resid1, resid2,distances between CA-CA
traj = md.load_pdb(pdbfile)
pairs=self.get_CA_pairs(pdbfile)
dist=md.compute_distances(traj,pairs)
#make dictionary you desire.
dict=dict(zip(CA, pairs))
return dict
This includes all alpha-Carbons. There should be a chain identifier too in mdtraj to select CA's from each chain.

Tensorflow : how to insert custom input to existing graph?

I have downloaded a tensorflow GraphDef that implements a VGG16 ConvNet, which I use doing this :
Pl['images'] = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
with open("tensorflow-vgg16/vgg16.tfmodel", mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
tf.import_graph_def(graph_def, input_map={"images": Pl['images']})
Besides, I have image features that are homogeneous to the output of the "import/pool5/".
How can I tell my graph that don't want to use his input "images", but the tensor "import/pool5/" as input ?
Thank's !
EDIT
OK I realize I haven't been very clear. Here is the situation:
I am trying to use this implementation of ROI pooling, using a pre-trained VGG16, which I have in the GraphDef format. So here is what I do:
First of all, I load the model:
tf.reset_default_graph()
with open("tensorflow-vgg16/vgg16.tfmodel",
mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
graph = tf.get_default_graph()
Then, I create my placeholders
images = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
boxes = tf.placeholder(tf.float32,
[None,5], # 5 = [batch_id,x1,y1,x2,y2]
name = "boxes")
And I define the output of the first part of the graph to be conv5_3/Relu
tf.import_graph_def(graph_def,
input_map={'images':images})
out_tensor = graph.get_tensor_by_name("import/conv5_3/Relu:0")
So, out_tensor is of shape [None,14,14,512]
Then, I do the ROI pooling:
[out_pool,argmax] = module.roi_pool(out_tensor,
boxes,
7,7,1.0/1)
With out_pool.shape = N_Boxes_in_batch x 7 x 7 x 512, which is homogeneous to pool5. I would then like to feed out_pool as an input to the op that comes just after pool5, so it would look like
tf.import_graph_def(graph.as_graph_def(),
input_map={'import/pool5':out_pool})
But it doesn't work, I have this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-89-527398d7344b> in <module>()
5
6 tf.import_graph_def(graph.as_graph_def(),
----> 7 input_map={'import/pool5':out_pool})
8
9 final_out = graph.get_tensor_by_name("import/Relu_1:0")
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/importer.py in import_graph_def(graph_def, input_map, return_elements, name, op_dict)
333 # NOTE(mrry): If the graph contains a cycle, the full shape information
334 # may not be available for this op's inputs.
--> 335 ops.set_shapes_for_outputs(op)
336
337 # Apply device functions for this op.
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
1610 raise RuntimeError("No shape function registered for standard op: %s"
1611 % op.type)
-> 1612 shapes = shape_func(op)
1613 if len(op.outputs) != len(shapes):
1614 raise RuntimeError(
/home/hbenyounes/vqa/roi_pooling_op_grad.py in _roi_pool_shape(op)
13 channels = dims_data[3]
14 print(op.inputs[1].name, op.inputs[1].get_shape())
---> 15 dims_rois = op.inputs[1].get_shape().as_list()
16 num_rois = dims_rois[0]
17
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
745 A list of integers or None for each dimension.
746 """
--> 747 return [dim.value for dim in self._dims]
748
749 def as_proto(self):
TypeError: 'NoneType' object is not iterable
Any clue ?
It is usually very convenient to use tf.train.export_meta_graph to store the whole MetaGraph. Then, upon restoring you can use tf.train.import_meta_graph, because it turns out that it passes all additional arguments to the underlying import_scoped_meta_graph which has the input_map argument and utilizes it when it gets to it's own invocation of import_graph_def.
It is not documented, and took me waaaay toooo much time to find it, but it works!
What I would do is something along those lines:
-First retrieve the names of the tensors representing the weights and biases of the 3 fully connected layers coming after pool5 in VGG16.
To do that I would inspect [n.name for n in graph.as_graph_def().node].
(They probably look something like import/locali/weight:0, import/locali/bias:0, etc.)
-Put them in a python list:
weights_names=["import/local1/weight:0" ,"import/local2/weight:0" ,"import/local3/weight:0"]
biases_names=["import/local1/bias:0" ,"import/local2/bias:0" ,"import/local3/bias:0"]
-Define a function that look something like:
def pool5_tofcX(input_tensor, layer_number=3):
flatten=tf.reshape(input_tensor,(-1,7*7*512))
tmp=flatten
for i in xrange(layer_number):
tmp=tf.matmul(tmp, graph.get_tensor_by_name(weights_name[i]))
tmp=tf.nn.bias_add(tmp, graph.get_tensor_by_name(biases_name[i]))
tmp=tf.nn.relu(tmp)
return tmp
Then define the tensor using the function:
wanted_output=pool5_tofcX(out_pool)
Then you are done !
Jonan Georgiev provided an excellent answer here. The same approach was also described with little fanfare at the end of this git issue: https://github.com/tensorflow/tensorflow/issues/3389
Below is a copy/paste runnable example of using this approach to switch out a placeholder for a tf.data.Dataset get_next tensor.
import tensorflow as tf
my_placeholder = tf.placeholder(dtype=tf.float32, shape=1, name='my_placeholder')
my_op = tf.square(my_placeholder, name='my_op')
# Save the graph to memory
graph_def = tf.get_default_graph().as_graph_def()
print('----- my_op before any remapping -----')
print([n for n in graph_def.node if n.name == 'my_op'])
tf.reset_default_graph()
ds = tf.data.Dataset.from_tensors(1.0)
next_tensor = tf.data.make_one_shot_iterator(ds).get_next(name='my_next_tensor')
# Restore the graph with a custom input mapping
tf.graph_util.import_graph_def(graph_def, input_map={'my_placeholder': next_tensor}, name='')
print('----- my_op after remapping -----')
print([n for n in tf.get_default_graph().as_graph_def().node if n.name == 'my_op'])
Output, where we can clearly see that the input to the square operation has changed.
----- my_op before any remapping -----
[name: "my_op"
op: "Square"
input: "my_placeholder"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
----- my_op after remapping -----
[name: "my_op"
op: "Square"
input: "my_next_tensor"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]

Plotting with date times and matplotlib

So, I'm using a function from this website to (try) to make stick plots of some netCDF4 data. There is an excerpt of my code below. I got my data from here.
The stick_plot(time,u,v) function is EXACTLY as it appears in the website I linked which is why I did not show a copy of that function below.
When I run my code I get the following error. Any idea on how to get around this?
AttributeError: 'numpy.float64' object has no attribute 'toordinal'
The description of time from the netCDF4 file:
<type 'netCDF4._netCDF4.Variable'>
float64 time(time)
long_name: time
standard_name: time
units: days since 1900-01-01 00:00:00Z
axis: T
ancillary_variables: time_quality_flag
data_min: 2447443.375
data_max: 2448005.16667
unlimited dimensions:
current shape = (13484,)
filling off
Here is an excerpt of my code:
imports:
import matplotlib.pyplot as plot
import numpy as np
from netCDF4 import Dataset
import os
from matplotlib.dates import date2num
from datetime import datetime
trying to generate the plots:
path = '/Users/Kyle/Documents/Summer_Research/east_coast_currents/'
currents = [x for x in os.listdir('%s' %(path)) if '.DS' not in x]
for datum in currents:
working_data = Dataset('%s' %(path+datum), 'r', format = 'NETCDF4')
u = working_data.variables['u'][:][:100]
v = working_data.variables['v'][:][:100]
time = working_data.variables['time'][:][:100]
q = stick_plot(time,u,v)
ref = 1
qk = plot.quiverkey(q, 0.1, 0.85, ref,
"%s N m$^{-2}$" % ref,
labelpos='N', coordinates='axes')
_ = plot.xticks(rotation=70)
Joe Kington answered my question. The netCDF4 file read the times in as a datetime object. All I had to do was replace date2num(time) with time which fixed everything.

Plotting labeled intervals in matplotlib/gnuplot

I have a data sample which looks like this:
a 10:15:22 10:15:30 OK
b 10:15:23 10:15:28 OK
c 10:16:00 10:17:10 FAILED
b 10:16:30 10:16:50 OK
What I want is to plot the above data in the following way:
captions ^
|
c | *------*
b | *---* *--*
a | *--*
|___________________
time >
With the color of lines depending on the OK/FAILED status of the data point. Labels (a/b/c/...) may or may not repeat.
As I've gathered from documentation for gnuplot and matplotlib, this type of a plot should be easier to do in the latter as it's not a standard plot and would require some preprocessing.
The question is:
Is there a standard way to do plots like this in any of the tools?
If not, how should I go about plotting this data (pointers to relevant tools/documentation/functions/examples which do something-kinda-like the thing described here)?
Updated: Now includes handling the data sample and uses mpl dates functionality.
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MinuteLocator, SecondLocator
import numpy as np
from StringIO import StringIO
import datetime as dt
### The example data
a=StringIO("""a 10:15:22 10:15:30 OK
b 10:15:23 10:15:28 OK
c 10:16:00 10:17:10 FAILED
b 10:16:30 10:16:50 OK
""")
#Converts str into a datetime object.
conv = lambda s: dt.datetime.strptime(s, '%H:%M:%S')
#Use numpy to read the data in.
data = np.genfromtxt(a, converters={1: conv, 2: conv},
names=['caption', 'start', 'stop', 'state'], dtype=None)
cap, start, stop = data['caption'], data['start'], data['stop']
#Check the status, because we paint all lines with the same color
#together
is_ok = (data['state'] == 'OK')
not_ok = np.logical_not(is_ok)
#Get unique captions and there indices and the inverse mapping
captions, unique_idx, caption_inv = np.unique(cap, 1, 1)
#Build y values from the number of unique captions.
y = (caption_inv + 1) / float(len(captions) + 1)
#Plot function
def timelines(y, xstart, xstop, color='b'):
"""Plot timelines at y from xstart to xstop with given color."""
plt.hlines(y, xstart, xstop, color, lw=4)
plt.vlines(xstart, y+0.03, y-0.03, color, lw=2)
plt.vlines(xstop, y+0.03, y-0.03, color, lw=2)
#Plot ok tl black
timelines(y[is_ok], start[is_ok], stop[is_ok], 'k')
#Plot fail tl red
timelines(y[not_ok], start[not_ok], stop[not_ok], 'r')
#Setup the plot
ax = plt.gca()
ax.xaxis_date()
myFmt = DateFormatter('%H:%M:%S')
ax.xaxis.set_major_formatter(myFmt)
ax.xaxis.set_major_locator(SecondLocator(interval=20)) # used to be SecondLocator(0, interval=20)
#To adjust the xlimits a timedelta is needed.
delta = (stop.max() - start.min())/10
plt.yticks(y[unique_idx], captions)
plt.ylim(0,1)
plt.xlim(start.min()-delta, stop.max()+delta)
plt.xlabel('Time')
plt.show()
the answer for #tillsten is not working for Python3 any more I did some modification I hope it will helps.
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MinuteLocator, SecondLocator
import numpy as np
import pandas as pd
import datetime as dt
import io
### The example data
a=io.StringIO("""
caption start stop state
a 10:15:22 10:15:30 OK
b 10:15:23 10:15:28 OK
c 10:16:00 10:17:10 FAILED
b 10:16:30 10:16:50 OK""")
data = pd.read_table(a, delimiter=" ")
data["start"] = pd.to_datetime(data["start"])
data["stop"] = pd.to_datetime(data["stop"])
cap, start, stop = data['caption'], data['start'], data['stop']
#Check the status, because we paint all lines with the same color
#together
is_ok = (data['state'] == 'OK')
not_ok = np.logical_not(is_ok)
#Get unique captions and there indices and the inverse mapping
captions, unique_idx, caption_inv = np.unique(cap, 1, 1)
#Build y values from the number of unique captions.
y = (caption_inv + 1) / float(len(captions) + 1)
#Plot function
def timelines(y, xstart, xstop, color='b'):
"""Plot timelines at y from xstart to xstop with given color."""
plt.hlines(y, xstart, xstop, color, lw=4)
plt.vlines(xstart, y+0.03, y-0.03, color, lw=2)
plt.vlines(xstop, y+0.03, y-0.03, color, lw=2)
#Plot ok tl black
timelines(y[is_ok], start[is_ok], stop[is_ok], 'k')
#Plot fail tl red
timelines(y[not_ok], start[not_ok], stop[not_ok], 'r')
#Setup the plot
ax = plt.gca()
ax.xaxis_date()
myFmt = DateFormatter('%H:%M:%S')
ax.xaxis.set_major_formatter(myFmt)
ax.xaxis.set_major_locator(SecondLocator(interval=20)) # used to be SecondLocator(0, interval=20)
#To adjust the xlimits a timedelta is needed.
delta = (stop.max() - start.min())/10
plt.yticks(y[unique_idx], captions)
plt.ylim(0,1)
plt.xlim(start.min()-delta, stop.max()+delta)
plt.xlabel('Time')
plt.show()
gnuplot 5.2 version with creating a unique key list
The main difference to #CiroSantilli's solution is that a list of unique keys is created automatically from column 1 and the index can be accessed via the defined function Lookup(). The referenced gnuplot demo already uses a list of unique items, however, in the OP's case there are duplicates.
Creating such a list of unique items does not exist in gnuplot right away, so you have to implement it yourself.
The code requires gnuplot >=5.2. It is probably difficult to get a solution which works under gnuplot 4.4 (the time of OP's question) because a few useful features were not implemented at that time: do for-loops, summation, datablocks, ... (a version for gnuplot 4.6 might be possible with some workarounds).
Edit: the earlier version used with vectors and linewidth 20 to plot the bars, however, linewidth 20 also extends in x-direction which is not desired here. Therefore, with boxxyerror is now used.
Yes, it can be done shorter and clearer.
Script:
### Time chart with gnuplot (requires gnuplot>=5.0)
reset session
$Data <<EOD
# category start end status
"event 1" 10:15:22 10:15:30 OK
"event 2" 10:15:23 10:15:28 OK
pause 10:16:00 10:17:10 FAILED
"something else" 10:16:30 10:17:50 OK
unknown 10:17:30 10:18:50 OK
"event 3" 10:18:30 10:19:50 FAILED
pause 10:19:30 10:20:50 OK
"event 1" 10:17:30 10:19:20 FAILED
EOD
# create list of unique items
uniqueList = ''
item(col) = ' "'.strcol(col).'"'
isInList(list,col) = strstrt(uniqueList,item(col)) # returns a number >0 if found
addToList(list,col) = list.item(col)
stats $Data u (!isInList(uniqueList,1) ? uniqueList = addToList(uniqueList,1) : 0) nooutput
timeCenter(col1,col2) = (timecolumn(col1,myTimeFmt)+timecolumn(col2,myTimeFmt))*0.5
timeDeltaT(col1,col2) = (timecolumn(col1,myTimeFmt)-timecolumn(col2,myTimeFmt))*0.5
Lookup(col) = int(sum [i=1:words(uniqueList)] (strcol(col) eq word(uniqueList,i)) ? i : 0)
myColor(col) = strcol(col) eq "OK" ? 0x00cc00 : 0xff0000
myBoxWidth = 0.6
myTimeFmt = "%H:%M:%S"
set format x "%M:%S" timedate
set yrange [0.5:words(uniqueList)+0.5]
set grid x,y
plot $Data u (timeCenter(2,3)):(Lookup(1)):(timeDeltaT(2,3)):(0.5*myBoxWidth): \
(myColor(4)):ytic(1) w boxxyerror fill solid 1.0 lc rgb var notitle
### end of script
Result:
gnuplot with vector solution
Minimized from: http://gnuplot.sourceforge.net/demo_5.2/gantt.html
main.gnuplot
#!/usr/bin/env gnuplot
$DATA << EOD
1 1 5
1 11 13
2 3 10
3 4 8
4 7 13
5 6 15
EOD
set terminal png size 512,512
set output "main.png"
set xrange [-1:]
set yrange [0:]
unset key
set border 3
set xtics nomirror
set ytics nomirror
set style arrow 1 nohead linewidth 3
plot $DATA using 2 : 1 : ($3-$2) : (0.0) with vector as 1, \
$DATA using 2 : 1 : 1 with labels right offset -2
GitHub upstream.
Output:
You can remove the labels by removing the second plot command line, I added them because they are useful in many applications to more easily identify the intervals.
The Gantt example I linked to shows how to handle date formats instead of integers.
Tested in gnuplot 5.2 patchlevel 2, Ubuntu 18.04.

Resources