errors when hashing data in Python - jupyter-notebook

I'm hoping somebody could please assist. When running the following code (below) in Jupyter notebook, I get an error
dummydata["ID_NUMBER"] = dummydata["ID_NUMBER"].to_string()
def clean_dummydata(dummydata,cols):
for col_name in cols:
keys = {cats: i for i,cats in str(hash(dummydata[col_name].unique()))}
dummydata[col_name] = dummydata[col_name].apply(lambda x: keys[x])
return dummydata
cols = ['ID_NUMBER']
dummydata = clean_dummydata(dummydata,cols)
dummydata.to_csv('anon_dummydata.csv')
This is the error:
TypeError Traceback (most recent call
last) ~\AppData\Local\Temp/ipykernel_3140/2100616149.py in
7
8 cols = ['ID_NUMBER']
----> 9 dummydata = clean_dummydata(dummydata,cols)
10 dummydata.to_csv('anon_dummydata.csv')
~\AppData\Local\Temp/ipykernel_3140/2100616149.py in
clean_dummydata(dummydata, cols)
2 def clean_dummydata(dummydata,cols):
3 for col_name in cols:
----> 4 keys = {cats: i for i,cats in str(hash(dummydata[col_name].unique()))}
5 dummydata[col_name] = dummydata[col_name].apply(lambda x: keys[x])
6 return dummydata
TypeError: unhashable type: 'numpy.ndarray'

Mutable types like NumPy arrays and lists are not hashable because they could change and break the lookup based on the hashing algorithm.
So, you can use hash only with immutable datatypes like a tuple. So, you can convert your numpy array into a tuple and then hash it, for eg:
import numpy as np
z = np.array(['one', 'two', 'three'])
tuple_z = tuple(z)
hash_z = hash(z)
and it should run perfectly.

Related

Associate letters with its value and sort output in python

Please help on this. I have an input like this:
a = """A|9578
C|547
A|459
B|612
D|53
B|6345
A|957498
C|2910"""
I want to print in sorted way the numbers related with each letter like this:
A_0|459
A_1|957498
A_2|9578
C_0|2910
C_1|547
B_0|612
B_1|6345
D_0|53
So far I was able to store in array b the letters and numbers, but I'm stuck when I try to create dictionary-like array to join a single letter with its values, I get this error.
b = [i.split('|') for i in a.split('\n')]
c = dict()
d = [c[i].append(j) for i,j in b]
>>> d = [c[i].append(j) for i,j in b]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
TypeError: list indices must be integers or slices, not str
I'm working on python 3.6 just in case. Thanks in advance.
We'll split the string into pairs, sort those pairs, then use groupby and enumerate to come up with the indices.
from itertools import groupby
from operator import itemgetter
def process(a):
pairs = sorted(x.split('|') for x in a.split())
groups = groupby(pairs, key=itemgetter(0))
for _, g in groups:
for index, (letter, number) in enumerate(g):
yield '{}_{}|{}'.format(letter, index, number)
for i in process(a): print(i)
gives us
A_0|459
A_1|957498
A_2|9578
B_0|612
B_1|6345
C_0|2910
C_1|547
D_0|53

Add values from dictionary

I have created a dictionary, and I want to add the values from each item in the dictionary. Here is what I have so far:
import sys
import re
import collections
import numpy as np
dict = {'x':['x',0], 'y':['y',0], 'z':['z',0]}
I am then using an input file to count the instances of x, y, and z and add them to the dictionary:
with open(input_file, 'r', encoding='utf-8') as f:
for line in f:
words = line.split()
count_lines += 1
num_lines += 1
num_words += len(words)
title = line
for key in dict:
title = line
if re.search(key, title):
trusted[key][1]+=1
title = re.sub(key,dict[key][0],title)
while re.search(key, title):
dict[key][1]+=1
title = re.sub(key,dict[key][0],title)
dict_values = dict.values()
output_file.write('Dict:', sum(dict_values[1:-1]))
Then my dictionary would be, for example: dict = {'x':['x',6], 'y':['y',10], 'z':['z',8]}, and I want to add 6, 10, and 8 together.
I have tried this with and without the string split, I have tried assigning the sum equation to a variable and writing the variable, etc. I am continuously getting "TypeError: unsupported operand type(s) for +: 'int' and 'list'" and "TypeError: 'dict_values' object is not subscriptable" error messages.

Tensorflow : how to insert custom input to existing graph?

I have downloaded a tensorflow GraphDef that implements a VGG16 ConvNet, which I use doing this :
Pl['images'] = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
with open("tensorflow-vgg16/vgg16.tfmodel", mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
tf.import_graph_def(graph_def, input_map={"images": Pl['images']})
Besides, I have image features that are homogeneous to the output of the "import/pool5/".
How can I tell my graph that don't want to use his input "images", but the tensor "import/pool5/" as input ?
Thank's !
EDIT
OK I realize I haven't been very clear. Here is the situation:
I am trying to use this implementation of ROI pooling, using a pre-trained VGG16, which I have in the GraphDef format. So here is what I do:
First of all, I load the model:
tf.reset_default_graph()
with open("tensorflow-vgg16/vgg16.tfmodel",
mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
graph = tf.get_default_graph()
Then, I create my placeholders
images = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
boxes = tf.placeholder(tf.float32,
[None,5], # 5 = [batch_id,x1,y1,x2,y2]
name = "boxes")
And I define the output of the first part of the graph to be conv5_3/Relu
tf.import_graph_def(graph_def,
input_map={'images':images})
out_tensor = graph.get_tensor_by_name("import/conv5_3/Relu:0")
So, out_tensor is of shape [None,14,14,512]
Then, I do the ROI pooling:
[out_pool,argmax] = module.roi_pool(out_tensor,
boxes,
7,7,1.0/1)
With out_pool.shape = N_Boxes_in_batch x 7 x 7 x 512, which is homogeneous to pool5. I would then like to feed out_pool as an input to the op that comes just after pool5, so it would look like
tf.import_graph_def(graph.as_graph_def(),
input_map={'import/pool5':out_pool})
But it doesn't work, I have this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-89-527398d7344b> in <module>()
5
6 tf.import_graph_def(graph.as_graph_def(),
----> 7 input_map={'import/pool5':out_pool})
8
9 final_out = graph.get_tensor_by_name("import/Relu_1:0")
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/importer.py in import_graph_def(graph_def, input_map, return_elements, name, op_dict)
333 # NOTE(mrry): If the graph contains a cycle, the full shape information
334 # may not be available for this op's inputs.
--> 335 ops.set_shapes_for_outputs(op)
336
337 # Apply device functions for this op.
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
1610 raise RuntimeError("No shape function registered for standard op: %s"
1611 % op.type)
-> 1612 shapes = shape_func(op)
1613 if len(op.outputs) != len(shapes):
1614 raise RuntimeError(
/home/hbenyounes/vqa/roi_pooling_op_grad.py in _roi_pool_shape(op)
13 channels = dims_data[3]
14 print(op.inputs[1].name, op.inputs[1].get_shape())
---> 15 dims_rois = op.inputs[1].get_shape().as_list()
16 num_rois = dims_rois[0]
17
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
745 A list of integers or None for each dimension.
746 """
--> 747 return [dim.value for dim in self._dims]
748
749 def as_proto(self):
TypeError: 'NoneType' object is not iterable
Any clue ?
It is usually very convenient to use tf.train.export_meta_graph to store the whole MetaGraph. Then, upon restoring you can use tf.train.import_meta_graph, because it turns out that it passes all additional arguments to the underlying import_scoped_meta_graph which has the input_map argument and utilizes it when it gets to it's own invocation of import_graph_def.
It is not documented, and took me waaaay toooo much time to find it, but it works!
What I would do is something along those lines:
-First retrieve the names of the tensors representing the weights and biases of the 3 fully connected layers coming after pool5 in VGG16.
To do that I would inspect [n.name for n in graph.as_graph_def().node].
(They probably look something like import/locali/weight:0, import/locali/bias:0, etc.)
-Put them in a python list:
weights_names=["import/local1/weight:0" ,"import/local2/weight:0" ,"import/local3/weight:0"]
biases_names=["import/local1/bias:0" ,"import/local2/bias:0" ,"import/local3/bias:0"]
-Define a function that look something like:
def pool5_tofcX(input_tensor, layer_number=3):
flatten=tf.reshape(input_tensor,(-1,7*7*512))
tmp=flatten
for i in xrange(layer_number):
tmp=tf.matmul(tmp, graph.get_tensor_by_name(weights_name[i]))
tmp=tf.nn.bias_add(tmp, graph.get_tensor_by_name(biases_name[i]))
tmp=tf.nn.relu(tmp)
return tmp
Then define the tensor using the function:
wanted_output=pool5_tofcX(out_pool)
Then you are done !
Jonan Georgiev provided an excellent answer here. The same approach was also described with little fanfare at the end of this git issue: https://github.com/tensorflow/tensorflow/issues/3389
Below is a copy/paste runnable example of using this approach to switch out a placeholder for a tf.data.Dataset get_next tensor.
import tensorflow as tf
my_placeholder = tf.placeholder(dtype=tf.float32, shape=1, name='my_placeholder')
my_op = tf.square(my_placeholder, name='my_op')
# Save the graph to memory
graph_def = tf.get_default_graph().as_graph_def()
print('----- my_op before any remapping -----')
print([n for n in graph_def.node if n.name == 'my_op'])
tf.reset_default_graph()
ds = tf.data.Dataset.from_tensors(1.0)
next_tensor = tf.data.make_one_shot_iterator(ds).get_next(name='my_next_tensor')
# Restore the graph with a custom input mapping
tf.graph_util.import_graph_def(graph_def, input_map={'my_placeholder': next_tensor}, name='')
print('----- my_op after remapping -----')
print([n for n in tf.get_default_graph().as_graph_def().node if n.name == 'my_op'])
Output, where we can clearly see that the input to the square operation has changed.
----- my_op before any remapping -----
[name: "my_op"
op: "Square"
input: "my_placeholder"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
----- my_op after remapping -----
[name: "my_op"
op: "Square"
input: "my_next_tensor"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]

rpy2 to subset RS4 object (expressionSet)

I'm building an ExpressionSet class using rpy2, following the relevant tutorial as a guide. One of the most common things I do with the Eset object is subsetting, which in native R is as straightforward as
eset2<-eset1[1:10,1:5] # first ten features, first five samples
which returns a new ExpressionSet object with subsets of both the expression and phenotype data, using the given indices. Rpy2's RS4 object doesn't seem to allow direct subsetting, or have rx/rx2 attributes unlike e.g. RS3 vectors. I tried, with ~50% success, adding a '_subset' function (below) that creates subsets of these two datasets separately and assigns them back to Eset, but is there a more straightforward way that I'm missing?
from rpy2 import (robjects, rinterface)
from rpy2.robjects import (r, pandas2ri, Formula)
from rpy2.robjects.packages import (importr,)
from rpy2.robjects.methods import (RS4,)
class ExpressionSet(RS4):
# funcs to get the attributes
def _assay_get(self): # returns an environment, use ['exprs'] key to access
return self.slots["assayData"]
def _pdata_get(self): # returns an RS4 object, use .slots("data") to access
return self.slots["phenoData"]
def _feats_get(self): # returns an RS4 object, use .slots("data") to access
return self.slots["featureData"]
def _annot_get(self): # slots returns a tuple, just pick 1st (only) element
return self.slots["annotation"][0]
def _class_get(self): # slots returns a tuple, just pick 1st (only) element
return self.slots["class"][0]
# funcs to set the attributes
def _assay_set(self, value):
self.slots["assayData"] = value
def _pdata_set(self, value):
self.slots["phenoData"] = value
def _feats_set(self,value):
self.slots["featureData"] = value
def _annot_set(self, value):
self.slots["annotation"] = value
def _class_set(self, value):
self.slots["class"] = value
# funcs to work with the above to get/set the data
def _exprs_get(self):
return self.assay["exprs"]
def _pheno_get(self):
pdata = self.pData
return pdata.slots["data"]
def _exprs_set(self, value):
assay = self.assay
assay["exprs"] = value
def _pheno_set(self, value):
pdata = self.pData
pdata.slots["data"] = value
assay = property(_assay_get, _assay_set, None, "R attribute 'assayData'")
pData = property(_pdata_get, _pdata_set, None, "R attribute 'phenoData'")
fData = property(_feats_get, _feats_set, None, "R attribute 'featureData'")
annot = property(_annot_get, _annot_set, None, "R attribute 'annotation'")
exprs = property(_exprs_get, _exprs_set, None, "R attribute 'exprs'")
pheno = property(_pheno_get, _pheno_set, None, "R attribute 'pheno")
def _subset(self, features=None, samples=None):
features = features if features else self.exprs.rownames
samples = samples if samples else self.exprs.colnames
fx = robjects.BoolVector([f in features for f in self.exprs.rownames])
sx = robjects.BoolVector([s in samples for s in self.exprs.colnames])
self.pheno = self.pheno.rx(sx, self.pheno.colnames)
self.exprs = self.exprs.rx(fx,sx) # can't assign back to exprs this way
When doing
eset2<-eset1[1:10,1:5]
in R, the R S4 method "[" with the signature ("ExpressionSet") is fetched and run using the parameter values you provided.
The documentation is suggesting the use of getmethod (see http://rpy2.readthedocs.org/en/version_2.7.x/generated_rst/s4class.html#methods ) to facilitate the task of fetching the relevant S4 method, but its behaviour seems to have changed after the documentation was written (resolution of the dispatch through inheritance is no longer done).
The following should do it though:
from rpy2.robjects.packages import importr
methods = importr('methods')
r_subset_expressionset = methods.selectMethod("[", "ExpressionSet")
with thanks to #lgautier's answer, here's a snippet of my above code, modified to allow subsetting of the RS4 object:
from multipledispatch import dispatch
#dispatch(RS4)
def eset_subset(eset, features=None, samples=None):
"""
subset an RS4 eset object
"""
features = features if features else eset.exprs.rownames
samples = samples if samples else eset.exprs.colnames
fx = robjects.BoolVector([f in features for f in eset.exprs.rownames])
sx = robjects.BoolVector([s in samples for s in eset.exprs.colnames])
esub=methods.selectMethod("[", signature="ExpressionSet")(eset, fx,sx)
return esub

How do you use matrices in Nimrod?

I found this project on GitHub; it was the only search term returned for "nimrod matrix". I took the bare bones of it and changed it a little bit so that it compiled without errors, and then I added the last two lines to build a simple matrix, and then output a value, but the "getter" function isn't working for some reason. I adapted the instructions for adding properties found here, but something isn't right.
Here is my code so far. I'd like to use the GNU Scientific Library from within Nimrod, and I figured that this was the first logical step.
type
TMatrix*[T] = object
transposed: bool
dataRows: int
dataCols: int
data: seq[T]
proc index[T](x: TMatrix[T], r,c: int): int {.inline.} =
if r<0 or r>(x.rows()-1):
raise newException(EInvalidIndex, "matrix index out of range")
if c<0 or c>(x.cols()-1):
raise newException(EInvalidIndex, "matrix index out of range")
result = if x.transposed: c*x.dataCols+r else: r*x.dataCols+c
proc rows*[T](x: TMatrix[T]): int {.inline.} =
## Returns the number of rows in the matrix `x`.
result = if x.transposed: x.dataCols else: x.dataRows
proc cols*[T](x: TMatrix[T]): int {.inline.} =
## Returns the number of columns in the matrix `x`.
result = if x.transposed: x.dataRows else: x.dataCols
proc matrix*[T](rows, cols: int, d: openarray[T]): TMatrix[T] =
## Constructor. Initializes the matrix by allocating memory
## for the data and setting the number of rows and columns
## and sets the data to the values specified in `d`.
result.dataRows = rows
result.dataCols = cols
newSeq(result.data, rows*cols)
if len(d)>0:
if len(d)<(rows*cols):
raise newException(EInvalidIndex, "insufficient data supplied in matrix constructor")
for i in countup(0,rows*cols-1):
result.data[i] = d[i]
proc `[][]`*[T](x: TMatrix[T], r,c: int): T =
## Element access. Returns the element at row `r` column `c`.
result = x.data[x.index(r,c)]
proc `[][]=`*[T](x: var TMatrix[T], r,c: int, a: T) =
## Sets the value of the element at row `r` column `c` to
## the value supplied in `a`.
x.data[x.index(r,c)] = a
var m = matrix( 2, 2, [1,2,3,4] )
echo( $m[0][0] )
This is the error I get:
c:\program files (x86)\nimrod\config\nimrod.cfg(36, 11) Hint: added path: 'C:\Users\H127\.babel\libs\' [Path]
Hint: used config file 'C:\Program Files (x86)\Nimrod\config\nimrod.cfg' [Conf]
Hint: system [Processing]
Hint: mat [Processing]
mat.nim(48, 9) Error: type mismatch: got (TMatrix[int], int literal(0))
but expected one of:
system.[](a: array[Idx, T], x: TSlice[Idx]): seq[T]
system.[](a: array[Idx, T], x: TSlice[int]): seq[T]
system.[](s: string, x: TSlice[int]): string
system.[](s: seq[T], x: TSlice[int]): seq[T]
Thanks you guys!
I'd like to first point out that the matrix library you refer to is three years old. For a programming language in development that's a lot of time due to changes, and it doesn't compile any more with the current Nimrod git version:
$ nimrod c matrix
...
private/tmp/n/matrix/matrix.nim(97, 8) Error: ']' expected
It fails on the double array accessor, which seems to have changed syntax. I guess your attempt to create a double [][] accessor is problematic, it could be ambiguous: are you accessing the double array accessor of the object or are you accessing the nested array returned by the first brackets? I had to change the proc to the following:
proc `[]`*[T](x: TMatrix[T], r,c: int): T =
After that change you also need to change the way to access the matrix. Here's what I got:
for x in 0 .. <2:
for y in 0 .. <2:
echo "x: ", x, " y: ", y, " = ", m[x,y]
Basically, instead of specifying two bracket accesses you pass all the parameters inside a single bracket. That code generates:
x: 0 y: 0 = 1
x: 0 y: 1 = 2
x: 1 y: 0 = 3
x: 1 y: 1 = 4
With regards to finding software for Nimrod, I would like to recommend you using Nimble, Nimrod's package manager. Once you have it installed you can search available and maintained packages. The command nimble search math shows two potential packages: linagl and extmath. Not sure if they are what you are looking for, but at least they seem more fresh.

Resources