Retraining Spacy Dependency Model Fails - python-3.6

When I try to retrain spacy english model, as I have found in the examples, It fails:
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> from spacy.tokens import Doc
>>> from spacy.gold import GoldParse
>>>
>>> nlp = spacy.load('en')
>>> doc = Doc(nlp.vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
>>> gold = GoldParse(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'), (1, 'punct')])
>>> nlp.parser(doc)
>>> gold
<spacy.gold.GoldParse object at 0x114008a58>
>>> nlp.parser.update( doc, gold )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "spacy/syntax/parser.pyx", line 320, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:10286)
File "spacy/syntax/arc_eager.pyx", line 357, in spacy.syntax.arc_eager.ArcEager.preprocess_gold (spacy/syntax/arc_eager.cpp:7888)
AttributeError: 'NoneType' object has no attribute 'upper'
What am I doing wrong here? Any help is appreciated.

As Ghislain PUTOIS (#ghpu) answered me on spacy support chatroom, the doc seems to be slightly outdated, see instead https://github.com/explosion/spaCy/blob/master/examples/training/train_parser.py, where gold heads and deps are now separated.

Related

Calling Julia from Streamlit App using PyJulia

I'm trying to use a julia function from a streamlit app. Created a toy example to test the interaction, simply returning a matrix from a julia functions based on a single parameter to specify the value of the diagonal elements.
Will also note at the outset that both julia_import_method = "api_compiled_false" and julia_import_method = "main_include" works when importing the function in Spyder IDE (rather than at the command line to launch the streamlit app via streamlit run streamlit_julia_test.py).
My project directory looks like:
├── my_project_directory
│   ├── julia_test.jl
│   └── streamlit_julia_test.py
The julia function is in julia_test.jl and just simply returns a matrix with diagonals specified by the v parameter:
function get_matrix_from_julia(v::Int)
m=[v 0
0 v]
return m
end
The streamlit app is streamlit_julia_test.py and is defined as:
import os
from io import BytesIO
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import streamlit as st
# options:
# main_include
# api_compiled_false
# dont_import_julia
julia_import_method = "api_compiled_false"
if julia_import_method == "main_include":
# works in Spyder IDE
import julia
from julia import Main
Main.include("julia_test.jl")
elif julia_import_method == "api_compiled_false":
# works in Spyder IDE
from julia.api import Julia
jl = Julia(compiled_modules=False)
this_dir = os.getcwd()
julia_test_path = """include(\""""+ this_dir + """/julia_test.jl\"""" +")"""
print(julia_test_path)
jl.eval(julia_test_path)
get_matrix_from_julia = jl.eval("get_matrix_from_julia")
elif julia_import_method == "dont_import_julia":
print("Not importing ")
else:
ValueError("Not handling this case:" + julia_import_method)
st.header('Using Julia in Streamlit App Example')
st.text("Using Method:" + julia_import_method)
matrix_element = st.selectbox('Set Matrix Diagonal to:', [1,2,3])
matrix_numpy = np.array([[matrix_element,0],[0,matrix_element]])
col1, col2 = st.columns([4,4])
with col1:
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(matrix_numpy, ax = ax, cmap="Blues",annot=True)
ax.set_title('Matrix Using Python Numpy')
buf = BytesIO()
fig.savefig(buf, format="png")
st.image(buf)
with col2:
if julia_import_method == "dont_import_julia":
matrix_julia = matrix_numpy
else:
matrix_julia = get_matrix_from_julia(matrix_element)
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(matrix_julia, ax = ax, cmap="Blues",annot=True)
ax.set_title('Matrix from External Julia Script')
buf = BytesIO()
fig.savefig(buf, format="png")
st.image(buf)
If the app were working correctly, it would look like this (which can be reproduced by setting the julia_import_method = "dont_import_julia" on line 13):
Testing
When I try julia_import_method = "main_include", I get the well known error:
julia.core.UnsupportedPythonError: It seems your Julia and PyJulia setup are not supported.
Julia executable:
julia
Python interpreter and libpython used by PyCall.jl:
/Users/myusername/opt/anaconda3/bin/python3
/Users/myusername/opt/anaconda3/lib/libpython3.9.dylib
Python interpreter used to import PyJulia and its libpython.
/Users/myusername/opt/anaconda3/bin/python
/Users/myusername/opt/anaconda3/lib/libpython3.9.dylib
Your Python interpreter "/Users/myusername/opt/anaconda3/bin/python"
is statically linked to libpython. Currently, PyJulia does not fully
support such Python interpreter.
The easiest workaround is to pass `compiled_modules=False` to `Julia`
constructor. To do so, first *reboot* your Python REPL (if this happened
inside an interactive session) and then evaluate:
>>> from julia.api import Julia
>>> jl = Julia(compiled_modules=False)
Another workaround is to run your Python script with `python-jl`
command bundled in PyJulia. You can simply do:
$ python-jl PATH/TO/YOUR/SCRIPT.py
See `python-jl --help` for more information.
For more information, see:
https://pyjulia.readthedocs.io/en/latest/troubleshooting.html
As suggested, when I set julia_import_method = "api_compiled_false", I get a seg fault:
include("my_project_directory/julia_test.jl")
2022-04-03 10:23:13.406 Traceback (most recent call last):
File "/Users/myusername/opt/anaconda3/lib/python3.9/site-packages/streamlit/script_runner.py", line 430, in _run_script
exec(code, module.__dict__)
File "my_project_directory/streamlit_julia_test.py", line 25, in <module>
jl = Julia(compiled_modules=False)
File "/Users/myusername/.local/lib/python3.9/site-packages/julia/core.py", line 502, in __init__
if not self.api.was_initialized: # = jl_is_initialized()
File "/Users/myusername/.local/lib/python3.9/site-packages/julia/libjulia.py", line 114, in __getattr__
return getattr(self.libjulia, name)
File "/Users/myusername/opt/anaconda3/lib/python3.9/ctypes/__init__.py", line 395, in __getattr__
func = self.__getitem__(name)
File "/Users/myuserame/opt/anaconda3/lib/python3.9/ctypes/__init__.py", line 400, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x21b8e5840, was_initialized): symbol not found
signal (11): Segmentation fault: 11
in expression starting at none:0
Allocations: 35278345 (Pool: 35267101; Big: 11244); GC: 36
zsh: segmentation fault streamlit run streamlit_julia_test.py
I've also tried the alternative recommendation provided in the PyJulia response message regarding the use of:
python-jl my_project_directory/streamlit_julia_test.py
But I get this error when running the python-jl command:
INTEL MKL ERROR: dlopen(/Users/myusername/opt/anaconda3/lib/libmkl_intel_thread.1.dylib, 0x0009): Library not loaded: #rpath/libiomp5.dylib
Referenced from: /Users/myusername/opt/anaconda3/lib/libmkl_intel_thread.1.dylib
Reason: tried: '/Applications/Julia-1.7.app/Contents/Resources/julia/bin/../lib/libiomp5.dylib' (no such file), '/usr/local/lib/libiomp5.dylib' (no such file), '/usr/lib/libiomp5.dylib' (no such file).
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.
So I'm stuck, thanks in advance for a modified reproducible example or instructions for the following system specs!
System specs:
Mac OS Monterey 12.2.1 (Chip - Mac M1 Pro)
Python 3.9.7
Julia 1.7.2
PyJulia 0.5.8.dev
Streamlit 1.7.0
yes we can use it streamlit_julia_test.py NumPy for instance

Python3.8 socketpair failed (s.o. cygwin)

Under cygwin console:
$ python3.8
Python 3.8.7 (default, Jan 26 2021, 07:37:32)
[GCC 10.2.0] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
```>>> import socket as s```
```>>> s.socketpair()```
File "\<stdin\>", line 1, in <module>
File "/usr/lib/python3.8/socket.py", line 571, in socketpair
```a, b = _socket.socketpair(family, type, proto)```
SystemError: <built-in function socketpair> returned NULL without setting an error
but...
$ python2.7
Python 2.7.18 (default, Jan 2 2021, 09:22:32)
[GCC 10.2.0] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
```>>> import socket as s```
```>>> s.socketpair()```
(<socket object, fd=3, family=1, type=1, protocol=0>, <socket object, fd=4, family=1, type=1,
protocol=0>)
I don't know where to look! :((
THX
The new versions implement a workaround similar to your idea
https://sourceware.org/pipermail/cygwin/2021-February/247684.html

run the same method of a list of instances in pathos.multiprocessing

I am working on a traveling salesman problem. Given that all agents traverse the same graph to find their own path separately, i am trying to parallelize the path-finding action of agents. the task is for each iteration, all agents start from a start node to find their paths and collect all the paths to find the best path in the current iteration.
I am using pathos.multiprocessing.
the agent class has a traverse method as,
class Agent:
def find_a_path(self, graph):
# here is the logic to find a path by traversing the graph
return found_path
I create a helper function to wrap up the method
def do_agent_find_a_path(agent, graph):
return agent.find_a_path(graph)
then create a pool and employ amap by passing the helper function, a list of agent instance and the same graph,
pool = ProcessPool(nodes = 10)
res = pool.amap(do_agent_find_a_path, agents, [graph] * len(agents))
but, the processes are created in sequence and it runs very slow. I'd like to have some instructions on a correct/decent way to leverage pathos in this situation.
thank you!
UPDATE:
I am using pathos 0.2.3 on ubuntu,
Name: pathos
Version: 0.2.3
Summary: parallel graph management and execution in heterogeneous computing
Home-page: https://pypi.org/project/pathos
Author: Mike McKerns
i get the following error with the TreadPool sample code:
>import pathos
>pathos.pools.ThreadPool().iumap(lambda x:x*x, [1,2,3,4])
Traceback (most recent call last):
File "/opt/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-f8f5e7774646>", line 1, in <module>
pathos.pools.ThreadPool().iumap(lambda x:x*x, [1,2,3,4])
AttributeError: 'ThreadPool' object has no attribute 'iumap'```
I'm the pathos author. I'm not sure how long your method takes to run, but from your comments, I'm going to assume not very long. I'd suggest that, if the method is "fast", that you use a ThreadPool instead. Also, if you don't need to preserve the order of the results, the fastest map is typically uimap (unordered, iterative map).
>>> class Agent:
... def basepath(self, dirname):
... import os
... return os.path.basename(dirname)
... def slowpath(self, dirname):
... import time
... time.sleep(.2)
... return self.basepath(dirname)
...
>>> a = Agent()
>>> import pathos.pools as pp
>>> dirs = ['/tmp/foo', '/var/path/bar', '/root/bin/bash', '/tmp/foo/bar']
>>> import time
>>> p = pp.ProcessPool()
>>> go = time.time(); tuple(p.uimap(a.basepath, dirs)); print(time.time()-go)
('foo', 'bar', 'bash', 'bar')
0.006751060485839844
>>> p.close(); p.join(); p.clear()
>>> t = pp.ThreadPool(4)
>>> go = time.time(); tuple(t.uimap(a.basepath, dirs)); print(time.time()-go)
('foo', 'bar', 'bash', 'bar')
0.0005156993865966797
>>> t.close(); t.join(); t.clear()
and, just to compare against something that takes a bit longer...
>>> t = pp.ThreadPool(4)
>>> go = time.time(); tuple(t.uimap(a.slowpath, dirs)); print(time.time()-go)
('bar', 'bash', 'bar', 'foo')
0.2055649757385254
>>> t.close(); t.join(); t.clear()
>>> p = pp.ProcessPool()
>>> go = time.time(); tuple(p.uimap(a.slowpath, dirs)); print(time.time()-go)
('foo', 'bar', 'bash', 'bar')
0.2084510326385498
>>> p.close(); p.join(); p.clear()
>>>

Module 'rpy2.robjects.pandas2ri' has no attribute 'ri2py'

I'm trying to convert R-dataframe to Python Pandas DataFrame.
I use the following code:
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_dataframe = r_function(my_dataframe['Numbers'])
print(r_dataframe)
python_dataframe = pandas2ri.ri2py(r_dataframe)
The above code works well in Jupyter Notebook (Anaconda). But if I run this code through a my_program.py file through the terminal, I get an error:
:~$ python3 my_program.py
Traceback (most recent call last):
File "my_program.py", line 223, in <module>
python_dataframe = pandas2ri.ri2py(r_dataframe)
AttributeError: module 'rpy2.robjects.pandas2ri' has no attribute 'ri2py'
Line of code: print(r_dataframe) shows right result in the terminal.
If I try to use code print(dir(pandas2ri)) in Jupyter Notebook I get ('ri2py'):
['DataFrame', 'FactorVector', 'FloatSexpVector', 'INTSXP', 'ISOdatetime', 'IntSexpVector', 'IntVector', 'ListSexpVector', 'ListVector', 'OrderedDict', 'POSIXct', 'PandasDataFrame', 'PandasIndex', 'PandasSeries', 'SexpVector', 'StrSexpVector', 'StrVector', 'Vector', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activate', 'as_vector', 'conversion', 'converter', 'datetime', 'deactivate', 'dt_O_type', 'dt_datetime64ns_type', 'get_timezone', 'numpy', 'numpy2ri', 'original_converter', 'os', 'pandas', 'py2ri', 'py2ri_categoryseries', 'py2ri_pandasdataframe', 'py2ri_pandasindex', 'py2ri_pandasseries', 'py2ro', 'pytz', 'recarray', 'ri2py', 'ri2py_dataframe', 'ri2py_floatvector', 'ri2py_intvector', 'ri2py_listvector', 'ri2py_vector', 'ri2ro', 'rinterface', 'ro', 'warnings']
And if I try to use the same code print(dir(pandas2ri)) in Terminal I get ('rpy2py'):
['DataFrame', 'FactorVector', 'FloatSexpVector', 'ISOdatetime', 'IntSexpVector', 'IntVector', 'ListSexpVector', 'OrderedDict', 'POSIXct', 'PandasDataFrame', 'PandasIndex', 'PandasSeries', 'Sexp', 'SexpVector', 'StrSexpVector', 'StrVector', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activate', 'as_vector', 'conversion', 'converter', 'datetime', 'deactivate', 'default_timezone', 'dt_O_type', 'get_timezone', 'is_datetime64_any_dtype', 'numpy', 'numpy2ri', 'original_converter', 'pandas', 'py2rpy', 'py2rpy_categoryseries', 'py2rpy_pandasdataframe', 'py2rpy_pandasindex', 'py2rpy_pandasseries', 'pytz', 'ri2py_vector', 'rinterface', 'rpy2py', 'rpy2py_dataframe', 'rpy2py_floatvector', 'rpy2py_intvector', 'rpy2py_listvector', 'tzlocal', 'warnings']
It turns out the developers have changed the name of the functions.
Since no one bothered to write down the way to do it with newer versions of rpy2:
Conversion is done using a localconverter block which automatically converts from pandas dataframe to r dataframe and back.
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pd_df = pd.DataFrame({'int_values': [1,2,3],
'str_values': ['abc', 'def', 'ghi']})
base = importr('base')
with localconverter(ro.default_converter + pandas2ri.converter):
df_summary = base.summary(pd_df)
You are likely using documentation/code written for a different version of rpy2 than what you have installed.
If using the latest release, consider checking the documentation for it:
https://rpy2.github.io/doc/v3.0.x/html/generated_rst/pandas.html
For anyone having issues with localconverter, here is an alternative way to convert a df of type rpy2.robjects.vectors.ListVector to a pandas dataframe. This solution drops columns names
pd.Dataframe(np.array(df).reshape((nrows, ncols)))

html5lib: TypeError: __init__() got an unexpected keyword argument 'encoding'

I'm trying to install html5lib. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it:
>>> with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())
I get an error:
Traceback (most recent call last):
File "<pyshell#11>", line 2, in <module>
document = html5lib.parse(f, encoding=f.info().get_content_charset())
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 35, in parse
return p.parse(doc, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 235, in parse
self._parse(stream, False, None, *args, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
What is wrong and what should I do?
I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:
pip3 install -U html5lib=="0.9999999"
Tested using bs4 4.4.1:
In [1]: import bs4
In [2]: bs4.__version__
Out[2]: '4.4.1'
In [3]: import html5lib
In [4]: html5lib.__version__
Out[4]: '0.9999999'
In [5]: from urllib.request import urlopen
In [6]: with urlopen("http://example.com/") as f:
...: document = html5lib.parse(f, encoding=f.info().get_content_charset())
...:
In [7]:
You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:
The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:
class HTMLBinaryInputStream(HTMLUnicodeInputStream):
"""Provides a unicode stream of characters to the HTMLTokenizer.
This class takes care of character encoding and removing or replacing
incorrect byte-sequences and also provides column and line tracking.
"""
def __init__(self, source, override_encoding=None, transport_encoding=None,
same_origin_parent_encoding=None, likely_encoding=None,
default_encoding="windows-1252", useChardet=True):
Setting override_encoding=f.info().get_content_charset() should do the trick.
Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:
In [16]: bs4.__version__
Out[16]: '4.5.1'
In [17]: html5lib.__version__
Out[17]: '0.999999999'
In [18]: with urlopen("http://example.com/") as f:
document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
....:
In [19]:

Resources