I'm trying to convert R-dataframe to Python Pandas DataFrame.
I use the following code:
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_dataframe = r_function(my_dataframe['Numbers'])
print(r_dataframe)
python_dataframe = pandas2ri.ri2py(r_dataframe)
The above code works well in Jupyter Notebook (Anaconda). But if I run this code through a my_program.py file through the terminal, I get an error:
:~$ python3 my_program.py
Traceback (most recent call last):
File "my_program.py", line 223, in <module>
python_dataframe = pandas2ri.ri2py(r_dataframe)
AttributeError: module 'rpy2.robjects.pandas2ri' has no attribute 'ri2py'
Line of code: print(r_dataframe) shows right result in the terminal.
If I try to use code print(dir(pandas2ri)) in Jupyter Notebook I get ('ri2py'):
['DataFrame', 'FactorVector', 'FloatSexpVector', 'INTSXP', 'ISOdatetime', 'IntSexpVector', 'IntVector', 'ListSexpVector', 'ListVector', 'OrderedDict', 'POSIXct', 'PandasDataFrame', 'PandasIndex', 'PandasSeries', 'SexpVector', 'StrSexpVector', 'StrVector', 'Vector', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activate', 'as_vector', 'conversion', 'converter', 'datetime', 'deactivate', 'dt_O_type', 'dt_datetime64ns_type', 'get_timezone', 'numpy', 'numpy2ri', 'original_converter', 'os', 'pandas', 'py2ri', 'py2ri_categoryseries', 'py2ri_pandasdataframe', 'py2ri_pandasindex', 'py2ri_pandasseries', 'py2ro', 'pytz', 'recarray', 'ri2py', 'ri2py_dataframe', 'ri2py_floatvector', 'ri2py_intvector', 'ri2py_listvector', 'ri2py_vector', 'ri2ro', 'rinterface', 'ro', 'warnings']
And if I try to use the same code print(dir(pandas2ri)) in Terminal I get ('rpy2py'):
['DataFrame', 'FactorVector', 'FloatSexpVector', 'ISOdatetime', 'IntSexpVector', 'IntVector', 'ListSexpVector', 'OrderedDict', 'POSIXct', 'PandasDataFrame', 'PandasIndex', 'PandasSeries', 'Sexp', 'SexpVector', 'StrSexpVector', 'StrVector', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activate', 'as_vector', 'conversion', 'converter', 'datetime', 'deactivate', 'default_timezone', 'dt_O_type', 'get_timezone', 'is_datetime64_any_dtype', 'numpy', 'numpy2ri', 'original_converter', 'pandas', 'py2rpy', 'py2rpy_categoryseries', 'py2rpy_pandasdataframe', 'py2rpy_pandasindex', 'py2rpy_pandasseries', 'pytz', 'ri2py_vector', 'rinterface', 'rpy2py', 'rpy2py_dataframe', 'rpy2py_floatvector', 'rpy2py_intvector', 'rpy2py_listvector', 'tzlocal', 'warnings']
It turns out the developers have changed the name of the functions.
Since no one bothered to write down the way to do it with newer versions of rpy2:
Conversion is done using a localconverter block which automatically converts from pandas dataframe to r dataframe and back.
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pd_df = pd.DataFrame({'int_values': [1,2,3],
'str_values': ['abc', 'def', 'ghi']})
base = importr('base')
with localconverter(ro.default_converter + pandas2ri.converter):
df_summary = base.summary(pd_df)
You are likely using documentation/code written for a different version of rpy2 than what you have installed.
If using the latest release, consider checking the documentation for it:
https://rpy2.github.io/doc/v3.0.x/html/generated_rst/pandas.html
For anyone having issues with localconverter, here is an alternative way to convert a df of type rpy2.robjects.vectors.ListVector to a pandas dataframe. This solution drops columns names
pd.Dataframe(np.array(df).reshape((nrows, ncols)))
Related
I'm trying to use a julia function from a streamlit app. Created a toy example to test the interaction, simply returning a matrix from a julia functions based on a single parameter to specify the value of the diagonal elements.
Will also note at the outset that both julia_import_method = "api_compiled_false" and julia_import_method = "main_include" works when importing the function in Spyder IDE (rather than at the command line to launch the streamlit app via streamlit run streamlit_julia_test.py).
My project directory looks like:
├── my_project_directory
│ ├── julia_test.jl
│ └── streamlit_julia_test.py
The julia function is in julia_test.jl and just simply returns a matrix with diagonals specified by the v parameter:
function get_matrix_from_julia(v::Int)
m=[v 0
0 v]
return m
end
The streamlit app is streamlit_julia_test.py and is defined as:
import os
from io import BytesIO
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import streamlit as st
# options:
# main_include
# api_compiled_false
# dont_import_julia
julia_import_method = "api_compiled_false"
if julia_import_method == "main_include":
# works in Spyder IDE
import julia
from julia import Main
Main.include("julia_test.jl")
elif julia_import_method == "api_compiled_false":
# works in Spyder IDE
from julia.api import Julia
jl = Julia(compiled_modules=False)
this_dir = os.getcwd()
julia_test_path = """include(\""""+ this_dir + """/julia_test.jl\"""" +")"""
print(julia_test_path)
jl.eval(julia_test_path)
get_matrix_from_julia = jl.eval("get_matrix_from_julia")
elif julia_import_method == "dont_import_julia":
print("Not importing ")
else:
ValueError("Not handling this case:" + julia_import_method)
st.header('Using Julia in Streamlit App Example')
st.text("Using Method:" + julia_import_method)
matrix_element = st.selectbox('Set Matrix Diagonal to:', [1,2,3])
matrix_numpy = np.array([[matrix_element,0],[0,matrix_element]])
col1, col2 = st.columns([4,4])
with col1:
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(matrix_numpy, ax = ax, cmap="Blues",annot=True)
ax.set_title('Matrix Using Python Numpy')
buf = BytesIO()
fig.savefig(buf, format="png")
st.image(buf)
with col2:
if julia_import_method == "dont_import_julia":
matrix_julia = matrix_numpy
else:
matrix_julia = get_matrix_from_julia(matrix_element)
fig, ax = plt.subplots(figsize=(5,5))
sns.heatmap(matrix_julia, ax = ax, cmap="Blues",annot=True)
ax.set_title('Matrix from External Julia Script')
buf = BytesIO()
fig.savefig(buf, format="png")
st.image(buf)
If the app were working correctly, it would look like this (which can be reproduced by setting the julia_import_method = "dont_import_julia" on line 13):
Testing
When I try julia_import_method = "main_include", I get the well known error:
julia.core.UnsupportedPythonError: It seems your Julia and PyJulia setup are not supported.
Julia executable:
julia
Python interpreter and libpython used by PyCall.jl:
/Users/myusername/opt/anaconda3/bin/python3
/Users/myusername/opt/anaconda3/lib/libpython3.9.dylib
Python interpreter used to import PyJulia and its libpython.
/Users/myusername/opt/anaconda3/bin/python
/Users/myusername/opt/anaconda3/lib/libpython3.9.dylib
Your Python interpreter "/Users/myusername/opt/anaconda3/bin/python"
is statically linked to libpython. Currently, PyJulia does not fully
support such Python interpreter.
The easiest workaround is to pass `compiled_modules=False` to `Julia`
constructor. To do so, first *reboot* your Python REPL (if this happened
inside an interactive session) and then evaluate:
>>> from julia.api import Julia
>>> jl = Julia(compiled_modules=False)
Another workaround is to run your Python script with `python-jl`
command bundled in PyJulia. You can simply do:
$ python-jl PATH/TO/YOUR/SCRIPT.py
See `python-jl --help` for more information.
For more information, see:
https://pyjulia.readthedocs.io/en/latest/troubleshooting.html
As suggested, when I set julia_import_method = "api_compiled_false", I get a seg fault:
include("my_project_directory/julia_test.jl")
2022-04-03 10:23:13.406 Traceback (most recent call last):
File "/Users/myusername/opt/anaconda3/lib/python3.9/site-packages/streamlit/script_runner.py", line 430, in _run_script
exec(code, module.__dict__)
File "my_project_directory/streamlit_julia_test.py", line 25, in <module>
jl = Julia(compiled_modules=False)
File "/Users/myusername/.local/lib/python3.9/site-packages/julia/core.py", line 502, in __init__
if not self.api.was_initialized: # = jl_is_initialized()
File "/Users/myusername/.local/lib/python3.9/site-packages/julia/libjulia.py", line 114, in __getattr__
return getattr(self.libjulia, name)
File "/Users/myusername/opt/anaconda3/lib/python3.9/ctypes/__init__.py", line 395, in __getattr__
func = self.__getitem__(name)
File "/Users/myuserame/opt/anaconda3/lib/python3.9/ctypes/__init__.py", line 400, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x21b8e5840, was_initialized): symbol not found
signal (11): Segmentation fault: 11
in expression starting at none:0
Allocations: 35278345 (Pool: 35267101; Big: 11244); GC: 36
zsh: segmentation fault streamlit run streamlit_julia_test.py
I've also tried the alternative recommendation provided in the PyJulia response message regarding the use of:
python-jl my_project_directory/streamlit_julia_test.py
But I get this error when running the python-jl command:
INTEL MKL ERROR: dlopen(/Users/myusername/opt/anaconda3/lib/libmkl_intel_thread.1.dylib, 0x0009): Library not loaded: #rpath/libiomp5.dylib
Referenced from: /Users/myusername/opt/anaconda3/lib/libmkl_intel_thread.1.dylib
Reason: tried: '/Applications/Julia-1.7.app/Contents/Resources/julia/bin/../lib/libiomp5.dylib' (no such file), '/usr/local/lib/libiomp5.dylib' (no such file), '/usr/lib/libiomp5.dylib' (no such file).
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.
So I'm stuck, thanks in advance for a modified reproducible example or instructions for the following system specs!
System specs:
Mac OS Monterey 12.2.1 (Chip - Mac M1 Pro)
Python 3.9.7
Julia 1.7.2
PyJulia 0.5.8.dev
Streamlit 1.7.0
yes we can use it streamlit_julia_test.py NumPy for instance
I'm executing the Dagster tutorial, and I got stuck at the Multiple and Conditional Outputs step.
In the solid definitions, it asks to declare (among other things):
output_defs=[
OutputDefinition(
name="hot_cereals", dagster_type=DataFrame, is_required=False
),
OutputDefinition(
name="cold_cereals", dagster_type=DataFrame, is_required=False
),
],
But there's no information where the DataFrame cames from.
Firstly I have tried with pandas.DataFrame but I faced the error: {dagster_type} is not a valid dagster type. It happens when I try to submit it via $ dagit -f multiple_outputs.py.
Then I installed the dagster_pyspark and gave a try with the dagster_pyspark.DataFrame. This time I managed to summit the DAG to the UI. However, when I run it from the UI, I got the following error:
dagster.core.errors.DagsterTypeCheckDidNotPass: Type check failed for step output hot_cereals of type PySparkDataFrame.
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_plan.py", line 210, in _dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 273, in core_dagster_event_sequence_for_step
for evt in _create_step_events_for_output(step_context, user_event):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 298, in _create_step_events_for_output
for output_event in _type_checked_step_output_event_sequence(step_context, output):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 221, in _type_checked_step_output_event_sequence
dagster_type=step_output.dagster_type,
Does anyone know how to fix it? Thanks for the help!
As Arthur pointed out, the full tutorial code is available on Dagster's github.
However, you do not need dagster_pandas, rather, the key lines missing from your code are:
if typing.TYPE_CHECKING:
DataFrame = list
else:
DataFrame = PythonObjectDagsterType(list, name="DataFrame") # type: Any
The reason for the above structure is to achieve MyPy compliance, see the Types & Expectations section of the tutorial.
See also the documentation on Dagster types.
I was stuck here, too, but luckily I found the updated source code.
They have updated the docs so that the OutputDefinition is defined beforehand.
Update your code before sorting and pipeline like below:
import csv
import os
from dagster import (
Bool,
Field,
Output,
OutputDefinition,
execute_pipeline,
pipeline,
solid,
)
#solid
def read_csv(context, csv_path):
lines = []
csv_path = os.path.join(os.path.dirname(__file__), csv_path)
with open(csv_path, "r") as fd:
for row in csv.DictReader(fd):
row["calories"] = int(row["calories"])
lines.append(row)
context.log.info("Read {n_lines} lines".format(n_lines=len(lines)))
return lines
#solid(
config_schema={
"process_hot": Field(Bool, is_required=False, default_value=True),
"process_cold": Field(Bool, is_required=False, default_value=True),
},
output_defs=[
OutputDefinition(name="hot_cereals", is_required=False),
OutputDefinition(name="cold_cereals", is_required=False),
],
)
def split_cereals(context, cereals):
if context.solid_config["process_hot"]:
hot_cereals = [cereal for cereal in cereals if cereal["type"] == "H"]
yield Output(hot_cereals, "hot_cereals")
if context.solid_config["process_cold"]:
cold_cereals = [cereal for cereal in cereals if cereal["type"] == "C"]
yield Output(cold_cereals, "cold_cereals")
You can also find the whole lines of codes from here.
Try first to install the dagster pandas integration:
pip install dagster_pandas
Then do:
from dagster_pandas import DataFrame
You can find the code from the tutorial here.
I have a problem which I have reduced to a small program. The program tries to write a Pandas dataframe. This works find in PyCharm, but when I produce an exe using PyInstaller it errors at runtime:
.\pyinst_excel.exe
Traceback (most recent call last):
File "pyinst_excel.py", line 31, in <module>
File "pyinst_excel.py", line 25, in add_generated_files_row
File "site-packages\pandas\io\excel\_openpyxl.py", line 18, in __init__
ModuleNotFoundError: No module named 'openpyxl'
[11408] Failed to execute script pyinst_excel
Now I have tried explicitly importing openpyxl and that didn't work and I have also used --hidden-import:
call venv\Scripts\activate
pyinstaller --onefile --hidden-import _openpyxl --hidden-import openpyxl --clean pyinst_excel.py
The generated spec file looks like this:
# -*- mode: python ; coding: utf-8 -*-
block_cipher = None
a = Analysis(['pyinst_excel.py'],
pathex=['D:\\Development\\OEM-SR-Rules'],
binaries=[],
datas=[],
hiddenimports=['_openpyxl', 'openpyxl'],
hookspath=[],
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
a.binaries,
a.zipfiles,
a.datas,
[],
name='pyinst_excel',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
upx_exclude=[],
runtime_tmpdir=None,
console=True )
However this doesn't fix the problem. From the build/warnings file I see:
...
missing module named openpyxl - imported by pandas.io.excel._openpyxl (delayed, conditional), D:\Development\OEM-SR-Rules\pyinst_excel.py (top-level)
...
missing module named 'openpyxl.styles' - imported by pandas.io.excel._openpyxl (delayed)
missing module named 'openpyxl.style' - imported by pandas.io.excel._openpyxl (delayed)
Program code:
import argparse
from operator import attrgetter
import os
import sys
import json
import pandas as pd
import numpy as np
from pathlib import Path
import glob
import re
import time
from sys import exit
from shutil import copyfile
REPORT_DATE_FMT = '%d-%m-%Y %H.%M'
def add_generated_files_row(generated_excel, sequence: int, date_generated):
# gen_files_df = pd.read_excel(generated_files_excel, index_col=0)
gen_files_df = pd.DataFrame(columns=['Sequence', 'Date Generated'])
new_row = {'Sequence': sequence,
'Date Generated': date_generated}
gen_files_df = gen_files_df.append(new_row, ignore_index=True, sort=False)
writer = pd.ExcelWriter(generated_excel)
gen_files_df.to_excel(writer)
generated_files_excel = Path('D:\temp\test.xlsx')
date_generated = time.strftime(f"{REPORT_DATE_FMT}")
add_generated_files_row(generated_files_excel, 1, date_generated)
I am using Python 3.8 and PyInstaller 3.6.
I am out of ideas. Any help would be much appreciated.
Thanks,
Clive
The problem turned out to be caused PyInstaller being run from a different environment.
To solve this I had to do:
venv\Scripts\activate.bat
(venv) D:\Development\OEM-SR-Rules>pip install PyInstaller
I'm wondering if jupyter_client is able to return code sent in to the execute function as HTML somehow?
I'm also wondering if I can do the same with stdout and stderr, as well as markdown?
If jupyter_client cannot do this, is there a jupyter library that does?
Adapting the solution from here might help. This adaptation takes the request of 1+1 or msg_id=c.execute('1+1') and returns the result formatted in html as bold red text with this display(HTML('<div style="color:Red;"><b>' + res + '</b></div>')) using IPython's display module. The kernel info status has been commented out but left for reference.
from subprocess import PIPE
from jupyter_client import KernelManager
from IPython.display import display, HTML
from queue import Empty
km = KernelManager(kernel_name='python3')
km.start_kernel()
# print(km.is_alive())
try:
c = km.client()
msg_id=c.execute('1+1')
state='busy'
data={}
while state!='idle' and c.is_alive():
try:
msg=c.get_iopub_msg(timeout=1)
if not 'content' in msg: continue
content = msg['content']
if 'data' in content:
data=content['data']
if 'execution_state' in content:
state=content['execution_state']
except Empty:
pass
res = data['text/plain']
# print(data)
display(HTML('<div style="color:Red;"><b>' + res + '</b></div>'))
except KeyboardInterrupt:
pass
finally:
km.shutdown_kernel()
# print(km.is_alive())
Also see here for more info.
Have a graph running on Datastax Enterprise Graph (5.1 Version), running on Cassandra storage.
Trying to run a query to get both the ID and the property.
In Gremlin Console I can do this:
gremlin> g.V(1).project("v", "properties").by().by(valueMap())
==>[v:v[1],properties:[name:[marko],age:[29]]]
How can I translate the valueMap call still using Python GraphTraversal API. I know I can run a direct query via Session Execution, like this.
session.execute_graph("g.V().has(\"Node_Name\",\"A\").project(\"v\", \"properties\").by().by(valueMap())",{"name":graph_name})
Below is my setup code.
from dse.cluster import Cluster, EXEC_PROFILE_GRAPH_DEFAULT
from dse_graph import DseGraph
from dse.cluster import GraphExecutionProfile, EXEC_PROFILE_GRAPH_SYSTEM_DEFAULT
from dse.graph import GraphOptions
from gremlin_python.process.traversal import T
from gremlin_python.process.traversal import Order
from gremlin_python.process.traversal import Cardinality
from gremlin_python.process.traversal import Column
from gremlin_python.process.traversal import Direction
from gremlin_python.process.traversal import Operator
from gremlin_python.process.traversal import P
from gremlin_python.process.traversal import Pop
from gremlin_python.process.traversal import Scope
from gremlin_python.process.traversal import Barrier
graph_name = "TEST"
graph_ip = ["127.0.0.1"]
graph_port = 9042
schema = """
schema.edgeLabel("Group").create();
schema.propertyKey("Version").Text().create();
schema.edgeLabel("Group").properties("Version").add()
schema.vertexLabel("Example").create();
schema.edgeLabel("Group").connection("Example", "Example").add()
schema.propertyKey("Node_Name").Text().create();
schema.vertexLabel("Example").properties("Node_Name").add()
schema.vertexLabel("Example").index("exampleByName").secondary().by("Node_Name").add();
"""
profile = GraphExecutionProfile(
graph_options=GraphOptions(graph_name=graph_name))
client = Cluster(
contact_points=graph_ip, port=graph_port,
execution_profiles={EXEC_PROFILE_GRAPH_DEFAULT: profile}
)
graph_name = graph_name
session = client.connect()
graph = DseGraph.traversal_source(session)
# force the schema to be clean
session.execute_graph(
"system.graph(name).ifExists().drop();",
{'name': graph_name},
execution_profile=EXEC_PROFILE_GRAPH_SYSTEM_DEFAULT
)
session.execute_graph(
"system.graph(name).ifNotExists().create();",
{'name': graph_name},
execution_profile=EXEC_PROFILE_GRAPH_SYSTEM_DEFAULT
)
session.execute_graph(schema)
session.shutdown()
session = client.connect()
graph = DseGraph.traversal_source(session)
Update:
I guess i have not made the problem clear. It is in python and not in gremlin console. So running code like graph.V().has("Node_Name","A").project("v","properties").by().by(valueMap()).toList()
will give following result. How to execute the gremlin query while still remain in the GLV level, not drop down to text serialized query to Gremlin-Server?
Traceback (most recent call last):
File "graph_test.py", line 79, in <module>
graph.V().has("Node_Name","A").project("v", "properties").by().by(valueMap()).toList()
NameError: name 'valueMap' is not defined
I may not fully understand your question but it seems like you largely have the answer most of the way there. This last line of code:
graph = DseGraph.traversal_source(session)
should probably be written as:
g = DseGraph.traversal_source(session)
The return value of traversal_source(session) is a TraversalSource and not a Graph instance and by convention TinkerPop tends to refer to such a variable as g. Once you have a TraversalSource, then you can just write your Gremlin.
g = DseGraph.traversal_source(session)
g.V().has("Node_Name","A").project("v", "properties").by().by(valueMap()).toList()