Dagster: Multiple and Conditional Outputs (Type check failed for step output xxx PySparkDataFrame) - dagster

I'm executing the Dagster tutorial, and I got stuck at the Multiple and Conditional Outputs step.
In the solid definitions, it asks to declare (among other things):
output_defs=[
OutputDefinition(
name="hot_cereals", dagster_type=DataFrame, is_required=False
),
OutputDefinition(
name="cold_cereals", dagster_type=DataFrame, is_required=False
),
],
But there's no information where the DataFrame cames from.
Firstly I have tried with pandas.DataFrame but I faced the error: {dagster_type} is not a valid dagster type. It happens when I try to submit it via $ dagit -f multiple_outputs.py.
Then I installed the dagster_pyspark and gave a try with the dagster_pyspark.DataFrame. This time I managed to summit the DAG to the UI. However, when I run it from the UI, I got the following error:
dagster.core.errors.DagsterTypeCheckDidNotPass: Type check failed for step output hot_cereals of type PySparkDataFrame.
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_plan.py", line 210, in _dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 273, in core_dagster_event_sequence_for_step
for evt in _create_step_events_for_output(step_context, user_event):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 298, in _create_step_events_for_output
for output_event in _type_checked_step_output_event_sequence(step_context, output):
File "/Users/bambrozio/.local/share/virtualenvs/dagster-tutorial/lib/python3.7/site-packages/dagster/core/execution/plan/execute_step.py", line 221, in _type_checked_step_output_event_sequence
dagster_type=step_output.dagster_type,
Does anyone know how to fix it? Thanks for the help!

As Arthur pointed out, the full tutorial code is available on Dagster's github.
However, you do not need dagster_pandas, rather, the key lines missing from your code are:
if typing.TYPE_CHECKING:
DataFrame = list
else:
DataFrame = PythonObjectDagsterType(list, name="DataFrame") # type: Any
The reason for the above structure is to achieve MyPy compliance, see the Types & Expectations section of the tutorial.
See also the documentation on Dagster types.

I was stuck here, too, but luckily I found the updated source code.
They have updated the docs so that the OutputDefinition is defined beforehand.
Update your code before sorting and pipeline like below:
import csv
import os
from dagster import (
Bool,
Field,
Output,
OutputDefinition,
execute_pipeline,
pipeline,
solid,
)
#solid
def read_csv(context, csv_path):
lines = []
csv_path = os.path.join(os.path.dirname(__file__), csv_path)
with open(csv_path, "r") as fd:
for row in csv.DictReader(fd):
row["calories"] = int(row["calories"])
lines.append(row)
context.log.info("Read {n_lines} lines".format(n_lines=len(lines)))
return lines
#solid(
config_schema={
"process_hot": Field(Bool, is_required=False, default_value=True),
"process_cold": Field(Bool, is_required=False, default_value=True),
},
output_defs=[
OutputDefinition(name="hot_cereals", is_required=False),
OutputDefinition(name="cold_cereals", is_required=False),
],
)
def split_cereals(context, cereals):
if context.solid_config["process_hot"]:
hot_cereals = [cereal for cereal in cereals if cereal["type"] == "H"]
yield Output(hot_cereals, "hot_cereals")
if context.solid_config["process_cold"]:
cold_cereals = [cereal for cereal in cereals if cereal["type"] == "C"]
yield Output(cold_cereals, "cold_cereals")
You can also find the whole lines of codes from here.

Try first to install the dagster pandas integration:
pip install dagster_pandas
Then do:
from dagster_pandas import DataFrame
You can find the code from the tutorial here.

Related

How to use rules_webtesting?

I want to use https://github.com/bazelbuild/rules_webtesting. I am using Bazel 5.2.0.
The whole project can be found here.
My WORKSPACE.bazel file looks like this:
load("#bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
http_archive(
name = "io_bazel_rules_webtesting",
sha256 = "3ef3bb22852546693c94e9b0b02c2570e74abab6f800fd58e0cbe79492e49c1b",
urls = [
"https://github.com/bazelbuild/rules_webtesting/archive/581b1557e382f93419da6a03b91a45c2ac9a9ec8/rules_webtesting.tar.gz",
],
)
load("#io_bazel_rules_webtesting//web:repositories.bzl", "web_test_repositories")
web_test_repositories()
My BUILD.bazel file looks like this:
load("#io_bazel_rules_webtesting//web:py.bzl", "py_web_test_suite")
py_web_test_suite(
name = "browser_test",
srcs = ["browser_test.py"],
browsers = [
"#io_bazel_rules_webtesting//browsers:chromium-local",
],
local = True,
deps = ["#io_bazel_rules_webtesting//testing/web"],
)
browser_test.py looks like this:
import unittest
from testing.web import webtest
class BrowserTest(unittest.TestCase):
def setUp(self):
self.driver = webtest.new_webdriver_session()
def tearDown(self):
try:
self.driver.quit()
finally:
self.driver = None
# Your tests here
if __name__ == "__main__":
unittest.main()
When I try to do a bazel build //... I get (under Ubuntu 20.04 and macOS):
INFO: Invocation ID: 74c03efd-9caa-4174-9fda-42f7ff37e38b
ERROR: error loading package '': Every .bzl file must have a corresponding package, but '#io_bazel_rules_webtesting//web:repositories.bzl' does not have one. Please create a BUILD file in the same or any parent directory. Note that this BUILD file does not need to do anything except exist.
INFO: Elapsed time: 0.038s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
The error message does not make sense to me, since there is a BUILD file in
https://github.com/bazelbuild/rules_webtesting/blob/581b1557e382f93419da6a03b91a45c2ac9a9ec8/BUILD.bazel
and https://github.com/bazelbuild/rules_webtesting/blob/581b1557e382f93419da6a03b91a45c2ac9a9ec8/web/BUILD.bazel.
I also tried a different version of Bazel - but with the same result.
Any ideas on how to get this working?
You need to add a strip_prefix = "rules_webtesting-581b1557e382f93419da6a03b91a45c2ac9a9ec8" in your http_archive call.
For debugging, you can look in the folder where Bazel extracts it: bazel-out/../../../external/io_bazel_rules_webtesting. #io_bazel_rules_webtesting//web translates to bazel-out/../../../external/io_bazel_rules_webtesting/web, so if that folder doesn't exist things won't work.

Is it possible to run a custom OpenAI gym environment entirely from within Jupyter Notebook

Long story short: I have been given some Python code for a custom openAI gym environment. I can successfully run the code via ExperimentGrid from the command line but would like to be able to run the entire experiment from within Jupyter notebook, rather than calling scripts. This would be more convenient for some experiments that I will be doing farther down the road.
My question: Is it possible to execute an experiment on a custom OpenAI gym environment entirely from within Jupyter Notebook and if so, how? I've seen plenty of examples of people executing gym's standard environments (like SpaceInvaders-v0 or CartPole-v0) from Jupyter but even then, they are calling the environment with
env=gym.make('SpaceInvaders-v0')
and essentially executing that environment's script behind the scenes.
Below is a basic description of how my code is set-up to run from the command line and the errors that I'm getting in Jupyter.
Any advice would be appreciated. I am admittedly rather new to Gym, Python and Linux.
My basic environment code is structured like this in, say, envs/mygames/Custom_Env.py:
various import statements (numpy, gym, pyglet, copy)
class Entity()
class State()
class The_Custom_Env(core.Env) # This is the main environment class
class Shell_Class # This class calls The_Custom_Env and provides some arguments
In mygames/__ init__.py,I import the Shell_Class:
from gym.envs.mygames.Custom_Env import Shell_Class
In envs/__ init__.py, I have the environment registered
register(
id='TEST-v0',
entry_point='gym.envs.mygames:Shell_Class',
max_episode_steps=200,
reward_threshold=25.0,)
Finally, if I execute a script containing this code from the command line, the experiment works without issue:
from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--cpu', type=int, default=4)
parser.add_argument('--num_runs', type=int, default=1)
args = parser.parse_args()
eg = ExperimentGrid(name='super-cool-test')
eg.add('env_name', 'TEST-v0', '', True)
eg.add('seed', [10*i for i in range(args.num_runs)])
eg.add('epochs', [10])
eg.add('steps_per_epoch', 4000)
eg.add('ac_kwargs:hidden_sizes', [(32, 32)], 'hid')
eg.add('ac_kwargs:activation', [torch.nn.ReLU], '')
eg.add('pi_lr', [0.001])
eg.add('clip_ratio', 0.3)
eg.run(ppo_pytorch, num_cpu=args.cpu)
My Jupyter Attempt
I put all of the code from Custom_env.py in cell #1.
I then registered the environment in cell #2:
gym.register(
id='TEST-v1',
entry_point='__main__:Shell_Class',
max_episode_steps=200,
reward_threshold=25.0,)
based on this Q/A: Register gym environment that is defined inside a jupyter notebook cell
, I make the environment in cell #3:
gym.make('TEST-v1')
and get this non-descriptive output:
<TimeLimit<Shell_Class< TEST-v1 >>>
In cell #4, I tried to execute ExperimentGrid code directly within Jupyter like so:
from spinup.utils.run_utils import ExperimentGrid
from spinup import ppo_pytorch
import torch
num_runs=1
cpu=4
env_name='TEST-v1'
eg = ExperimentGrid(name='Jupyter-test')
eg.add('env_name', env_name, '', True)
eg.add('seed', [10*i for i in range(num_runs)])
eg.add('epochs', 500)
eg.add('steps_per_epoch', 4000)
eg.add('ac_kwargs:hidden_sizes', [(32, 32)], 'hid')
eg.add('ac_kwargs:activation', [torch.nn.ReLU], '')
eg.add('pi_lr', 0.001)
eg.add('clip_ratio', 0.3)
eg.run(ppo_pytorch, num_cpu=cpu)
The experiment starts up as usual but then runs into some kind of error:
> ================================================================================
ExperimentGrid [Jupyter-test] runs over parameters:
env_name []
TEST-v1
seed [see]
0
epochs [epo]
500
steps_per_epoch [ste]
4000
ac_kwargs:hidden_sizes [hid]
(32, 32)
ac_kwargs:activation []
ReLU
pi_lr [pi]
0.001
clip_ratio [cli]
0.3
Variants, counting seeds: 1
Variants, not counting seeds: 1
================================================================================
Preparing to run the following experiments...
Jupyter-test_test-v1
================================================================================
Launch delayed to give you a few seconds to review your experiments.
To customize or disable this behavior, change WAIT_BEFORE_LAUNCH in
spinup/user_config.py.
================================================================================
Running experiment:
Jupyter-test_test-v1
with kwargs:
{
"ac_kwargs": {
"activation": "ReLU",
"hidden_sizes": [
32,
32
]
},
"clip_ratio": 0.3,
"env_name": "TEST-v1",
"epochs": 500,
"pi_lr": 0.001,
"seed": 0,
"steps_per_epoch": 4000
}
================================================================================
There appears to have been an error in your experiment.
Check the traceback above to see what actually went wrong. The
traceback below, included for completeness (but probably not useful
for diagnosing the error), shows the stack leading up to the
experiment launch.
================================================================================
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-14-de843fd528cf> in <module>
15 eg.add('pi_lr', 0.001)
16 eg.add('clip_ratio', 0.3)
---> 17 eg.run(ppo_pytorch, num_cpu=cpu)
~/Downloads/spinningup/spinup/utils/run_utils.py in run(self, thunk, num_cpu, data_dir, datestamp)
544
545 call_experiment(exp_name, thunk_, num_cpu=num_cpu,
--> 546 data_dir=data_dir, datestamp=datestamp, **var)
547
548
~/Downloads/spinningup/spinup/utils/run_utils.py in call_experiment(exp_name, thunk, seed, num_cpu, data_dir, datestamp, **kwargs)
169 cmd = [sys.executable if sys.executable else 'python', entrypoint, encoded_thunk]
170 try:
--> 171 subprocess.check_call(cmd, env=os.environ)
172 except CalledProcessError:
173 err_msg = '\n'*3 + '='*DIV_LINE_WIDTH + '\n' + dedent("""
~/anaconda3/envs/spinningup/lib/python3.6/subprocess.py in check_call(*popenargs, **kwargs)
309 if cmd is None:
310 cmd = popenargs[0]
--> 311 raise CalledProcessError(retcode, cmd)
312 return 0
313

When using jupyter_client how do I get data in HTML?

I'm wondering if jupyter_client is able to return code sent in to the execute function as HTML somehow?
I'm also wondering if I can do the same with stdout and stderr, as well as markdown?
If jupyter_client cannot do this, is there a jupyter library that does?
Adapting the solution from here might help. This adaptation takes the request of 1+1 or msg_id=c.execute('1+1') and returns the result formatted in html as bold red text with this display(HTML('<div style="color:Red;"><b>' + res + '</b></div>')) using IPython's display module. The kernel info status has been commented out but left for reference.
from subprocess import PIPE
from jupyter_client import KernelManager
from IPython.display import display, HTML
from queue import Empty
km = KernelManager(kernel_name='python3')
km.start_kernel()
# print(km.is_alive())
try:
c = km.client()
msg_id=c.execute('1+1')
state='busy'
data={}
while state!='idle' and c.is_alive():
try:
msg=c.get_iopub_msg(timeout=1)
if not 'content' in msg: continue
content = msg['content']
if 'data' in content:
data=content['data']
if 'execution_state' in content:
state=content['execution_state']
except Empty:
pass
res = data['text/plain']
# print(data)
display(HTML('<div style="color:Red;"><b>' + res + '</b></div>'))
except KeyboardInterrupt:
pass
finally:
km.shutdown_kernel()
# print(km.is_alive())
Also see here for more info.

Argument "xyz" to "ABC" has incompatible type "Tuple[None, ...]"; expected "Tuple[None]"

As an experiment, I wanted to add type annotations to my project and test it with mypy --strict. Consider the following code and the error message below:
#!/usr/bin/env python
import typing as T
from dataclasses import dataclass
#dataclass(frozen=True)
class Question:
choices: T.Tuple[None]
def gen_question() -> Question:
choices = [None]
return Question(choices=tuple(choices))
if __name__ == '__main__':
gen_question()
Here's the error message:
test.py:18: error: Argument "choices" to "Question" has incompatible type "Tuple[None, ...]"; expected "Tuple[None]"
Is there something I'm doing wrong, or is that a bug? How can I solve the problem?
It appears that in case of typing.Tuple, according to the documentation, if I need to specify a variable-length tuple I need to add , ... as in the following:
choices: T.Tuple[None, ...]
Note that this doesn't seem to apply to lists.

How to get sys.exc_traceback form IPython shell.run_code?

My app interfaces with the IPython Qt shell with code something like this:
from IPython.core.interactiveshell import ExecutionResult
shell = self.kernelApp.shell # ZMQInteractiveShell
code = compile(script, file_name, 'exec')
result = ExecutionResult()
shell.run_code(code, result=result)
if result:
self.show_result(result)
The problem is: how can show_result show the traceback resulting from exceptions in code?
Neither the error_before_exec nor the error_in_exec ivars of ExecutionResult seem to give references to the traceback. Similarly, neither sys nor shell.user_ns.namespace.get('sys') have sys.exc_traceback attributes.
Any ideas? Thanks!
Edward
IPython/core/interactiveshell.py contains InteractiveShell._showtraceback:
def _showtraceback(self, etype, evalue, stb):
"""Actually show a traceback. Subclasses may override..."""
print(self.InteractiveTB.stb2text(stb), file=io.stdout)
The solution is to monkey-patch IS._showtraceback so that it writes to sys.stdout (the Qt console):
from __future__ import print_function
...
shell = self.kernelApp.shell # ZMQInteractiveShell
code = compile(script, file_name, 'exec')
def show_traceback(etype, evalue, stb, shell=shell):
print(shell.InteractiveTB.stb2text(stb), file=sys.stderr)
sys.stderr.flush() # <==== Oh, so important
old_show = getattr(shell, '_showtraceback', None)
shell._showtraceback = show_traceback
shell.run_code(code)
if old_show: shell._showtraceback = old_show
Note: there is no need to pass an ExecutionResult object to shell.run_code().
EKR

Resources