Producing files in dagster without caring about the filename

Producing files in dagster without caring about the filename - dagster

In the dagster tutorial, in the Materializiations section, we choose a filename (sorted_cereals_csv_path) for our intermediate output, and then yield it as a materialization:
#solid
def sort_by_calories(context, cereals):
# Sort the data (removed for brevity)
sorted_cereals_csv_path = os.path.abspath(
'calories_sorted_{run_id}.csv'.format(run_id=context.run_id)
)
with open(sorted_cereals_csv_path, 'w') as fd:
writer = csv.DictWriter(fd, fieldnames)
writer.writeheader()
writer.writerows(sorted_cereals)
yield Materialization(
label='sorted_cereals_csv',
description='Cereals data frame sorted by caloric content',
metadata_entries=[
EventMetadataEntry.path(
sorted_cereals_csv_path, 'sorted_cereals_csv_path'
)
],
)
yield Output(None)
However, this is relying on the fact that we can use the local filesystem (which may not be true), it will likely get overwritten by later runs (which is not what I want) and it's also forcing us to come up with a filename which will never be used.
What I'd like to do in most of my solids is just say "here is a file object, please store it for me", without concerning myself with where it's going to be stored. Can I materialize a file without considering all these things? Should I use python's tempfile facility for this?

Actually it seems this is answered in the output_materialization example.
You basically define a type:
#usable_as_dagster_type(
name='LessSimpleDataFrame',
description='A more sophisticated data frame that type checks its structure.',
input_hydration_config=less_simple_data_frame_input_hydration_config,
output_materialization_config=less_simple_data_frame_output_materialization_config,
)
class LessSimpleDataFrame(list):
pass
This type has an output_materialization strategy that reads the config:
def less_simple_data_frame_output_materialization_config(
context, config, value
):
csv_path = os.path.abspath(config['csv']['path'])
# Save data to this path
And you specify this path in the config:
execute_pipeline(
output_materialization_pipeline,
{
'solids': {
'sort_by_calories': {
'outputs': [
{'result': {'csv': {'path': 'cereal_out.csv'}}}
],
}
}
},
)
You still have to come up with a filename for each intermediate output, but you can do it in the config, which can differ per-run, instead of defining it in the pipeline itself.

Related

Google Earth Engine download problems, is this caused by immutable server side objects?

I have a function that will download an image collection as a TFrecord or a geotiff.
Heres the function -
def download_image_collection_to_drive(collection, aois, bands, limit, export_format):
if collection.size().lt(ee.Number(limit)):
bands = [band for band in bands if band not in ['SCL', 'QA60']]
for aoi in aois:
cluster = aoi.get('cluster').getInfo()
geom = aoi.bounds().getInfo()['geometry']['coordinates']
aoi_collection = collection.filterMetadata('cluster', 'equals', cluster)
for ts in range(1, 11):
print(ts)
ts_collection = aoi_collection.filterMetadata('interval', 'equals', ts)
if ts_collection.size().eq(ee.Number(1)):
image = ts_collection.first()
p_id = image.get("PRODUCT_ID").getInfo()
description = f'{cluster}_{ts}_{p_id}'
task_config = {
'fileFormat': export_format,
'image': image.select(bands),
'region': geom,
'description': description,
'scale': 10,
'folder': 'output'
}
if export_format == 'TFRecord':
task_config['formatOptions'] = {'patchDimensions': [256, 256], 'kernelSize': [3, 3]}
task = ee.batch.Export.image.toDrive(**task_config)
task.start()
else:
logger.warning(f'no image for interval {ts}')
else:
logger.warning(f'collection over {limit} aborting drive download')
It seems whenever it gets to the second aoi it fails, Im confused by this as if ts_collection.size().eq(ee.Number(1)) confirms there is an image there so it should manage to get product id from it.
line 24, in download_image_collection_to_drive
p_id = image.get("PRODUCT_ID").getInfo()
File "/lib/python3.7/site-packages/ee/computedobject.py", line 95, in getInfo
return data.computeValue(self)
File "/lib/python3.7/site-packages/ee/data.py", line 717, in computeValue
prettyPrint=False))['result']
File "/lib/python3.7/site-packages/ee/data.py", line 340, in _execute_cloud_call
raise _translate_cloud_exception(e)
ee.ee_exception.EEException: Element.get: Parameter 'object' is required.
am I falling foul of immutable server side objects somewhere?

This is a server-side value, problem, yes, but immutability doesn't have to do with it — your if statement isn't working as you intend.
ts_collection.size().eq(ee.Number(1)) is a server-side value — you've described a comparison that hasn't happened yet. That means that doing any local operation like a Python if statement cannot take the comparison outcome into account, and will just treat it as a true value.
Using getInfo would be a quick fix:
if ts_collection.size().eq(ee.Number(1)).getInfo():
but it would be more efficient to avoid using getInfo more than needed by fetching the entire collection's info just once, which includes the image info.
...
ts_collection_info = ts_collection.getInfo()
if ts_collection['features']: # Are there any images in the collection?
image = ts_collection.first()
image_info = ts_collection['features'][0] # client-side image info already downloaded
p_id = image_info['properties']['PRODUCT_ID'] # get ID from client-side info
...
This way, you only make two requests per ts: one to check for the match, and one to start the export.
Note that I haven't actually run this Python code, and there might be some small mistakes; if it gives you any trouble, print(ts_collection_info) and examine the structure you actually received to figure out how to interpret it.

Difference between two files view in HTML using Java or any jar

I want to write a script which compare two files in java and see there difference in html page ( side by side ), can someone help me out how to write ( where to start). I am pulling my hair out for this....
I want to use this script in beanshell postprocessor so that I can compare the standard output files with result files easily

I don't think you should be asking people for writing code for you here, consider hiring a freelancer instead.
Alternatively you can use the following approach:
Add JSR223 Assertion as a child of the request which you would like to fail if files won't be equal
Put the following code into "Script" area:
def file1 = new File('/path/to/file1')
def file2 = new File('/path/to/file2')
def file1Lines = file1.readLines('UTF-8')
def file2Lines = file2.readLines('UTF-8')
if (file1Lines.size() != file2Lines.size()) {
AssertionResult.setFailure(true)
AssertionResult.setFailureMessage('Files size is different, omitting line-by-line compare')
} else {
def differences = new StringBuilder()
file1Lines.eachWithIndex {
String file1Line, int number ->
String file2Line = file2Lines.get(number)
if (!file1Line.equals(file2Line)) {
differences.append('Difference # ').append(number).append('. Expected: ')
.append(file1Line).append('. Actual: ' + file2Line)
differences.append(System.getProperty('line.separator'))
}
}
if (differences.toString().length() > 0) {
AssertionResult.setFailure(true)
AssertionResult.setFailureMessage(differences.toString())
}
}
If there will be differences in files content you will see them listed one by one in the JSR223 Assertion
See Scripting JMeter Assertions in Groovy - A Tutorial for more details.

using dictionaries in swi-prolog

I'm working on a simple web service in Prolog and wanted to respond to my users with data formatted as JSON. A nice facility is reply_json_dict/1 which takes a dictionary and converts it in a HTTP response with well formatted JSON body.
My trouble is that building the response dictionary itself seems a little cumbersome. For example, when I return some data, I have data id but may/may not have data properties (possibly an unbound variable). At the moment I do the following:
OutDict0 = _{ id : DataId },
( nonvar(Props) -> OutDict1 = OutDict0.put(_{ attributes : Props }) ; OutDict1 = OutDict0 ),
reply_json_dict(OutDict1)
Which works fine, so output is { "id" : "III" } or { "id" : "III", "attributes" : "AAA" } depending whether or not Props is bound, but... I'm looking for an easier approach. Primarily because if I need to add more optional key/value pairs, I end up with multiple implications like:
OutDict0 = _{ id : DataId },
( nonvar(Props) -> OutDict1 = OutDict0.put(_{ attributes : Props }) ; OutDict1 = OutDict0 ),
( nonvar(Time) -> OutDict2 = OutDict1.put(_{ time : Time }) ; OutDict2 = OutDict1 ),
( nonvar(UserName) -> OutDict3 = OutDict2.put(_{ userName : UserName }) ; OutDict3 = OutDict2 ),
reply_json_dict(OutDict3)
And that seems just wrong. Is there a simpler way?
Cheers,
Jacek

Instead of messing with dictionaries, my recommendation in this case is to use a different predicate to emit JSON.
For example, consider json_write/2, which lets you emit JSON, also on current output as the HTTP libraries require.
Suppose your representation of data fields is the common Name(Value) notation that is used throughout the HTTP libraries for option processing:
Fields0 = [attributes(Props),time(Time),userName(UserName)],
Using the meta-predicate include/3, your whole example becomes:
main :-
Fields0 = [id(DataId),attributes(Props),time(Time),userName(UserName)],
include(ground, Fields0, Fields),
json_write(current_output, json(Fields)).
You can try it out yourself, by plugging in suitable values for the individual elements that are singleton variables in the snippet above.
For example, we can (arbitrarily) use:
Fields0 = [id(i9),attributes(_),time('12:00'),userName(_)],
yielding:
?- main.
{"id":"i9", "time":"12:00"}
true.
You only need to emit the suitable Content-Type header, and have the same output that reply_json_dict/1 would have given you.

You can do it in one step if you use a list to represent all values that need to go into the dict.
?- Props = [a,b,c], get_time(Time),
D0 = _{id:001},
include(ground, [props:Props,time:Time,user:UserName], Fs),
D = D0.put(Fs).
D0 = _17726{id:1},
Fs = [props:[a, b, c], time:1477557597.205908],
D = _17726{id:1, props:[a, b, c], time:1477557597.205908}.
This borrows the idea in mat's answer to use include(ground).

Many thanks mat and Boris for suggestions! I ended up with a combination of your ideas:
dict_filter_vars(DictIn, DictOut) :-
findall(Key=Value, (get_dict(Key, DictIn, Value), nonvar(Value)), Pairs),
dict_create(DictOut, _, Pairs).
Which then I can use as simple as that:
DictWithVars = _{ id : DataId, attributes : Props, time : Time, userName : UserName },
dict_filter_vars(DictWithVars, DictOut),
reply_json_dict(DictOut)

How to save a record that has_many other records in EmberJs

Please can you help me.
I have a model A:
A = DS.Model.extend
title: DS.attr('string')
bs: DS.hasMany('b', {async: true})
`export default A'
and model B:
B = DS.Model.extend
title: DS.attr('string')
as: DS.hasMany('a', {async: true})
'export default B'
I can not seem to save A with some Bs.
I tried different things I could have found on SO or around the internet.
But the best thing I accomplished was to get A saved without Bs.
someB = here exists loaded from server
a = #store.createRecord 'a', {
title: 'sth'
}
a.save().then((a) ->
a.get('bs').then((bs) ->
bs.pushObject(someB)
a.save()
)
# i tried with a.save() here as well
)
So A get saved, but when I want to save A with bs, so that on my server goes PUT/PATCH on a with {bs: [someID]}

I have succeeded to make it work, but it is hackish so if someone knows better solution please help.
a.save().then((a)=>
a.get('bs').then((bs)=>
bs.pushObjects(someBs)
a.save()
).then((a)=>
a.save()
)
)
As you can see there is one save to many but this is the only way it worked. First save of a sends to server bs: nil, the second one sends bs: [someBID, someOtherBID, ...]

Filtering tab completion in input task implementation

I'm currently implementing a SBT plugin for Gatling.
One of its features will be to open the last generated report in a new browser tab from SBT.
As each run can have a different "simulation ID" (basically a simple string), I'd like to offer tab completion on simulation ids.
An example :
Running the Gatling SBT plugin will produce several folders (named from simulationId + date of report generaation) in target/gatling, for example mysim-20140204234534, myothersim-20140203124534 and yetanothersim-20140204234534.
Let's call the task lastReport.
If someone start typing lastReport my, I'd like to filter out tab-completion to only suggest mysim and myothersim.
Getting the simulation ID is a breeze, but how can help the parser and filter out suggestions so that it only suggest an existing simulation ID ?
To sum up, I'd like to do what testOnly do, in a way : I only want to suggest things that make sense in my context.
Thanks in advance for your answers,
Pierre
Edit : As I got a bit stuck after my latest tries, here is the code of my inputTask, in it's current state :
package io.gatling.sbt
import sbt._
import sbt.complete.{ DefaultParsers, Parser }
import io.gatling.sbt.Utils._
object GatlingTasks {
val lastReport = inputKey[Unit]("Open last report in browser")
val allSimulationIds = taskKey[Set[String]]("List of simulation ids found in reports folder")
val allReports = taskKey[List[Report]]("List of all reports by simulation id and timestamp")
def findAllReports(reportsFolder: File): List[Report] = {
val allDirectories = (reportsFolder ** DirectoryFilter.&&(new PatternFilter(reportFolderRegex.pattern))).get
allDirectories.map(file => (file, reportFolderRegex.findFirstMatchIn(file.getPath).get)).map {
case (file, regexMatch) => Report(file, regexMatch.group(1), regexMatch.group(2))
}.toList
}
def findAllSimulationIds(allReports: Seq[Report]): Set[String] = allReports.map(_.simulationId).distinct.toSet
def openLastReport(allReports: List[Report], allSimulationIds: Set[String]): Unit = {
def simulationIdParser(allSimulationIds: Set[String]): Parser[Option[String]] =
DefaultParsers.ID.examples(allSimulationIds, check = true).?
def filterReportsIfSimulationIdSelected(allReports: List[Report], simulationId: Option[String]): List[Report] =
simulationId match {
case Some(id) => allReports.filter(_.simulationId == id)
case None => allReports
}
Def.inputTaskDyn {
val selectedSimulationId = simulationIdParser(allSimulationIds).parsed
val filteredReports = filterReportsIfSimulationIdSelected(allReports, selectedSimulationId)
val reportsSortedByDate = filteredReports.sorted.map(_.path)
Def.task(reportsSortedByDate.headOption.foreach(file => openInBrowser((file / "index.html").toURI)))
}
}
}
Of course, openReport is called using the results of allReports and allSimulationIds tasks.
I think I'm close to a functioning input task but I'm still missing something...

Def.inputTaskDyn returns a value of type InputTask[T] and doesn't perform any side effects. The result needs to be bound to an InputKey, like lastReport. The return type of openLastReport is Unit, which means that openLastReport will construct a value that will be discarded, effectively doing nothing useful. Instead, have:
def openLastReport(...): InputTask[...] = ...
lastReport := openLastReport(...).evaluated
(Or, the implementation of openLastReport can be inlined into the right hand side of :=)
You probably don't need inputTaskDyn, but just inputTask. You only need inputTaskDyn if you need to return a task. Otherwise, use inputTask and drop the Def.task.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Producing files in dagster without caring about the filename - dagster

Related

Google Earth Engine download problems, is this caused by immutable server side objects?

Difference between two files view in HTML using Java or any jar

using dictionaries in swi-prolog

How to save a record that has_many other records in EmberJs

Filtering tab completion in input task implementation

Categories

Resources