Can I use the output of one DynamicResource after a map() for multiple solids? - dagster

I am doing something similar to the dynamic mapping and collect example in the documentation. The example lists files in a directory, maps each to a solid which computes the file size, and then collects the output for a summary of the overall size.
However, I would like to run multiple solids in parallel on each solid. So to continue with the example: I would list files in a directory; then map so that for each file I would compute the size, check the file permissions, and compute the md5sum all in parallel; and finally collect the output.
I can run these in sequence on each file with something like:
file_results = list_files()
.map(compute_size)
.map(check_permissions)
.map(compute_md5sum)
summarize(file_results.collect())
But if these aren't actually serial dependencies, it would be nice to parallelize the work on each file. Is there some syntax like this:
file_results = list_files().map(
compute_md5sum(check_permissions(compute_size)))
summarize(file_results.collect())

If I understand correctly, something like this should accomplish what you are looking for:
def _process_file(file):
size = compute_size(file)
perms = check_permissions(file)
hash = compute_md5sum(file)
return summarize_file(size, perms, hash)
file_results = list_files().map(_process_file)
summarize(file_results.collect)

Related

What is the best way to use expand() with one unknown variable in Snakemake?

I am currently using Snakemake for a bioinformatics project. Given a human reference genome (hg19) and a bam file, I want to be able to specify that there will be multiple output files with the same name but different extensions. Here is my code
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample = "{sample}")
Currently config['workreq'] is a list of extensions that start with "."
For example, I want to be able to use expand to indicate the following files
S1.dedup.downsampled.bam.cigar_metrics
S1.dedup.downsampled.bam.computesamtags.changes.tsv
S1.dedup.downsampled.bam.coverage.blacklist.bed
S1.dedup.downsampled.bam.idsv_metrics
I want to be able to do this for multiple sample files, S_. Currently I am not getting an error when I try to do a dry run. However, I am not sure if this will run properly.
Am I doing this right?
expand() defines a list of files. If you're using two parameters, the cartesian product will be used. Thus, your rule will define as output ALL files with your extension list for ALL samples. Since you define a wildcard in your input, I think that what you want is all files with your extension for ONE sample. And this rule will be executed as many times as the number of samples.
You're mixing up wildcards and placeholders for the expand() function. You can define a wildcard inside an expand() by doubling the brackets:
rule all:
input: expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample=SAMPLELIST)
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{{sample}}.dedup.downsampled.bam{ext}", ext = config['workreq'])
This expand function will expand in list
{sample}.dedup.downsampled.bam.cigar_metrics
{sample}.dedup.downsampled.bam.computesamtags.changes.tsv
{sample}.dedup.downsampled.bam.coverage.blacklist.bed
{sample}.dedup.downsampled.bam.idsv_metrics
and thus define the wildcard sample to match the files in the input.

scp_download to download multiple files based on a pattern?

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.
I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

Looping through files in an array and executing a command in IDL

I have an array with several files in it. And i want to loop through these files. For each file i want to run a command.
result = [rtlvis_20190518_13.35.48_00001.bin, rtlvis_20190518_13.35.48_00002.bin, rtlvis_20190518_13.35.48_00003.bin, rtlvis_20190518_13.35.48_00004.bin, rtlvis_20190518_13.35.48_00005.bin]
something like: for each file in result run the following command read_rtlvis_v12,a,c,t,g,FILE="file",/CFILEONLY
Where file is each one of the files in result
Ive tried the following
FOREACH file, result do begin read_rtlvis_v12,a,c,t,g,FILE="file",/CFILEONLY
The error i get is in the read_rtlvis_v12 code. But my question is, is this the right way to go about doing a for loop with this kind of command.
Am i setting FILE="file" correctly, where file is each one of the files in result.
Do not use quotes around "file" — that is trying to read a file literally named "file". Use:
foreach file, result do read_rtlvis_v12, a, c, t, g, FILE=file, /CFILEONLY

Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file

In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

Resources