how to specify wildcards in a filename for amazon EMR job - emr

If I run a EMR job and specify wildcards in the directory path it all works fine
e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket
However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard
e.g: s3n//mybucket/Dir1/fileName..gz
gives an error back that no matches were found for fielName.*.gz in that directory
How do we specify wildcards in filename for an amazon emr job

Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:
df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')
Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.
The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To
show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!
# Use single quotes to stop globbing at the var level:
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'
# DO NOT add trailing slash to the output_uri. S3 will
# automatically create subdirs under that. e.g.
# --output_uri s3://$SRC_BUCKET/V4_t
# will be created and populated with many part-0000-... files.
# If you are not renaming or deleting the output_uri for each run,
# make sure your spark program uses overwrite mode for dataframe output e.g.
# dfx.write.mode("overwrite").json(output_uri)
# Careful to protect the DATA_URI arg by wrapping it single quotes:
aws emr add-steps \
--cluster-id j-3CDMYEF3NJGHR \
--steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

Related

Snakemake: wildcards do not expand in script line of rule

I am running a pipeline and was trying to optimize it by declaring the paths in a config file (config.yaml). The config.yaml file contains the path to find the scripts to run inside the pipeline, but when I expand the wildcard of the path, the pipeline does not run the script. The script itself runs fine.
To explain my problem:
rule with_script:
input: someinput
output: someoutput
script: expand("{script_path}/scriptfile", script_path = config[scriptpath])
input, output or rule all do not contain the script's path wildcard, so here is the first time I'm declaring it. The config.yaml line that contains the path looks like this:
scriptpath: /path/to/the/script
is there a way to maintain the wildcard and config file path (to make it easier for others to make changes if needed) and have the script work? Like this snakemake doesn't even enter the script file. Or maybe it is possible to declare global wildcards outside the rule all?
Thank you for your help!
P.S.: I'm sorry if this question has already been answered, but I couldn't find anything to help me with this.
You cannot define a function like expand() in the script section. Snakemake expects a path to your script.
Like the documentation states:
The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory). It is recommended to put all scripts into a subfolder "scripts"
If you need to define different paths to your scripts, you can always do it in python outside of your rules. Don't forget, all python code outside of rules is executed before building the DAG. Thus, you can define all variables you want and use them in your rules.
SCRIPTSPATH = config["scriptpath"]
rule with_script:
input: someinput
output: someoutput
script: "{SCRIPTSPATH}/scriptfile"
Note:
Do not mix wildcards and "variables". In an expand function as
expand("{script_path}/scriptfile", script_path = config[scriptpath])
{script_path} is not a wildcard but just a placeholder for the values given in the second parameter of the function.

scp_download to download multiple files based on a pattern?

I need to download many files from a server (specifically tectia) ideally using the ssh package. These files all follow the a predictable pattern across multiple sub folders. The filepath is formatted like this
/directory/subfolder/A001/abcde001.csv
Where A001 counts up alongside the last 3 digits of the filename (/A002/abcde002.csv and so on)
In the vignette for scp_download it states that the files parameter may contain wildcards so I have tried to do something like
scp_download(session, "/directory/subfolder/A.*/abcde.*[.]csv", to=tempdir())
and
scp_download(session, "directory/subfolder/A\\d{3}/abcde\\d{3}[.]csv", to=tempdir())
but no matter which combination of patterns or wildcards I can think of (which isn't many) I only get something like
Warning: SSH warning: scp: /directory/subfolder/A\d{3}/abcde\d{3}[.]csv: No such file or directory
What I'm hoping to do is either find a way to do pattern matching here, or to find a way to store tectia directories as a string to be read by scp_download. I've made sure that my session is connected properly and it works without attempting to pattern match, which it does.
I had the same problem. The problem is that when you use * in your pattern it gets escaped when you send it to the server. However, when you request a special file name like this /directory/subfolder/A001/abcde001.csv, it works fine.
Finally I changed my code based on the below steps:
I got the list of files/folders using ls command with ssh_exec_wait function and then store them on a variable.
Download files in the variable separately
session <- ssh_connect("username#ip",passwd="password")
files<-capture.output(ssh_exec_wait(session, command = 'ls /directory/subfolder/A001/*'))
dnc1<- scp_download(session, files[1], to = paste0(getwd(),"/data/"))
dnc2<- scp_download(session, files[2], to = paste0(getwd(),"/data/"))
dnc3<- scp_download(session, files[3], to = paste0(getwd(),"/data/"))
The bottom 3 commands can be done in a loop as this could be hundreds or thousands of records.

Pass a path argument having brackets to a MS DOS Batch file

I need to achieve the below using Robotframework script:
c:\>runbatch "C:\Program Files (x86)\tool\bin\test.exe" C:\tool\get.ini
where runbatch is a MS DOS Batch and "C:\Program Files (x86)\tool\bin\test.exe" and C:\tool\get.ini are parameters to the batch file. The first argument contains path of a tool that has "(" and ")" in its path.
So in my Robot script I have a variable like below:
${tool_path} "C:\\Program Files (x86)\\tool\\bin\\test.exe"
${tool_ini} "C:\tool"
And invoke like below:
${RC}= Run Process ${CURDIR}/../scripts/runbatch.bat ${tool_path} ${tool_ini}\\get.ini
The execution fails but note when I run it via the same param thru the command line as standalone batch it works fine.
In the batch I added comments to just log the arguments and I found that they are completely distorted, the tool_path value is completely distorted ("\"C:\Program) and second argument becomes (Files ) - how can I fix the issue in robot script such that when a path is passed having braces are not modified?
You need to also escape the backslashes in ${tool_ini} - make its value c:\\tool; that's not the culprit thought, just something else to change.
Remove the double quotes in the arguments' values - Run Process does not need them in the way you are calling it, with a keyword argument per script argument. E.g.:
${tool_path} C:\\Program Files (x86)\\tool\\bin\\test.exe
${tool_ini} C:\\tool
${RC}= Run Process ${CURDIR}/../scripts/runbatch.bat ${tool_path} ${tool_ini}\\get.ini
The way you've put them, they have become a part of the value itself.
Alternatively, keeping the double quotes there, you can call the script with all arguments in the call line:
${tool_path} "C:\\Program Files (x86)\\tool\\bin\\test.exe"
${tool_ini} C:\\tool
${RC}= Run Process ${CURDIR}/../scripts/runbatch.bat ${tool_path} "${tool_ini}\\get.ini"
(the second one doesn't really need quotes, but I've added them for consistency)
By the way, not really an issue, yet - the script path uses slashes (/), which is a bit unorthodox for Windows. Contrary to the popular believe, the OS does support this path delimiter pretty much the same way as it supports backslashes (\), it's just not widely used and looks a bit out of place.

How to execute a command with multiple parameters from a file in gradle?

I have below command in a .txt-file:
java -jar /path/to/something.jar --classpath="/path/to/something/other.jar" --url="something:#127.0.0.1:1234:TEST12" --driver=some.driver update
As can be seen multiple parameters with different syntax (with -, --, and/or with and without "") are used.
I tried the following code:
task test(type: Exec) {
workingDir '/path/to/working/dir'
String commandFromFile = new File('/path/to/file/with/command' + 'filewithcommand.txt').getText('UTF-8')
commandLine commandFromFile
}
On windows platforms this code is working but on unix it doesn't.
As you can see in the documentation of the Exec task, you should split up your command into its parts. So doing commandLine commandFromFile.split(' ') should work if you do not have spaces in your arguments. If you have, you need a more sophisticated way to split the command that takes quotes into account.
Or you change the format of your command file so that it has one argument per line and you use .readLines('UTF-8') instead of .getText('UTF-8').
I'm not 100% sure about the following, but it could be that you also have to remove the quoting around arguments even if they contain spaces, as you give the arguments as single entities to the commandLine call and thus need no quoting for escaping spaces here. Depending on OS and tool you call it could even break the command if there are quotes that it cannot handle.
Alternatively, but that is the worse method imho, you can also do something like
if (windows) {
commandLine 'cmd', '/c', commandFromFile
} else {
commandLine 'sh', '-c', commandFromFile
}
where then the command processor does the splitting and so on. There you need the quotes and stuff of course. The windows variable in this example of course needs to be determined, e. g. from system properties.

HISTIGNORE not working in zsh

I have added
export HISTIGNORE="ls:cd:pwd:exit:cd .."
to my .zshrc file.
Deleted .zsh_history and restarted terminal but it still wont ignore those commands.
The zsh shell doesn't use the HISTIGNORE environment variable. Instead, it has a HISTORY_IGNORE environment variable.
From the zshparam manual:
HISTORY_IGNORE
If set, is treated as a pattern at the time history files are
written. Any potential history entry that matches the pattern is skipped. For example, if the value is fc * then
commands that invoke the interactive history editor are never
written to the history file.
Note that HISTORY_IGNORE defines a single pattern: to specify
alternatives use the (first|second|...) syntax.
So in your case, you would want to do
HISTORY_IGNORE="(ls|cd|pwd|exit|cd ..)"
or something similar.
Notice that this affects only history written to the history file, not the history in the currently active shell session, as far as I can see.

Resources