Azure ML - Read only error when adding file - azure-machine-learning-studio

So I am creating a simple pipeline to learn AML. I am creating with OutputFileDatasetConfig folders to contain the train and test files.
In the first step I am splitting the baseline dataset into train X and Y and test X and Y and writing them to this location - this is working.
In the second step, I am taking the train data and transforming it, creating a transformed data file. When trying to write it, using exactly the same method as before I am getting an error
"User program failed with OSError: [Errno 30] Read-only file system: '
Snippets from the second step
parser = argparse.ArgumentParser()
parser.add_argument('--train_folder', dest='train_folder', required=True)
args = parser.parse_args()
X_train_path = os.path.join(args.train_folder, "X_train.txt")
X_train = np.loadtxt(X_train_path, delimiter=",")
## transformation code here returning X_train_transformed numpy array
run.log('X_train_transf', X_train_transformed.shape)
#this works
np.savetxt(
os.path.join(args.train_folder,"X_train_transf.txt"),
X_train_transformed,
delimiter=",")
I tried to try different OutputFileDatasetConfig modes, but it does not seem like it is the case.

Error - "User program failed with OSError: [Errno 30] Read-only file
system: '
As error suggests, file has Read only permission, you do not have Write permission for same file.
Try
chmod -R 777 /path/to/your/folder
Or
chmod -R 755 /path/to/your/folder

Related

Snakemake wildcards: Using wildcarded files from directory output

I'm new to Snakemake and try to use specific files in a rule, from the directory() output of another rule that clones a git repo.
Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file', and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.
The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.
Workflow description in plain text:
Clone a git repository to path {path}
Run a script {script} on every single JSON files in the directory {path}/parsed/ in parallel to produce the aggregate result {result}
GIT_PATH = config['git_local_path'] # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']
# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'
# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file
# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)
rule all:
input: 'result.json'
rule clone_git:
output: directory(GIT_PATH)
threads: 1
conda: f'{ENVS_DIR}git.yml'
shell: f'git clone --depth 1 {GIT_URL} {{output}}'
rule extract_json:
input:
cmd='scripts/extract_json.py',
json_file=PARSED_JSON_FILE
output: 'result.json'
threads: 50
shell: 'python {input.cmd} {input.json_file} {output}'
Running only clone_git works fine (if I set an all input of GIT_PATH).
Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?
Also - I don't know if this matters - this is a subworkflow used with module.
What you need seems to be a checkpoint rule which is first executed and only then snakemake determines which .json files are present and runs your extract/aggregate functions. Here's an example adapted:
I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources for downloaded and results for created files.
You'll need to re-adjust those paths to match your case again:
GIT_PATH = config["git_local_path"] # git/
GIT_URL = config["git_url"]
checkpoint clone_git:
output:
git=directory(GIT_PATH),
threads: 1
conda:
f"{ENVS_DIR}git.yml"
shell:
f"git clone --depth 1 {GIT_URL} {{output.git}}"
rule extract_json:
input:
cmd="scripts/extract_json.py",
json_file="resources/{file_name}.json",
output:
"results/parsed_files/{file_name}.json",
shell:
"python {input.cmd} {input.json_file} {output}"
def get_all_json_file_names(wildcards):
git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
file_names = glob_wildcards(
"resources/{file_name}.json"
).file_name
return expand(
"results/parsed_files/{file_name}.json",
file_name=file_names,
)
# Rule has checkpoint dependency: Only after the checkpoint is executed
# the rule is executed which then evaluates the function to determine all
# json files downloaded from the git repo
rule aggregate:
input:
get_all_json_file_names
output:
"result.json",
default_target: True
shell:
# TODO: Action which combines all JSON files
edit: Moved the expand(...) from rule aggregate into get_all_json_file_names.

define SAMPLE for different dir name and sample name in snakemake code

I have written a snakemake code to run bwa_map. Fastq files are with different folder name and different sample name (paired end). It shows error as 'SAMPLES' is not defined. Please help.
Error:
$snakemake --snakefile rnaseq.smk mapped_reads/EZ-123-B_IGO_08138_J_2_S101_R2_001.bam -np
*NameError in line 2 of /Users/singhh5/Desktop/tutorial/rnaseq.smk:
name 'SAMPLES' is not defined
File "/Users/singhh5/Desktop/tutorial/rnaseq.smk", line 2, in *
#SAMPLE DIRECTORY
fastq
Sample_EZ-123-B_IGO_08138_J_2
EZ-123-B_IGO_08138_J_2_S101_R1_001.fastq.gz
EZ-123-B_IGO_08138_J_2_S101_R2_001.fastq.gz
Sample_EZ-123-B_IGO_08138_J_4
EZ-124-B_IGO_08138_J_4_S29_R1_001.fastq.gz
EZ-124-B_IGO_08138_J_4_S29_R2_001.fastq.gz
#My Code
expand("~/Desktop/{sample}/{rep}.fastq.gz", sample=SAMPLES)
rule bwa_map:
input:
"data/genome.fa",
"fastq/{sample}/{rep}.fastq"
conda:
"env.yaml"
output:
"mapped_reads/{rep}.bam"
threads: 8
shell:
"bwa mem {input} | samtools view -Sb -> {output}"
The specific error you are seeing is because the variable SAMPLES isn't set to anything before you use it in expand.
Some other issues you may run into:
Output file is missing the {sample} wildcard.
The value of threads isn't passed into bwa or samtools
You should place your expand into the input directive of the first rule in your snakefile, typically called all to properly request the files from bwa_map.
You aren't pairing your reads (R1 and R2) in bwa.
You should look around stackoverflow or some github projects for similar rules to give you inspiration on how to do this mapping.

Informatica IPC - UNIX script fail

I have created a unix script to be executed after the session finished.
The script basically counts the lines of specific file and then creates a trailer with this specific structure:
T000014800000000000000000000000000000
T - for trailer
0000148 - number of lines
00000000000000000000000000000 - filler
I have tested the script in Mac, I know already that environments are totally different, but I want to know what is needed to be changed in order to execute this script successfully in IPC.
After execution I get the following error message:
The shell command failed with exit code 126.
I invoke the script as follows:
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
#! /bin/sh
TgtFiles=$1
TgtFilesBody=$TgtFiles/body.txt
TgtFilesTrailer=$TgtFiles/trailer.txt
string1=$(sed -n '$=' $TgtFilesBody)
pad=$(printf '%0.1s' "0"{1..8})
padlength=8
string2='T'
string3=$(printf '%s%*.*s%s\n' "$string2" 0 $((padlength - ${#string1} - ${#string2} )) "$pad" "$string1")
string4='00000000000000000000000000000'
string5=$(printf '%s%*.*s%s\n' "$string3" 0 $((${#string3} - ${#string4} )) "$string4")
echo $string5 > $TgtFilesTrailer
Any idea would be great.
Thanks in advance.
Please check below points.
it looks like permission issue. Please login using informatica user(the user that runs infa demon) and run this command. You should be able to get the errors.
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
Sometime the server variable $PMRootDir in UNIX doesnt get interpreted and can result null value. Please use echo $PMRootDir to check if its working after logging into UNIX using above user.
You can create trailer file using Infa easily.
Just add an aggregator transformation right before actual target( group by a dummy field to calculate count(*)). Then add an expression transformation to create those strings. And then trailer file target. Just 3 more transformations.
| --> AGG --> EXP --> Trailer Target file
Final Tr --|--> Final Target

Using rtools grep/pipe combination through a system call

I have a file called goodfile. Lets say the contents are
badline
goodline
badline
goodline
badline
badline
On a windows machine I want to filter this file to get only the "goodline"s before reading it to save on memory costs. Thankfully, the rtools installation comes with grep that should allow me to do that. I should be able to do
if(!pkgbuild::has_rtools()){
stop('install rtools')
}
rtoolsPath = pkgbuild::rtools_path()
grep = file.path(rtoolsPath,'grep.exe')
command = paste(grep, "goodline goodfile")
system(command)
and get
goodline
goodline
However when I try to pipe the output to a file by doing
command = paste(grep, "goodline goodfile > betterfile")
system(command)
I get
goodfile:goodline
goodfile:goodline
/usr/bin/grep: >: No such file or directory
/usr/bin/grep: betterfile: No such file or directory
This error message and the "betterfile" is not generated.
If I take the same command and run it on my command line, it just works, if I do the same system call with regular grep on R in linux machine it just works, so I can't see what the problem is.
I was able to find an alternative way to get the file by doing
system2(grep,
args = c('goodline','goodfile'),stderr = 'betterfile',stdout = 'betterfile')
but still curious why the pipe doesn't work

Unix SQLLDR scipt gives 'Unexpected End of File' error

All, I am running the following script to load the data on to the Oracle Server using unix box and sqlldr. Earlier it gave me an error saying sqlldr: command not found. I added "SQLPLUS < EOF", it still gives me an error for unexpected end of file syntax error on line 12 but it is only 11 line of code. What seems to be the problem according to you.
#!/bin/bash
FILES='ls *.txt'
CTL='/blah/blah1/blah2/name/filename.ctl'
for f in $FILES
do
cat $CTL | sed "s/:FILE/$f/g" >$f.ctl
sqlplus ID/'PASSWORD'#SERVERNAME << EOF sqlldr SCHEMA_NAME/SCHEMA_PASSWORD control=$f.ctl data=$f EOF
done
sqlplus will never know what to do with the command sqlldr. They are two complementary cmd-line utilities for interfacing with Oracle DB.
Note NO sqlplus or EOF etc required to load data into a schema:
#!/bin/bash
#you dont want this FILES='ls *.txt'
CTL_PATH=/blah/blah1/blah2/name/'
CTL_FILE="$CTL_PATH/filename.ctl"
SCHEMA_NM=SCHEMA_NAME
SCHEMA_PSWD=SCHEMA_PASSWORD
for f in *.txt
do
# don't need cat! cat $CTL | sed "s/:FILE/$f/g" >"$f".ctl
sed "s/:FILE/$f/g" "$CTL_FILE" > "$CTL_PATH/$f.ctl"
#myBad sqlldr "$SCHEMA_NAME/$SCHEMA_PASSWORD" control="$CTL_PATH/$f.ctl" data="$f"
sqlldr $SCHEMA_USER/$SCHEMA_PASSWORD#$SERVER_NAME control="$CTL_PATH/$f.ctl" data="$f" rows=10000 direct=true errors=999
done
Without getting too philosophical, using assignments like FILES=$(ls *.txt) is a bad habit to get into. By contrast, for f in *.txt will deal correctly for files with odd characters in them (like spaces or other syntax breaking values). BUT the other habit you do want to get into is to quote all variable references (like $f), with dbl-quotes : "$f", OK? ;-) This is the otherside of protection for files with spaces etc embedded in them.
In the edit update, I've varibalized your CTL_PATH and CTL_FILE. I think I understand your intent, that you have 1 std CTL_FILE that you pass thru sed to create a table specific .ctl file (a good approach in my experience). Note that you don't need to use cat to send a file to sed, but your use to create a altered file via redirection (> $f.ctl) is very shell-like too.
In 2nd edit update, I looked here on S.O. and found an example sqlldr cmdline that has the correct syntax and have modified to work with your variable names.
To finish up,
A. Are you sure the Oracle Client package is installed on the machine
that you are running your script on?
B. Is the /path/to/oracle/client/tools/bin included in your working
$PATH?
C. try which sqlldr. If you don't get anything, either its not
installed or its not in the path.
D. If not installed, you'll have to get it installed.
E. Once installed, note the directory that contains the sqlldr cmd.
find / -name 'sqlldr*' will take a long time to run, but it will
print out the path you want to use.
F. Take the "path" part of what is returned (like
/opt/oracle/11.2/client/bin/ (but not the sqlldr at the end), and
edit script at 2nd line with
(Txt added to appease the S.O. Formatter ;-) )
export ORCL_PATH="/path/you/found/to/oracle/client"
export PATH="$ORCL_PATH:$PATH"
These steps should solve any remaining issues. If this doesn't work, see if there is someone where you work that understands your local computing environment that can help explain any missing or different steps.
IHTH

Resources