Airflow RedshiftSQLOperator executes different code as rendered SQL - airflow

I am using airflow version 2.4.3. I have the following code
sql_string = """
COPY {}
FROM 's3://{}/{{{{ ds }}}}/{}'
IAM_ROLE '{}'
JSON '{}'
""".format(
"source_layer.broker_account_nav",
ib_bucket,
s3_key_file,
RS_ROLE,
"s3://{}/dags/interactive_brokers/json/{}_paths.json".format(
airflow_bucket, "broker_account_nav"
),
)
load_data_nav = RedshiftSQLOperator(
task_id="load_data_nav",
dag=dag,
sql=sql_string,
redshift_conn_id="redshift_connection",
)
It renders as follows (verified via Rendered Template)
COPY source_layer.accounts
FROM 's3://datalake/2023-02-13/accounts/accounts_nd.json'
IAM_ROLE 'arn:aws:iam::111111:role/redshift_custom_role'
JSON 's3://airflow/dags/brokers/json/accounts_paths.json'
But the code execution of the task results in the following code (I can inspect it via Task Instance Details) and it fails.
COPY source_layer.accounts
FROM 's3://datalake/{{ ds }}/accounts/accounts_nd.json'
IAM_ROLE 'arn:aws:iam::111111:role/redshift_custom_role'
JSON 's3://airflow/dags/brokers/json/accounts_paths.json'
Please advise on how to resolve this problem.

Related

path not being detected by Nextflow

i'm new to nf-core/nextflow and needless to say the documentation does not reflect what might be actually implemented. But i'm defining the basic pipeline below:
nextflow.enable.dsl=2
process RUNBLAST{
input:
val thr
path query
path db
path output
output:
path output
script:
"""
blastn -query ${query} -db ${db} -out ${output} -num_threads ${thr}
"""
}
workflow{
//println "I want to BLAST $params.query to $params.dbDir/$params.dbName using $params.threads CPUs and output it to $params.outdir"
RUNBLAST(params.threads,params.query,params.dbDir, params.output)
}
Then i'm executing the pipeline with
nextflow run main.nf --query test2.fa --dbDir blast/blastDB
Then i get the following error:
N E X T F L O W ~ version 22.10.6
Launching `main.nf` [dreamy_hugle] DSL2 - revision: c388cf8f31
Error executing process > 'RUNBLAST'
Error executing process > 'RUNBLAST'
Caused by:
Not a valid path value: 'test2.fa'
Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run
I know test2.fa exists in the current directory:
(nfcore) MN:nf-core-basicblast jraygozagaray$ ls
CHANGELOG.md conf other.nf
CITATIONS.md docs pyproject.toml
CODE_OF_CONDUCT.md lib subworkflows
LICENSE main.nf test.fa
README.md modules test2.fa
assets modules.json work
bin nextflow.config workflows
blast nextflow_schema.json
I also tried with "file" instead of path but that is deprecated and raises other kind of errors.
It'll be helpful to know how to fix this to get myself started with the pipeline building process.
Shouldn't nextflow copy the file to the execution path?
Thanks
You get the above error because params.query is not actually a path value. It's probably just a simple String or GString. The solution is to instead supply a file object, for example:
workflow {
query = file(params.query)
BLAST( query, ... )
}
Note that a value channel is implicitly created by a process when it is invoked with a simple value, like the above file object. If you need to be able to BLAST multiple query files, you'll instead need a queue channel, which can be created using the fromPath factory method, for example:
params.query = "${baseDir}/data/*.fa"
params.db = "${baseDir}/blastdb/nt"
params.outdir = './results'
db_name = file(params.db).name
db_path = file(params.db).parent
process BLAST {
publishDir(
path: "{params.outdir}/blast",
mode: 'copy',
)
input:
tuple val(query_id), path(query)
path db
output:
tuple val(query_id), path("${query_id}.out")
"""
blastn \\
-num_threads ${task.cpus} \\
-query "${query}" \\
-db "${db}/${db_name}" \\
-out "${query_id}.out"
"""
}
workflow{
Channel
.fromPath( params.query )
.map { file -> tuple(file.baseName, file) }
.set { query_ch }
BLAST( query_ch, db_path )
}
Note that the usual way to specify the number of threads/cpus is using cpus directive, which can be configured using a process selector in your nextflow.config. For example:
process {
withName: BLAST {
cpus = 4
}
}

How to use rules_webtesting?

I want to use https://github.com/bazelbuild/rules_webtesting. I am using Bazel 5.2.0.
The whole project can be found here.
My WORKSPACE.bazel file looks like this:
load("#bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
http_archive(
name = "io_bazel_rules_webtesting",
sha256 = "3ef3bb22852546693c94e9b0b02c2570e74abab6f800fd58e0cbe79492e49c1b",
urls = [
"https://github.com/bazelbuild/rules_webtesting/archive/581b1557e382f93419da6a03b91a45c2ac9a9ec8/rules_webtesting.tar.gz",
],
)
load("#io_bazel_rules_webtesting//web:repositories.bzl", "web_test_repositories")
web_test_repositories()
My BUILD.bazel file looks like this:
load("#io_bazel_rules_webtesting//web:py.bzl", "py_web_test_suite")
py_web_test_suite(
name = "browser_test",
srcs = ["browser_test.py"],
browsers = [
"#io_bazel_rules_webtesting//browsers:chromium-local",
],
local = True,
deps = ["#io_bazel_rules_webtesting//testing/web"],
)
browser_test.py looks like this:
import unittest
from testing.web import webtest
class BrowserTest(unittest.TestCase):
def setUp(self):
self.driver = webtest.new_webdriver_session()
def tearDown(self):
try:
self.driver.quit()
finally:
self.driver = None
# Your tests here
if __name__ == "__main__":
unittest.main()
When I try to do a bazel build //... I get (under Ubuntu 20.04 and macOS):
INFO: Invocation ID: 74c03efd-9caa-4174-9fda-42f7ff37e38b
ERROR: error loading package '': Every .bzl file must have a corresponding package, but '#io_bazel_rules_webtesting//web:repositories.bzl' does not have one. Please create a BUILD file in the same or any parent directory. Note that this BUILD file does not need to do anything except exist.
INFO: Elapsed time: 0.038s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
The error message does not make sense to me, since there is a BUILD file in
https://github.com/bazelbuild/rules_webtesting/blob/581b1557e382f93419da6a03b91a45c2ac9a9ec8/BUILD.bazel
and https://github.com/bazelbuild/rules_webtesting/blob/581b1557e382f93419da6a03b91a45c2ac9a9ec8/web/BUILD.bazel.
I also tried a different version of Bazel - but with the same result.
Any ideas on how to get this working?
You need to add a strip_prefix = "rules_webtesting-581b1557e382f93419da6a03b91a45c2ac9a9ec8" in your http_archive call.
For debugging, you can look in the folder where Bazel extracts it: bazel-out/../../../external/io_bazel_rules_webtesting. #io_bazel_rules_webtesting//web translates to bazel-out/../../../external/io_bazel_rules_webtesting/web, so if that folder doesn't exist things won't work.

Airflow 2.0 - running locally keeps running the function

I have the below task keeps running I know this because it runs a query in Snowflake and I keep getting the DUO push notification. every. 5. seconds! What can I do to stop this and only have it run when the DAG runs
This is the task:
create_foreign_keys = SnowflakeQueryOperator(
dag=dag,
task_id='check_and_run_foreign_key_query',
sql=SnowHook().run_fk_alter_statements(schema,query),
trigger_rule=TriggerRule.ALL_DONE
)
This is the method being called in the sql part:
def run_fk_alter_statements(self, schema, additional_fk):
fk_query_path = "/fkeys.sql"
fd = open(f'{fk_query_path}', 'r')
query = fd.read()
fd.close()
additions = []
for fk in additional_fk:
additions.append(f""" or (t2.table_name = '{fk['table_name']}' and t2.column_name = '{fk['column_name']}'
and t1.table_name = '{fk['ref_table_name']}' and t1.column_name = '{fk['ref_column_name']}')\n""".upper())
raw_out = self.execute_query(query.format(schema=schema, fks=''.join(additions)), fetch_all=True)
query_jobs = []
for raw_query in raw_out:
query_jobs.append(raw_query[0])
return query_jobs
The sql=SnowHook().run_fk_alter_statements(schema,query) call in your instantiation of the SnowflakeQueryOperator is actually top-level code so it will execute every time the DAG is parsed by the Scheduler. You need to find a way to have that function called within an operator's execute() method.
You could add a TaskFlow function/PythonOperator task to push the output from run_fk_alter_statements() to XCom and then the SnowflakeQueryOperator uses this XCom to execute the SQL(s) that's generated.

Custom command result

When invoking a custom command, I noticed that only the logs are displayed. For example, if my Custom Comand script contains a retrun statement return "great custom command", I can't find it in the result. Both in API Java client or shell execution cases.
What can I do to be able to retrieve that result at the end of an execution?
Thanks.
Command definition in service description file:
customCommands ([
"getText" : "getText.groovy"
])
getText.groovy file content:
def text = "great custom command"
println "trying to get a text"
return text
Assuming that you service file contains the following :
customCommands ([
"printA" : {
println "111111"
return "222222"
},
"printB" : "fileB.groovy"
])
And fileB.groovy contains the following code :
println "AAAAAA"
return "BBBBBB"
Then if you run the following command : invoke yourService printA
You will get this :
Invocation results:
1: OK from instance #1..., Result: 222222
invocation completed successfully.
and if you run the following command : invoke yourService printB
You will get this :
Invocation results:
1: OK from instance #1..., Result: AAAAAA
invocation completed successfully.
So if your custom command's implementation is a Groovy closure, then its result is its return value.
And if your custom command's implementation is an external Groovy file, then its result is its last statement output.
HTH,
Tamir.

How to have buildbot running a server and retrieving output after tests

I'd like to run integration tests against a running server and retrieve server output to check it later.
Here is a cheap way to do so with local slave (otherwhise you'll probably need a FileUpload extra step)
class StopServer(ShellCommand):
def _init_(self):
ShellCommand.__init__(self, command=['pkill', '-f', 'my-server-name'],
workdir='build/python',
description='Stopping test server')
def createSummary(self, log):
buildername = self.getProperty("buildername")
f = '/home/buildbot/slave/%s/build/python/nohup.out' % buildername
output = open(f, "r").read()
self.addCompleteLog('server output', output)
class StartServer(ShellCommand):
def _init_(self):
ShellCommand.__init__(self, command=['./start-test-server.sh'],
workdir='build/python', haltOnFailure=True,
description='Starting test server')
The shell script is just a nohup with stderr and stdout redirect

Resources