Pass pyfiles and arguments to DataProcPySparkOperator - airflow

I am trying to pass arguments and zipped pyfiles to a temporary Dataproc Cluster in Composer
spark_args = {
'conn_id': 'spark_default',
'num_executors': 2,
'executor_cores': 2,
'executor_memory': '2G',
'driver_memory': '2G',
}
task = dataproc_operator.DataProcPySparkOperator(
task_id='spark_preprocess_{}'.format(name),
project_id=PROJECT_ID,
cluster_name=CLUSTER_NAME,
region='europe-west4',
main='gs://my-bucket/dist/main.py',
pyfiles='gs://my-bucket/dist/jobs.zip',
dataproc_pyspark_properties=spark_args,
arguments=['--name', 'test', '--date', self.date_exec],
dag=subdag
)
But I get the following error, any idea how to correctly format the arguments?
Invalid value at 'job.pyspark_job.properties[1].value' (TYPE_STRING)

As pointed out in the comment, the issues is that spark_args has non-string values, but it should contain only strings per error message:
Invalid value at 'job.pyspark_job.properties[1].value' (TYPE_STRING)

Related

Airflow:.AirflowException: Issues in reading JSON template variable

Requirement: I am trying to avoid using Variable.get() Instead use Jinja templated {{var.json.variable}}
I have defined the variables in JSON format as an example below and stored them in the secret manager as snflk_json
snflk_json
{
"snwflke_acct_request_memory":"4000Mi",
"snwflke_acct_limit_memory":"4000Mi",
"schedule_interval_snwflke_acct":"0 12 * * *",
"LIST" ::[
"ABC.DEV","CDD.PROD"
]
}
Issue 1: Unable to retrieve schedule interval from the JSON variable
Error : Invalid timetable expression: Exactly 5 or 6 columns has to be specified for iterator expression.
Tried to use in the dag as below
schedule_interval = '{{var.json.snflk_json.schedule_interval_snwflke_acct}}',
Issue 2:
I am trying to loop to get the task for each in LIST, I tried as below but in vain
with DAG(
dag_id = dag_id,
default_args = default_args,
schedule_interval = '{{var.json.usage_snwflk_acct_admin_config.schedule_interval_snwflke_acct}}' ,
dagrun_timeout = timedelta(hours=3),
max_active_runs = 1,
catchup = False,
params = {},
tags=tags
) as dag:
shares = '{{var.json.snflk_json.LIST}}'
for s in shares:
sf_tasks = SnowflakeOperator(
task_id=f"{s}" ,
snowflake_conn_id= snowflake_conn_id,
sql=sqls,
params={"sf_env": s},
)
Error
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 754, in __init__
validate_key(task_id)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/helpers.py", line 63, in validate_key
raise AirflowException(
airflow.exceptions.AirflowException: The key '{' has to be made of alphanumeric characters, dashes, dots and underscores exclusively
Airflow is parsing the dag every few seconds (30 by default). so actually it runs the for loop on a string with value {{var.json.snflk_json.LIST}} and that why you get that error.
you should use DynamicTask (from ver 2.3) or put the code under Python task that creates tasks and execute the new tasks.

R: Errorhandling with tryCatchLog - create a customizable result code for the console and write a detailled traceback to log file

I have code that includes several initial checks of different parameter values. The code is part of a larger project involving several R scripts as well as calls from other environments. If a parameter value does not pass one of the checks, I want to
Generate a customizable result code
Skip the remaining code (which is not going to work anyhow if the parameters are wrong)
Create a log entry with the line where the error was thrown (which tells me which test was not passed by the parameters)
Print my customizable result code to the console (without a more detailed explanation / trace back from the error)
Otherwise, the remaining code should be run. If there are other errors (not thrown by me), I also need an error handling resulting in a customizable general result code (signalling that there was an error, but that it was not one thrown by me) and a more detailled log.
The result codes are part of the communication with a larger environment and just distinguishes between wrong parameter values (i.e., errors thrown by me) and other internal problems (that might occur later in the script).
I would like to use tryCatchLog because it allows me to log a detailed traceback including the script name (I am sourcing my own code) and the line number. I have not figured out, however, how to generate my own error code (currently I am doing this via the base function stop()) and pass this along using tryCatchLog (while also writing a log).
Example
In the following example, my parameter_check() throws an error via stop() with my result code "400". Using tryCatchLog I can catch the error and get a detailed error message including a traceback. However, I want to seperate my own error code (just "400"), which should be printed to the console, and a more detailed error message, which should go to a log file.
library(tryCatchLog)
parameter_check <- function(error) {
if (error){
stop("400")
print("This line should not appear")
}
}
print("Beginning")
tryCatchLog(parameter_check(error = TRUE),
error = function(e) {print(e)}
)
print("End")
Currently, the result is:
[1] "Beginn"
ERROR [2021-12-08 11:43:38] 400
Compact call stack:
1 tryCatchLog(parameter_check(0), error = function(e) {
2 #3: stop("400")
Full call stack:
1 tryCatchLog(parameter_check(0), error = function(e) {
print(e)
2 tryCatch(withCallingHandlers(expr, condition =
cond.handler), ..., finall
3 tryCatchList(expr, classes, parentenv, handlers)
4 tryCatchOne(expr, names, parentenv, handlers[[1]])
5 doTryCatch(return(expr), name, parentenv, handler)
6 withCallingHandlers(expr, condition = cond.handler)
7 parameter_check(0)
8 #3: stop("400")
9 .handleSimpleError(function (c)
{
if (inherits(c, "condition")
<simpleError in parameter_check(0): 400>
I would like to get my own result code ("400") so that I can print it to the console while logging the complete error message in a file. Is there a way of doing it without writing code parsing the error message, etc.?
Solution with tryCatch
Based on the hint by R Yoda and this answers this is a solution with tryCatch and calling handlers.
### Parameters
log_file_location <- "./logs/log.txt"
### Defining functions
parameter_check_1 <- function(error) {
if (error){
stop("400")
}
}
parameter_check_2 <- function(error) {
if (error){
stop("400")
}
}
write_to_log <- function(file_location, message) {
if (file.exists(file_location))
{write(message, file_location, append = TRUE)}
else
{write(message, file_location, append = FALSE)}
}
parameter_check <- function(){
print("Beginning of parameter check")
print("First check")
parameter_check_1(error = TRUE)
print("Second check")
parameter_check_2(error = FALSE)
print("End of parameter check")
}
main<- function() {
print("Beginning of main function")
log(-1) # throws warning
log("error") # throws error
print("End of main function")
}
### Setting parameters
result_code_no_error <- "200"
result_code_bad_request <- "400"
result_code_internal_error <- "500"
# initial value for result_code
result_code <- result_code_no_error
print("Beginning of program")
### Execute parameter check with tryCatch and calling handlers
# Error in parameter checking functions should result in result_code_bad_request
tryCatch(withCallingHandlers(parameter_check(),
error = function(condition){},
warning = function(condition){
write_to_log(log_file_location, condition$message)
invokeRestart("muffleWarning")
}
),
error = function(condition) {
write_to_log(log_file_location, condition$message)
result_code <<- result_code_bad_request
}
)
### Execute main section with tryCatch and calling handlers
# Error in main section should result in result_code_internal_error
# main section should only be excecuted if there is no error (internal or bad request) in the previous section
if (result_code == result_code_no_error) {
tryCatch(withCallingHandlers(main(),
error = function(condition){},
warning = function(condition){
write_to_log(log_file_location, condition$message)
invokeRestart("muffleWarning")
}
),
error = function(condition) {
write_to_log(log_file_location, condition$message)
result_code <<- result_code_internal_error
}
)
}
print("End of program")
print(result_code)
As explained in the vignette for tryCatchLog this has the disadvantage of not logging the precise location of the error. I am not passing on the error message from stop("400"), because all parameter checking functions are in one function call now, but this could be done using condition$message.
The solution is (totally independent of using tryCatchLog or standard R tryCatch):
...
error = function(e) {print(e$message)}
..
Background (how R errors work): They create an object of type (error) condition:
e <- simpleError("400") # same "condition" object as created by stop("400")
str(e)
# List of 2
# $ message: chr "400"
# $ call : NULL
# - attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
print(e$message)
[1] "400"

Error while using write_xes function : Error in defaultvalues[[datatype]] : invalid subscript type 'list'

I want to export an eventlog object built in R using bupaR package function - eventlog as an xes file. For that I am using function write_xes() of package xesreadR. But the function is giving out error :
Error in defaultvalues[[datatype]] : invalid subscript type 'list'
>class(log)
output:
[1] "eventlog" "tbl_df" "tbl" "data.frame"
write_xes(log,"myxes.xes")
According to the documentation it should save the log to the destined file.But instead it is producing the error :
ERROR : Error in defaultvalues[[datatype]] : invalid subscript type
'list'
I have tried multiple things to troubleshoot this problem but haven't came up with a solution. So can somebody help me to solve this error. Thank You!
Your function is defined as follow:
write_xes ( eventlog, case_attributes = NULL, file = file.choose())
Thus, writing
write_xes(log,"myxes.xes")
means
write_xes(eventlog = log, case_attributes = "myxes.xes").
Instead, you shall write
write_xes(eventlog = log, file = "myxes.xes")

Argument "xyz" to "ABC" has incompatible type "Tuple[None, ...]"; expected "Tuple[None]"

As an experiment, I wanted to add type annotations to my project and test it with mypy --strict. Consider the following code and the error message below:
#!/usr/bin/env python
import typing as T
from dataclasses import dataclass
#dataclass(frozen=True)
class Question:
choices: T.Tuple[None]
def gen_question() -> Question:
choices = [None]
return Question(choices=tuple(choices))
if __name__ == '__main__':
gen_question()
Here's the error message:
test.py:18: error: Argument "choices" to "Question" has incompatible type "Tuple[None, ...]"; expected "Tuple[None]"
Is there something I'm doing wrong, or is that a bug? How can I solve the problem?
It appears that in case of typing.Tuple, according to the documentation, if I need to specify a variable-length tuple I need to add , ... as in the following:
choices: T.Tuple[None, ...]
Note that this doesn't seem to apply to lists.

safely get variable from enviroment

When I execute:
my_env <- new.env(parent = emptyenv())
test <- purrr::safely(get("meta", envir = my_env))
I get the following error:
Error in get("meta") : object 'meta' not found
The error is correct in the sense of that the meta variable is not defined in the environment but my line of thinking was that safely would return a NULL in that case.
I can get around the error by using checking using exists first but I was curious about why safely fails. Am I wrong in thinking of safely as the equivalent of a try-catch?
You are misinterpreting the actions of the safely function. It was actually succeeding. If you had examined the value of test, you should have seen:
> test
[1] "Error in get(\"meta\", env = my_env) : object 'meta' not found\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in get("meta", env = my_env): object 'meta' not found
To suppress error messages from being visible at the console you can either turn off reporting with options(show.error.messages = FALSE) or you can redirect the destination of stderr().

Resources