What does "rerun in upstream in pipeline"? - pipeline

I am defining a pipeline in data factory, I had some errors that I correct.
The first activity is calling an usql script to do some aggregation, I changed the script plenty of time but the error is still:
[{"errorId":"E_CSC_USER_SYNTAXERROR","severity":"Error","component":"CSC","source":"USER","message":"syntax
error. Final statement did not end with a semicolon","details":"at
token 'usql', line 4\r\nnear the ###:\r\n**************\r\nCLARE
#lineitemsfile string =
\"/datalakerepo/input/2016/01/01lineitems.txt\";\nDECLARE #ordersfile
string = \"/datalakerepo/input/2016/01/01orders.txt\";\nsales.usql ###
\n","description":"Invalid syntax found in the
script.","resolution":"Correct the script syntax, using expected
token(s) as a
guide.","helpLink":"","filePath":"","lineNumber":4,"startOffset":228,"endOffset":232}].
seem like not all usql script is read from the data factory, so I though that may be the "rerun in upstream in pipeline" have something to do with this, like clear cache from previous script.
Anyone knows what "rerun in upstream in pipeline" does?
Many thanks!

"Rerun with upstream in pipeline" basically means "recalculate with all dependencies". For example, if one has pipeline1 -> dataset1 -> pipeline2 and tries to rerun pipeline2 with dependecies, then pipeline1 and pipeline2 will be both executed. I believe it works same with several chained activities within single pipeline.

Related

LocustIO: How to do batch request

I started to use LocustIO for load testing a 3rd party API which provides a way to do batch requests (http://docs.oasis-open.org/odata/odata/v4.01/odata-v4.01-part1-protocol.html#sec_BatchRequests).
How can this be done using LocustIO?
I tried with the following:
def batch(self):
response = self.client.request(method="POST", url="/$batch", auth=("ABC", "DEF"), headers={"ContentType": "multipart/mixed; boundary=batch_36522ad7-fc75-4b56-8c71-56071383e77b"}, data="Content-Type: application/http\nContent-Transfer-Encoding: binary\n\nGET putyoururlhere HTTP/1.1\nAccept: application/json\n\n\n")
Auth is something I need to have authentication to the API, but that's not the point of the question and "putyoururlhere" should be replaced with the actual url. Either way, it gives errors when executing the test, so I must be doing something wrong.
People with experience how to do this?
Kind regards!
The data parameter should be your POST body (only), you cant put additional headers in it the way you did. You probably just want to add them as additional entries in the dict you pass as headers
Se the documentation for python requests library for more details. https://requests.readthedocs.io/en/master/

How can I log sql execution results in airflow?

I use airflow python operators to execute sql queries against a redshift/postgres database. In order to debug, I'd like the DAG to return the results of the sql execution, similar to what you would see if executing locally in a console:
I'm using psycop2 to create a connection/cursor and execute the sql. Having this logged would be extremely helpful to confirm the parsed parameterized sql, and confirm that data was actually inserted (I have painfully experiences issues where differences in environments caused unexpected behavior)
I do not have deep knowledge of airflow or the low level workings of the python DBAPI, but the pscyopg2 documentation does seem to refer to some methods and connection configurations that may allow this.
I find it very perplexing that this is difficult to do, as I'd imagine it would be a primary use case of running ETLs on this platform. I've heard suggestions to simply create additional tasks that query the table before and after, but this seems clunky and ineffective.
Could anyone please explain how this may be possible, and if not, explain why? Alternate methods of achieving similar results welcome. Thanks!
So far I have tried the connection.status_message() method, but it only seems to return the first line of the sql and not the results. I have also attempted to create a logging cursor, which produces the sql, but not the console results
import logging
import psycopg2 as pg
from psycopg2.extras import LoggingConnection
conn = pg.connect(
connection_factory=LoggingConnection,
...
)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur = conn.cursor()
sql = """
INSERT INTO mytable (
SELECT *
FROM other_table
);
"""
cur.execute(sql)
I'd like the logger to return something like:
sql> INSERT INTO mytable (
SELECT ...
[2019-07-25 23:00:54] 912 rows affected in 4 s 442 ms
Let's assume you are writing an operator that uses postgres hook to do something in sql.
Anything printed inside an operator is logged.
So, if you want to log the statement, just print the statement in your operator.
print(sql)
If you want to log the result, fetch the result and print the result.
E.g.
result = cur.fetchall()
for row in result:
print(row)
Alternatively you can use self.log.info in place of print, where self refers to the operator instance.
Ok, so after some trial and error I've found a method that works for my setup and objective. To recap, my goal is to run ETL's via python scripts, orchestrated in Airflow. Referring to the documentation for statusmessage:
Read-only attribute containing the message returned by the last command:
The key is to manage logging in context with transactions executed on the server. In order for me to do this, I had to specifically set con.autocommit = False, and wrap SQL blocks with BEGIN TRANSACTION; and END TRANSACTION;. If you insert cur.statusmessage directly following a statement that deletes or inserts, you will get a response such as 'INSERT 0 92380'.
This still isn't as verbose as I would prefer, but it is a much better than nothing, and is very useful for troubleshooting ETL issues within Airflow logs.
Side notes:
- When autocommit is set to False, you must explicitly commit transactions.
- It may not be necessary to state transaction begin/end in your SQL. It may depend on your DB version.
con = psy.connect(...)
con.autocommit = False
cur = con.cursor()
try:
cur.execute([some_sql])
logging.info(f"Cursor statusmessage: {cur.statusmessage})
except:
con.rollback()
finally:
con.close()
There is some buried functionality within psycopg2 that I'm sure can be utilized, but the documentation is pretty thin and there are no clear examples. If anyone has suggestions on how to utilize things such as logobjects, or returning join PID to somehow retrieve additional information.

An output problem about nginx module?How to fix it?

I'm new to nginx and I'm trying to develope a simple nginx module,a handler module specifically. Although it's not what I really wanna do,I try to finish this task first. I'm trying to get the socketfd when a browser(or a client) connects to nginx.And I have get it successfully.However, when I tried to output something using dup2(),the nginx is always pending and just outputs nothing.Sometimes I can get output after a long time and once I stop nginx like nginx -s stop,and the output appears immediately.
Like this:
reach http://100.100.60.199/nc?search=123456
get
search=123456 HTTP/1.l
HOST
output
I have read some blogs about nginx module and I found that handler module has its own pattern (to my understanding?).For example, the output should be nginx_chain_t,and I should construct that chain instead of using dup2 like a regular c code.So I wonder if it's feasible to get output like the function below.
Here is my handler function:
static ngx_int_t ngx_http_nc_handler(ngx_http_request_t *r){
//ngx_int_t rc;
ngx_socket_t connfd = r->connection->fd;
int nZero=0;
//if(setsockopt(connfd,SOL_SOCKET,SO_SNDBUF,(const void*)&nZero,sizeof(nZero))==0)
if(setsockopt(connfd,IPPROTO_TCP,TCP_NODELAY,(const void*)&nZero,sizeof(int))==0){
setbuf(stdout,NULL);
setbuf(stdin,NULL);
setbuf(stderr,NULL);
dup2(connfd,STDOUT_FILENO);
dup2(connfd,STDERR_FILENO);
dup2(connfd,STDIN_FILENO);
printf("%s\n", r->args.data);
//close(connfd);
}
return NGX_OK;
}
So I wonder if it's feasible,how can I get things right using the method above or can anybody just say it's impossible and construct a chain is the only way?
I finally solved this problem by trying to understand how exactly nginx works.In short,all I need to do is to add a http header to the output.But it's not that easy like what I described.

tSQLt - How to output a custom failure or success message?

We are using the tSQLt framework and have the below code in the test.
IF #count>0
EXEC tsqlt.fail;
else EXEC tSQLt.AssertEquals 1,1;
I am interested to know how we can display a custom test success or failure message when this test gets executed?
tSQLt.fail takes up to 10 parameters that all get concatenated into a custom failure message.
You also do not need the call to tSQLt.AssertEquals as it, in your case, literally does nothing.
BTW, asserting a count is in almost all cases a bad idea, as it does not really tell you anything about the result. If you get the correct count back, you could still have wrong data. And if you get the incorrect count, you don't have any additional info on what went wrong.
Have a look at tSQLt.AssertEqualsTable or tSQLt.AssertEmptyTable instead.

Apache camel using seda

I want to have a behavior like this:
Camel reads a file from a directory, splits it into chunks (using streaming), sends each chunk to a seda queue for concurrent processing, and after the processing is done, a report generator is invoked.
This is my camel route:
from("file://c:/mydir?move=.done")
.to("bean:firstBean")
.split(ExpressionBuilder.beanExpression("splitterBean", "split"))
.streaming()
.to("seda:processIt")
.end()
.to("bean:reportGenerator");
from("seda:processIt")
.to("bean:firstProcessingBean")
.to("bean:secondProcessingBean");
When I run this, the reportGenerator bean is run concurrently with the seda processing.
How to make it run once after the whole seda processing is done?
The splitter has built-in parallel so you can do this easier as follows:
from("file://c:/mydir?move=.done")
.to("bean:firstBean")
.split(ExpressionBuilder.beanExpression("splitterBean", "split"))
.streaming().parallelProcessing()
.to("bean:firstProcessingBean")
.to("bean:secondProcessingBean");
.end()
.to("bean:reportGenerator");
You can see more details about the parallel option at the Camel splitter page: http://camel.apache.org/splitter
I think you can use the delayer pattern of Camel on the second route to achieve the purpose.
delay(long) in which the argument indicates time in milliseconds. You can read more abuout this pattern here
For eg; from("seda:processIt").delay(2000)
.to("bean:firstProcessingBean"); //delays this route by 2 seconds
I'd suggest the usage of startupOrder to configure the startup of route though.
The official documentation provides good details on the topic. Kindly read it here
Point to note - " The routes with the lowest startupOrder is started first. All startupOrder defined must be unique among all routes in your CamelContext."
So, I'd suggest something like this -
from("endpoint1").startupOrder(1)
.to("endpoint2");
from("endpoint2").startupOrder(2)
.to("endpoint3");
Hope that helps..
PS : I'm new to Apache Camel and to stackoverflow as well. Kindly pardon any mistake that might've occured.

Resources