Airflow DataprocClusterCreateOperator - airflow

In Airflow DataprocClusterCreateOperator settings:
Do we have a chance to set the Primary disk type for master and worker to pd-ssd?
The default setting is standard.
I was looking into the documentation - I don't find any parameters.

Unfortunately, there is no option to change the Disk Type in DataprocClusterCreateOperator.
In Google API it is available if you pass a parameter to https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#diskconfig
I will try and add this feature and should be available in Airflow 1.10.1 or Airflow 2.0.
For now, you can create an Airflow plugin that modifies the current DataprocClusterCreateOperator.

There seem to be two fields in regard to this:
master_machine_type: Compute engine machine type to use for the master node
worker_machine_type: Compute engine machine type to use for the worker nodes
I have found this simply looking into the source code here (this is for latest, but no version was provided so I assumed the latest version):
https://airflow.readthedocs.io/en/latest/_modules/airflow/contrib/operators/dataproc_operator.html

Related

Datadog cannot find check in "catalog" when implementing integration

I've been trying to implement a Datadog integration, more specifically Airflow's. I've been following this piece of documentation using the containerized approach (I've tried using pod annotations and adding the confd parameters to the agent's Helm values). I've only made advances when adding the airflow.yaml config to the confd sectrion of the cluster agent. However, I get stuck when I try to validate the integration as specified in the documentation by running datadog-cluster-agent status. Under the "Running Checks" section, I can see the following:
airflow
-------
Core Check Loader:
Check airflow not found in Catalog
On top of being extremely generic, this error message mentions a "Catalog" that is not referenced anywhere else in the DD documentation. It doesn't tell me or give me any hints on what could possibly wrong with the integration. Anyone had the same problem and knows how to solve it or at least how can I get more info/details/verbosity to debug this issue?
You may need to add cluster_check: true to your airflow.yaml confd configuration.

Need to create custom s3KeySensor

I'm using airflow_docker and it seems it does not come with an s3KeySensor. It does come with a Sensor Class. How can I create my own custom s3KeySensor? Do I just have to inherit from Sensor and overwrite the poke method? Can I literally just copy the source code from the s3KeySensor? Here's the source code for the s3KeySensor
The reason I am using airflow docker is that it can run in a container and I can pass in aws role creds to the task container so that it has the proper and exact permissions to do an action other than using the worker container's role permissions.
I would recommend upgrading to the latest version Airflow or at least start at Airflow 2.0.
In Airflow 2.0, a majority of the operators, sensors, and hooks that are not considered core functionality are categorized and separated into providers. Amazon has a provider, apache-airflow-providers-amazon, that includes the S3KeySensor.
You can also install backport providers so the one you would need is apache-airflow-backport-providers-amazon if you want to stay on Airflow 1.10.5 (based on the links you sent).

How to check if there are sufficient resources to create an instance?

I want to check whether there are sufficient resources available to create an instance of specific flavor without creating the instance. I tried stack --dry-run but it does not check whether there are available resources.
I also tried to go through CLI and rest api docs, but did not find any other solution than
to check available resources on each hypervisor and calculate it manually (compare it with resources required by the flavor). Isn't there any easier solution, like CLI command that would give me a yes/no answer?
Thank you.

Presto Interpreter in Zeppelin on EMR

Is it possible to add Presto interpreter to Zeppelin on AWS EMR 4.3 and if so, could someone please post the instructions? I have Presto-Sandbox and Zeppelin-Sandbox running on EMR.
There's no official Presto interpreter for Zeppelin, and the conclusion of the Jira ticket raised is that it's not necessary because you can just use the jdbc interpreter
https://issues.apache.org/jira/browse/ZEPPELIN-27
I'm running a later EMR with presto & zeppelin, and the default set of interpreters doesn't include jdbc, but it can be installed using a ssh to the master node and running
sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc
Even better is to use that as a bootstrap script.
Then you can add a new interpreter in Zeppelin.
Click the login-name drop down in the top right of Zeppelin
Click Interpreter
Click +Create
Give it a name like presto, meaning you need to use %presto as a directive on the first line of a paragraph in zeppelin, or set it as the default interpreter.
The settings you need here are:
default.driver com.facebook.presto.jdbc.PrestoDriver
default.url jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889
default.user hadoop
Note there's no password provided because the EMR environment should be using IAM roles, and ppk keys etc for authentication.
You will also need a Dependency for the presto JDBC driver jar. There's multiple ways to add dependencies in Zeppelin, but one easy way is via a maven groupid:artifactid:version reference in the interpreter settings under Dependencies
e.g.
under artifact
com.facebook.presto:presto-jdbc:0.170
Note the version 0.170 corresponds to the version of Presto currently deployed on EMR, which will change in the future. You can see in the AWS EMR settings which version is being deployed to your cluster.
You can also get Zeppelin to connect directly to a catalog, or a catalog & schema by appending them to the default.url setting
As per the Presto docs for the JDBC driver
https://prestodb.io/docs/current/installation/jdbc.html
e.g. As an example, using Presto with a hive metastore with a database called datakeep
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive
OR
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive/datakeep
UPDATE Feb 2018
EMR 5.11.1 is using presto 0.187 and there is an issue in the way Zeppelin interpreter provides properties to the Presto Driver, causing an error something like Unrecognized connection property 'url'
Currently the only solutions appear to be using an older version in the artifact, or manually uploading a patched presto driver
See https://github.com/prestodb/presto/issues/9254 and https://issues.apache.org/jira/browse/ZEPPELIN-2891
In my case using an old reference to a driver (apparently must be older than 0.180) e.g. com.facebook.presto:presto-jdbc:0.179 did not work, and zeppelin gave me an error about can't download dependencies. Funny error but probably something to do with Zeppelin's local maven repo not containing this, not sure I gave up on that.
I can confirm that patching the driver works.
(Assuming you have java & maven installed)
Clone the presto github repo
Checkout the release tag e.g. git checkout 0.187
Make the edits as per that patch https://groups.google.com/group/presto-users/attach/1231343dbdd09/presto-jdbc.diff?part=0.1&authuser=0
Build the jar using mvn clean package
Copy the jar to the zeppelin machine somewhere zeppelin user has permission to read.
In the interpreter, under the Dependencies - Artifacts section, instead of a maven reference use the absolute path to that jar file.
There appears to be an issue passing the user to the presto driver, so just add it to the "default.url" jdbc connection string as a url parameter, e.g.
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889?user=hadoop
Up and running. Meanwhile, it might be worth considering Athena as an alternative to Presto give it's serverless & is effectively just a fork of Presto. It does have limitation to External hive tables only, and they must be created in Athena's own catalog (or now in AWS Glue catalog, also restricted to External tables).
Chris Kang has a good post on doing that in spark-shell, http://theckang.com/2016/spark-with-presto/. I don't see you wouldn't be able to do that in Zeppelin. Another helpful post is making sure you have the right Java version in EMR, http://queirozf.com/entries/update-java-to-jdk-8-on-amazon-elastic-mapreduce. The current Presto version as of writing only runs on Java 8. I hope it sets you in the right direction.

How run multiple meteor servers on different ports

How can meteor run on multiple ports.For example if the meteor run on 3000 i need another meteor app run on the same terminal.Please help me.
You can use the --port parameter:
`meteor run --port 3030`
To learn more about command line parameters, run meteor help <command>, e.g. meteor help run.
I see you've tagged your question meteor-up. If you're actually using mup, check out the env parameter in the config file.
I think the OP was referring to the exceptions caused because of locks on the mongo db. I am only on this platform for last week - and am learning as quick as I can. But when I tried running my application from the same project directory as two different users on two different ports - I got an exception about MongoDB :
Error: EBUSY, unlink 'D:\test\.meteor\local\db\mongod.lock'
The root of the issue isn't running on different ports - it is the shared files between the two instances - Specifically the database.
I don't think any of your answers actually helped him out. And .. neither can I yet.
I see two options -
First -
I am going to experiment with links to see if I can get the two users to use a different folder for the .meteor\local tree ... so both of us can work on the same code at the same time - but not impact each other when testing.
But I doubt if that is what the OP was referring to either (different users same app)...
Second - is trying to identify if I can inject into the run-mongo.js some concept of the URL / port number I am running on, so the mongodb.lock (and db of course) ... are named something like mongodb.lock-3000
I don't like the 2nd option because then I am on my own version of standard scripts.
B
No, it is mainly used the default port of 3000 or any state at the start, and the following (+1) to Mongo.
That is, the following application can be run through a 2-port, already in 3002, hence the previous 2-port as before - it is 2998.
Check can be very simple (Mac, Linux):
ps|grep meteor

Resources