I have a file with DB connection properties (including passwords) in HDFS that needs to be accessible to all oozie jobs.
I am looking for a strategy to use this file using oozie actions in order to connect to DB.
I wonder if creating hive table to load that file and query them via oozie hive action is good strategy to consider.
Related
I would like to use DBT in MWAA Airflow enviroment. To achieve this I need to install DBT in the managed environment and from there run the dbt commands via the Airflow operators or CLI (BashOperator).
My problem with solution is that I need store the dbt profile file(s) -which contains the target / source database credentials- in S3. Otherwise the file is not going to be deployed to the Airflow worker nodes hence cannot be used by dbt.
Is there any other option? I feel this is a big security risk and also undermines the use of Airflow (because I would like to use its inbuilt password manager)
My ideas:
Create the profile file on the fly in the Airflow dag as a task and
write it out to local. I do not think this is a feasible workaround, because there is no guarantee that the dbt task is going to run on the same worker node which my code created.
Move the profile file manually to S3 (Exclude it from CI/CD). Again, I see a security risk, as I am storing credentials on S3.
Create a custom operator, which builds the profile file on the same machine as command will run. Maintenance nightmare.
Use MWAA environment variables (https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html) and combine it with dbt's env_var command. (https://docs.getdbt.com/reference/dbt-jinja-functions/env_var)
Storing credentials in System wide EVs, this way feels awkward.
Any good ideas or best practices?
#PeterRing, in our case we use Dbt Cloud. Once the connection is set up in the Airflow UI, you are calling Dbt Job IDs to trigger the job (then using a sensor to monitor it until it completes).
If you can't use Dbt Cloud, perhas you can use AWS Secrets Manager to store your db profile/creds: Configuring an Apache Airflow connection using a Secrets Manager secret
I am trying to automate druid batch ingestion using Airflow. My data pipeline creates EMR cluster on demand and shut it down once druid indexing is completed. But for druid we need to have Hadoop configurations in druid server folder ref. This is blocking me from dynamic EMR clusters. Can we override Hadoop connection details in Job configuration or is there a way to support multiple indexing jobs to use different EMR clusters ?
I have tried out overriding the parameters ( Hadoop configuration) in core-site.xml,yarn-site.xml,mapred-site.xml,hdfs-site.xml as Job properties in druid indexing job. It worked. In that case no need of copying the above files in druid server.
Just used below python program to convert the properties to json key value pairs from xml files. Can do the same for all the files and pass everything as indexing job payload. The below thing can be automated using airflow after creating different EMR clusters.
import json
import xmltodict
path = 'mypath'
file = 'yarn-site.xml'
with open(os.path.join(path,file)) as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
druid_dict = {property.get('name'):property.get('value') for property in data_dict.get('configuration').get('property') }
print(json.dumps(druid_dict)) ```
In researching how this might be done, I found hadoopDependencyCoordinates property here: https://druid.apache.org/docs/0.22.1/ingestion/hadoop.html#task-syntax
which seems relevant.
I am following this document-https://is.docs.wso2.com/en/5.9.0/setup/changing-datasource-bpsds/
deployment.toml Configurations.
[bps_database.config]
url = "jdbc:mysql://localhost:3306/IAMtest?useSSL=false"
username = "root"
password = "root"
driver = "com.mysql.jdbc.Driver"
Executing database scripts.
Navigate to <IS-HOME>/dbscripts. Execute the scripts in the following files, against the database created.
<IS-HOME>/dbscripts/bps/bpel/create/mysql.sql
<IS-HOME>/dbscripts/bps/bpel/drop/mysql-drop.sql
<IS-HOME>/dbscripts/bps/bpel/truncate/mysql-truncate.sql
Now create/mysql.sql creates table and the rest two file are responsible for deleting and trucating the same table..............what do i do?????????
Can anyone also tell the use case of BPS datasource??????
Please Help...........
You should only change your bps database if you have a requirement of using the workflow feature[1] in the wso2 identity server. It is mentioned in this documentation https://is.docs.wso2.com/en/5.9.0/setup/changing-to-mysql/
The document supposed to menstion the related db script. But it seems like mis leading. As it has requested to execute all three scripts. if you are using the workflow feature just use the
/dbscripts/bps/bpel/create/mysql.sql
script to create tables in you mysql database.
[1]. https://is.docs.wso2.com/en/5.9.0/learn/workflow-management/
I have an EMR and intend to do CRUD operations on dynamo DB as part of my Reducer.
Note I am not using Hive or Spark and using Apache Hadoop. Is there any documentation on how to connect to Dynamo DB from my EMR ?
emr-dynamodb-connector is open source library and includes Hadoop classes like DynamoDBInputFormat, DefaultDynamoDBRecordReader for Reading data(With Parallel scans) from DynamoDB with Read rate control &
DynamoDBOutputFormat DefaultDynamoDBRecordWriter for writing(using BatchWrites API) to DynamoDB with write rate control to avoid throttling.
I don't think there's any more AWS Documentation on this one other than README of this open source lib.
All EMR clusters should have a pre-build package of this library(except emr-dynamodb-tools) usually # /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar and included in classpath of EMR Hadoop. So, you can just use Hadoop InputFormat and OutputFormat Implementation's from this JAR on your MR Application by setting required config's(including DynamoDB config's) using Job Configuration
I am trying to copy files from my edge node to HDFS using oozie. Many suggested to setup password less ssh to get this done.
Iam unable to login to oozie user as it is a service user.
Is there any other way other than password less ssh.
Thanks in advance.
Other than password less ssh there are two more options :
1. My preferred option : Use JSch java library and create a java application which will accept a shell script to be executed as argument. Using the JSch , it will perform ssh on the configured edge node and execute the shell script on the edge node. In the jsch, you can configure, the edgenode username and password. Use 'JCEKS' file to store the password.
Then add a Java Action in Oozie to run the java application created using JSch.
2. Use "/usr/bin/expect" library to create a shell script, which will perform ssh on edgenode and then run the configured shell script. More details are here Use expect in bash script to provide password to SSH command