I am trying to create tasks dynamically based on response of a database call. But when I do this the run option just don't come in Airflow, so I cant run.
Here s the code:
tables = ['a','b','c'] // This works
#tables = get_tables() // This never works
check_x = python_operator.PythonOperator(task_id="verify_loaded",
python_callable = lambda: verify_loaded(tables)
)
bridge = DummyOperator(
task_id='bridge'
)
check_x >> bridge
for vname in tables:
sql = ("SELECT * FROM `asd.temp.{table}` LIMIT 5".format(table= vname ))
log.info(vname)
materialize__bq = BigQueryOperator( sql=sql,
destination_dataset_table="asd.temp." + table_prefix + vname,
task_id = "materialize_" + vname,
bigquery_conn_id = "bigquery_default",
google_cloud_storage_conn_id="google_cloud_default",
use_legacy_sql = False,
write_disposition = "WRITE_TRUNCATE",
create_disposition = "CREATE_IF_NEEDED",
query_params = {},
allow_large_results = True
)
bridge >> materialize__bq
def get_tables():
bq_hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
my_query = ("SELECT table_id FROM `{project}.{dataset}.{table}` LIMIT 3;".format(
project=project, dataset=dataset, table='__TABLES__'))
df = bq_hook.get_pandas_df(sql=my_query, dialect='standard')
return view_names
I am trying to make the commented part work but no way. The get_tables() function fetches tablenames from bigquery and I wanted to make it work dynamically this way. When I do this, I dont get the option to run AND IT LOOKS LIKE dag IS broken. Any help? Trying for a long time.
Here is screenshot:
To understand the problem we must check Composer architecture
https://cloud.google.com/composer/docs/concepts/overview
The scheduler runs in GKE using the service account configured when you created the Composer instance
The web UI runs in a tenant project in App Engine using a different service account. The resources of this tenant project are hidden (you don't see the App Engine application, the Cloud SQL instance or the service account in the project resources)
When the web UI parses the DAG file, it tries to access BigQuery using the connection 'bigquery_default'.
Checking airflow GCP _get_credentials source code
https://github.com/apache/airflow/blob/1.10.2/airflow/contrib/hooks/gcp_api_base_hook.py#L74
If you have not configured the connection in the airflow admin, it will use google.auth.default method for connecting to BigQuery using the tenant project service account. This service account doesn't have permissions to access BigQuery, it will get an unauthorised error and will not able to generate the DAG in the UI. Probably if you check in Stackdriver, you will find the BigQuery error.
On the other side, the airflow scheduler uses the service account used in Composer creation, that have the right permissions and it parses the DAG correctly
If you execute the code in your local airflow instance, as the Web UI and the Scheduler use the same service account it works as expected in both cases
The easiest solution is to add to the bigquery_default connection a Keyfile Path o Keyfile JSON to avoid using the default service account in the web UI
If you have any security concern with this solution (service account credentials will be available to anyone with access to Composer) another option is to restructure the code to execute all your code inside a PythonOperator. This PythonOperator will call get_table and then will loop executing the BigQuery commands (using a BigQueryHook instead of a BigQueryOperator). The problem of this solution is that you will have a single task instead of a task per table
Related
I try to access my own data tables stored on Google BigQuery in my Google Colab sheet (with a R runtime) by running the following code:
# install.packages("bigrquery")
library("bigrquery")
bq_auth(path = "mykeyfile.json")
projectid = "work-366734"
sql <- "SELECT * FROM `Output.prepared_data`"
Running
tb <- bq_project_query(projectid, sql)
results in the following access denied error:
Access Denied: BigQuery BigQuery: Permission denied while globbing file pattern. [accessDenied]
For clarification, I already created a service account (under Google Cloud IAM and admin), gave it the roles ‘BigQuery Admin’ and ‘BigQuery Data Owner’, and extracted the above-mentioned json Key file ‘mykeyfile.json’. (as suggested here)
Additionally, I added the Role of the service account to the dataset (BigQuery – Sharing – Permissions – Add Principal), but still, the same error shows up…
Of course, I already reset/delete and reinitialized the runtime.
Am I missing giving additional permissions somewhere else?
Thanks!
Not sure if it is relevant, but I add it just in case: I also tried the authentication process via
bq_auth(use_oob = TRUE, cache = FALSE)
which opens an additional window, where I have to allow access (using my Google Account, which is also the Data Owner) and enter an authorization code. While this steps works, bq_project_query(projectid, sql) still gives the same Access Denied error.
Trying to authorize access to Google BigQuery using python and the following commands, works flawless (using the same account/credentials).
from google.colab import auth
auth.authenticate_user()
project_id = "work-366734"
client = bigquery.Client(project=project_id)
df = client.query( '''
SELECT
*
FROM
`work-366734.Output.prepared_data`
''' ).to_dataframe()
Would like to write the airflow logs to s3. Following are the parameter that we need to set according to the doc-
remote_logging = True
remote_base_log_folder =
remote_log_conn_id =
If Airflow is running in AWS, why do I have to pass the AWS keys? Shouldn't the boto3 API be able to write/read to s3 if correct permission are set on IAM role attached to the instance?
Fair point, but I think it allows for more flexibility if Airflow is not running on AWS or if you want to use a specific set of credentials rather than give the entire instance access. It might have also been easier implementation as well because the underlying code for writing logs into S3 uses the S3Hook (https://github.com/apache/airflow/blob/1.10.9/airflow/utils/log/s3_task_handler.py#L47), which requires a connection id.
I am following this document-https://is.docs.wso2.com/en/5.9.0/setup/changing-datasource-bpsds/
deployment.toml Configurations.
[bps_database.config]
url = "jdbc:mysql://localhost:3306/IAMtest?useSSL=false"
username = "root"
password = "root"
driver = "com.mysql.jdbc.Driver"
Executing database scripts.
Navigate to <IS-HOME>/dbscripts. Execute the scripts in the following files, against the database created.
<IS-HOME>/dbscripts/bps/bpel/create/mysql.sql
<IS-HOME>/dbscripts/bps/bpel/drop/mysql-drop.sql
<IS-HOME>/dbscripts/bps/bpel/truncate/mysql-truncate.sql
Now create/mysql.sql creates table and the rest two file are responsible for deleting and trucating the same table..............what do i do?????????
Can anyone also tell the use case of BPS datasource??????
Please Help...........
You should only change your bps database if you have a requirement of using the workflow feature[1] in the wso2 identity server. It is mentioned in this documentation https://is.docs.wso2.com/en/5.9.0/setup/changing-to-mysql/
The document supposed to menstion the related db script. But it seems like mis leading. As it has requested to execute all three scripts. if you are using the workflow feature just use the
/dbscripts/bps/bpel/create/mysql.sql
script to create tables in you mysql database.
[1]. https://is.docs.wso2.com/en/5.9.0/learn/workflow-management/
I want to create a shiny application which makes use of the bigrquery to connect to the BigQuery API and run a query.
I use the following code to execute the query:
library(bigrquery)
project <- "PROJECT_ID" # put your project ID here
sql <- 'QUERY '
test <- query_exec(sql, project = project)
But before this there is an authentication process in the bigrquery package like:
google <- oauth_endpoint(NULL, "auth", "token",
base_url = "https://accounts.google.com/o/oauth2")
bigqr <- oauth_app("google",
"465736758727.apps.googleusercontent.com",
"fJbIIyoIag0oA6p114lwsV2r")
cred <- oauth2.0_token(google, bigqr,
scope = c(
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/cloud-platform"))
How can I integrate the auth process in my application that
the process needs no interaction or
the process works with given app key and secrets (where do I get them? ) or
the auth process opens up in another browser window.
Regards
One suggestion I have, which is similar to an answer I provided on a question about server-side access to Google Analytics data, is to use a Google Service Account. The googleAuthR package by Mark Edmondson, available through CRAN, provides functionality to perform server-side authentication in R using a Google Service Account. Another package by the same author called bigQueryR, also on CRAN, integrates with googleAuthR and uses the resulting authentication token to execute queries to Google BigQuery.
To achieve this:
Create a service account for your Google API project.
Download the JSON file containing the private key of the service account.
Grant the service account access to your Google BigQuery project, in the same way as you would for any other user. This is done via the Google API console IAM screen where you set permissions for your project.
Supply the location of the private key JSON file as an argument when authenticating with googleAuthR (see the example below.):
The following example R script, based off an example from the bigrquery package, references the JSON file containing the private key and performs a basic Google BigQuery query. Remember to set the json_file argument to the appropriate file path and the project argument to your Google BigQuery project:
library(googleAuthR)
library(bigQueryR)
gar_auth_service(
json_file = "API Project-xxxxxxxxxxxx.json",
scope = "https://www.googleapis.com/auth/bigquery"
)
project <- "project_id" # put your project ID here
sql <- "SELECT year, month, day, weight_pounds
FROM [publicdata:samples.natality] LIMIT 5"
bqr_query(projectId = project, query = sql, datasetId = "samples")
i have a Cloudstack 4.2.1 here and would like my VMs to boot from time and shutdown at a scheduled time.
Hence i was thinking if i could integrate Cloudmonkey with CronTab together.
Firstly by creating a Cloudmonkey Script or API call then using crontab to run it at a specific time.
However i have problems creating a Cloudmonkey script/API call...
i haved googled and found this link
http://dlafferty.blogspot.sg/2013/07/using-cloudmonkey-to-automate.html
and had a result of
apiresult=cloudmonkey api stop virtualmachine id="'e10bdf21-2d5c-4277-9d8d-791b82b9e3be'"
unfortunately when i entered this command, nothing happened. If anyone could have an alternative suggestion or rather my API call command is wrong, please correct me and help
Thank you.
CloudMonkey requires some setup before it works (e.g. setting your API key).
Check [1] for the documentation for CloudMoney and follow through the Usage section to setup your environment.
Once your setup is complete and you can interact with CloudStack via CloudMonkey, you should take into account that the VM ids might change, so before you issue a command for a VM, you should first find the correct id, by listing the VMs and picking the right one.
Also, if you run into trouble, post the relevant log from CLoudStack management server (typically in /var/log/cloudstack/management/management-server.log).
[1] - https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+cloudmonkey+CLI
Edit: If you have a working connection via CloudMonkey to CloudStack, you need to configure CloudMonkey in the same way in your shell script. For instance when you configured CloudMonkey you probably set a host, a port and your api and secret keys. So for your scrip to work you need to provide the same configuration to CloudMonkey prior to issuing the commands. My best guess is to use the -c option and provide a config file to set all the relevant parameters (e.g. api an secret key). cloudmonkey -c CONFIG_FILE ....
Edit2: You don't actually need to re-configure cloudmonkey in your script because it will remember your config from the interactive session. I would still advise you to do it, because your script gets more reliable. I've just made an example script like this:
#! /bin/bash
result=$(cloudmonkey list users)
echo $result
Result:
> ./tmp.sh
count = 1 user: id = 678e3a24-082c-11e4-86de-acbdb2423647 account = admin accountid = 678dffe6-082c-11e4-86de-acbdb2423647 accounttype = 1 apikey = T6sDBIpytyJ4_PMgNXYi8YgjMtwTiiDjijbXNB1J78EAZq2foKhCoGKjgJnej5tMaHM0LUvejgTddkhVU63wdw created = 2014-07-10T16:19:13+0200 domain = ROOT domainid = 678dd7b4-082c-11e4-86de-acbdb2423647 email = admin#mailprovider.com firstname = Admin iscallerchilddomain = False isdefault = True lastname = User secretkey = dzOPRecI5vvEVK7Vie2D0tDsQGXunUnpIAczbXnPI3sfMwQ-upWL_bPOisEYg4C-nXi-ldQno2KVZbVR-5NmVw state = enabled username = admin
Maybe you forgot to echothe result?