Connect airflow to google fusion - airflow

I'd like to write python script which manages my google data fusion pipelines and instances (creates new, deletes, starts, etc). For that purpose I use airflow installed as library. I've read some tutorials and documentations but I still can't make that script connect with data fusion instance. I've tried to use next string:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://?extra__google_cloud_platform__key_path=%2Fkeys%2Fkey.json&extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&extra__google_cloud_platform__project=airflow&extra__google_cloud_platform__num_retries=5'
with my data json key file and Project id but it still doesn't work. Can you give me an example of creating that connection?

You can find an example python script here:
https://airflow.readthedocs.io/en/latest/_modules/airflow/providers/google/cloud/example_dags/example_datafusion.html
This page provides a breakdown for each Data Fusion Operator if you would like to learn more about them:
https://airflow.readthedocs.io/en/latest/howto/operator/gcp/datafusion.html

Related

ADX's built in export command vs ADFv2 copy activity performance

Kusto (ADX) has .export command to export data from Kusto to other data stores (Azure Storage is of my particular interest though). Here I am not referring to continuous export but rather just export command which can also be used in on-demand manner. I think couple of years back this was the only (out of the box) option to export data out of Kusto. Now I see that ADFv2 (Azure Data Factory v2) copy activity also supports ADX (Kusto) as one of the sources and using this we are able to copy data from Kusto to Azure Storage. But which method is faster ? Is there any recommendation? One could also orchestrate .export command from ADFv2's ADX command activity which is what we were doing earlier -- but now I see that copy activity from ADFv2 has started supporting ADX as source out of the box , which is much easier to work with. But we only want to use the method which will provide best performance (faster).
Please see the comparison between copy activity and the export command here specifically the tip here:
The reason for it, is that the ".export" command perform better as it by default executed in parallel as well as providing more customization options.

Creating a one off cron job using firebase and app engine

I'm creating an app where a user can create a piece of data that could be presented in the ui at a later date. I'm trying to find a way to create cron entries dynamically using either java code (for android devices) or node.js code (firebase cloud function generates a cron job). I haven't been able to find a way to do it dynamically and based on what I read it may not be possible. Does anyone out there know a way?
Presently the only way to create GAE cron jobs is via deployment of a cron configuration file (by itself or, in some cases, together with the app code). Which today can only be done via CLI tools (from the GAE or Cloud SDKs).
I'm unsure if you'd consider programmatically invoking such CLI tools qualifying to 'create cron entries dynamically'. If you do - generation the cron config file and scripting the desired CLI-based deployment would be an approach.
But creating jobs via an API is still a feature request, see How to schedule repeated jobs or tasks from user parameters in Google App Engine?. It also contains another potentially acceptable approach (along the same lines as the one mentioned in #ceejayoz's comment))

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Installing Google Cloud Datastore gcd

I'm trying to wrap my head around working with Google Cloud Datastore but am having trouble getting started. I've downloaded the zip of the gcd tool (v1beta2) as described here, which, when unpacked, is comprised of three files: gcd.sh, gcd.cmd, and CloudDatastore.jar. Unfortunately, there are no further instructions on what to do next - where to install it, what path variables or permissions to set, etc. Can someone fill me in?
TIA - Joe
Typical usage looks something like:
# create a dataset
gcd.sh create my-project
# start the local datastore
gcd.sh start my-project
Then, if you're using the Java or Python protocol buffers library, you set a couple of environment variables to instruct the client to use the local datastore:
export DATASTORE_HOST=http://localhost:8080
export DATASTORE_DATASET=my-project
You can find more details about the gcd tool (including instructions for managing indexes) here.

Is there a way to back up my App Services / Usergrid data

App Services is a great place to store data but now that I have a lot of critial info in there I realized there isn't a way to create a backup or roll back to an earlier state (in case I did something stupid like -X DELETE /users)
Any way to back up this data either online or offline?
Apart from API access to fetch records x by x and storing locally, there is no solution at the moment. Team is planning an S3 integration (export data to S3) but no completion date is defined for that yet.
Looks like the only way is to query the data using e.g. CURL and save the results to a local file. I dont believe there is a way to export natively.
http://apigee.com/docs/app-services/content/working-queries
From 2014/2015 Usergrid versions it is possible to make exports and imports using "Usergrid tools"
On this page it is explained how to install them :
https://github.com/apache/incubator-usergrid/tree/master/stack/tools
Basically once you run
$ java -jar usergrid-tools.jar export
and this will export your data as json files in an export directory.
There are several export and import tools avaible, the best way to see them is to visit this page :
https://github.com/apache/incubator-usergrid/tree/6d962b7fe1cd5b47896ca16c0d0b9a297df45a54/stack/tools/src/main/java/org/apache/usergrid/tools

Resources