Installing Google Cloud Datastore gcd - google-cloud-datastore

I'm trying to wrap my head around working with Google Cloud Datastore but am having trouble getting started. I've downloaded the zip of the gcd tool (v1beta2) as described here, which, when unpacked, is comprised of three files: gcd.sh, gcd.cmd, and CloudDatastore.jar. Unfortunately, there are no further instructions on what to do next - where to install it, what path variables or permissions to set, etc. Can someone fill me in?
TIA - Joe

Typical usage looks something like:
# create a dataset
gcd.sh create my-project
# start the local datastore
gcd.sh start my-project
Then, if you're using the Java or Python protocol buffers library, you set a couple of environment variables to instruct the client to use the local datastore:
export DATASTORE_HOST=http://localhost:8080
export DATASTORE_DATASET=my-project
You can find more details about the gcd tool (including instructions for managing indexes) here.

Related

Connect airflow to google fusion

I'd like to write python script which manages my google data fusion pipelines and instances (creates new, deletes, starts, etc). For that purpose I use airflow installed as library. I've read some tutorials and documentations but I still can't make that script connect with data fusion instance. I've tried to use next string:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://?extra__google_cloud_platform__key_path=%2Fkeys%2Fkey.json&extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&extra__google_cloud_platform__project=airflow&extra__google_cloud_platform__num_retries=5'
with my data json key file and Project id but it still doesn't work. Can you give me an example of creating that connection?
You can find an example python script here:
https://airflow.readthedocs.io/en/latest/_modules/airflow/providers/google/cloud/example_dags/example_datafusion.html
This page provides a breakdown for each Data Fusion Operator if you would like to learn more about them:
https://airflow.readthedocs.io/en/latest/howto/operator/gcp/datafusion.html

Google Cloud Functions - Custom library used in many function

I have 2 functions on Google Cloud Functions, using python, that use the same library.
The file organization I have is:
/libs/libCommon.py
/funcA/main.py
/funcB/main.py
Both function A and function B use libCommon.
Through the docs I only see ways for including subdirectories.
There is no clear way to include a parent directory.
What's the best way to organize the code?
Thanks
You can't share code between functions. However you have several solution to achieve this:
Create a package, deploy on PiPy and add this as dependency in your requirements.txt file. Issue -> PiPy is public without souscription
Create a deployment script which copy the source where they should be, and then run the gcloud command -> I'm not fan of scripting, especially if your project becomes complex
Use Cloud Run instead of Function. You can create 2 different containers or only one with 2 entry points. Cloud Run has many advantages.
If your request can be processed in parallel on the same instance, you can save money.
If not, set the concurrency param to 1 (same behavior as function).
Your code can be shared between several endpoints.
Your code is portable
Your service has always 1vCPU for processing, memory is customizable. You can also save money compare to function
Of course, I'm Cloud Run fan, but I think it's the best solution. About Storage event, it's not an issue. Set up a notification to publish storage event to PubSub and then set up a Push Subscription to your service

Travis and Firebase: deploy only changed functions

I'm using Travis to automatically deploy my Firebase hosted website and cloud functions as I push to GitHub, as detailed here. However, even for my small website with a limited amount of cloud functions, deploying all of the functions takes quite a long time. Were I deploying manually, I would be able to use --only to specify precisely those functions that I actually changed. Is there a way to make this information available to Travis, so that only the necessary functions are rebuilt?
https://m.youtube.com/watch?v=iyGHW4UQ_Ts
min 30 and following
This guy solves the problem by copying all functions to a cloud bucket and then making a diff for every file. This works well if all your logic is in one file. But this is not what you want for larger projects. For my own project i used webpack to create one file for each function that includes the imports. then i generate a md5 hash for that file and save it to a functions-lock.json. with the next run i can easily check against the old hash value and only deploy the changed functions. The ci should manage the state of the lock file by uploading it to the cloud or doing some git magic
Unfortunately this isn't going to be simple to do -- the Firebase CLI deploys all of your functions because it's next-to-impossible to just analyze the code and figure out which functions are impacted (since you can require other files, you might have updated dependencies but no files changed, etc.).
One thing I can think of that might be a hack would be to have named branches for functions or groups of functions. Then you could git push to the branch of the specific function you want to deploy, and have a script that uses the branch name as a signal to pass the --only functions:<fnName> to the firebase deploy command. That's not the most glamorous solution, but, depending on how much this bugs you, it might help.
So this is a bit late but the long deployment times have bothered us for a while now.
Our solution is based on CircleCI but it should be possible to adapt.
First we get all changed files in the last merged PR for our branch with
git log -m -1 --name-only --pretty="format:" ${process.env.CIRCLE_SHA1}
CIRCLE_SHA1 is the SHA of the last merge commit, i.e featurebranch -> master
Then we get all the function filenames from our /functions/ directory and use
madge to generate an array of all the dependencies those functions have.
Next we go trough all changed files that we got from git and check if their filename is part of the dependency array for a sepcific cloud function, if so we add the cloudfunction to another array.
once this is done we pretty much have an array from all cloudfunctions that have been affected by the change of a specific file that we now can map to their actual cloud function names for deployment.
Now instead of always deploying 75 cloudfunctions which takes 45 minutes we only deploy maybe 20.

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!
Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.
It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.
#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs
If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.
Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

Is there a way to back up my App Services / Usergrid data

App Services is a great place to store data but now that I have a lot of critial info in there I realized there isn't a way to create a backup or roll back to an earlier state (in case I did something stupid like -X DELETE /users)
Any way to back up this data either online or offline?
Apart from API access to fetch records x by x and storing locally, there is no solution at the moment. Team is planning an S3 integration (export data to S3) but no completion date is defined for that yet.
Looks like the only way is to query the data using e.g. CURL and save the results to a local file. I dont believe there is a way to export natively.
http://apigee.com/docs/app-services/content/working-queries
From 2014/2015 Usergrid versions it is possible to make exports and imports using "Usergrid tools"
On this page it is explained how to install them :
https://github.com/apache/incubator-usergrid/tree/master/stack/tools
Basically once you run
$ java -jar usergrid-tools.jar export
and this will export your data as json files in an export directory.
There are several export and import tools avaible, the best way to see them is to visit this page :
https://github.com/apache/incubator-usergrid/tree/6d962b7fe1cd5b47896ca16c0d0b9a297df45a54/stack/tools/src/main/java/org/apache/usergrid/tools

Resources