Is there an established means of using AzureStor and arrow together in R? - r

In the arrow R guide there's info about using S3 buckets but nothing about using Azure cloud storage. There's an unrelated package AzureStor which connects to Azure Storage but uses different syntax so they don't (seemingly) work together.
Is there a an existing adaptation or easy way to adapt the AzureStor syntax over to a FileSystem class that arrow can use?

I am not sure but does Azure File System will help?
Something like mentioned in the link below:
https://www.educba.com/azure-file-storage/

Related

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Connect airflow to google fusion

I'd like to write python script which manages my google data fusion pipelines and instances (creates new, deletes, starts, etc). For that purpose I use airflow installed as library. I've read some tutorials and documentations but I still can't make that script connect with data fusion instance. I've tried to use next string:
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT='google-cloud-platform://?extra__google_cloud_platform__key_path=%2Fkeys%2Fkey.json&extra__google_cloud_platform__scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&extra__google_cloud_platform__project=airflow&extra__google_cloud_platform__num_retries=5'
with my data json key file and Project id but it still doesn't work. Can you give me an example of creating that connection?
You can find an example python script here:
https://airflow.readthedocs.io/en/latest/_modules/airflow/providers/google/cloud/example_dags/example_datafusion.html
This page provides a breakdown for each Data Fusion Operator if you would like to learn more about them:
https://airflow.readthedocs.io/en/latest/howto/operator/gcp/datafusion.html

Is it possbile to work locally by training a tensroflow model using data from google cloud storage without uploading it?

I am trying to use locally in R a tensorflow model using tfdatasets and cloudML using training data available in Google cloud storage without uploading it. AS far as I know the package "tfdtasets" should use gs:// URLs directly with gs_data_dir().
If I specify in TSScript.R:
data_dir <- gs_data_dir(gs://my-gcp-project/data/)
When I run cloud_train(TSScript.R) I get the error:
Error: 'gs://my-gpc-project/data/train.data.csv' does not exist in current working directory ('/root/.local/lib/python2.7/site-packages/cloudml-model')
Here my questions:
Is it somehow possible to do it but I am doing some mistakes in my script?
If not, do I need to install R in the cloud and working from there directly?
would it possible maybe training data from bigTable without uploading it locally?
Thanks
For 1) I think you might be looking for tf.gfile()
https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
Example of use: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/model.py#L154
Hope this helps!
For 2) If you want to do this, you should look at Custom Containers. https://cloud.google.com/ml-engine/docs/custom-containers-training
For 3) I'm not familiar with BigTable, but my guess is you would have to query that data you need and manually pull it locally. I don't think TF.gfile supports BigTable only GCS.

Google cloud function & Imagemagick: can't deal with PDF

I am trying to convert the first page of a pdf uploaded to Storage to a JPG so that I can generate a thumbnail and display it to my users. I use Imagemagick for that. The issue is that it seems like Google cloud function instances don't have ghostscript (gs) that seems to be a dependency to manipulate pdfs.
Is there a way to have it available in some way?
(fyi, I am able to properly convert on my local machine with both Imagemagick and ghostscript installed). So, I know the command I am using is good.
AWS Lambda instances have ghostscript installed by the way
Thanks
Actually Ghostscript was deprecated from the app engine as well. I think your best option is maybe to use pdf.js deployed with your cloud function. I have not tried it myself but it looks like the only way forward with the current state of CF. Other option is to deploy a GCE with Ghostscript and send a request from the CF to convert the PDF page for you.

Installing Google Cloud Datastore gcd

I'm trying to wrap my head around working with Google Cloud Datastore but am having trouble getting started. I've downloaded the zip of the gcd tool (v1beta2) as described here, which, when unpacked, is comprised of three files: gcd.sh, gcd.cmd, and CloudDatastore.jar. Unfortunately, there are no further instructions on what to do next - where to install it, what path variables or permissions to set, etc. Can someone fill me in?
TIA - Joe
Typical usage looks something like:
# create a dataset
gcd.sh create my-project
# start the local datastore
gcd.sh start my-project
Then, if you're using the Java or Python protocol buffers library, you set a couple of environment variables to instruct the client to use the local datastore:
export DATASTORE_HOST=http://localhost:8080
export DATASTORE_DATASET=my-project
You can find more details about the gcd tool (including instructions for managing indexes) here.

Resources