How to make Azure batch see my data files? - r

I am new to Azure batch. I am trying to use R in parallel with Azure batch in rstudio to run code on a cluster. I am able to successfully start the cluster and get the example code to work properly. When I try to run my own code I am getting an error that says the cluster nodes cannot find my data files. Do I have to change my working directory to Azure batch somehow?
Any information on how to do this is much appreciated.

I have figured out how to get Azure batch to see my data files. Not sure if this is the most efficient way, but here is what I did.
Download a program called Microsoft Azure Storage Explorer which runs on my local computer.
Connect to my Azure storage using the storage name and primary storage key found in the Azure portal.
in Microsoft Azure Storage Explorer find Blob containers, right click create new container.
Upload data files to that new container.
Right click on data files and go to copy URL.
Paste URL in R like this model_Data<-read.csv(paste('https://<STORAGE NAME HERE>.blob.core.windows.net/$root/k',k,'%20data%20file.csv',sep=''),header=T)

Related

How to access remote file in GGIR package in R?

I a using GGIR package for accelerometer data analysis. My data is onedrive folder which takes a long time to download. Is there a way I can access the onedrive files directly without downloading to my local machine?
My guess would be that this is not possible. If you're working with Azure there are tools available to connect to OneDrive and download/upload the data which is then processed on a separate instance. I'm guessing the same applies to your local machine, but I'm not intimately familiar with Microsoft's services to be sure.
For example:
By using Azure Logic Apps and the OneDrive connector, you can create automated tasks and workflows to manage your files, including upload, get, delete files, and more. With OneDrive, you can perform these tasks:
Build your workflow by storing files in OneDrive, or update existing files in OneDrive.
Use triggers to start your workflow when a file is created or updated within your OneDrive.
Use actions to create a file, delete a file, and more. For example, when a new Office 365 email is received with an attachment (a trigger), create a new file in OneDrive (an action).
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-onedrive

write.csv and read.csv in Shiny App shared on shinyapps.io

I have created an app that I want to share on shinypps.io
Within the code for the I use the functions load, write.csv, and read.csv which read and write files to folders called outputs and data. My app works fine when I run it locally but when I deploy it I get the error:
cannot open compressed file 'data\Catchments.RData', probable reason 'No such file or directory'
I tried using a folder called www to store these but still had error messages. Is there a way to use these functions when sharing an app on shinyapps.io?
There's no possibility of using directories in shinyapp.io. An easy fix is to place an upload button inside the app, perform all the manipulations you need and finally download the result with a download button again. Getting the data from a remote server is also a good option.
As shown in this Article
"Local vs remote storage
Before diving into the different storage methods, one important distinction to understand is local storage vs remote storage.
Local storage means saving a file on the same machine that is running the Shiny application. Functions like write.csv(), write.table(), and saveRDS() implement local storage because they will save a file on the machine running the app. Local storage is generally faster than remote storage, but it should only be used if you always have access to the machine that saves the files.
Remote storage means saving data on another server, usually a reliable hosted server such as Dropbox, Amazon, or a hosted database. One big advantage of using hosted remote storage solutions is that they are much more reliable and can generally be more trusted to keep your data alive and not corrupted.
When going through the different storage type options below, keep in mind that if your Shiny app is hosted on shinyapps.io, you will have to use a remote storage method for the time being. In the meantime, using local storage is only an option if you’re hosting your own Shiny Server. If you want to host your own server, here is a guide that describes in detail how to set up your own Shiny Server."

Generating ZIP files in azure blob storage

What is the best method to zip large files present in AZ blob storage and download them to the user in an archive file (zip/rar)
does using azure batch can help ?
currently we implement this functions in a traditionally way , we read stream generate zip file and return the result but this take many resources on the server and time for users.
i'am asking about the best technical and technologies solution (preferred way using Microsoft techs)
There are few ways you can do this **from azure-batch only point of view**: (for the initial part user code should own whatever zip api they use to zip their files but once it is in blob and user want to use in the nodes then there are options mentioned below.)
For initial part of your question I found this which could come handy: https://microsoft.github.io/AzureTipsAndTricks/blog/tip141.html (but this is mainly from idea sake and you will know better + need to design you solution space accordingly)
In option 1 and 3 below you need to make sure you user code handle the unzip or unpacking the zip file. Option 2 is the batch built-in feature for *.zip file both at pool and task level.
Option 1: You could have your *rar or *zip file added as azure batch resource files and then unzip them at the start task level, once resource file is downloaded. Azure Batch Pool Start up task to download resource file from Blob FileShare
Option 2: The best opiton if you have zip but not rar file in the play is this feature named Azure batch applicaiton package link here : https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
The application packages feature of Azure Batch provides easy
management of task applications and their deployment to the compute
nodes in your pool. With application packages, you can upload and
manage multiple versions of the applications your tasks run, including
their supporting files. You can then automatically deploy one or more
of these applications to the compute nodes in your pool.
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages#application-packages
An application package is a .zip file that contains the application binaries and supporting files that are required for your
tasks to run the application. Each application package represents a
specific version of the application.
With regards to the size: refer to the max allowed in blob link in the document above.
Option 3: (Not sure if this will fit your scenario) Long shot for your specific scenario but you could also mount virtual blob to the drive at join pool via mount feature in azure batch and you need to write code at start task or some thing to unzip from the mounted location.
Hope this helps :)

Accessing files from Google cloud storage in RStudio

I have been trying to create connection between the Google cloud storage and RStudio server(The one I spinned up in Google cloud), so that I can access the files in R to run sum analysis on.
I have found three different ways to do it on the web, but I don't see many clarity around these ways so far.
Access the file by using the public URL specific to the file [This is not an option for me]
Mount the Google cloud storage as a disc in RStudio server and access it like any other files in the server [ I saw someone post about this method but could not find on any guides or materials that shows how it's done]
Using the googleCloudStorageR package to get full access to the Cloud Storage bucket.
The step 3 looks like the pretty standard way to do it. But I get following error when I try to hit the gcs_auth() command
Error in gar_auto_auth(required_scopes, new_user = new_user, no_auto =
no_auto, : Cannot authenticate -
options(googleAuthR.scopes.selected) needs to be set to
includehttps://www.googleapis.com/auth/devstorage.full_control or
https://www.googleapis.com/auth/devstorage.read_write or
https://www.googleapis.com/auth/cloud-platform
The guide on how to connect using this is found on
https://github.com/cloudyr/googleCloudStorageR
but it says it requires a service-auth.json file to set the environment variables and all other keys and secret keys, but do not really specify on what these really are.
If someone could help me know how this is actually setup, or point me to a nice guide on setting the environment up, I would be very much grateful.
Thank you.
Before using any services by google cloud you have to attach your card.
So, I am assuming that you have created the account, after creating the account go to Console ,if you have not created Project then Create Project, then click on sidebar find APIs & Services > Credentials.
Then,
1)Create Service Account Keys save this File in json you can only download it once.
2)OAuth 2.0 client ID give the name of the app and select type as web application and download the json file.
Now For Storage go to Sidebar Find Storage and click on it.
Create Bucket and give the name of Bucket.
I have added the single image in bucket, you can also add for the code purpose.
lets look how to download this image from storage for other things you can follow the link that you have given.
First create environment file as .Renviron so it automatically catches the json file and save it in a working directory.
In .Renviron file add those two downloaded json files like this
GCS_AUTH_FILE="serviceaccount.json"
GAR_CLIENT_WEB_JSON="Oauthclient.json"
#R part
library(googleCloudStorageR)
library(googleAuthR)
gcs_auth() # for authentication
#set the scope
gar_set_client(scopes = c("https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/cloud-platform"))
gcs_get_bucket("you_bucket_name") #name of the bucket that you have created
gcs_global_bucket("you_bucket_name") #set it as global bucket
gcs_get_global_bucket() #check if your bucket is set as global,you should get your bucket name
objects <- gcs_list_objects() # data from the bucket as list
names(objects)
gcs_get_object(objects$name[[1]], saveToDisk = "abc.jpeg") #save the data
**Note :**if you dont get json file loaded restart the session using .rs.restartR()
and check the using
Sys.getenv("GCS_AUTH_FILE")
Sys.getenv("GAR_CLIENT_WEB_JSON")
#it should show the files
You probably want the FUSE adaptor - this will allow you to mount your GCS bucket as a directory on your Server.
Install gcsfuse on the R server.
create a mnt directory.
run gcsfuse your-bucket /path/to/mnt
Be aware though that RW performance isnt great vis FUSE
Full documentation
https://cloud.google.com/storage/docs/gcs-fuse

Where to store binary format database when uploading to Azure in order to reduce package size?

We are using the city database GeoLiteCity in our C# web site hosted on Azure.
As they recommend, we use the binary: GeoLiteCity.dat.
This file is included in our project.
The problem is that the file weights roughly 28 Mo.
As the GeoLiteCity.dat is present both in the source code and in the bin folder, every time we deploy to Azure, the package size is huge.
Is there a way for us to reduce the size of the package while still using the same database?
We already use blobs for our static content, as advised in this question, but is it possible to do the same with .dat file ?
EDIT: It's a read only database.
Since this is a read-only database, have you consider using Azure Drive? That way, you could just have a single virtual hard-drive stored in blob storage and mount it has read-only in every single instance you need.

Resources