I am trying to set up a Google Co-lab notebook that runs in R and can read a GCS bucket from a GCP project. I am using the googleCloudStorageR package. To authenticate and read the bucket, the initial Co-lab notebook runs the following Python commands:
!gcloud auth login
!gcloud config set project project_name
!gcloud sql instances describe project_name
How can I run the above commands in R using the googleCloudStorageR package ? In the documentation for the package, they mention using the gcs_auth function that reads an authentication JSON file. However, since I will be accessing the buckets through a Co-Lab notebook running on R, I do not want to use an authentication file and instead want to authenticate and connect to the GCP storage in real-time from the Co-Lab notebook. Thank you!
Figured this out. In a Co-lab notebook, run the following code snippet:
install.packages("httr")
install.packages("R.utils")
install.packages("googleCloudStorageR")
if (file.exists("/usr/local/lib/python3.6/dist-packages/google/colab/_ipython.py")) {
library(R.utils)
library(httr)
reassignInPackage("is_interactive", pkgName = "httr", function() return(TRUE))
}
library(googleCloudStorageR)
options(
rlang_interactive = TRUE,
gargle_oauth_email = "email_address",
gargle_oauth_cache = TRUE
)
token <- gargle::token_fetch(scopes = "https://www.googleapis.com/auth/cloud-platform")
googleAuthR::gar_auth(token = token)
There is an issue with gargle authentication that the googleCloudStorageR package uses. A workaround that is similar to the one listed here (https://github.com/r-lib/gargle/issues/140) is to generate a token for cloud scopes, which would give us a token object that we would then use in the gar_auth function.
Related
I'm using the gmailr package for sending emails from a r script.
Locally it's all working fine, but when I try to run this during a docker build step in google cloud I'm getting an error.
I implemented it in the following way described here.
So basically, locally the part of my code for sending emails looks like this:
gm_auth_configure(path = "credentials.json")
gm_auth(email = TRUE, cache = "secret")
gm_send_message(buy_email)
Please note, that I renamed the .secret folder to secret, because I want to deploy my script with docker in gcloud and didn't want to get any unexpected errors due to the dot in the folder name.
This is the code, which I'm now trying to run in the cloud:
setwd("/home/rstudio/")
gm_auth_configure(path = "credentials.json")
options(
gargle_oauth_cache = "secret",
gargle_oauth_email = "email.which.was.used.to.get.secret#gmail.com"
)
gm_auth(email = "email.which.was.used.to.get.secret#gmail.com")
When running this code in a docker build process, I'm receiving the following error:
Error in gmailr_POST(c("messages", "send"), user_id, class = "gmail_message", :
Gmail API error: 403
Request had insufficient authentication scopes.
Calls: gm_send_message -> gmailr_POST -> gmailr_query
I can reproduce the error locally, when I do not check the
following box.
Therefore my first assumption is, that the secret folder is not beeing pushed correctly in the docker build process and that the authentication tries to authenticate again, but in a non interactive-session the box can't be checked and the error is thrown.
This is the part of the Dockerfile.txt, where I'm pushing the files and running the script:
#2 ADD FILES TO LOCAL
COPY . /home/rstudio/
WORKDIR /home/rstudio
#3 RUN R SCRIPT
CMD Rscript /home/rstudio/run_script.R
and this is the folder, which contains all files / folders beeing pushed to the cloud.
My second assumption is, that I have to somehow specify the scope to use google platform for my docker image, but unfortunately I'm no sure where to do that.
I'd really appreciate any help! Thanks in advance!
For anyone experiencing the same problem, I was finally able to find a solution.
The problem is that GCE auth is set by the "gargle" package, instead of using the "normal user OAuth flow".
To temporarily disable GCE auth, I'm using the following piece of code now:
library(gargle)
cred_funs_clear()
cred_funs_add(credentials_user_oauth2 = credentials_user_oauth2)
gm_auth_configure(path = "credentials.json")
options(
gargle_oauth_cache = "secret",
gargle_oauth_email = "sp500tr.cloud#gmail.com"
)
gm_auth(email = "email.which.was.used.for.credentials.com")
cred_funs_set_default()
For further references see also here.
I am trying to use pyghsheets from within a Jupyter notebook and I do not get it to work, while the same piece of code works nicely from within ipython.
from pathlib import Path
import pygsheets
creds = Path(r"/path/to/client_secret.json")
gc = pygsheets.authorize(client_secret=creds)
book = gc.open_by_key("__key__of__sheet__")
wks = book.worksheet_by_title("Sheet1")
wks.clear(start="A2")
When called from within ipython everthing works fine, whereas from within a Jupyter notebook I get
RefreshError: ('invalid_grant: Token has been expired or revoked.', {'error': 'invalid_grant', 'error_description': 'Token has been expired or revoked.'})
I run both pieces from within the same conda environment. Any suggestion on how to narrow down the problem (and solutions) are very welcome!
It turns out that current working directory of my Jupyter notebook was not the same as the one of my plain ipython. Now pygsheets stores the token that is used for authentication with Google in the current working directory. If the .json file in that directory is invalid, authentication will fail.
You can add a parameter credentials_directory=... to specify a previously validated token file.
The solution I ended up with was
gc = pygsheets.authorize(client_secret=creds, credentials_directory=creds.parent)
That way token and credential files are in the same directory.
I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:
How can I read a csv directly from Amazon s3 from r studio. I can't just use read_csv, If I put,
read_csv(url("s3a://abc/rerer.txt"))
I get
Error in url("s3a://abc/rerer.txt") :
URL scheme unsupported by this method
I don't want to first move the file locally. I tried using functions like get_bucket in AWS s3 library but thats not in human readable format
I recommend the package aws.s3 from the CloudyR project.
To install this package:
# stable version
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))
# on windows you may need:
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"), INSTALL_opts = "--no-multiarch")
Once installed you can read the file in just like this:
library("aws.s3")
r = aws.s3::get_object(bucket = "bucket",object = "object.csv")
As #Thomas mentioned in a comment if you know the file type you can use the read_using() function in combination with fread or read_csv or whatever R function you normally use. This saves a parsing step after you've retrieved the datum.
If your credentials are already environment variables, almost no setup is required. Otherwise, you can add them like this:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
There's also support for multiple AWS accounts. You can find the CloudyR project and its docs here:
https://github.com/cloudyr
Specifically, the AWS S3 API Client pages of CloudyR are found here:
https://github.com/cloudyr/aws.s3
I am trying to install a sample package from my github repo:
https://github.com/jpmarindiaz/samplepkg
I can install it when the repo is public using any of the following commands through the R interpreter:
install_github("jpmarindiaz/rdali")
install_github("rdali",user="jpmarindiaz")
install_github("jpmarindiaz/rdali",auth_user="jpmarindiaz")
But when the git repository is private I get an Error:
Installing github repo samplepkg/master from jpmarindiaz
Downloading samplepkg.zip from
https://github.com/jpmarindiaz/samplepkg/archive/master.zip
Error: client error: (406) Not Acceptable
I haven't figured out how the authentication works when the repo is private, any hints?
Have you tried setting a personal access token (PAT) and passing it along as the value of the auth_token argument of install_github()?
See ?install_github way down at the bottom (Package devtools version 1.5.0.99).
Create an access token in:
https://github.com/settings/tokens
Check the branch name and pass it to ref
devtools::install_github("user/repo"
,ref="main"
,auth_token = "tokenstring"
)
A more modern solution to this problem is to set your credentials in R using the usethis and credentials packages.
#set config
usethis::use_git_config(user.name = "YourName", user.email = "your#mail.com")
#Go to github page to generate token
usethis::create_github_token()
#paste your PAT into pop-up that follows...
credentials::set_github_pat()
#now remotes::install_github() will work
remotes::install_github("username/privaterepo")
More help at https://happygitwithr.com/common-remote-setups.html#common-remote-setups