AWS s3 r studio - r

How can I read a csv directly from Amazon s3 from r studio. I can't just use read_csv, If I put,
read_csv(url("s3a://abc/rerer.txt"))
I get
Error in url("s3a://abc/rerer.txt") :
URL scheme unsupported by this method
I don't want to first move the file locally. I tried using functions like get_bucket in AWS s3 library but thats not in human readable format

I recommend the package aws.s3 from the CloudyR project.
To install this package:
# stable version
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))
# on windows you may need:
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"), INSTALL_opts = "--no-multiarch")
Once installed you can read the file in just like this:
library("aws.s3")
r = aws.s3::get_object(bucket = "bucket",object = "object.csv")
As #Thomas mentioned in a comment if you know the file type you can use the read_using() function in combination with fread or read_csv or whatever R function you normally use. This saves a parsing step after you've retrieved the datum.
If your credentials are already environment variables, almost no setup is required. Otherwise, you can add them like this:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
There's also support for multiple AWS accounts. You can find the CloudyR project and its docs here:
https://github.com/cloudyr
Specifically, the AWS S3 API Client pages of CloudyR are found here:
https://github.com/cloudyr/aws.s3

Related

Reading csv files from microsoft Azure using R

I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:

RStudio Connect, Packrat, and custom packages in local repos

We have recently got RStudio Connect in my office. For our work, we have made custom packages, which we have updated amongst ourselves by opening the project and build+reloading.
I understand the only way I can get our custom packages to work within apps with RSConnect is to get up a local repo and set our options(repos) to include this.
Currently I have the following:
library(drat)
RepoAddress <- "C:/<RepoPath>" # High level path
drat::insertPackage(<sourcePackagePath>, repodir = RepoAddress)
# Add this new repo to Rs knowledge of repos.
options(repos = c(options("repos")$repos,LocalCurrent = paste0("file:",RepoAddress)))
# Install <PackageName> from the local repo :)
install.packages("<PackageName>")
Currently this works nicely and I can install my custom package from the local repo. This indicates to me that the local repo is set up correctly.
As an additional aside, I have changed the DESCRIPTION file to have an extra line saying repository:LocalCurrent.
However when I try to deploy a Shiny app or Rmd which references , I get the following error on my deploy:
Error in findLocalRepoForPkg(pkg, repos, fatal = fatal) :
No package '<PackageName> 'found in local repositories specified
I understand this is a problem with packrat being unable to find my local repos during the deploy process (I believe at a stage where it uses packrat::snapshot()).This is confusing since I would have thought packrat would use my option("repos") repos similar to install.packages. If I follow through the functions, I can see the particular point of failure is packrat:::findLocalRepoForPkg("<PackageName", repos = packrat::get_opts("local.repos")), which fails even after I define packrat::set_opts("local.repos" = c(CurrentRepo2 = paste0("file:",RepoAddress)))
If I drill into packrat:::findLocalRepoForPkg, it fails because it can't find a file/folder called: "C://". I would have thought this is guaranteed to fail, because repos follow the C://bin/windows/contrib/3.3/ structure. At no point would a repo have the structure it's looking for?
I think this last part is showing I'm materially misunderstanding something. Any guidance on configuring my repo so packrat can understand it would be great.
One should always check what options RStudio connect supports at the moment:
https://docs.rstudio.com/connect/admin/r/package-management/#private-packages
Personally I dislike all options for including local/private packages, as it defeats the purpose of having a nice easy target for deploying shiny apps. In many cases, I can't just set up local repositories in the organization because, I do not have clearance for that. It is also inconvenient that I have to email IT-support to make them manually install new packages. Overall I think RS connect is great product because it is simple, but when it comes to local packages it is really not.
I found a nice alternative/Hack to Rstudio official recommendations. I suppose thise would also work with shinyapps.io, but I have not tried. The solution goes like:
add to global.R if(!require(local_package) ) devtools::load_all("./local_package")
Write a script that copies all your source files, such that you get a shinyapp with a source directory for a local package inside, you could call the directory for ./inst/shinyconnect/ or what ever and local package would be copied to ./inst/shinyconnect/local_package
manifest.
add script ./shinyconnect/packrat_sees_these_dependencies.R to shiny folder, this will be picked up by packrat-manifest
Hack rsconnet/packrat to ignore specifically named packages when building
(1)
#start of global.R...
#load more packages for shiny
library(devtools) #need for load_all
library(shiny)
library(htmltools) #or what ever you need
#load already built local_package or for shiny connection pseudo-build on-the-fly and load
if(!require(local_package)) {
#if local_package here, just build on 2 sec with devtools::load_all()
if(file.exists("./DESCRIPTION")) load_all(".") #for local test on PC/Mac, where the shinyapp is inside the local_package
if(file.exists("./local_package/DESCRIPTION")) load_all("./local_package/") #for shiny conenct where local_package is inside shinyapp
}
library(local_package) #now local_package must load
(3)
make script loading all the dependencies of your local package. Packrat will see this. The script will never be actually be executed. Place it at ./shinyconnect/packrat_sees_these_dependencies.R
#these codelines will be recognized by packrat and package will be added to manifest
library(randomForest)
library(MASS)
library(whateverpackageyouneed)
(4) During deployment, manifest generator (packrat) will ignore the existence of any named local_package. This is an option in packrat, but rsconnect does not expose this option. A hack is to load rsconnect to memory and and modify the sub-sub-sub-function performPackratSnapshot() to allow this. In script below, I do that and deploy a shiny app.
library(rsconnect)
orig_fun = getFromNamespace("performPackratSnapshot", pos="package:rsconnect")
#packages you want include manually, and packrat to ignore
ignored_packages = c("local_package")
#highjack rsconnect
local({
assignInNamespace("performPackratSnapshot",value = function (bundleDir, verbose = FALSE) {
owd <- getwd()
on.exit(setwd(owd), add = TRUE)
setwd(bundleDir)
srp <- packrat::opts$snapshot.recommended.packages()
packrat::opts$snapshot.recommended.packages(TRUE, persist = FALSE)
packrat::opts$ignored.packages(get("ignored_packages",envir = .GlobalEnv)) #ignoreing packages mentioned here
print("ignoring following packages")
print(get("ignored_packages",envir = .GlobalEnv))
on.exit(packrat::opts$snapshot.recommended.packages(srp,persist = FALSE), add = TRUE)
packages <- c("BiocManager", "BiocInstaller")
for (package in packages) {
if (length(find.package(package, quiet = TRUE))) {
requireNamespace(package, quietly = TRUE)
break
}
}
suppressMessages(packrat::.snapshotImpl(project = bundleDir,
snapshot.sources = FALSE, fallback.ok = TRUE, verbose = FALSE,
implicit.packrat.dependency = FALSE))
TRUE
},
pos = "package:rsconnect"
)},
envir = as.environment("package:rsconnect")
)
new_fun = getFromNamespace("performPackratSnapshot", pos="package:rsconnect")
rsconnect::deployApp(appDir="./inst/shinyconnect/",appName ="shinyapp_name",logLevel = "verbose",forceUpdate = TRUE)
The problem is one of nomenclature.
I have set up a repo in the CRAN sense. It works fine and is OK. When packrat references a local repo, it is referring to a local git-style repo.
This solves why findlocalrepoforpkg doesn't look like it will work - it is designed to work with a different kind of repo.
Feel free to also reach out to support#rstudio.com
I believe the local package code path is triggered in packrat because of the missing Repository: value line in the Description file of the package. You mentioned you added this line, could you try the case-sensitive version?
That said, RStudio Connect will not be able to install the package from the RepoAddress as you've specified it (hardcoded on the Windows share). We'd recommend hosting your repo over https from a server that both your Dev environment and RStudio Connect have access to. To make this type of repo setup much easier we just released RStudio Package Manager which you (and IT!) may find a compelling alternative to manually managing releases of your internal packages via drat.

Issue with R - Shiny command runGitHub() [duplicate]

I am trying to install a sample package from my github repo:
https://github.com/jpmarindiaz/samplepkg
I can install it when the repo is public using any of the following commands through the R interpreter:
install_github("jpmarindiaz/rdali")
install_github("rdali",user="jpmarindiaz")
install_github("jpmarindiaz/rdali",auth_user="jpmarindiaz")
But when the git repository is private I get an Error:
Installing github repo samplepkg/master from jpmarindiaz
Downloading samplepkg.zip from
https://github.com/jpmarindiaz/samplepkg/archive/master.zip
Error: client error: (406) Not Acceptable
I haven't figured out how the authentication works when the repo is private, any hints?
Have you tried setting a personal access token (PAT) and passing it along as the value of the auth_token argument of install_github()?
See ?install_github way down at the bottom (Package devtools version 1.5.0.99).
Create an access token in:
https://github.com/settings/tokens
Check the branch name and pass it to ref
devtools::install_github("user/repo"
,ref="main"
,auth_token = "tokenstring"
)
A more modern solution to this problem is to set your credentials in R using the usethis and credentials packages.
#set config
usethis::use_git_config(user.name = "YourName", user.email = "your#mail.com")
#Go to github page to generate token
usethis::create_github_token()
#paste your PAT into pop-up that follows...
credentials::set_github_pat()
#now remotes::install_github() will work
remotes::install_github("username/privaterepo")
More help at https://happygitwithr.com/common-remote-setups.html#common-remote-setups

How to setup a local repository for R package?

I want to setup local repository for R package, I'd like the repository works like sonatype nexus(it can proxy the central repository, and cache the artifacts after downloading the artifact from central repository).
Currently nexus does not support R repository format, so it doesn't suite what I needed to do.
Is that any existing solution for creating this repository? I don't want to create a CRAN mirror, which is too heavy for me.
First, you'll want to make sure you have the following path and its directories in your system: "/R/src/contrib". If you don't have this path and these directories, you'll need to create them. All of your R packages files will be stored in the "contrib" directory.
Once you've added package files to the "contrib" directory, you can use the setRepositories function from the utils package to create the repository. I'd recommend adding the following code to your .Rprofile for a local repository:
utils::setRepositories(ind = 0, addURLs = c(WORK = "file://<your higher directories>/R"))
After editing your .Rprofile, restart R.
ind = 0 will indicate that you only want the local repository. Additional repositories can be included in the addURLs = option and are comma separated within the character vector.
Next, create the repository index with the following code:
tools::write_PACKAGES("/<your higher directories>/R/src/contrib", verbose = TRUE)
This will generate the PACKAGE files that serve as the repository index.
To see what packages are in your repository, run the following code and take a look at the resulting data frame: my_packages <- available.packages()
At this point, you can install packages from the repo without referencing the entire path to the package installation file. For example, to install the dplyr package, you could run the following:
install.packages("dplyr", dependencies = TRUE)
If you want to take it a step further and manage a changing repository, you could install and use the miniCRAN package. Otherwise, you'll need to execute the write_PACKAGES function whenever your repository changes.
After installing the miniCRAN package, you can execute the following code to create the miniCRAN repo:
my_packages <- available.packages()
miniCRAN::makeRepo(
pkgs = my_packages[,1,
path = "/<your higher directories>/R",
repos = getOption("repos"),
type = "source",
Rversion = R.version,
download = TRUE,
writePACKAGES = TRUE,
quiet = FALSE
)
You only need to execute the code above once for each repo.
Then, check to make sure each miniCRAN repo has been created. You only need to do this once for each repo:
pkgAvail(
repos = getOption("repos"),
type = "source",
Rversion = R.version,
quiet = FALSE
)
Whenever new package files are placed into the local repo you can update the local repo's index as follows:
miniCRAN::updateRepoIndex("/<your higher directories>/R/")
Finally, as an optional step to see if the new package is in the index, create a data frame of the available packages and search the data frame:
my_packages <- available.packages(repos = "file://<your higher directories>/R/")
This approach has worked pretty well for me, but perhaps others have comments and suggestions that could improve upon it.

Create an remote directory using SFTP / RCurl

Is it possible to create a directory on an SFTP site using the RCurl package? I found the sftp_create_dirs function, but I could not find an example how to use it.
I tried to set the ftp.create.missing.dirs option to TRUE, as in
library(RCurl)
opts <- list(ftp.create.missing.dirs=TRUE,
ssh.public.keyfile = mypubkey, ssh.private.keyfile = myprivatekey)
ftpUpload("myfile.txt", "sftp://me#www.host.org/newdir/myfile.txt", .opts=opts)
This works if newdir exists, but fails if it does not exists.
Any hint appreciated!

Resources