s3sync() Exclude Directory - r

I'm trying to pull down all files in a given bucket, except those in a specific directory, using R.
In the aws cli, I can use...
aws s3 sync s3://my_bucket/my_prefix ./my_destination --exclude="*bad_directory*"
In aws.s3::s3sync(), I'd like to do something like...
aws.s3::s3sync(path='./my_destination', bucket='my_bucket', prefix='my_prefix', direction='download', exclude='*bad_directory*')
...but exclude is not a supported argument.
Is this possible using aws.s3 (or paws for that matter)?
Please don't recommend using aws cli - there are reasons that approach doesn't make sense for my purpose.
Thank you!!

Here's what I came up with to solve this...
library(paws)
library(aws.s3)
s3 <- paws::s3()
contents <- s3$list_objects(Bucket='my_bucket',Prefix='my_prefix/')$Contents
keys <- unlist(sapply(contents,FUN=function(x){
if(!grepl('/bad_directory/',x$Key,fixed=TRUE)){
x$Key
}
}))
for(i in keys){
dir.create(dirname(i),showWarnings=FALSE,recursive=TRUE)
aws.s3::save_object(
object = i,
bucket='my_bucket',
file = i
)
}
Still open to more efficient implementations - thanks!

Related

Textract in R (paws) without S3Object

When using textract from the paws package in R the start_document_analysis call requires the path to a S3Object in DocumentLocation.
textract$start_document_analysis(
DocumentLocation = list(
S3Object = list(Bucket = bucket, Name = file)
)
)
Is it possible to use DocumentLocation without a S3Object? I would prefer to just provide the path to a local PDF.
The start_document_analysis api only supports providing an s3 object as input, and not a base64 encoded string like the analyze_document api (see also CLI docs on https://docs.aws.amazon.com/cli/latest/reference/textract/start-document-analysis.html)
So unfortunately you have to use S3 as a place to (temporarily) store your data. Of course you can write your own logic to do that :). Great tutorial on that can be found at
https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-r/
Since you have already set up credentials etc. you can skip a lot of the steps and start at step 3 for example.

How do you load an existing S3 bucket in Sagemaker using R Programming?

I know how to use python to load an existing s3 bucket in sage maker using R. Something like this:
role = get_execution_role()
region = boto3.Session().region_name
bucket='existing S3 Bucket'
data_key = 'Data file in the existing s3 bucket'
data_location = 's3://{}/{}'.format(bucket, data_key)
How can one recreate this using R in Sage maker? All i see in the available documentation is how to create a new bucket but none of it mentions how to use an existing S3 bucket. Help would be appreciated.
link to documentation for R in sage maker:
https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-sage maker/
Thanks for using Amazon SageMaker!
You can use SageMaker Session helper methods for listing and reading files from S3. Please checkout this sample notebook if you need examples for using SageMaker Session using_r_with_amazon_sagemaker.ipynb .
Thanks,
Neelam
You can use the Sagemaker Python SDK via reticulate, for example:
library(reticulate)
sagemaker <- import("sagemaker")
uri <- "s3://my-bucket/my-prefix"
files <- sagemaker$s3$S3Downloader$list(uri)
csv <- sagemaker$s3$S3Downloader$read_file(files[1])
df <- read.csv(text=csv)
You can use packages from the cloudyr project:
library(aws.s3)
library(aws.ec2metadata) # will mutate environment to add credentials
library(readr)
df <- s3read_using(FUN = read_csv, bucket = "my-bucket", object = "my-key.csv")

Using ad-hoc wrapper functions without packaging and without showing in global environment in R(studio)

Not sure if this is even possible.
I use Rstudio and appreciate having an overview of the objects I'm working with in the Global Environment pane.
However, at the same time, I have some 15 or so simple wrapper functions that are specific to my project, e.g. on various reading and writing functions so that they automate some file management tasks and follow my preferred folder structure; unfortunately, they also clutter that GE view.
I guess I could put them all in a package but I'm quite sure I will not publish it and may not even need many of them beyond this one project.
Is there anything short of bundling them into a package for this kind of three-line functions?
Thank you!
You could always put them into a list:
helper_functions <- list(f1 = function1,
f2 = function2)
Then you can call them by helper_functions$f2().
Example:
plus_one <- function(n){
return(n + 1)
}
plus_two <- function(n){
return(n + 2)
}
plus <- list(one = plus_one,
two = plus_two)
plus$two(2)
# 4

Moving data from local directory to AWS

I'm very new to R so be gentle. I've been tasked to make some amendments to a pre-existing project.
I have some code:
#SHINY_ROOT <- getwd()
#ARCHIVE_FILEPATH <- file.path(SHINY_ROOT, 'Data', 'archived_pqs.csv')
I want to move 'archived_pqs.csv' into S3 (Amazon Web Services), preferably while making as few changes to the rest of the code as possible.
My first thought was that I could do this:
ARCHIVE_FILEPATH <- s3tools::s3_path_to_full_df("alpha-pq-tool-data/Data/archived_pqs.csv")
Where 'alpha-pq-tool-data' is the S3 bucket.
I've tested this and it does indeed pull in the dataframe:
df <-s3tools::s3_path_to_full_df("alpha-pq-tool-data/Data/archived_pqs.csv")
The issue is that when I run other functions that go as follows:
if(file.exists(ARCHIVE_FILEPATH)) {
date <- last_answer_date()}
I get this error:
Error in file.exists(ARCHIVE_FILEPATH) : invalid 'file' argument
Called from: file.exists(ARCHIVE_FILEPATH)
Is there any easy way of doing this while making minimal changes? Can I no longer use file.exists function because the data is in S3?

source R file from private gitlab with basic auth

I would like to source an .R file from a private gitlab serveur. I need to use the basic authentication with user/password
I tried this kind of instruction without succes
httr::GET("http://vpsxxxx.ovh.net/root/project/raw/9f8a404b5b33c216d366d80b7d48e34577598069/R/script.R",
authenticate("user", "password",type="basic"))
any idea?
Regards
edit : I found this way... but I need to download all the project...
bundle <- tempfile()
git2r::clone("http://vpsxxx.ovh.net/root/projet.git",
bundle, credentials=git2r::cred_user_pass("user", "password"))
source(file.path(bundle,"R","script.R"))
you can use gitlab API to get a file from a repository. gitlabr can help you do that. Current version 0.9 is compatible with apiV3 and V4.
This should work (it work on my end on a private gitlab with apiv3)
library(gitlabr)
my_gitlab <- gl_connection("https://private-gitlab.com",
login = "username",
password = "password",
api_version = "v4") # by default. put v3 here if needed
my_file <- my_gitlab(gl_get_file, project = "project_name", file_path = "path/to/file")
This will get you a character version of your file. You can also get back a raw version to deal with it in another way.
raw <- gl_get_file(project = "project_name",
file_path = "file/to/path",
gitlab_con = my_gitlab,
to_char = F)
temp_file <- tempfile()
writeBin(raw, temp_file)
You can now source the code
source(temp_file)
It is one solution among others. I did not manage to source the file without using the API.
Know that :
* You can use an Access Token instead of username and password
* You can use the gitlabr several ways. It is documented in the vignette. I used 2 differents ways here.
* Version 1.0 will not be compatible with v3 api. But I think you use v4.
Feel free to get back to me so that I update this post if need a clearer answer.

Resources