I'm deploying an app to shinyapps.io using data I'm grabbing from S3 and I want to make sure my AWS keys are safe. Currently within the app.R code I'm setting environment variables and then querying S3 to get the data.
Is there a way to create a file that obscures the keys and deploy it to shinyApss along with my app.R file
Sys.setenv("AWS_ACCESS_KEY_ID" = "XXXXXXXX",
"AWS_SECRET_ACCESS_KEY" = "XXXXXXXXX",
"AWS_DEFAULT_REGION" = "us-east-2")
inventory =aws.s3::s3read_using(read.csv, object = "s3://bucket/file.csv")
I'll also add that I'm on the free plan so user authentication is not available otherwise I wouldn't fuss about my keys being visible.
I recommend the following solution and the reasons behind it:
Firstly, create a file named .Renviron (just create it with a text editor like the one on RStudio). Since that file has a dot before the name, the file will be hidden (in Mac/Linux for example). Type the following:
AWS_ACCESS_KEY_ID = "your_access_key_id"
AWS_SECRET_ACCESS_KEY = "you_secret_access_key"
AWS_DEFAULT_REGION = "us-east-2"
Secondly, if you are using .git it is advisable to add the following text in your gitignore file (so to avoid to share that file for version control):
# R Environment Variables
.Renviron
Finally you can retrieve the values stored in .Renviron to connect to your databases, S3 buckets and so on:
library(aws.s3)
bucketlist(key = Sys.getenv("AWS_ACCESS_KEY_ID"),
secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"))
In that way your keys will be "obscured" and will be retrieved by the function Sys.getenv from .Renviron so you can protect your code.
Perhaps this solution is too basic, but you can simply create a .txt file, with the keys in it one per line. Than you can use scan() to read that file.
Something like:
Sys.setenv("AWS_ACCESS_KEY_ID" = scan("file.txt",what="character")[1],
"AWS_SECRET_ACCESS_KEY" = scan("file.txt",what="character")[2],
"AWS_DEFAULT_REGION" = "us-east-2")
It is similar to the first solution in the "managing secrets" link in the comments, except that we use a simple text format instead of JSON.
Related
I've had success using GoogleDriveToGCSOperator to copy a file from drive to gcs.
But what I really need to do is, given drive folder id then copy all files and subdirectories of that drive folder to gcs.
Is there an operator that does this using airflow?
I've googled and googled and had no success. I'm assuming there's some solution for this as I'm sure I'm not the only one needing this.
I've had success doing this with colab notebook but am now hoping to schedule something in airflow to achieve same task. Not sure if mount drive and pyDrive facilities in Colab are directly transferrable to airflow, or whether there's a better airflow solve for this.
Thanks
Actually, there is no operator to do copy the whole gdrive folder to GCS, but you can develop a new one.
If you read the official operator GoogleDriveToGCSOperator source code, you can see that use GoogleDriveHook to download the file frome gdrive, and GCSHook to create a new file in GCS.
So you need to list the files in gdrive, and copy them in a loop.
The problem is that you cannot list the files in gdrive using GoogleDriveHook, so you have to call the google API do list the file. Here you can find an example.
Once you have the files list, you can create a new operator by modifying the GoogleDriveToGCSOperator execute method (and for sure the __init__ method arguments based on your needs):
def execute(self, context: 'Context'):
files_list = ... # read for the API
gdrive_hook = GoogleDriveHook(
gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
gcs_hook = GCSHook(
gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
for file_name in files_list:
... # get the gdrive file name without prefix
... # choose a prefix for GCS objects
file_metadata = gdrive_hook.get_file_id(
folder_id=self.folder_id, file_name=file_name, drive_id=self.drive_id
)
with gcs_hook.provide_file_and_upload(
bucket_name=self.bucket_name, object_name=<GCS prefix>/<gdrive file name without prefix>
) as file:
gdrive_hook.download_file(file_id=file_metadata["id"], file_handle=file)
I am trying to add the txt file that contains let's say 50 proj, with paths outside of the package. I am attempting to use these files to get a shinyapp using golem framework.
My problem is, as much as I read on golem shiny apps, I do not understand where to add these txt files so that I can then use them for my shiny applications. NOTE: I want to work with golem framework and therefore the answer should be aligned to these request.
This is a txt file.
nameproj technology pathwork LinkPublic Access
Inside I have 50 projects with paths and links that will be used to retrieve the data for the app.
L3_baseline pooled /projects/gb/gb_screening/analyses_gb/L3_baseline/ kkwf800, kkwf900, etc..
Then I create paths to the data like this:
path_to_data1 = "data/data1.txt"
path_to_data2 = "data/data2.txt"
Then, I create helper functions. These helper functions will be used in app_server and app_ui modules. Something like the bellow.
make_path<-function(pathwork,type,ex, subfolder=""){
path<-paste0(pathwork,"/proj", type,"/",ex,"/",subfolder,"/")
return(path)
}
getfiles = function(screennames, types, pathwork){
files = data.frame()
for (ind in 1:length(screennames)){
hitfile = file.path(make_path(path_worj, types[ind], names[ind], "analysis"),"File.tsv")
if(file.exists(file)){files=rbind(files, data.frame(filename=file, screen=paste0(names[ind],"-",types[ind])))}
}
return(files)
}
Can someone direct me to:
how to actually add the txt files containing paths to external data and projects within golem framework
a clear example where these files are added within the golem
NOTES: My datasets are all within private servers within my company. Thus, all these paths are directing me to these servers. And I have no issues with accessing these datasets.
I have solve the issue by simply adding a source file, with the paths above only and run the app. It seems it is working.
I have 3 r scripts;
data1.r
data2.r
graph1.r
the two data files, run some math and generate 2 separate data files, which I save in my working directory. I then call these two files in graph1.r and use it to plot the data.
How can I organise and create an R project which has;
these two data files - data1.r and data2.r
another file which calls these files (graph1.r)
Output of graph1.r
I would then like to share all of this on GitHub (I know how to do this part).
Edit -
Here is the data1 script
df1 <- data.frame(x = seq(1,100,1), y=rnorm(100))
save(df1, file = "data1.Rda")
Here is the data2 script
df2 <- data.frame(x = seq(1,100,1), y=rnorm(100))
save(df2, file = "data2.Rda")
Here is the graph1 script
load(file = "data1.Rda")
load(file = "data2.Rda")
library(ggplot2)
ggplot()+geom_point(data= df1, aes(x=x,y=y))+geom_point(data= df2, aes(x=x,y=y))
Question worded differently -
How would the above need to be executed inside a project?
I have looked at the following tutorials -
https://r4ds.had.co.nz/workflow-projects.html
https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner's-guide/
https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/
https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/
https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
I have broken my answer into three parts:
The question in your title
The reworded question in your text
What I, based on your comments, believe you are actually asking
How to transfer my files to R Projects, and then to GitHub?
From RStudio, just create a new project and move your files to this folder. You can then initialize this folder with git using git init.
How would [my included code] need to be executed inside a project?
You don't need to change anything in your example code. If you just place your files in a project folder they will run just fine.
An R project mainly takes care of the following for you:
Working directory (it's always set to the project folder)
File paths (all paths are relative to the project root folder)
Settings (you can set project specific settings)
Further, many external packages are meant to work with projects, making many task easier for you. A project is also a very good starting point for sharing your code with Git.
What would be a good workflow for working with multiple scripts in an R project?
One common way of organizing multiple scripts is to make a new script calling the other scripts in order. Typically, I number the scripts so it's easy to see the order to call them. For example, here I would create 00_main.R and include the code:
source("01_data.R")
source("02_data.R")
source("03_graph.R")
Note that I've renamed your scripts to make the order clear.
In your code, you do not need to save the data to pass it between the scripts. The above code would run just fine if you delete the save() and load() parts of your code. The objects created by the scripts would still be in your global environment, ready for the next script to use them.
If you do need to save your data, I would save it to a folder named data/. The output from your plot I would probably save to outputs/ or plots/.
When you get used to working with R, the next step to organize your code is probably to create a package instead of using only a project. You can find all the information you need in this book.
I'm using a .yaml file as my config file for some R code. When I was the only R developer this didn't pose a problem, but now we're trying to bring multiple people on board. We don't want to keep rewriting the config file because that's slow, and we don't want to create individual config files because we keep adding new elements to them so we'll likely end up running different code.
To access the .yaml file, I typically run the code below in R
config = yaml::read_yaml('base/config.yaml')
and the text in config.yaml starts with
experiments:
# Config paths
first: 'C:/Users/BNye/OneDrive/Science/experiments/first'
second: 'C:/Users/BNye/OneDrive/Science/experiments/second'
third: ...
and so on. What I'd like to do is swap that out for something like
'C:/Users/{username}/OneDrive/Science/experiments/first'
and have it spit back 'C:/Users/BNye/OneDrive/Science/experiments/first' when I run config$experiments$first in R, but that just returns the same line of text ('C:/Users/{username}/OneDrive/Science/experiments/first') right back to me. Using setwd(paste0("C:/Users/",Sys.info()[6], "/OneDrive/Science/experiments"))
in R worked fine, so the real hang up is config. How should I code this?
Using this query as a reference, I found a way to get R to parse the config file as an R expression. My config files now read:
experiments:
# Config paths
first: !expr 'paste0("C:/Users/",Sys.info()[6], "/OneDrive/Science/experiments/first")'
second: !expr 'paste0("C:/Users/",Sys.info()[6], "/OneDrive/Science/experiments/second")'
third: ...
Then, when I read it in R, I use config = yaml::read_yaml('base/config.yaml', eval.expr=TRUE) and it works perfectly. It feels like a fragile solution, but so far it's holding up.
Someone, please guide me. Suppose I choose the location of a data file using the file.choose () and load the dataset after that. Also, suppose I have sent the script+data set to a friend of mine from e-mail. when my friend downloaded the files and run the r script, he has to choose the location of the file to run the script. I want to know an automated way to load the data set when the files moved to another computer.
First, consider having a "project" directory where you have a directory for scripts and one for data. There's a 📦 called rprojroot that has filesystem helpers which will aid you in writing system independent code and will work well if you have a "project" directory. RStudio has a concept of projects & project directories which makes this even easier.
Second, consider using a public or private GitHub for this work (scripts & data). If the data is sensitive, make it a private repo and grant access as you need. If it's not, then it's even easier to share. You'll get data and code version control this way as well.
Third --- as a GitHub alternative --- consider using Keybase shared directories or git spaces. You can grant/remove access to specific individuals and they remain private and secure as well as easy to use.
These solutions will work on any computer without changing the script.
1) use current dir If you assume the data and script are in the same directory then this will work on any computer provided the user first does a setwd("/my/dir") or starts R in that directory. One invokes the script using source("myscript.R") and the script reads the data using read.table("mydata.dat"). This approach is the simplest, particularly if the script is only going to be used once or a few times and then never used again.
2) use R options A slightly more general approach is to assume that R option DATADIR (pick any name you like) contains that directory or the current directory if not defined. In the script write:
datadir <- getOption("DATADIR", ".") # use DATADIR or . if DATADIR not defined
read.table(file.path(datadir, "mydata.dat"))
Then the user can define DATADIR in their R session or in their .Rprofile:
options(DAtADIR = "/my/dir")
or not define it at all but setwd to that directory in their R session prior to running the script or start R in that directory.
This might be better than (1) if the script is going to be used over a long period of time and moved around without the data. If you put the options statement in your .Rprofile then it will help remind you where the data is if you don't use the script for a long time and lose track of its location.
3) include data in script If the script always uses the same data and it is not too large you could include the data in the script. Use dput(DF) where DF is the data frame in order to get the R code corresponding to DF and then just paste that into your script. Here is such a sample script where we used the output of dput(BOD):
DF <- structure(list(Time = c(1, 2, 3, 4, 5, 7), demand = c(8.3, 10.3,
19, 16, 15.6, 19.8)), .Names = c("Time", "demand"), row.names = c(NA,
-6L), class = "data.frame", reference = "A1.4, p. 270")
plot(demand ~ Time, DF)
Of course if you always use the same data you could create a package and include the script and the data.
4) config package You could use the config package to define a configuration file for your script. That still begs the question of how to find the configuration file but config can search the current directory and all ancestors (parent dir, grandparent dir, etc.) for the config file so specification of its location may not be needed.