I've had success using GoogleDriveToGCSOperator to copy a file from drive to gcs.
But what I really need to do is, given drive folder id then copy all files and subdirectories of that drive folder to gcs.
Is there an operator that does this using airflow?
I've googled and googled and had no success. I'm assuming there's some solution for this as I'm sure I'm not the only one needing this.
I've had success doing this with colab notebook but am now hoping to schedule something in airflow to achieve same task. Not sure if mount drive and pyDrive facilities in Colab are directly transferrable to airflow, or whether there's a better airflow solve for this.
Thanks
Actually, there is no operator to do copy the whole gdrive folder to GCS, but you can develop a new one.
If you read the official operator GoogleDriveToGCSOperator source code, you can see that use GoogleDriveHook to download the file frome gdrive, and GCSHook to create a new file in GCS.
So you need to list the files in gdrive, and copy them in a loop.
The problem is that you cannot list the files in gdrive using GoogleDriveHook, so you have to call the google API do list the file. Here you can find an example.
Once you have the files list, you can create a new operator by modifying the GoogleDriveToGCSOperator execute method (and for sure the __init__ method arguments based on your needs):
def execute(self, context: 'Context'):
files_list = ... # read for the API
gdrive_hook = GoogleDriveHook(
gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
gcs_hook = GCSHook(
gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
impersonation_chain=self.impersonation_chain,
)
for file_name in files_list:
... # get the gdrive file name without prefix
... # choose a prefix for GCS objects
file_metadata = gdrive_hook.get_file_id(
folder_id=self.folder_id, file_name=file_name, drive_id=self.drive_id
)
with gcs_hook.provide_file_and_upload(
bucket_name=self.bucket_name, object_name=<GCS prefix>/<gdrive file name without prefix>
) as file:
gdrive_hook.download_file(file_id=file_metadata["id"], file_handle=file)
Related
I'm deploying an app to shinyapps.io using data I'm grabbing from S3 and I want to make sure my AWS keys are safe. Currently within the app.R code I'm setting environment variables and then querying S3 to get the data.
Is there a way to create a file that obscures the keys and deploy it to shinyApss along with my app.R file
Sys.setenv("AWS_ACCESS_KEY_ID" = "XXXXXXXX",
"AWS_SECRET_ACCESS_KEY" = "XXXXXXXXX",
"AWS_DEFAULT_REGION" = "us-east-2")
inventory =aws.s3::s3read_using(read.csv, object = "s3://bucket/file.csv")
I'll also add that I'm on the free plan so user authentication is not available otherwise I wouldn't fuss about my keys being visible.
I recommend the following solution and the reasons behind it:
Firstly, create a file named .Renviron (just create it with a text editor like the one on RStudio). Since that file has a dot before the name, the file will be hidden (in Mac/Linux for example). Type the following:
AWS_ACCESS_KEY_ID = "your_access_key_id"
AWS_SECRET_ACCESS_KEY = "you_secret_access_key"
AWS_DEFAULT_REGION = "us-east-2"
Secondly, if you are using .git it is advisable to add the following text in your gitignore file (so to avoid to share that file for version control):
# R Environment Variables
.Renviron
Finally you can retrieve the values stored in .Renviron to connect to your databases, S3 buckets and so on:
library(aws.s3)
bucketlist(key = Sys.getenv("AWS_ACCESS_KEY_ID"),
secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"))
In that way your keys will be "obscured" and will be retrieved by the function Sys.getenv from .Renviron so you can protect your code.
Perhaps this solution is too basic, but you can simply create a .txt file, with the keys in it one per line. Than you can use scan() to read that file.
Something like:
Sys.setenv("AWS_ACCESS_KEY_ID" = scan("file.txt",what="character")[1],
"AWS_SECRET_ACCESS_KEY" = scan("file.txt",what="character")[2],
"AWS_DEFAULT_REGION" = "us-east-2")
It is similar to the first solution in the "managing secrets" link in the comments, except that we use a simple text format instead of JSON.
I would like to import an external dataset using read.table() (or any other function for reading files) and then randomize or sample over it. The file is stored in a subfolder within the parent folder that contains the exercises *.rmd. I am working within a RStudio project. I tried placing the dataset in different levels of the folder structure. Using relative path did not work, but absolute paths did.
My folder structure is:
$home/project_name/exercises # It contains the RMD files
$home/project_name/exercises/data # It contains data files that I want to process
$home/project_name/datasets # this folder could eventually contain the dataset I want to process
To make this code more portable, I would like to know o the manage relative paths within *.Rmd for the knitting process.
The exercises are copied to a temporary directory and processed there. Hence, the easiest option is to copy these files to the temporary directory using include_supplement("file.csv"). By default this assumes that the file.csv is in the same directory that the exercise itself resides in. If it is in a subdirectory you can use include_supplement("file.csv", recursive = TRUE) and then subdirectories are searched recursively for file.csv.
After using include_supplement(), the copied file is available locally and can be processed with read.table() or also included in the exercise as a supplementary file. See http://www.R-exams.org/templates/Rlogo/ for a worked example. However, note that the Rlogo template explicitly specifies the directory from which the file should be copied. This is not necessary if that directory is the same as the exercise or a subdirectory.
I want to know how can I import a .txt file in R, but avoiding pathing to my file. I usually import like this: "Import Dataset" and the a select "From text(base)", but when I write in the program file.exists("myfilename.txt") it tells me FALSE. How can I do it correctly?
When you run file.exists("myfilename.txt"), R will search your current working directory for a file called myfilename.txt. If you have a file called myfilename.txt that you imported from some other directory, then file.exists("myfilename.txt") will return FALSE.
Solution 1:
Put your R script and the myfilename.txt file in the same folder
Change your working directory to that folder, either using the session menu or using setwd("path/to/folder")
file.exists("myfilename.txt") should now return TRUE
You can read your table with read.delim("myfilename.txt")
Solution 2:
Create an Rstudio project
Place your R script and the myfilename.txt file in the project folder.
Every time you open the project, your working directory will point to the project folder.
file.exists("myfilename.txt") is TRUE
You can read your table using read.delim("myfilename.txt").
Solution 3:
Leave myfilename.txt where it is and read it by providing the absolute path, for example: read.delim("C:/Users/Jiakai/Documents/myfilename.txt")
In this case file.exists("myfilename.txt") is FALSE and file.exists("C:/Users/Jiakai/Documents/myfilename.txt") is TRUE.
If you want file.exists("myfilename.txt") to return TRUE change your working directory to "C:/Users/Jiakai/Documents/myfilename.txt".
To import a txt file, you have several options. The two best options are
readr::read_delim("path/tomyfile/myfilename.txt", delim = "\t")
or
data.table::fread("path/tomyfile/myfilename.txt", sep = "\t")
They are preferrable to the base R read.delim function that is slower.
You can provide absolute paths or relative path, if you know your working directory
Edit
If you don't know your working directory, you can run
getwd()
If you don't find your file with file.exists it means you need to change your working directory or change the path in your import and file.exists command
I use the googlesheets package. The default directory for spreadsheets is the root of Google Drive. I guess that I can specify the directory - like for a "normal" directory path - but I don't know how to do that.
gs_new(title = "MyData") # export to the root
gs_new(title = "Something/MyData") # export to the specified directory
I'm also interested in this question. I will try the following to see if it works. If not, I may try to use the 'googledrive' package on top of, or in replacement of, the 'googlesheets' package to do sheet creation in a list folder hierarchy. This way I can loop through a list of subfolders while creating any files inside them until all subfolders have their new files created.
So here's my thinking... When I have time to test this out, I'll let you know!
for(path in file_paths){
setwd(path)
for(file in files){
gs_new(file)
}
}
Of course, get your parent folder as a string and use list.files("string", full.names=TRUE). Then, if you have any subfolders (assuming they're created already), it'll return a list in which to loop through. If you just want to create one workbook at one location, simply setting the working directory might work. Again, I'll need to test this in multiple methods.
In order to ease the manual copying of large file amounts, I often use FreeFileSync. I noticed that it preserves the original file information such as when a file was created, last modified etc.
Now I need to regularly copy tons of files in batch mode and I'd like to do it in R. So I wondered if R is capable of preserving that information as well. AFAIU, file.rename() and file.copy() alter the file information, e.g. the times are set to the time the files were actually copied.
Is there any way I can restore the original file information after the files have been copied?
Robocopy via system2() can keep the timestamps.
> cmdArgs<- paste( normalizePath( file.path(getwd()), winslash="/"),
normalizePath( file.path(getwd(), "bkup"), winslash="/" ),
"*.txt",
"/copy:DAT /V" )
> system2( "robocopy.exe", args=cmdArgs )
Robocopy has a slew of switches for all different types of use cases and can accept a 'job' file for the params and file names. The ability of R to call out using system could also be used to execute an elevated session (perhaps the easiest would be by using a powershell script to call Robocopy) so that all of the auditing info (permissions and such) could be retained as well.