Using 'R' and "aws.s3" how to push a directory to the cloud - r

I have a directory with subdirectories and many files that need to be pushed to Amazon S3. I am using the 'R' tool.
Is there a clean/easy way to say "push this directory and everything in it up to S3"? I am hoping to avoid pushing things up one at a time, and manually re-building the directory structures.

If you pass file names to put_object() using their full path names and then use those path names as their object keys, then you can implicitly recreate a directory structure. Basically like this (though you may want to change the filenames when using them as object keys in some way):
library("aws.s3")
lapply(dir(full.names = TRUE, recursive = TRUE), function(filename) {
put_object(file = filename, object = filename, bucket = "mybucket")
})
There is also an experimental function s3sync() that should do this for a complete file tree (but it isn't widely tested):
s3sync()

Related

Creating objects from all .xlsx documents in working directory

I am trying to create objects from all files in working directory with name of the original file. I tried to go the following way, but couldn't solve appearing problems.
# - SETTING WD
getwd()
setwd("PATH TO THE FILE")
library(readxl)
# - CREATING OBJECTS
file_objects <- list.files()
xlsx_objects <- unlist(grep(".xlsx",file_objects,value = T))
for (i in xlsx_objects) {
xlsx_objects[i] <- read_xlsx(xlsx_objects[i], header = T)
}
I tried to paste [i]item from "xlsx_objects" with path to WD but it only created a list of files names from docs in WD.
I also find information, that read.csv can read only one file at the time, but I guess that it should be the case with for loop, right? It is reading only one file at the time.
Using lapply (as described in this forum) I was able to get the data in the environment, but argument header didn't work, I lost names of my docs in that object which does not have desired structure. I am though looking for having these files in separated objects without calling every document exclusively.
IIUC, you could do something like:
files = list.files("PATH TO THE FILE", full.names = T, pattern = 'xlsx')
list_files = map(files, readxl::read_excel)
(You can't use read.csv to read excel files)
Also I recommend reading about R Projects so you don't have to use setwd() ever again, which makes your code harder to reproduce down the pipeline

Running a same script in sub folders

In my main folder i have many sub folders like AA,BB,CC,DD ...etc. and all folders have a common script named run_script.R and i want to run this script in every folder. folder can be any amount.
Its working abut running in first folder only ,but i wanted it to run in every folder.
also when i am using setwd(folder) then showing error
Error in setwd(folder) : cannot change working directory
data_folder <- "C:/Users/mosho/Desktop/New folder (2)/"
allfolders <- data.frame(Folders = list.dirs(path = data_folder, recursive = F, full.names = F))
r_scripts <- "run_script.R"
for (folder in allfolders$Folders) {
#setwd(folder)
message(folder)
source(paste0(data_folder,folder,"/",r_scripts))
}
You are on a right path, I did some minor tweaks to your script which will resolve the issue. The points missing in your scripts are;
the allfolders contains the folder name not the entire explicit path. To set the working directory you need to set give the explicit path, by only calling the folder name will result into error unless you existing working directory is contains that folder. Anyways, its best practice to work with full path names.
also to simplify setting up allfolders as list for iterator will make your life lot easier than a data frame
Below is my work-out;
I created some dummy folders (DIC01, DIC02, DIC03...) under path "C:\Users\XXXXXX\Documents\TEST MAIN", and placed code run_script.R inside each one. This run_script.R contains simple code print("Hello World !!")
Next I set initial working directory where to the path where all the folders present i.e. to path "C:\Users\XXXXXX\Documents\TEST MAIN". Next listed the folders/directories present within this path as a list instead of data frame. Next is for loop which iterate over list of folder names. Inside we reset the working directory by the folder name and source the R code.
data_folder <- "C:\\Users\\XXXXXX\\Documents\\TEST MAIN"
setwd(data_folder)
allfolders <- list.dirs(path = data_folder, recursive = F, full.names = F)
r_scripts <- "run_script.R"
for (folder in allfolders) {
print(folder)
setwd(paste0(data_folder,"\\",folder))
source(paste0(data_folder,"\\",folder,"\\",r_scripts))
}
The result I get after the execution is something like this. First the name of the directory and then execution result.
I hope this resolves you problem. If yes Like/Up vote the answer and let me know.

Is there a way of reading shapefiles directly into R from an online source?

I am trying to find a way of loading shapefiles (.shp) from an online repository/folder/url directly into my global environment in R, for the purpose of making plots in ggplot2 using geom_sf. In the first instance I'm using my Google Drive to store these files but I'd ideally like to find a solution that works with any folder with a valid url and appropriate access rights.
So far I have tried a few options, the first 2 involving zipping the source folder on Google Drive where the shapefiles are stored and then downloading and unzipping in some way. Have included reproducable examples using a small test shapefile:
Using utils::download.file() to retrieve the compressed folder and unzipping using either base::system('unzip..') or zip::unzip() (loosely following this thread: Downloading County Shapefile from ONS):
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Download the zipped file/folder
download.file("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing", destfile = "data/test_shp.zip")
# Unzip folder using unzip (fails)
unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Unzip folder using system (also fails)
system("unzip data/test_shp.zip")
If you can't run the above code then FYI the 2 error messages are:
Warning message:
In unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", :
error 1 in extracting from zip file
AND
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data/test_shp.zip or
data/test_shp.zip.zip, and cannot find data/test_shp.zip.ZIP, period.
Worth noting here that I can't even manually unzip this folder outside R so I think there's something going wrong with the download.file() step.
Using the googledrive package:
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Specify googledrive url:
test_shp = drive_get(as_id("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing"))
# Download zipped folder
drive_download(test_shp, path = "data/test_shp.zip")
# Unzip folder
zip::unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Load test.shp
test_shp <- read_sf("data/test_shp/test.shp")
And that works!
...Except it's still a hacky workaround, which requires me to zip, download, unzip and then use a separate function (such as sf::read_sf or st_read) to read in the data into my global environment. And, as it's using the googledrive package it's only going to work for files stored in this system (not OneDrive, DropBox and other urls).
I've also tried sf::read_sf, st_read and fastshp::read.shp directly on the folder url but those approaches all fail as one might expect.
So, my question: is there a workflow for reading shapefiles stored online directly into R or should I stop looking? If there is not, but there is a way of expanding my above solution (2) beyond googledrive, I'd appreciate any tips on that too!
Note: I should also add that I have deliberately ignored any option requiring the package rgdal due to its imminient permanent retirement and so am looking for options that are at least somewhat future-proof (I understand all packages drop off the map at some point). Thanks in advance!
I ran into a similar problem recently, having to read in shapefiles directly from Dropbox into R.
As a result, this solution only applies for the case of Dropbox.
The first thing you will need to do is create a refreshable token for Dropbox using rdrop2, given recent changes from Dropbox that limit single token use to 4 hours. You can follow this SO post.
Once you have set up your refreshable token, identify all the files in your spatial data folder on Dropbox using:
shp_files_on_db<- drop_dir("Dropbox path/to your/spatial data/", dtoken = refreshable_token) %>%
filter(str_detect(name, "adm2"))
My 'spatial data' folder contained two sets of shapefiles – adm1 and adm 2. I used the above code to choose only those associated with adm2.
Then create a vector of the names of the shp, csv, shx, dbf, cpg files in the 'spatial data' folder, as follows:
shp_filenames<- shp_files_on_db$name
I choose to read in shapefiles into a temporary directory, avoiding the need to have to store the files on my disk – also useful in a Shiny implementation. I create this temporary directory as follows:
# create a new directory under tempdir
dir.create(dir1 <- file.path(tempdir(), "testdir"))
#If needed later on, you can delete this temporary directory
unlink(dir1, recursive = T)
#And test that it no longer exists
dir.exists(dir1)
Now download the Dropbox files to this temporary directory:
for (i in 1: length(shp_filenames)){
drop_download(paste0("Dropbox path/to your/spatial data/",shp_filenames[i]),
dtoken = refreshable_token,
local_path = dir1)
}
And finally, read in your shapefile as follows:
#path to the shapefile in the temporary directory
path1_shp<- paste0(dir1, "/myfile_adm2.shp")
#reading in the shapefile using the sf package - a recommended replacement for rgdal
shp1a <- st_read(path1_shp)

How to work with multiple data sources within a working directory/RProject in RStudio

I am trying to get to grips with the idea that you shouldn't change the working directory and also good transferable/transparant working practices mean it's worth using the RProjects option. However, this has caused me an issue when I have to draw upon different datasets within a single project. I am currently struggling to map a dataset from a file that is within a subfile of my working directory.
I use the code import <- list.files( pattern = ".csv", recursive = TRUE) %>% map_df(~read_csv(.,col_types = cols(.default = 'd', 'UTCDateTime' = 'c', 'mac_address' = 'c', 'firmware_ver' = 'c', 'hardware' = 'c'))) that will work if i have my files stored in the working directory.
To keep my project file manageable and due to having to work with multiple sets of .csv files, I cannot have them all in the working directory.
If I then specify the path I want by using path = 'insert absolute/relative path here' within the list.files I get an error:
'file1/20190913.csv' does not exist in the current working directory ('insert working directory here').
Any advice on working within files would be appreciated.
Your list.files statement returns the names of the files found in the path you give, e.g. myfile.csv, and read_csv searches these files in the current directory, where they are not.
If you add the option full.names = TRUE in list.files, it will return the absolute paths to the files, and read_csv will find them.

R extension write local data

I am creating a package and would like to store settings data locally, since it is unique for each user of the package and so that the setting does not have to be set each time the package is loaded.
How can I do this in the best way?
You could save your necessary data in an object and save it using saveRDS()
whenever a change it made or when user is leaving or giving command for saving.
It saves the R object as it is under a file name in the specified path.
saveRDS(<obj>, "path/to/filename.rds")
And you can load it next time when package is starting using loadRDS().
The good thing of loadRDS() is that you can assign a new name to the obj. (So you don't have to remember its old obj name. However the old obj name is also loaded with the object and will eventually pollute your namespace.
newly.assigned.name <- loadRDS("path/to/filename.rds")
# or also possible:
loadRDS("path/to/filename.rds") # and use its old name
Where to store
Windows
Maybe here:
You can use %systemdrive%%homepath% environment variable to accomplish
this.
The two command variables when concatenated gives you the desired
user's home directory path as below:
Running echo %systemdrive% on command prompt gives:
C:
Running echo %homepath% on command prompt gives:
\Users\
When used together it becomes:
C:\Users\
Linux/OsX
Either in the package location of the user,
path.to.package <- find.package("name.of.your.pacakge",
lib.loc = NULL, quiet = FALSE,
verbose = getOption("verbose"))
# and then construct with
destination.folder.path <- file.path(path.to.package,
"subfoldername", "filename")`
# the path to the final destination
# You should use `file.path()` to construct such paths, because it detects automatically the correct ('/' or '\') separators for the file paths in Unix-derived systems (Linux/Mac Os X) versus Windows.
Or use the $HOME variable of the user and there in a file - the name of which beginning with "." - this is convention in Unix-systems (Linux/Mac OS X) for such kind of file which save configurations of software programs.
e.g. ".your-packages-name.rds".
If anybody has a better solution, please help!

Resources