How to automate the process of unzip steps in RStudio - directory

I have downloaded the transportation history data. The data for each year contain the same numbers of files with exactly same name. Each year's data was zipped in a single files. I am trying to automate the process of unzipping.
for example: I have three zip files named (2014.zip, 2013.zip, 2012.zip) and each zip file contains three files(car.csv, truck.csv, train.csv). What I want is to unzip these files in their corresponding folders which will be created on the fly. How can I automate this process in RStudio? Thanks.

lapply(filenames, function(x)){
foldername<-substr(filename, 1, nchar(filename)-4)
if (file.exists(x)==FALSE){
download.file(url, x)
}
if (file.exists(foldername)==FALSE){
dir.create(foldername)
}
unzip(x)
for (file in list.files(pattern="*.dbf")){
file.copy(file,foldername)
file.remove(file)
}}

Related

Is there a way of reading shapefiles directly into R from an online source?

I am trying to find a way of loading shapefiles (.shp) from an online repository/folder/url directly into my global environment in R, for the purpose of making plots in ggplot2 using geom_sf. In the first instance I'm using my Google Drive to store these files but I'd ideally like to find a solution that works with any folder with a valid url and appropriate access rights.
So far I have tried a few options, the first 2 involving zipping the source folder on Google Drive where the shapefiles are stored and then downloading and unzipping in some way. Have included reproducable examples using a small test shapefile:
Using utils::download.file() to retrieve the compressed folder and unzipping using either base::system('unzip..') or zip::unzip() (loosely following this thread: Downloading County Shapefile from ONS):
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Download the zipped file/folder
download.file("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing", destfile = "data/test_shp.zip")
# Unzip folder using unzip (fails)
unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Unzip folder using system (also fails)
system("unzip data/test_shp.zip")
If you can't run the above code then FYI the 2 error messages are:
Warning message:
In unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", :
error 1 in extracting from zip file
AND
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data/test_shp.zip or
data/test_shp.zip.zip, and cannot find data/test_shp.zip.ZIP, period.
Worth noting here that I can't even manually unzip this folder outside R so I think there's something going wrong with the download.file() step.
Using the googledrive package:
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Specify googledrive url:
test_shp = drive_get(as_id("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing"))
# Download zipped folder
drive_download(test_shp, path = "data/test_shp.zip")
# Unzip folder
zip::unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Load test.shp
test_shp <- read_sf("data/test_shp/test.shp")
And that works!
...Except it's still a hacky workaround, which requires me to zip, download, unzip and then use a separate function (such as sf::read_sf or st_read) to read in the data into my global environment. And, as it's using the googledrive package it's only going to work for files stored in this system (not OneDrive, DropBox and other urls).
I've also tried sf::read_sf, st_read and fastshp::read.shp directly on the folder url but those approaches all fail as one might expect.
So, my question: is there a workflow for reading shapefiles stored online directly into R or should I stop looking? If there is not, but there is a way of expanding my above solution (2) beyond googledrive, I'd appreciate any tips on that too!
Note: I should also add that I have deliberately ignored any option requiring the package rgdal due to its imminient permanent retirement and so am looking for options that are at least somewhat future-proof (I understand all packages drop off the map at some point). Thanks in advance!
I ran into a similar problem recently, having to read in shapefiles directly from Dropbox into R.
As a result, this solution only applies for the case of Dropbox.
The first thing you will need to do is create a refreshable token for Dropbox using rdrop2, given recent changes from Dropbox that limit single token use to 4 hours. You can follow this SO post.
Once you have set up your refreshable token, identify all the files in your spatial data folder on Dropbox using:
shp_files_on_db<- drop_dir("Dropbox path/to your/spatial data/", dtoken = refreshable_token) %>%
filter(str_detect(name, "adm2"))
My 'spatial data' folder contained two sets of shapefiles – adm1 and adm 2. I used the above code to choose only those associated with adm2.
Then create a vector of the names of the shp, csv, shx, dbf, cpg files in the 'spatial data' folder, as follows:
shp_filenames<- shp_files_on_db$name
I choose to read in shapefiles into a temporary directory, avoiding the need to have to store the files on my disk – also useful in a Shiny implementation. I create this temporary directory as follows:
# create a new directory under tempdir
dir.create(dir1 <- file.path(tempdir(), "testdir"))
#If needed later on, you can delete this temporary directory
unlink(dir1, recursive = T)
#And test that it no longer exists
dir.exists(dir1)
Now download the Dropbox files to this temporary directory:
for (i in 1: length(shp_filenames)){
drop_download(paste0("Dropbox path/to your/spatial data/",shp_filenames[i]),
dtoken = refreshable_token,
local_path = dir1)
}
And finally, read in your shapefile as follows:
#path to the shapefile in the temporary directory
path1_shp<- paste0(dir1, "/myfile_adm2.shp")
#reading in the shapefile using the sf package - a recommended replacement for rgdal
shp1a <- st_read(path1_shp)

How to select specific files according to a spreadsheet criteria and then copy from directory to another directory in R?

I have a task that requires me to use a specific column in a CSV spreadsheet that stores the file names, for example:
File Name
CA-001
WV-001
ma-001
My task is to move some files from folder 'source' to folder 'target'.
And I'm using this csv spreadsheet as a crosswalk to select any files with names that match with what's in the column 'File Name'. Then I'm asking R to copy from the source folder that contains not only these files but also other files that are not in this list(eg: CO-001, SC-001...). If it's helpful, all of the files are PDFs, so we don't worry about file type. I want only the files that have names match with what's in the csv spreadsheet. How can I do this?
I have some sample code below, but it still didn't execute successfully.
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
all.files <- list.files(path = source)
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
toCopy <- all.files[all.files %in% csvfile$Move]
file.copy(toCopy, target)
Thank you!
With the provided code, the selection of patterns you want to match will be in csvfile$File.Name.
I'm assuming the source directory is potentially very large. Instead of performing slow regular expressions to match substrings (while we know the exact filename), and/or getting a complete file listing (which is also slow), I will only seek if the exactly wanted filenames exist before copying them:
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
# add .pdf suffix
toCopy <- paste0(csvfile$File.Name,'.pdf')
# add source directory path
toCopy <- file.path(source, toCopy)
# optional: extract only the existing files from toCopy. You can skip this step if you're sure they exist and/or you don't mind receiving errors
toCopy <- toCopy[file.exists(toCopy)]
# make it so
file.copy(toCopy, target, overwrite = T)
I would preferably keep the .pdf extension in the filename at all times, so also in the source CSV. There would be an issue on case-sensitive filesystems (almost all Linux installations, rarely macOS or Windows) if the extension is .PDF, .Pdf, etc.

How can I write multiple csv files in a specific directory and then merge them into a single csv?

I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in separate csv files. I am using the following code which does the job and stores the csv files in the same directory as the .grb2 files.
path <- "file path"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <-
paste0(tools::file_path_sans_ext(input.file.names),".csv")
for(i in 1:length(input.file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
tmp2m.6hr <- GRIB[46,13,c(1:20)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
My first question is this: how can I store the csv files in a different directory than the .grb2 files?
My .grb2 files, and thus the resulting csv files, end in four different types, i.e. 00.grb2, 06.grb2, 12.grb2, 18.grb2. The resulting csv files have the following form:
enter image description here
My second question is: how can I merge all my 00.csv, 06.csv, 12.csv, 18.csv files (each category in the same column) in a single csv file in a directory of my choice with the following headrs: 00_tmp2m.6hr, 06_tmp2m.6hr, 12_tmp2m.6hr, 18_tmp2m.6hr, and also create a fifth column with the average of the other four? The result that I want is the following:
enter image description here
As I m not an experienced user this is too complicated for me. I would very much apreciate any assistance with this.
For your fist question, you might try specifying the path using a relative reference to the folder, as in write.csv(paste0("./myfolder/", output.file.names[i])).
Your second question might be easier if you read the data and then write your results as a new file. you might also want to take a look at the optional parameters of write.csv(append = FALSE, ...).
Also, you might get a better answer by creating a minimal example.

renaming files based on names of other files in R

Right now I have two folders, each with the same number .dat files. In folder 1, I have my original data files and in folder 2 I have files where I have deleted some data from folder 1 files. I'm trying to see how missing data can change my average, so I am randomly deleting some data then performing the same statistics to see how much it differs from the original data set So in folder one I have files that have names
kn_2014_01_09_0600.dat
kn_2014_01_09_0700.dat
kn_2014_01_09_0800.dat
and so on
After I read them in to R and work them in the way I need I write the files to a new folder where they now have the names
1.dat
2.dat
3.dat
and so on
How can I change the names in my output folder to match the original names?
So my output file 1.dat should be kn_2014_01_09_0600.dat and my 2.dat file should be kn_2014_01_09_0700.dat and so on
right now I have
nfiles <- list.files('new file location')
ofiles <- list.files('original file location')
lapply(nfiles,function(i){file.rename(from=i,to= )})
I don't know what to put for the to argument? I thought it might be something like
lapply(nfiles,function(i){file.rename(from=i,to=ofiles[i])})
or
lapply(nfiles,function(i){file.rename(from=i,to=ofiles[1:length(ofiles)})
but neither of those worked. Any suggestions?

How to use R to Iterate through Subfolders and bind CSV files of the same ID?

I am stuck. I need a way to iterate through a bunch of subfolders in a directory, pull out 4 .csv files , bind the contents of those 4 .csv files, then write out the new .csv to a new directory using the name of the initial subfolder as the name of the new .csv.
I know R could do this. But I am stuck at how to iterate across the subfolders and bind the csv files together. My obstacle is that each subfolder contains the same 4 .csv files using the same 8-digit id. For example, subfolder A contains 09061234.csv, 09061345.csv, 09061456.csv, and 09061560.csv. subfolder B contains 9061234.csv, 09061345.csv, 09061456.csv, and 09061560.csv. (...). There are 42 subfolders, and hence 168 csv files with the same names. I want to compact the files down to 42.
I can use list.files to retrieve all the subfolders. But then what?
##Get Files from directory
TF = "H:/working/TC/TMS/Counts/June09"
##List Sub folders
SF <- list.files(TF)
##List of File names inside folders
FN <- list.files(SF)
#Returns list of 168 filenames
###?????###
#How to iterate through each subfolder, read each 8-digit integer id file,
#bind them all together into one single csv,
#Then write to new directory using
#the name of the subfolder as the name of the new csv?
There is probably a way to do this easily but I am a noob with R. Something involving functions, paste and write.table perhaps? Any hints/help/suggestions is greatly appreciated. Thanks!
You can use recursive=T option for list.files,
lapply(c('1234' ,'1345','1456','1560'),function(x){
sources.files <- list.files(path=TF,
recursive=T,
pattern=paste('*09061*',x,'*.csv',sep='')
,full.names=T)
## ou read all files with the id and bind them
dat <- do.call(rbind,lapply(sources.files,read.csv))
### write the file for the
write(dat,paste('agg',x,'.csv',sep='')
}
After some tweaking of agstudy's code, I came up with the solution I was ultimately after. There were a couple of missing pieces that are more due to the nature of my specific problem, so I am leaving agstudy's answer as "accepted".
Turns out a function really wasn't needed. At least not for now. If I need to perform this same task again, I will create a function out of it. For now, I can solve this particular problem without it.
Also, for my instance, I needed a conditional "if" statement to handle any non-csv files that may have lived in the subfolders. By adding an if statement, R throws warnings and skips any files that are not comma-separated.
Code:
##Define directory path##
TF = "H:/working/TC/TMS/Counts/June09"
##List of subfolder files where file name starts with "0906"##
SF <- list.files(TF,recursive=T, pattern=paste("*09061*",x,'*.csv',sep=""))
##Define the list of files to search for##
x <- (c('1234' ,'1345','1456','1560')
##Create a conditional to skip over the non-csv files in each folder##
if (is.integer(x)){
sources.files <- list.files(TF, recursive=T,full.names=T)}
dat <- do.call(rbind,lapply(sources.files,read.csv))
#the warnings thrown are ok--these are generated due to the fact that some of the folders contain .xls files
write.table(dat,file="H:/working/TC/TMS/June09Output/June09Batched.csv",row.names=FALSE,sep=",")

Resources