Recursively ftp download, then extract gz files - r

I have a multiple-step file download process I would like to do within R. I have got the middle step, but not the first and third...
# STEP 1 Recursively find all the files at an ftp site
# ftp://prism.oregonstate.edu//pub/prism/pacisl/grids
all_paths <- #### a recursive listing of the ftp path contents??? ####
# STEP 2 Choose all the ones whose filename starts with "hi"
all_files <- sapply(sapply(strsplit(all_paths, "/"), rev), "[", 1)
hawaii_log <- substr(all_files, 1, 2) == "hi"
hi_paths <- all_paths[hawaii_log]
hi_files <- all_files[hawaii_log]
# STEP 3 Download & extract from gz format into a single directory
mapply(download.file, url = hi_paths, destfile = hi_files)
## and now how to extract from gz format?

For part 1, RCurl might be helpful. The getURL function retrieves one or more URLs; dirlistonly lists the contents of the directory without retrieving the file. The rest of the function creates the next level of url
library(RCurl)
getContent <- function(dirs) {
urls <- paste(dirs, "/", sep="")
fls <- strsplit(getURL(urls, dirlistonly=TRUE), "\r?\n")
ok <- sapply(fls, length) > 0
unlist(mapply(paste, urls[ok], fls[ok], sep="", SIMPLIFY=FALSE),
use.names=FALSE)
}
So starting with
dirs <- "ftp://prism.oregonstate.edu//pub/prism/pacisl/grids"
we can invoke this function and look for things that look like directories, continuing until done
fls <- character()
while (length(dirs)) {
message(length(dirs))
urls <- getContent(dirs)
isgz <- grepl("gz$", urls)
fls <- append(fls, urls[isgz])
dirs <- urls[!isgz]
}
we could then use getURL again, but this time on fls (or elements of fls, in a loop) to retrieve the actual files. Or maybe better open a url connection and use gzcon to decompress and process on the file. Along the lines of
con <- gzcon(url(fls[1], "r"))
meta <- readLines(con, 7)
data <- scan(con, integer())

I can read the contents of the ftp page if I start R with the internet2 option. I.e.
C:\Program Files\R\R-2.12\bin\x64\Rgui.exe --internet2
(The shortcut to start R on Windows can be modified to add the internet2 argument - right-click /Properties /Target, or just run that at the command line - and obvious on GNU/Linux).
The text on that page can be read like this:
download.file("ftp://prism.oregonstate.edu//pub/prism/pacisl/grids", "f.txt")
txt <- readLines("f.txt")
It's a little more work to parse out the Directory listings, then read them recursively for the underlying files.
## (something like)
dirlines <- txt[grep("Directory <A HREF=", txt)]
## split and extract text after "grids/"
split1 <- sapply(strsplit(dirlines, "grids/"), function(x) rev(x)[1])
## split and extract remaining text after "/"
sapply(strsplit(split1, "/"), function(x) x[1])
[1] "dem" "ppt" "tdmean" "tmax" "tmin"
It's about here that this stops seeming very attractive, and gets a bit laborious so I would actually recommend a different option. There would no doubt be a better solution perhaps with RCurl, and I would recommend learning to use and ftp client for you and your user. Command line ftp, anonymous logins, and mget all works pretty easily.
The internet2 option was explained for a similar ftp site here:
https://stat.ethz.ch/pipermail/r-help/2009-January/184647.html

ftp.root <- where are the files
dropbox.root <- where to put the files
#=====================================================================
# Function that downloads files from URL
#=====================================================================
fdownload <- function(sourcelink) {
targetlink <- paste(dropbox.root, substr(sourcelink, nchar(ftp.root)+1,
nchar(sourcelink)), sep = '')
# list of contents
filenames <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = TRUE)
filenames <- strsplit(filenames, "\n")
filenames <- unlist(filenames)
files <- filenames[grep('\\.', filenames)]
dirs <- setdiff(filenames, files)
if (length(dirs) != 0) {
dirs <- paste(sourcelink, dirs, '/', sep = '')
}
# files
for (filename in files) {
sourcefile <- paste(sourcelink, filename, sep = '')
targetfile <- paste(targetlink, filename, sep = '')
download.file(sourcefile, targetfile)
}
# subfolders
for (dirname in dirs) {
fdownload(dirname)
}
}

Related

Downloading txt files from multiple directories using R

I've been trying to download .txt files from https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/
So far, I've managed to download the complete set for 1850 by using the code from Download all the files (.zip and .txt) from a webpage using R which is, for my case:
page <- "https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/1850/"
a <- readLines(
page
)
loc.txt <- grep(
".txt",
a
)
#------------------------------------
convfn <- function(line, marker, page){
i <- unlist(gregexpr(pattern ='href="', line)) + 6
i2<- unlist(gregexpr(pattern =,marker, line)) + 3
#target file
.destfile <- substring(line, i[1], i2[1])
#target url
.url <- paste(page, .destfile, sep = "/")
#print targets
cat(.url, '\n', .destfile, '\n')
#the workhorse function
download.file(url=.url, destfile=.destfile)
}
#------------------------------------
print(
getwd()
)
sapply(a[loc.txt],
FUN = convfn,
marker = '.txt"',
page = page)
I would like to know how to write a function that will allow me to automate replacing the years 1850 to 2022 since doing this would somehow be long and repetitive (over 170 years). My idea is somehow stuck on the line:
page <- paste0("https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M/", c(seq(1850, 2022, by = 1)), "/")
but I do not know how to make it into a working function
Please help, thank you and keep safe
Best regards,
Raven
I'd be inclined to do this differently. Better to use xml/xpath to extract the file links.
library(httr) # for GET(...)
library(XML) # for htmlParse(...)
base.url <- 'https://ds.data.jma.go.jp/gmd/goos/data/pub/JMA-product/cobe2_sst_glb_M'
get.docs <- function(year) {
url <- paste(base.url, year, sep='/')
html <- htmlParse(content(GET(url), type='text'))
file.names <- html['//td/a/#href'][-1] # first href is to parent directory; remove
##
# uncomment next line to save files
#
# mapply(download.file, paste(url, file.names, sep='/'), file.names)
print(sprintf('Downloaded from: %s to: %s', paste(url, file.names, sep='/'), file.names))
}
lapply(1850:2022, get.docs)

How to get the directory of the executing script in R? [duplicate]

This question already has answers here:
Determine path of the executing script
(30 answers)
Closed 4 years ago.
I have a function that looks like this:
read_data <- function(filename, header) {
path <- paste("./output/", filename, sep = "")
if (file.exists(path)) {
data <- read.csv(file = path, header = header, sep = ",")
}
# Partially removed for brevity.
}
What I want to achieve is that, given a filename, I want to search whether that filename is available inside the output subdirectory, which is a subdirectory where my script is located, and if it is available, I want to read that file. The problem is that as long as I know read.csv function's file argument requires a full path for the file. So, I somehow need to get the directory where my script is located, so I can concatenate it with the rest of the subdirectory and filename. I can get the current working directory with getwd(), but that's not quite the same thing, as my working directory seems always to be fixed, whereas the script can be located anywhere in the computer. Any ideas how to get the directory of the script, and concatenate it with the output subdirectory and the provided filename in R?
if you want to detemine the directory of the executing script, this might be a dup: Rscript: Determine path of the executing script
initial.options <- commandArgs(trailingOnly = FALSE)
file.arg.name <- "--file="
script.name <- sub(file.arg.name, "", initial.options[grep(file.arg.name, initial.options)])
script.dirname <- dirname(script.name)
print(script.dirname)
> f <- "/path/to/my/script.R"
> f
[1] "/path/to/my/script.R"
> basename(f)
[1] "script.R"
> dirname(f)
[1] "/path/to/my"
> dirname(dirname(f))
[1] "/path/to"
> file.path(dirname(f), "output")
[1] "/path/to/my/output"
> file.path(dirname(f), "output", "data.csv")
[1] "/path/to/my/output/data.csv"
Adding this as an answer for you Tinker.
Considering you're just trying to read in a file you can do it this way:
## So what does this do?
# The path is where the files exist
# The pattern is some identifiable portion of the file name, which list.files() will bring back
# You need the full name so that R knows where to read from, this way you don't have to set a new working directory.
data <- if(file.exists(list.files(path = "./output/", pattern = "filename",full.names=T))){ read.csv(list.files(path = "./output/", pattern = "filename",full.names=T))}
# Let's imagine you have a number of files to read in
# Generate a list of filenames
filename <- list("file1","file2","file3","filen")
data <- lapply( filename, function(x) {
if( file.exists( list.files( path = "./output/", pattern = x ,full.names=T ) ) ) {
read.csv( list.files(path = "./output/", pattern = x ,full.names=T) ) }
} )
# each element of the list is oe of your data files
data[[1]]
data[[2]]
data[[n]]
I'm not sure what you're declaring with header as csv's are assumed to have a header inherently, additionally a csv is comma separated so declaring the sep character is also redundant.
Collecting resources from multiple questions in SO, I came up with the following solution, that seems to work with multiple calling conventions:
library(base)
library(rstudioapi)
get_directory <- function() {
args <- commandArgs(trailingOnly = FALSE)
file <- "--file="
rstudio <- "RStudio"
match <- grep(rstudio, args)
if (length(match) > 0) {
return(dirname(rstudioapi::getSourceEditorContext()$path))
} else {
match <- grep(file, args)
if (length(match) > 0) {
return(dirname(normalizePath(sub(file, "", args[match]))))
} else {
return(dirname(normalizePath(sys.frames()[[1]]$ofile)))
}
}
}
Which later I can use as:
path <- paste(get_directory(), "/output/", filename, sep = "")

Add a suffix to filenames based on subfolder names within a directory in R

I have a number of (sub)folders stored within a directory folder. Each subfolder contains 5-35 .jpg aerial photograph files that are named by flightline name and number (ie: bej-3-83). I would like to add a suffix to each of these files based upon the subfolder they are stored upon. For example if 'bej-3-83' is stored within 'T13N_10W' subfolder I would like my R script to rename 'bej-3-83' as 'bej-3-83-T13N_10W' and so forth for each file stored within each subfolder.
I can partially accomplish this process albeit still with more manual input than I'd like using this script:
folder = "C:\\...\\T23N_R14W"
files <- list.files(folder,pattern = "\\.jpg$",full.names = T)
files
sapply(files,FUN=function(eachPath){
file.rename(from=eachPath,to= sub(pattern="_clip", paste0("_T23N_R14W"),eachPath))
})
But as you can see this script uses a manual paste input of the subfolder name which isn't useful when you're trying to create a script that does what I need in one fell swoop.
I'm seeing similar questions and answers which utilize 'pushd' and 'popd' and I've attached to of those threads below as links. I'm trying to read as much as I can on these functions but so far the process to make it work has me stuck.
How to rename files in folders to foldername using batch file
Rename Files Based On Folder Name
Sincerely,
Henry
You might have to change the dir_separator to \ on windows:
make_filename <- function(file_path) {
s <- unlist(strsplit(file_path, dir_separator))
fname <- gsub('\\.jpg$', '', s[length(s)])
parent_dir <- s[(length(s) - 1)]
new_fname <- paste0(parent_dir, "_", fname, '.jpg')
path <- paste(s[-length(s)], collapse = dir_separator)
return(paste(path, new_fname, sep = dir_separator))
}
folder = './data'
dir_separator = '/'
files <- paste0(folder, dir_separator, list.files(folder, recursive = T))
sapply(files, function(x) file.rename(from = x, to = make_filename(x)))
A recursive approach.
Pass the path to the root folder containing your files and the extension of the files you want to rename, to rename_batch.
Defaults are working directory and jpeg.
library(stringr)
# An auxiliary function
rename_file <- function(str, extra){
file_name <- tools::file_path_sans_ext(str)
file_ext <- tools::file_ext(str)
return(paste0(file_name, '-', extra, '.', file_ext))
}
rename_batch <- function(path = "./",
extension = 'jpeg'){
# Separate files from folders
l <- list.files(path)
files <- l[grepl(paste0("\\." , extension), l)]
folders <- list.dirs(path, F, F)
present_folder <-
stringr::str_extract(path, '(?<=/)([^/]+)$')
# Check if there is a / at the end of path and removes it
# for consistency
path_len <- nchar(path)
last <- substr(path, path_len, path_len)
if (last == '/') {
path <- substr(path, 1, path_len - 1)
}
if (length(files) > 0) {
file_updtate <- paste0(path, '/', files)
file.rename(file_updtate, rename_file(file_updtate, present_folder))
}
if (length(folders) > 0) {
for (i in paste0(path, '/', folders)) {
cat('Renaming in:', i, '\n')
rename_batch(i)
}
}
}

How to find all files sourcing a particular file?

Suppose I have file_a.R. It is sourced via R's base source function by some other files file_b.R, file_c.R, which are located in the same folder or sub folder. Is there an easy way to get the paths of file_b.R and file_c.R given the path of file_a.R?
EDIT:
If you want to get all links between R files, and some files that are sourced in those files, you can use something like that:
library(stringr)
#Get all R files paths in working directory and subdirectories
filelist <- lapply(list.files(
pattern = "[.]R$", recursive = TRUE
), print)
#Extract one file's sources
getSources <- function(file, pattern) {
#Store all file lines in a character vector
lines <- readLines(file, warn = FALSE)
#Extract R-filenames starting with "pattern" in all lines containing "source"
sources <- lapply(lines, function(x) {
if (length(grep("source", x) > 0)) {
str_extract(x, paste0(pattern, ".*[.]R"))
}
else{
NA
}
})
#Remove NA (lines without source)
sources <- sources[!is.na(sources)]
#Return a list
list(path = file,
pattern = pattern,
sources = unlist(sources))
}
#Example
corresp <- lapply(X = filelist, FUN = getSources, pattern = "file")
It will return a list of:
$path: R file path
$pattern: pattern used to match sources
$sources: the name of the sourced file
And you'll be able to see if anything is sourced anywhere, including file_A.

Automate zip file reading in R

I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.
Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}

Resources