dropbox folder extraction in R - r

I'm trying to extract some functions stored on dropbox (in a folder).
It all goes well until I try to untar the file. Here's an example:
library("R.utils")
temp <- tempfile()
temp<-paste(temp,".gz",sep="")
download.file("http://www.dropbox.com/sh/dzgrfdd18dljpj5/OyoTBMj8-v?dl=1",temp)
untar(temp,compressed="gzip",exdir=dirname(temp))
Here I get an error:
Error in rawToChar(block[seq_len(ns)]) :
embedded nul in string: 'PK\003\004\024\0\b\b\b....
Ideally I would then load all the functions in folder like this:
sourceDirectory(dirname(temp))
...but I need to be able to untar them first. I can open the archive in Windows but in R I get the error above. Can anyone help? I've tried to use unzip but this only works with smaller folders downloaded from dropbox (such as the one above), bigger ones only work as gzip format (at least in my experience).

# use the httr package
library(httr)
# define your desired file
u <- "http://www.dropbox.com/sh/dzgrfdd18dljpj5/OyoTBMj8-v?dl=1"
# save the file to a .zip
tf <- paste0( tempfile() , '.zip' )
# create a temporary directory
td <- tempdir()
# get the file
fc <- GET(u)
# write the content of the download to a binary file
writeBin(content(fc, "raw"), tf)
# unzip it.
unzip( tf , exdir = td )
# locate all files in this directory
af <- list.files( td , recursive = TRUE )
# subset the files to the ones ending with R
R.files <- af[ substr( af , nchar( af ) , nchar( af ) ) == 'R' ]
# set your working directory
setwd( td )
# source 'em
for ( i in R.files ) source( i )
# see that they're loaded
ls()

Maybe you must use the option mode='wb' for the download.file.

Related

Is there a way to load in multiple .txt files and list them after so the encoding can be changed?

I am trying to load multiple .txt files into a corpus. I've set up the working directory and then have the following to load the files:
filenames <- list.files(getwd(),pattern="*.txt", full.names=FALSE)
The problem is, some of the text file names have special characters (they are people's names), and I can't find a way to change the encoding to UTF-8 with list.files(), and I'm not sure how to load in many .txt files without list.files(). I also can't remove the special characters in this case.
Any suggestions? Thanks in advance!
Edit: Working in Windows
The pattern argument won't work if the encoding is wrong. Use list.files() without pattern=... and you can at least get character strings from the mis-encoded filenames that you can then work with and possibly fix in R.
This is a minimal demonstating exmaple (needs the convmv system command to set up the test case)
dir.create( wd <- tempfile() )
setwd(wd)
convmv <- Sys.which("convmv")
if( convmv == "" )
stop("Need the convmv available to continue")
f1 <- "æøå.txt"
cat( "foo\n", file=f1 )
system2( convmv, args=c("-f", "utf8", "-t", "latin1", "--notest", f1) )
f2 <- "ÆØÅ.txt"
cat( "bar\n", file=f2 )
plain.list.files <- list.files()
stopifnot( length( plain.list.files ) == 2 )
with.pattern.list.files <- list.files( pattern="\\.txt" )
stopifnot( length( with.pattern.list.files ) == 1 )
Fixing the character set can be done, but I'm not sure if you're asking about that at this point.
EDIT: Actually working with or fixing these filenames:
Now that you can read the files, how bad they may be, if you know they are latin1 for example, the following might be of help. Ironically detect_str_enc doesn't get it right (and I found no good alternative), but if you know that any filename that isn't ASCII or UTF-8, will be latin1, then this might be a working fix for you:
library(uchardet)
hard.coded.encoding <- "latin1"
nice.filenames <- sapply( plain.list.files, function(fname) {
if( !detect_str_enc(fname) %in% c("ASCII","UTF-8") ) {
Encoding(fname) <- hard.coded.encoding
}
return( fname )
})
## Now its presumably safe to look for our pattern:
i.txt <- grepl( "\\.txt$", nice.filenames )
## And we can now work with the files and present them nicely:
file.data <- lapply( plain.list.files[i.txt], function(fname) {
## Do what you want to do with the file here:
readLines( fname )
})
names(file.data) <- nice.filenames[i.txt]

invalid path argument in R

I had tried to execute this function to find correlation between two elements in the excel files of directory "specdata",by passing in the command line corr("specdata"). But it shows " Error in list.files(directory, full.names = TRUE) : invalid 'path' argument ". I had checked the current working directory and it was correct. Any ideas for the reason of error ?
corr <- function(directory , threshold=0){
files_all <- list.files( directory , full.names = TRUE )
v <- c(mode = "numeric" , length = 0 )
for ( i in 1:length(files_all)) {
individual <- read.csv(files_all[i] , head = TRUE )
nobs <- sum(complete.cases(individual))
if ( nobs > threshold ) {
xSulfate <- individual[which(!is.na(individual$sulfate)),]
yPollutant <- xSulfate[which(!is.na(xSulfate$nitrate)),]
v <- c( v , cor(yPollutant$sulfate,yPollutant$nitrate))
}
}
return v
}
The corr() function in the Johns Hopkins University R Programming course assignment makes the assumption that the specdata directory is a subdirectory of the current R working directory.
One way to retrieve the list of files from this subdirectory is to retrieve the current directory with getwd() and use this to build the full path to use as an argument to list.files().
## obtain a list of all the CSV files in directory
theList <- list.files(paste0(getwd(),"/",directory))
Another approach is to combine ./ with the directory argument.
theList <- list.files(paste0("./",directory))
Did I unzip specdata correctly?
Students are given instructions to download the data file required for the assignment, specdata.zip and unzip it in the current R working directory. The zip file contains a directory, and the unzip process creates the directory and stores 332 pollution sensor data files in the subdirectory.
We can confirm the current R directory and whether specdata is a subdirectory from the working directory as follows.
# first, get working directory
getwd()
# second, confirm specdata is a subdirectory of this directory
dir.exists("./specdata")
...and the output:
> # first, get working directory
> getwd()
[1] "/Users/lgreski/gitrepos/datascience"
> # second, confirm specdata is a subdirectory of this directory
> dir.exists("./specdata")
[1] TRUE
>

How to use read.csv2.sql to read zip file without unzipping it?

I am trying to read a zip file without unzipping it in my directory while utilizing read.csv2.sql for specific row filtering.
Zip file can be downloaded here :
I have tried setting up a file connection to read.csv2.sql, but it seems that it does not take in file connection as an parameter for "file".
I already installed sqldf package in my machine.
This is my following R code for the issue described:
### Name the download file
zipFile <- "Dataset.zip"
### Download it
download.file("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip",zipFile,mode="wb")
## Set up zip file directory
zip_dir <- paste0(workingDirectory,"/Dataset.zip")
### Establish link to "household_power_consumption.txt" inside zip file
data_file <- unz(zip_dir,"household_power_consumption.txt")
### Read file into loaded_df
loaded_df <- read.csv2.sql(data_file , sql="SELECT * FROM file WHERE Date='01/02/2007' OR Date='02/02/2007'",header=TRUE)
### Error Msg
### -Error in file(file) : invalid 'description' argument
This does not use read.csv2.sql but as there are only ~ 2 million records in the file it should be possible to just download it, read it in using read.csv2 and then subset it in R.
# download file creating zipfile
u <-"https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
zipfile <- sub(".*%2F", "", u)
download.file(u, zipfile)
# extract fname from zipfile, read it into DF0 and subset it to DF
fname <- sub(".zip", ".txt", zipfile)
DF0 <- read.csv2(unz(zipfile, fname))
DF0$Date <- as.Date(DF0$Date, format = "%d/%m/%Y")
DF <- subset(DF0, Date == '2007-02-01' | Date == '2007-02-02')
# can optionally free up memory used by DF0
# rm(DF0)

Automate zip file reading in R

I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.
Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}

Recursively ftp download, then extract gz files

I have a multiple-step file download process I would like to do within R. I have got the middle step, but not the first and third...
# STEP 1 Recursively find all the files at an ftp site
# ftp://prism.oregonstate.edu//pub/prism/pacisl/grids
all_paths <- #### a recursive listing of the ftp path contents??? ####
# STEP 2 Choose all the ones whose filename starts with "hi"
all_files <- sapply(sapply(strsplit(all_paths, "/"), rev), "[", 1)
hawaii_log <- substr(all_files, 1, 2) == "hi"
hi_paths <- all_paths[hawaii_log]
hi_files <- all_files[hawaii_log]
# STEP 3 Download & extract from gz format into a single directory
mapply(download.file, url = hi_paths, destfile = hi_files)
## and now how to extract from gz format?
For part 1, RCurl might be helpful. The getURL function retrieves one or more URLs; dirlistonly lists the contents of the directory without retrieving the file. The rest of the function creates the next level of url
library(RCurl)
getContent <- function(dirs) {
urls <- paste(dirs, "/", sep="")
fls <- strsplit(getURL(urls, dirlistonly=TRUE), "\r?\n")
ok <- sapply(fls, length) > 0
unlist(mapply(paste, urls[ok], fls[ok], sep="", SIMPLIFY=FALSE),
use.names=FALSE)
}
So starting with
dirs <- "ftp://prism.oregonstate.edu//pub/prism/pacisl/grids"
we can invoke this function and look for things that look like directories, continuing until done
fls <- character()
while (length(dirs)) {
message(length(dirs))
urls <- getContent(dirs)
isgz <- grepl("gz$", urls)
fls <- append(fls, urls[isgz])
dirs <- urls[!isgz]
}
we could then use getURL again, but this time on fls (or elements of fls, in a loop) to retrieve the actual files. Or maybe better open a url connection and use gzcon to decompress and process on the file. Along the lines of
con <- gzcon(url(fls[1], "r"))
meta <- readLines(con, 7)
data <- scan(con, integer())
I can read the contents of the ftp page if I start R with the internet2 option. I.e.
C:\Program Files\R\R-2.12\bin\x64\Rgui.exe --internet2
(The shortcut to start R on Windows can be modified to add the internet2 argument - right-click /Properties /Target, or just run that at the command line - and obvious on GNU/Linux).
The text on that page can be read like this:
download.file("ftp://prism.oregonstate.edu//pub/prism/pacisl/grids", "f.txt")
txt <- readLines("f.txt")
It's a little more work to parse out the Directory listings, then read them recursively for the underlying files.
## (something like)
dirlines <- txt[grep("Directory <A HREF=", txt)]
## split and extract text after "grids/"
split1 <- sapply(strsplit(dirlines, "grids/"), function(x) rev(x)[1])
## split and extract remaining text after "/"
sapply(strsplit(split1, "/"), function(x) x[1])
[1] "dem" "ppt" "tdmean" "tmax" "tmin"
It's about here that this stops seeming very attractive, and gets a bit laborious so I would actually recommend a different option. There would no doubt be a better solution perhaps with RCurl, and I would recommend learning to use and ftp client for you and your user. Command line ftp, anonymous logins, and mget all works pretty easily.
The internet2 option was explained for a similar ftp site here:
https://stat.ethz.ch/pipermail/r-help/2009-January/184647.html
ftp.root <- where are the files
dropbox.root <- where to put the files
#=====================================================================
# Function that downloads files from URL
#=====================================================================
fdownload <- function(sourcelink) {
targetlink <- paste(dropbox.root, substr(sourcelink, nchar(ftp.root)+1,
nchar(sourcelink)), sep = '')
# list of contents
filenames <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = TRUE)
filenames <- strsplit(filenames, "\n")
filenames <- unlist(filenames)
files <- filenames[grep('\\.', filenames)]
dirs <- setdiff(filenames, files)
if (length(dirs) != 0) {
dirs <- paste(sourcelink, dirs, '/', sep = '')
}
# files
for (filename in files) {
sourcefile <- paste(sourcelink, filename, sep = '')
targetfile <- paste(targetlink, filename, sep = '')
download.file(sourcefile, targetfile)
}
# subfolders
for (dirname in dirs) {
fdownload(dirname)
}
}

Resources