I am currently working through Coursera's R Programming course and have hit a bit of a snag with this assignment. I have been getting various errors (not I'm not totally sure I've nailed down) but this is a new one and no matter what I do I can't seem to shake it.
Whenever I run the below code it comes back with
Error in file(file, "rt") : cannot open the connection
pollutantmean <- function (directory, pollutant, id){
files<- list.files(path = directory, "/", full.names = TRUE)
dat <- data.frame()
dat <- sapply(file = directory,"/", read.csv)
mean(dat["pollutant"], na.rm = TRUE)
}
I have tried numerous different solutions posted here on SO for this issue but none of it has worked. I made sure that I am running after setting the working directory to the folder with all of the CSV files and I can see all of the files in the file pane. I have also moved that working directory around a few times since some of the suggestions were to put it on the desktop, etc. but none of that has worked. I am currently running R Studio as an admin but that does not seem to have done anything and I have also modified the permissions on the specdata file to ensure there's no weird restrictions there. Any help is appreciated.
Here are two possible implementations:
# list all files in "directory", read them, combine and then take mean of "pollutant" column
pollutantmean_1 <- function (directory){
files <- list.files(path = directory, full.names = TRUE)
dat <- lapply(file = directory, read.csv)
dat <- data.table::rbindlist(dat) |> as.data.frame()
mean(dat[, 'pollutant' ], na.rm = TRUE)
}
# list all files in "directory", read them, take the mean of "pollutant" column for each file and return them
pollutantmean_2 <- function (directory){
files <- list.files(path = directory, full.names = TRUE)
dat <- lapply(file = directory, read.csv)
pollutant_means <- sapply(dat, function(x) mean(x[ , 'pollutant' ], na.rm = TRUE))
names(pollutant_means) <- basename(files)
pollutant_means
}
I am trying to read multiple excel files under different folders by R
Here is my solution:
setwd("D:/data")
filename <- list.files(getwd(),full.names = TRUE)
# Four folders "epdata1" "epdata2" "epdata3" "epdata4" were inside the folder "data"
dataname <- list.files(filename,pattern="*.xlsx$",full.names = TRUE)
# Every folder in the folder "data" contains five excel files
datalist <- lapply(dataname,read_xlsx)
Error: `path` does not exist:'D:/data/epidata1/出院舱随访1.xlsx'
But read_xlsx was successfully run
read_xlsx("D:/data/epidata1/出院舱随访1.xlsx")
All file directories are available in the "data" folder and why R fails to read those excel file?
Your help will much appreciated!
I dont see any point why your code shouldnt work. Make sure your folder names are correct. In your comments you write "epdata1" and your error says "epidata1".
I tried it with some csv and mixed xlsx files.
This is again what i would come up with, to find the error/typo:
library(readxl)
pp <- function(...){print(paste(...))}
main <- function(){
# finding / setting up data main folder
# You may change this to your needs
main_dir <- paste0(getwd(),"/data/")
pp("working directory:",dir_data)
pp("Found following folders:")
pp(list.files(main_dir,full.names = FALSE))
data_folders <- list.files(dir_data,full.names = TRUE)
pp("Found these files in folders:",list.files(data_folders,full.names = TRUE))
pp("Filtering *.xlsx files",list.files(data_folders,pattern="*.xlsx$",full.names = TRUE))
files <- list.files(data_folders,pattern="\\.xlsx$",full.names = TRUE)
datalist <- lapply(files,read_xlsx)
print(datalist)
}
main()
I had tried to execute this function to find correlation between two elements in the excel files of directory "specdata",by passing in the command line corr("specdata"). But it shows " Error in list.files(directory, full.names = TRUE) : invalid 'path' argument ". I had checked the current working directory and it was correct. Any ideas for the reason of error ?
corr <- function(directory , threshold=0){
files_all <- list.files( directory , full.names = TRUE )
v <- c(mode = "numeric" , length = 0 )
for ( i in 1:length(files_all)) {
individual <- read.csv(files_all[i] , head = TRUE )
nobs <- sum(complete.cases(individual))
if ( nobs > threshold ) {
xSulfate <- individual[which(!is.na(individual$sulfate)),]
yPollutant <- xSulfate[which(!is.na(xSulfate$nitrate)),]
v <- c( v , cor(yPollutant$sulfate,yPollutant$nitrate))
}
}
return v
}
The corr() function in the Johns Hopkins University R Programming course assignment makes the assumption that the specdata directory is a subdirectory of the current R working directory.
One way to retrieve the list of files from this subdirectory is to retrieve the current directory with getwd() and use this to build the full path to use as an argument to list.files().
## obtain a list of all the CSV files in directory
theList <- list.files(paste0(getwd(),"/",directory))
Another approach is to combine ./ with the directory argument.
theList <- list.files(paste0("./",directory))
Did I unzip specdata correctly?
Students are given instructions to download the data file required for the assignment, specdata.zip and unzip it in the current R working directory. The zip file contains a directory, and the unzip process creates the directory and stores 332 pollution sensor data files in the subdirectory.
We can confirm the current R directory and whether specdata is a subdirectory from the working directory as follows.
# first, get working directory
getwd()
# second, confirm specdata is a subdirectory of this directory
dir.exists("./specdata")
...and the output:
> # first, get working directory
> getwd()
[1] "/Users/lgreski/gitrepos/datascience"
> # second, confirm specdata is a subdirectory of this directory
> dir.exists("./specdata")
[1] TRUE
>
I have directory with a list of folders which contains a folder named "ABC" . This "ABC" has '.xlsm' files. I want to use a R code to read '.xlsm' files in the folder "ABC", which under different folders in a directory.
Thank you for your help
If you already know the paths to each file, then simply use read_excel from the readxl package:
library(readxl)
mydata <- read_excel("ABC/myfile.xlsm")
If you first need to get the paths to each file, you can use a system command (I'm on Ubuntu 18.04) to find all of the paths and store them in a vector. You can then import them one at a time:
myshellcommand <- "find /path/to/top/directory -path '*/ABC/*' -type d"
mypaths <- system(command = myshellcommand, intern = TRUE)
Because of your directory requirements, one method for finding all of the files can be a double list.files:
ld <- list.files(pattern="^ABC$", include.dirs=TRUE, recursive=TRUE, full.names=TRUE)
lf <- list.files(ld, pattern="\\.xlsm$", ignore.case=TRUE, recursive=TRUE, full.names=TRUE)
To read them all into a list (good ref for dealing with a list-of-frames: http://stackoverflow.com/a/24376207/3358272):
lstdf <- sapply(lf, read_excel, simplify=FALSE)
This defaults to opening the first sheet in each workbook. Other options in readxl::read_excel that might be useful: sheet=, range=, skip=, n_max=.
Given a list of *.xlsm files in your working directory you can do the following:
list.files(
path = getwd(),
pattern = glob2rx(pattern = "*.xlsm"),
full.names = TRUE,
recursive = TRUE
) -> files_to_read
lst_dta <- lapply(
X = files_to_read,
FUN = function(x) {
cat("Reading:", x, fill = TRUE)
openxlsx::read.xlsx(xlsxFile = x)
}
)
Results
Given two files, each with two columns A, B and C, D the generated list corresponds to:
>> lst_dta
[[1]]
C D
1 3 4
[[2]]
A B
1 1 2
Notes
This will read all .xlsm files found in the directory tree starting from getwd().
openxlsx is efficient due to the use of Rcpp. If you are going to be handling a substantial amount of MS Excel files this package is worth exploring, IMHO.
Edit
As pointed out by #r2evans in comments, you may want to read *.xlsm files that reside only within ABC folder ignoring *.xlsm files outside the ABC folder. You could filter your files vector in the following manner:
grep(pattern = "ABC", x = files_to_read, value = TRUE)
Unlikely, if you have *.xlsm files that have ABC string in names and are saved outside ABC folder you may get extra matches.
I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.
Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}