Reading first 10 folders in directory / R Programming Language - r

I have been working on a dataset of folders and subfolders (folder -> subfolder - > file)
I have trouble reading the first 10 folders of data. I have used the below code but it doesn't work. Please help
> for(i in seq_along(my_folders)){
+ my_data[[[i]]] = list.files(path = "~/dataset1", recursive = TRUE)
Below see problem with reading txt file in subfolder:
> for(i in 1:13){
+ current_dir = dirs[i]
+ lines = readLines(mydata[[i]])}
This gives error: Error in file(con, "r") : invalid 'description' argument
But outside of the loop this works:
> lines <- readLines(my_data[[1]])

What do you think of that:
dirs = list.dirs(recursive = FALSE) # reads all directories/folders
mydata = list() # create empty list
for (i in 1:10) { # only takes the first 10 directories
current_dir = dirs[i]
mydata[[i]] = list.files(path = file.path("~/dataset1", current_dir), recursive = TRUE)
}
You only have to adapt your folder structure

#sequoia's answer works, but in R you can take advantage of concise functional programming, which #langtang's answer gets at with lapply(). Try this one-liner:
library(tidyverse)
library(fs)
d <- dir_ls("path/to/folders", recurse = TRUE) %>% walk(~read_lines(.x))

Use dir to get a vector of file names, for example all .txt files in folder "f" and all it subfolders
files= dir("f",pattern = ".txt", full.names = T,recursive = T)
files
[1] "f/f1/f1_1/f1_1.txt"
[2] "f/f1/f1_2/f1_2.txt"
[3] "f/f2/f2_1/f2_1.txt"
[4] "f/f2/f2_2/f2_2.txt"
Then read them using readLines
lapply(files, readLines)

Related

Issues about the loop combining different data types in one column

I have more than 1000 csv files. I would like to combine in a single file, after running some processes. So, I used loop function as follow:
> setwd("C:/....") files <- dir(".", pattern = ".csv$") # Get the names
> of the all csv files in the current directory.
>
> for (i in 1:length(files)) { obj_name <- files %>% str_sub(end = -5)
> assign(obj_name[i], read_csv(files[i])) }
Until here, it works well.
I tried to concatenate the imported files into a list to manipulate them at once as follow:
command <- paste0("RawList <- list(", paste(obj_name, collapse = ","),
> ")") eval(parse(text = command))
>
> rm(i, obj_name, command, list = ls(pattern = "^g20")) Ref_com_list =
> list()
Until here, it still okay. But ...
> for (i in 1:length(RawList)) { df <- RawList[[i]] %>%
> pivot_longer(cols = -A, names_to = "B", values_to = "C") %>%
> mutate(time_sec = paste(YMD[i], B) %>% ymd_hms())%>%
> mutate(minute = format(as.POSIXct(B,format="%H:%M:%S"),"%M"))
>
> ...(some calculation)
> Ref_com_list [[i]] <- file_all }
>
> Ref_com_all <- do.call(rbind,Ref_com_list)
At that time, I got the error as follow:
> Error: Can't combine `A` <double> and `B` <datetime<UTC>>. Run
> `rlang::last_error()` to see where the error occurred.
If I run individual file, it work well. But if I run in for loop, the error showed up.
Does anyone could tell me what the problem is?
Thanks a lot in advance.
There is a substantial scope for improvement in your code. Broadly speaking, if you are working in tidyverse you can pass multiple files to read_csv directly. Example:
# Generate some sample files
tmp_dir <- fs::path_temp("some_csv_files")
fs::dir_create(tmp_dir)
for (i in 1:100) {
readr::write_csv(mtcars, fs::file_temp(pattern = "cars",
tmp_dir = tmp_dir, ext = ".csv"))
}
# Actual file reading
dta_cars <- readr::read_csv(
file = fs::dir_ls(path = tmp_dir, glob = "*.csv"),
id = "file_path"
)
If you want to keep information on the file origination, using id = "file_path" in read_csv will store the path details in column. This is arguably more efficient than and less error-prone than:
for (i in 1:length(files)) { obj_name <- files %>% str_sub(end = -5)
assign(obj_name[i], read_csv(files[i])) }
This is much cleaner and will be faster than growing object via loop. After you would progress with your transformations:
dta_cars %>% ...
try:
library(data.table)
files <- list.files(path = '.', full.names=T, pattern='csv')
files_open <- lapply(files, function(x) fread(x, ...)) # ... for arguments like sep, dec, etc...
big_file <- rbindlist(files_open)
fwrite(big_file, ...) # ... for arguments like sep, dec, path to save data, etc...
Now I found out the reason why it happened. There was another file which is not the same file name but with the same file type. So, the code read all the files, and provided the error.
I am sorry I made you all confused.
Thank you so much!

Specifying pathname in map_dfr

The structure of my directory is as follows:
Extant_Data -> Data -> Raw
-> course_enrollment
-> frpm
I have a few different function to to read in some text files and excel files respectively.
read_fun = function(path){
test = read.delim(path, sep="\t", header=TRUE, fill = TRUE, colClasses = c(rep("character",23)))
test
}
read_fun_frpm= function(path){
test = read_excel(path, sheet = 2, col_names = frpm_names)
}
I feed this into map_dfr so that the function reads in each of the files and rowbinds them.
allfiles = list.files(path = "Extant_Data/Data/Raw/course_enrollment",
pattern = "CourseEnrollment.txt",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!! BUT I HAVE set the working directory to a subdirectory so that it finds those files
setwd("/Extant_Data/Data/Raw/course_enrollment")
course_combined <- map_dfr(allfiles,read_fun)
allfiles = list.files(path = "Extant_Data/Data/Raw/frpm/post12",
pattern = "frpm*",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!!I have to change the directory AGAIN
setwd(""Extant_Data/Data/Raw/frpm/post12")
frpm_combined <- map_dfr(allfiles,read_fun_frpm)
As mentioned in the comments, I have to keep changing the working directory so that map_dfr can locate the files. I don't think this is best practice, how might I work around this so I don't have to keep changing the directory? Any suggestions appreciated. Sorry it's hard to provide a re-producible example.
Note: This throws an error.
frpm_combined <- map_dfr(allfiles,read_fun_frpm('Extant_Data/Data/Raw/frpm/post12'))

How to get the directory of the executing script in R? [duplicate]

This question already has answers here:
Determine path of the executing script
(30 answers)
Closed 4 years ago.
I have a function that looks like this:
read_data <- function(filename, header) {
path <- paste("./output/", filename, sep = "")
if (file.exists(path)) {
data <- read.csv(file = path, header = header, sep = ",")
}
# Partially removed for brevity.
}
What I want to achieve is that, given a filename, I want to search whether that filename is available inside the output subdirectory, which is a subdirectory where my script is located, and if it is available, I want to read that file. The problem is that as long as I know read.csv function's file argument requires a full path for the file. So, I somehow need to get the directory where my script is located, so I can concatenate it with the rest of the subdirectory and filename. I can get the current working directory with getwd(), but that's not quite the same thing, as my working directory seems always to be fixed, whereas the script can be located anywhere in the computer. Any ideas how to get the directory of the script, and concatenate it with the output subdirectory and the provided filename in R?
if you want to detemine the directory of the executing script, this might be a dup: Rscript: Determine path of the executing script
initial.options <- commandArgs(trailingOnly = FALSE)
file.arg.name <- "--file="
script.name <- sub(file.arg.name, "", initial.options[grep(file.arg.name, initial.options)])
script.dirname <- dirname(script.name)
print(script.dirname)
> f <- "/path/to/my/script.R"
> f
[1] "/path/to/my/script.R"
> basename(f)
[1] "script.R"
> dirname(f)
[1] "/path/to/my"
> dirname(dirname(f))
[1] "/path/to"
> file.path(dirname(f), "output")
[1] "/path/to/my/output"
> file.path(dirname(f), "output", "data.csv")
[1] "/path/to/my/output/data.csv"
Adding this as an answer for you Tinker.
Considering you're just trying to read in a file you can do it this way:
## So what does this do?
# The path is where the files exist
# The pattern is some identifiable portion of the file name, which list.files() will bring back
# You need the full name so that R knows where to read from, this way you don't have to set a new working directory.
data <- if(file.exists(list.files(path = "./output/", pattern = "filename",full.names=T))){ read.csv(list.files(path = "./output/", pattern = "filename",full.names=T))}
# Let's imagine you have a number of files to read in
# Generate a list of filenames
filename <- list("file1","file2","file3","filen")
data <- lapply( filename, function(x) {
if( file.exists( list.files( path = "./output/", pattern = x ,full.names=T ) ) ) {
read.csv( list.files(path = "./output/", pattern = x ,full.names=T) ) }
} )
# each element of the list is oe of your data files
data[[1]]
data[[2]]
data[[n]]
I'm not sure what you're declaring with header as csv's are assumed to have a header inherently, additionally a csv is comma separated so declaring the sep character is also redundant.
Collecting resources from multiple questions in SO, I came up with the following solution, that seems to work with multiple calling conventions:
library(base)
library(rstudioapi)
get_directory <- function() {
args <- commandArgs(trailingOnly = FALSE)
file <- "--file="
rstudio <- "RStudio"
match <- grep(rstudio, args)
if (length(match) > 0) {
return(dirname(rstudioapi::getSourceEditorContext()$path))
} else {
match <- grep(file, args)
if (length(match) > 0) {
return(dirname(normalizePath(sub(file, "", args[match]))))
} else {
return(dirname(normalizePath(sys.frames()[[1]]$ofile)))
}
}
}
Which later I can use as:
path <- paste(get_directory(), "/output/", filename, sep = "")

R rename files keeping part of original name

I'm trying to rename all files in a folder (about 7,000 files) files with just a portion of their original name.
The initial fip code is a 4 or 5 digit code that identifies counties, and is different for every file in the folder. The rest of the name in the original files is the state_county_lat_lon of every file.
For example:
Original name:
"5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth"
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth"
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth"
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth"
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth"
And I need it to rename with just the initial code (fips):
"5081.wth"
"7083.wth"
"11085.wth"
"13087.wth"
"17089.wth"
I've tried by using the list.files and file.rename functions, but I do not know how to identify the code name out of he full name. Some kind of a "wildcard" could work, but don't know how to apply those properly because they all have the same pattern but differ in content.
This is what I've tried this far:
setwd("C:/Users/xxx")
Files <- list.files(path = "C:/Users/xxx", pattern = "fips_*.wth" all.files = TRUE)
newName <- paste("fips",".wth", sep = "")
for (x in length(Files)) {
file.rename(nFiles,newName)}
I've also tried with the "sub" function as follows:
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("_*", ".wth", Files)}
but get Error in as.character(x) :
cannot coerce type 'closure' to vector of type 'character'
OR
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("^(\\d+)_.*", "\\1.wth", file)}
Which runs without errors but does nothing to the names in the file.
I could use any help.
Thanks
Here is my example.
Preparation for data to use;
dir.create("test_dir")
data_sets <- c("5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth",
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth",
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth",
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth",
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth")
setwd("test_dir")
file.create(data_sets)
Rename the files;
Files <- list.files(all.files = TRUE, pattern = ".wth")
newName <- sub("^(\\d+)_.*", "\\1.wth", Files)
file.rename(Files, newName)

Automate zip file reading in R

I need to automate R to read a csv datafile that's into a zip file.
For example, I would type:
read.zip(file = "myfile.zip")
And internally, what would be done is:
Unzip myfile.zip to a temporary folder
Read the only file contained on it using read.csv
If there is more than one file into the zip file, an error is thrown.
My problem is to get the name of the file contained into the zip file, in orded to provide it do the read.csv command. Does anyone know how to do it?
UPDATE
Here's the function I wrote based on #Paul answer:
read.zip <- function(zipfile, row.names=NULL, dec=".") {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get the files into the dir
files <- list.files(zipdir)
# Throw an error if there's more than one
if(length(files)>1) stop("More than one data file inside zip")
# Get the full name of the file
file <- paste(zipdir, files[1], sep="/")
# Read the file
read.csv(file, row.names, dec)
}
Since I'll be working with more files inside the tempdir(), I created a new dir inside it, so I don't get confused with the files. I hope it may be useful!
Another solution using unz:
read.zip <- function(file, ...) {
zipFileInfo <- unzip(file, list=TRUE)
if(nrow(zipFileInfo) > 1)
stop("More than one data file inside zip")
else
read.csv(unz(file, as.character(zipFileInfo$Name)), ...)
}
You can use unzip to unzip the file. I just mention this as it is not clear from your question whether you knew that. In regard to reading the file. Once your extracted the file to a temporary dir (?tempdir), just use list.files to find the files that where dumped into the temporary directory. In your case this is just one file, the file you need. Reading it using read.csv is then quite straightforward:
l = list.files(temp_path)
read.csv(l[1])
assuming your tempdir location is stored in temp_path.
I found this thread as I was trying to automate reading multiple csv files from a zip. I adapted the solution to the broader case. I haven't tested it for weird filenames or the like, but this is what worked for me so I thought I'd share:
read.csv.zip <- function(zipfile, ...) {
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir)
files <- files[grep("\\.csv$", files)]
# Create a list of the imported csv files
csv.data <- sapply(files, function(f) {
fp <- file.path(zipdir, f)
return(read.csv(fp, ...))
})
return(csv.data)}
If you have zcat installed on your system (which is the case for linux, macos, and cygwin) you could also use:
zipfile<-"test.zip"
myData <- read.delim(pipe(paste("zcat", zipfile)))
This solution also has the advantage that no temporary files are created.
Here is an approach I am using that is based heavily on #Corned Beef Hash Map 's answer. Here are some of the changes I made:
My approach makes use of the data.table package's fread(), which
can be fast (generally, if it's zipped, sizes might be large, so you
stand to gain a lot of speed here!).
I also adjusted the output format so that it is a named list, where
each element of the list is named after the file. For me, this was a
very useful addition.
Instead of using regular expressions to sift through the files
grabbed by list.files, I make use of list.file()'s pattern
argument.
Finally, I by relying on fread() and by making pattern an
argument to which you could supply something like "" or NULL or
".", you can use this to read in many types of data files; in fact,
you can read in multiple types of at once (if your .zip contains
.csv, .txt in you want both, e.g.). If there are only some types of
files you want, you can specify the pattern to only use those, too.
Here is the actual function:
read.csv.zip <- function(zipfile, pattern="\\.csv$", ...){
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
dir.create(zipdir)
# Unzip the file into the dir
unzip(zipfile, exdir=zipdir)
# Get a list of csv files in the dir
files <- list.files(zipdir, rec=TRUE, pattern=pattern)
# Create a list of the imported csv files
csv.data <- sapply(files,
function(f){
fp <- file.path(zipdir, f)
dat <- fread(fp, ...)
return(dat)
}
)
# Use csv names to name list elements
names(csv.data) <- basename(files)
# Return data
return(csv.data)
}
The following refines the above answers. FUN could be read.csv, cat, or anything you like, providing the first argument will accept a file path. E.g.
head(read.zip.url("http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip", filename = "CMS32_DESC_LONG_DX.txt"))
read.zip.url <- function(url, filename = NULL, FUN = readLines, ...) {
zipfile <- tempfile()
download.file(url = url, destfile = zipfile, quiet = TRUE)
zipdir <- tempfile()
dir.create(zipdir)
unzip(zipfile, exdir = zipdir) # files="" so extract all
files <- list.files(zipdir)
if (is.null(filename)) {
if (length(files) == 1) {
filename <- files
} else {
stop("multiple files in zip, but no filename specified: ", paste(files, collapse = ", "))
}
} else { # filename specified
stopifnot(length(filename) ==1)
stopifnot(filename %in% files)
}
file <- paste(zipdir, files[1], sep="/")
do.call(FUN, args = c(list(file.path(zipdir, filename)), list(...)))
}
Another approach that uses fread from the data.table package
fread.zip <- function(zipfile, ...) {
# Function reads data from a zipped csv file
# Uses fread from the data.table package
## Create the temporary directory or flush CSVs if it exists already
if (!file.exists(tempdir())) {dir.create(tempdir())
} else {file.remove(list.files(tempdir(), full = T, pattern = "*.csv"))
}
## Unzip the file into the dir
unzip(zipfile, exdir=tempdir())
## Get path to file
file <- list.files(tempdir(), pattern = "*.csv", full.names = T)
## Throw an error if there's more than one
if(length(file)>1) stop("More than one data file inside zip")
## Read the file
fread(file,
na.strings = c(""), # read empty strings as NA
...
)
}
Based on the answer/update by #joão-daniel
unzipped file location
outDir<-"~/Documents/unzipFolder"
get all the zip files
zipF <- list.files(path = "~/Documents/", pattern = "*.zip", full.names = TRUE)
unzip all your files
purrr::map(.x = zipF, .f = unzip, exdir = outDir)
I just wrote a function based on top read.zip that may help...
read.zip <- function(zipfile, internalfile=NA, read.function=read.delim, verbose=TRUE, ...) {
# function based on http://stackoverflow.com/questions/8986818/automate-zip-file-reading-in-r
# check the files within zip
unzfiles <- unzip(zipfile, list=TRUE)
if (is.na(internalfile) || is.numeric(internalfile)) {
internalfile <- unzfiles$Name[ifelse(is.na(internalfile),1,internalfile[1])]
}
# Create a name for the dir where we'll unzip
zipdir <- tempfile()
# Create the dir using that name
if (verbose) catf("Directory created:",zipdir,"\n")
dir.create(zipdir)
# Unzip the file into the dir
if (verbose) catf("Unzipping file:",internalfile,"...")
unzip(zipfile, file=internalfile, exdir=zipdir)
if (verbose) catf("Done!\n")
# Get the full name of the file
file <- paste(zipdir, internalfile, sep="/")
if (verbose)
on.exit({
catf("Done!\nRemoving temporal files:",file,".\n")
file.remove(file)
file.remove(zipdir)
})
else
on.exit({file.remove(file); file.remove(zipdir);})
# Read the file
if (verbose) catf("Reading File...")
read.function(file, ...)
}

Resources