Skip empty files inside zip files

Skip empty files inside zip files - r

I am reading a lot of .csv files inside a .zip file with the following code
for (i in unzip("data.zip", list = TRUE)) {
read.csv(unz("data.zip", i))
}
The problem is that some of .csv files are empty that leads to no lines available in input error that causes the execution of the loop be interrupted. How can I skip those empty files?

Try this
flist <- unzip("data.zip", list=TRUE)
Now flist$Length gives you the length of each file, so e.g.
keep <- flist$Length > 100 # or some other value that indicates the file has no data
Now you can read the nonempty ones and save them to a list:
AllFiles <- lapply(flist$Name[keep], read.csv)

Related

R - read_csv without using paste

Use the read_csv function to read each of the files you got in the files object with code below:
path <- system.file("extdata", package = "dslabs")
files <- list.files(path)
files
I tried this code below but I get "vroom_ error. Please help.
for (f in files){
read_csv(f)
}

First, in list.files you should set full.names=TRUE to include the whole path. Next, if you look into files, there are also .xls and .pdf files included. You may want to filter just for .csv files, which can easily be done using grep.
files <- list.files(path, full.names=TRUE)
files <- grep('.csv$', files, value=TRUE)
However, even then readr::read_csv complains about column issues.
lst <- readr::read_csv(files)
# Error: Files must all have 2 columns:
# * File 2 has 57 columns
To avoid editing the columns by hand, I recommend to use rio::import_list instead, which gives just a warning, that a column name was guessed and can be changed if needed. You may even include the .xls in the grep.
files <- grep('.csv$|.xls', files, value=TRUE)
lst <- rio::import_list(files)
Note that rio::import_list (as well as readr::read_csv) is vectorized, so you won't need a loop.
Data:
path <- system.file("extdata", package="dslabs")

Iteration over a non-existing file in the directory

I have around five files in my directory that I want to read in r. Each file has a name pattern: "filex.html", where x=1,2,3 and so on. However, a few files are missing. I wanted to create a loop to read all the files and whenever any file is non-existential, the loop should jump to next file in the sequence. However, my loop stops whenever it encounters the first non-existing file.
Following is the loop.
ids = c(1:10)
for (i in ids) {
myurl = paste("mypage",i,".html")
myurl = gsub(" ","",myurl)
pointer = read_html(myurl)
if(is_null(pointer)){
next
}
}
This is the error.
Error: 'mypage3.html' does not exist in current working directory ('E:/My_projects/mydb').
How can iterate my loop over the non-existing file?

Instead of looping over your ids vector that may include non-existant files, try to lapply over a list of the actual files, obtained from list.files().
You can use a pattern to only get the html files with list.files(pattern = "*.html").
Here is an example
html_files = list.files(pattern = "*.html").
lapply(html_files, function(x) {
pointer = read_html(x)
}

Looping over a set of standardized files to collect information and save it in a different files

I have several files in a folder. They all have same layout and I have extracted the information I want from them.
So now, for each file, I want to write a .csv file and name it after the original input file and add "_output" to it.
However, I don't want to repeat this process manually for each file. I want to loop over them. I looked for help online and found lots of great tips, including many in here.
Here's what I tried:
#Set directory
dir = setwd("D:/FRhData/elb") #set directory
filelist = list.files(dir) #save file names into filelist
myfile = matrix()
#Read files into R
for ( i in 1:length(filelist)){
myfile[i] = readLines(filelist[i])
*code with all calculations*
write.csv(x = finalDF, file = paste (filename[i] ,"_output. csv")
}
Unfortunately, it didn't work out. Here's the error message I get:
Error in as.character(x) :
cannot coerce type 'closure' to vector of type 'character'
In addition: Warning message:
In myfile[i] <- readLines(filelist[i]) :
number of items to replace is not a multiple of replacement length
And 'report2016-03.txt' is the name of the first file the code should be executed on.
Does anyone know what I should do to correct this mistake - or any other possible mistakes you can foresee?
Thanks a lot.
======================================================================
Here's some of the resources I used:
https://www.r-bloggers.com/looping-through-files/
How to iterate over file names in a R script?
Looping through files in R
Loop in R loading files
How to loop through a folder of CSV files in R

This worked for me. I used a vector instead of a matrix, took out the readLines() call and used paste0 since there was no separator.
dir = setwd("C:/R_projects") #set directory
filelist = list.files(dir) #save file names into filelist
myfile = vector()
finalDF <- data.frame(a=3, b=2)
#Read files into R
for ( i in 1:length(filelist)){
myfile[i] = filelist[i]
write.csv(x = finalDF, file = paste0(myfile[i] ,"_output.csv"))
}
list.files(dir)

Prompting user for multiple input files in R

I'm trying to do something I think should be straight forward enough, but so far I've been unable to figure it out (not surprisingly I'm a noob)...
I would like to be able to prompt a user for input file(s) in R. I've successfully used file.choose() to get a single file, but I would like to have the option of selecting more than one file at a time.
I'm trying to write a program that sucks in daily data files, with the same header and appends them into one large monthly file. I can do it in the console by importing the files individually, and then using rbind(file1, file2,...) but I need a script to automate the process. The number of files to append will not necessarily be constant between runs.
Thanks
Update: Here the code I came up that works for me, maybe it will be helpful to someone else as well
library (tcltk)
File.names <- tk_choose.files() #Prompts user for files to be combined
Num.Files <-NROW(File.names) # Gets number of files selected by user
# Create one large file by combining all files
Combined.file <- read.delim(File.names [1], header=TRUE, skip=2) #read in first file of list selected by user
for(i in 2:Num.Files){
temp <- read.delim(File.names [i], header=TRUE, skip=2) #temporary file reads in next file
Combined.file <-rbind(Combined.file, temp) #appends Combined file with the last file read in
i<-i+1
}
output.dir <- dirname(File.names [1]) #Finds directory of the files that were selected
setwd(output.dir) #Changes directory so output file is in same directory as input files
output <-readline(prompt = "Output Filename: ") #Prompts user for output file name
outfile.name <- paste(output, ".txt", sep="", collapse=NULL)
write.table(Combined.file, file= outfile.name, sep= "\t", col.names = TRUE, row.names=FALSE)` #write tab delimited text file in same dir that original files are in

Have you tried ?choose.files
Use a Windows file dialog to choose a list of zero or more files interactively.

If you are willing to type each file name, why not just loop over all the files like this:
filenames <- c("file1", "file2", "file3")
filecontents <- lapply(filenames, function(fname) {<insert code for reading file here>})
bigfile <- do.call(rbind, filecontents)
If your code must be interactive, you can use the readline function in a loop that will stop asking for more files when the user inputs an empty line:
getFilenames <- function() {
filenames <- list()
x <- readline("Filename: ")
while (x != "") {
filenames <- append(filenames, x)
x <- readline("Filename: ")
}
filenames
}

load new files in directory

I have a R script to load multiple text files in a directory and save the data as compressed .rda. It looks like this,
#!/usr/bin/Rscript --vanilla
args <- commandArgs(TRUE)
## arg[1] is the folder name
outname <- paste(args[1], ".rda", sep="")
files <- list.files(path=args[1], pattern=".txt", full=TRUE)
tmp <- list()
if(file.exists(outname)){
message("found ", outname)
load(outname)
tmp <- get(args[1]) # previously read stuff
files <- setdiff(files, names(tmp))
}
if(is.null(files))
message("no new files") else {
## read the files into a list of matrices
results <- plyr::llply(files, read.table, .progress="text")
names(results) <- files
assign(args[1], c(tmp, results))
message("now saving... ", args[1])
save(list=args[1], file=outname)
}
message("all done!")
The files are quite large (15Mb each, 50 of them typically), so running this script takes up to a few minutes typically, a substantial part of which is taken writing the .rda results.
I often update the directory with new data files, therefore I would like to append them to the previously saved and compressed data. This is what I do above by checking if there's already an output file with that name. The last step is still pretty slow, saving the .rda file.
Is there a smarter way to go about this in some package, keeping a trace of which files have been read, and saving this faster?
I saw that knitr uses tools:::makeLazyLoadDB to save its cached computations, but this function is not documented so I'm not sure where it makes sense to use it.

For intermediate files that I need to read (or write) often, I use
save (..., compress = FALSE)
which speeds up things considerably.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Skip empty files inside zip files - r

Try this flist <- unzip("data.zip", list=TRUE) Now flist$Length gives you the length of each file, so e.g. keep <- flist$Length > 100 # or some other value that indicates the file has no data Now you can read the nonempty ones and save them to a list: AllFiles <- lapply(flist$Name[keep], read.csv)

Related

R - read_csv without using paste

Iteration over a non-existing file in the directory

Looping over a set of standardized files to collect information and save it in a different files

Prompting user for multiple input files in R

load new files in directory

Categories

Resources