tryCatch - withCallingHandlers - recover from error - r

I have a csv file (aprox 1000 lines) with some sample data. while reading the csv with read.table
read.table(csv_File,header = FALSE, sep=",",na.strings = '')
I was getting an error,
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 515 did not have 5 elements
Is there any way, by using tryCatch and withCallingHandlers, to print this error message and continue with the rest of the file?
all I am expecting is to get error messages/ stack trace in case of errors and process the rest of the lines in csv.

No, as far as I know there's no way to get read.table to skip lines that contain errors. What you should do is use the count.fields function to find how many fields are in each line of your file, then read the whole file, delete the bad lines, and read again. For example:
fields <- count.fields(csv_File, sep = ",")
bad <- fields != 5
lines <- readLines(csv_File)
# At this point you could display the bad lines or
# give some other information about them.
# Then delete them and read again:
lines <- lines[!bad]
f <- tempfile()
writeLines(lines, f)
read.table(f, header = FALSE, sep=",", na.strings = '')
unlink(f)
EDITED to add:
I should mention that the readr package does a better job when files contain problems. If you use
library(readr)
read_csv(csv_File, col_names = FALSE)
it will produce a "tibble" instead of a data frame, but otherwise should do what you want. Each line that has problems will be reported, and the overall problems will be kept with the dataset in case you want to examine them later.

Related

Large file processing - error using chunked::read_csv_chunked with dplyr::filter

When using the function chunked::read_csv_chunked and dplyr::filter in a pipe, I get an error every time the filter returns an empty dataset on any of the chunks. In other words, this occurs when all the rows from a given chunk of the dataset are filtered out.
Here is a modified example, drawn from the package chunked help file:
library(chunked); library(dplyr)
# create csv file for demo purpose
in_file <- file.path(tempdir(), "in.csv")
write.csv(women, in_file, row.names = FALSE, quote = FALSE)
# reading chunkwise and filtering
women_chunked <-
read_chunkwise(in_file, chunk_size = 3) %>% #read only a few lines for the purpose of this example
filter(height > 150) # This basically filters out most lines of the dataset,
# so for instance the first chunk (first 3 rows) should return an empty table
# Trying to read the output returns an error message
women_chunked
# >Error in UseMethod("groups") :
# >no applicable method for 'groups' applied to an object of class "NULL"
# As does of course trying to write the output to a file
out_file <- file.path(tempdir(), "processed.csv")
women_chunked %>%
write_chunkwise(file=out_file)
# >Error in read.table(con, nrows = nrows, sep = sep, dec = dec, header = header, :
# >first five rows are empty: giving up
I am working on many csv files, each 50 millions rows, and will thus often end up in a similar situation where the filtering returns (at least for some chunks) an empty table.
I coudn't find a solution or any post related to on this problem. Any suggestions?
I do not think the sessionInfo output is useful in this case, but please let me know if I should post it anyway. Thanks a lot for any help!

How to import multiple csv files into R without getting duplicate row names error

Ive seen the multiple answers to a similar question where people have the error of duplicate 'row.names' are not allowed when importing one csv file into R, but I haven't seen a question for when you're trying to import multiple csv files into one data frame. So essentially, I'm trying to importing 104 files from the same directory and I get the duplicate 'row.names' are not allowed. I woud be able to solve the problem if i was only importing one file as the code is extremely simple, but when it comes to muliple files I struggle. I've tried a number of different ways of importing the data properly, here are a couple of them:
setwd("path")
loaddata <- function(file ="directory") {
files <- dir("directory", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
dplyr::bind_rows
}
data <- loaddata("PhaseReports")
Error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, duplicate 'row.names' are not allowed
Another attempt:
path <- "path"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
Error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, duplicate 'row.names' are not allowed
EDIT: For the second method, when I try read.csv(paste(path,file,sep=""), row.names=NULL)) it changes the title of my first column to row.names and shifts the data one column to the right. I tried putting
colnames(rec) <- c(colnames(rec)[-1],"x")
rec$x <- NULL
under the last line and I get this error:
Error in `colnames<-`(`*tmp*`, value = "x") :
attempt to set 'colnames' on an object with less than two dimensions
If there is a much easier way to import multiple csv files into R and I'm over complicating things don't be afraid to let me know.
I know this is a combination of two questions which have been answered plenty of times on stack, I didn't see if anyone had asked this specific question. Thanks in advance!
EDIT 2:
All of the individual files contain data like this:
Half,Play,Type,Time
1,1,Start,00:00:0`
1,2,,0:23:5
1,3,pass,00:03:76
2,4,start,00:04:76
2,5,pass,00:06:92
2,6,end,00:08:00
Although this may not solve your problem, you could try to skip the headers while you are reading the files and put it afterwards. So something like (in some of your approaches):
read.csv("Your files/file/paste", header = F, skip = 1)
This will skip the header and hopefully will help with the duplicate row names. The full code to do it could be:
my_files <- dir("Your path/folder etc", pattern = '\\.csv', full.names = TRUE)
result <- do.call(rbind, lapply(my_files, read.csv, header = F, skip = 1))
names(result) <- c("Half","Play","Type","Time")
You can put the header later (the names(result) line does that).
If you still have problems I would suggest creating a loop like this:
for (i in my_files){
print(i)
read.csv(i)
}
And then see what is the last file name printed before you get an error. This one should be the one you should investigate. You could look whether a row has more than 3 commas because I think that this will be the problem. Hope it helps!

Write col names while writing csv files in R

What is the proper way to append col names to the header of csv table which is generated by write.table command?
For example write.table(x, file, col.names= c("ABC","ERF")) throws error saying invalid col.names specification.Is there way to get around the error, while maintaining the function header of write.table.
Edit:
I am in the middle of writing large code, so exact data replication is not possible - however, this is what I have done:
write.table(paste("A","B"), file="AB.csv", col.names=c("A1","B1")) , I am still getting this error Error in write.table(paste("A","B"), file="AB.csv", col.names=c("A", : invalid 'col.names' specification.
Is that what you expect, tried my end
df <- data.frame(condition_1sec=1)
df1 <- data.frame(susp=0)
write.table(c(df,df1),file="table.csv",col.names = c("A","B"),sep = ",",row.names = F)

read.table quits after encountering special symbol

I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R

R error handling using read.table and defined colClasses having corrupt lines in CSV file

I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.
I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:
colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")
inputText <- "2015-01-01;123;-0.01\n
2015-01-02;421;-0.022015-01-03;433;-0.04\n
2015-01-04;321;-0.03\n
2015-01-05;230;-0.05\n
2015-01-06;313;0.-02"
con <- textConnection(inputText, "r")
mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)
At the first corrupt lines read.table stops with the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '-0.022015-01-03'
With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.
So my question is:
Is there a way to get read.table to tell me the line where the error occurred (and maybe save it for further processing)
Is there a way to get read.table to just skip lines with improper formatting (not to stop at an error)?
Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.
What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.
I have a solution, but it its very slow
With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:
con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){
correction = T
while(correction) {
err <- tryCatch(part <- read.table(text=text, sep=";", fill = T,
col.names = colNames,
colClasses = colClasses),
error=function(e) e)
if(inherits(err, "error")){
# try to correct this line
message(err, "\n")
text <- edit(text)
}else{
correction = F
}
}
mydata <- rbind(mydata, part)
}
If the user made the corrections right this returns:
> mydata
date parA parB
1 2015-01-01 123 -0.01
2 2015-01-02 421 -0.02
3 2015-01-03 433 -0.04
4 2015-01-04 321 -0.03
5 2015-01-05 230 -0.05
6 2015-01-06 313 -0.02
The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.
What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.
The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.

Resources