More problems with "incomplete final line" - r

This problem is similar to that seen here.
I have a large number of large CSVs which I am loading and parsing serially through a function. Many of these CSVs present no problem, but there are several which are causing problems when I try to load them with read.csv().
I have uploaded one of these files to a public Dropbox folder here (note that the file is around 10.4MB).
When I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and Rhelp for solutions. Maddeningly, when I run
Import <- read.csv("http://dl.dropbox.com/u/83576/Candidate%20Mentions.csv")
using the Dropbox URL instead of my local path, it loads, but when I then save that very data frame and try to reload it thus:
write.csv(Import, "Test_File.csv", row.names = F)
TestImport <- read.csv("Test_File.csv")
I get the "incomplete final line" warning again.
So, I am wondering why the Dropbox-loaded version works, while the local version does not, and how I can make my local versions work -- since I have somewhere around 400 of these files (and more every day), I can't use a solution that can't be automated in some way.
In a related problem, perhaps deserving of its own question, it appears that some "special characters" break the read.csv() process, and prevent the loading of the entire file. For example, one CSV which has 14,760 rows only loads 3,264 rows. The 3,264th row includes this eloquent Tweet:
"RT #akiron3: ácÎå23BkªÐÞ'q(#BarackObama )nĤÿükTPP ÍþnĤüÈ’áY‹ªÐÞĤÿüŽ
\&’ŸõWˆFSnĤ©’FhÎåšBkêÕ„kĤüÈLáUŒ~YÒhttp://t.co/ABNnWfTN
“jg)(WˆF"
Again, given the serialized loading of several hundred files, how can I (a) identify what is causing this break in the read.csv() process, and (b) fix the problem with code, rather than by hand?
Thanks so much for your help.

1)
suppressWarnings(TestImport <- read.csv("Test_File.csv") )
2) Unmatched quotes are the most common cause of apparent premature closure. You could try adding all of these:
quote="", na,strings="", comment.char=""

Related

R: "No line available in input" Error when reading multiple csv files in from a directory

I'm having trouble reading multiple .csv files in from a directory. It's odd because I read in files from two other directories using the same code with no issue immediately prior to running this code chunk.
setwd("C:\\Users\\User\\Documents\\College\\MLMLMasters\\Thesis\\TaggingEffectsData\\DiveStat")
my_dive <- list.files(pattern="*.csv")
my_dive
head(my_dive)
if(!require(plyr)){install.packages("plyr")}
DB = do.call(rbind.fill, lapply(my_dive, function(x) read.csv(x, stringsAsFactors = FALSE)))
DB
detach("package:plyr") ### I run this after I have finished creating all the dataframes because I sometimes have issues with plyr and dplyr not playing nice
if(!require(dplyr)){install.packages("dplyr")}
Then it throws this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Which doesn't make any sense because the list.files function works and when I run head(my_dive) I get this output:
head(my_dive)
[1] "2004001_R881_TV3.csv" "2004002_R 57_TV3.csv" "2004002_R57_TV3.csv" "2004003_W1095_TV3.csv"
[5] "2004004_99AB_TV3.csv" "2004005_O176_TV3.csv"
Plus the Environment clearly shows that my list is populated with all 614 files as I would expect it to be.
All of the csv file sets have identical file names but different data so they have to be read in as separate data frames from separate directories (not my decision that's just how this dataset was organized). For that reason I can't seem to figure out why this set of files is giving me grief when the other two sets read in just fine with no issues. The only difference should be the working directory and the names of the lists and dataframes. I thought it might be something within the actual directory, but I checked and there are only the .csv files in the directory and the list.files functions works fine. I saw a previous question that was similar to mine but the poster didn't initially use the (pattern = "*.csv") argument and that was the cause for the error. I always use this argument so that seems unlikely to be the cause.
I'm not sure how to go about making this reproduceable but I appreciate any help offered.

read_tsv stalls: is this an encoding issue?

I apologize in advance for the lack of specificity of this post but I can't provide a reproducible example in this case. I'm trying to read a tab-separated data file with R readr's read_tsv. The data is from a confidential source so I can't share it, even just the problematic part. read_tsv stalls around 20% of reading progress and unless I kill R quickly, my RAM usage starts blowing up to the point that my computer freezes (I'm on Ubuntu 18.04). Specifically, I'm running:
read_tsv(file = path_to_file,
skip = 10e6,
n_max = 1e5)
I'm skipping lines and setting n_max to vaguely isolate where the problem is and run faster tests. I also tried setting read_tsv's locale to locale(encoding = 'latin1') without success. I tried inspecting this problematic part by reading it with readr's read_lines:
read_lines(file = path_to_file,
skip = 10e6,
n_max = 1e5)
There's no reading problem there: I'm getting a list of character strings. I ran validUTF8 on all of them and they all seem valid. I just have no idea what type of problem could cause read_tsv to stall. Any ideas?
I solved the problem. It seems like it came from inappropriate handling of quoting characters with the default read_tsv quote option. Using quotes = "" instead made it work smoothly.

R save() not producing any output but no error

I am brand new to R and I am trying to run some existing code that should clean up an input .csv then save the cleaned data to a different location as a .RData file. This code has run fine for the previous owner.
The code seems to be pulling the .csv and cleaning it just fine. It also looks like the save is running (there are no errors) but there is no output in the specified location. I thought maybe R was having a difficult time finding the location, but it's pulling the input data okay and the destination is just a sub folder.
After a full day of extensive Googling, I can't find anything related to a save just not working.
Example code below:
save(data, file = "C:\\Users\\my_name\\Documents\\Project\\Data.RData", sep="")
Hard to believe you don't see any errors - unless something has switched errors off:
> data = 1:10
> save(data, file="output.RData", sep="")
Error in FUN(X[[i]], ...) : invalid first argument
Its a misleading error, the problem is the third argument, which doesn't do anything. Remove and it works:
> save(data, file="output.RData")
>
sep is used as an argument in writing CSV files to separate columns. save writes binary data which doesn't have rows and columns.

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

Can scan (or any import function) return partial results after it bump into errors?

Is there anything I can do to get partial results from after bumping into errors in a big file? I am using the following command to import data from files. This is the fastest way I know, but it's not robust. It can easily screw up everything because of a small error. I hope at least there is way that scan(or any reader) can quickly return which row/line has the error, or partial results it read (than I will have an idea where the error is). Then, I can skip enough lines to recover over 99% good data.
rawData = scan(file = "rawData.csv", what = scanformat, sep = ",", skip = 1, quiet = TRUE, fill = TRUE, na.strings = c("-", "NA", "Na","N"))
All importing data tutorials I found seem to assume the files are in good shape. I didn't find a useful hint to deal with dirty files.
I will sincerely appreciate any hint or suggestion! It was really frustrating.
Idea1: Open a file connection (with file function) and then scan line by line (with nlines=1). Put each scan into try to recover after reading a bad line.
Idea2: Use readLines to read the file in raw format; then use strsplit to parse. You can analyse this output to find bad lines and remove it.
The count.fields function will preprocess a table like file and give you how many fields it found on each line (in the sense that read.table will look for fields). This is often a quick way to identify lines that have a problem because they will show a different number of fields from what is expected (or just different from the majority of other lines).

Resources