Why is my txt file not being completely read with read.delim? - r

I am trying to read this large text file (~3gb) into R, but unfortunately I am not being able to read fully load it. What happens is that I'm missing a lot of rows (I get a dataframe of ~700 thousand rows, while I know the file has at least 4-5million).
The code I was initially using was as follows:
df<-read.delim("file.txt",quote = "",comment.char = "")
However, besides noticing that R wasn't loading all the rows, I was also receiving this warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
I searched for a bit online, and I found I could solve it by adding the skipNul = TRUE argument. When I included it in the read.delim function, the warning stopped showing, but my file keeps missing a lot of rows, and still returns the same number of rows as before.
I have loaded files of similar size in the past, so I'm not sure why this is happening.
If someone has any idea what might be causing the problem, I would be very thankful.

Related

Trouble loading xls file into R with R Markdown

I'm trying to upload a GSS data set into R Markdown for creating a lecture presentation.
Each time I do, I get errors that I do not understand. Any help would be appreciated.
read.csv("directory/GSS2018.xls", headers = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
unused argument (headers = TRUE)
My .xls has headers, so I'm not sure why it is saying "untrue". Even still, I tried taking out the headers option and received this:
read.csv("~directory/GSS2018.xls")
line 1 appears to contain embedded nullsline 2 appears to contain embedded nullsline 3 appears to contain embedded nullsline 4 appears to contain embedded nullsline 5 appears to contain embedded nullsError in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<1a>'
I can't quite get what this error is telling me, nor how to fix it. I can import my data just fine using the "Import Dataset" button on the environment sector of R Studio - but when I put that code into R markdown, it shows up all these errors.
Any help is appreciated!

Importing multiple csv files with lappy

When i need to import multiple csv files I use:
Cluster <- lapply(dir(),read.csv)
Previously setting the working directoy of course, but somehow today it stopped working, and returning this error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
The only unusual thing i did was setting the Java directory manually so that way rJava can be loaded.
Any idea what happen?

Using read.csv when a data entry is a space (not blank!)

I am having a problem with using read.csv in R. I am trying to import a file that has been saved as a .csv file in Excel. Missing values are blank, but I have a single entry in one column which looks blank, but is in fact a space. Using the standard command that I have been using for similar files produces this error:
raw.data <- read.csv("DATA_FILE.csv", header=TRUE, na.strings="", encoding="latin1")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at ' floo'
I have tried a few variations, adding arguments to the read.csv() command such as na.strings=c(""," ") and strip.white=TRUE, but these result in the exact same error.
It is a similar error to what you get when you use the wrong encoding option, but I am pretty sure this shouldn't be a problem here. I have of course tried manually removing the space (in Excel), and this works, but as I'm trying to write generic code for a Shiny tool, this is not really optimal.

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

More problems with "incomplete final line"

This problem is similar to that seen here.
I have a large number of large CSVs which I am loading and parsing serially through a function. Many of these CSVs present no problem, but there are several which are causing problems when I try to load them with read.csv().
I have uploaded one of these files to a public Dropbox folder here (note that the file is around 10.4MB).
When I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and Rhelp for solutions. Maddeningly, when I run
Import <- read.csv("http://dl.dropbox.com/u/83576/Candidate%20Mentions.csv")
using the Dropbox URL instead of my local path, it loads, but when I then save that very data frame and try to reload it thus:
write.csv(Import, "Test_File.csv", row.names = F)
TestImport <- read.csv("Test_File.csv")
I get the "incomplete final line" warning again.
So, I am wondering why the Dropbox-loaded version works, while the local version does not, and how I can make my local versions work -- since I have somewhere around 400 of these files (and more every day), I can't use a solution that can't be automated in some way.
In a related problem, perhaps deserving of its own question, it appears that some "special characters" break the read.csv() process, and prevent the loading of the entire file. For example, one CSV which has 14,760 rows only loads 3,264 rows. The 3,264th row includes this eloquent Tweet:
"RT #akiron3: ácÎå23BkªÐÞ'q(#BarackObama )nĤÿükTPP ÍþnĤüÈ’áY‹ªÐÞĤÿüŽ
\&’ŸõWˆFSnĤ©’FhÎåšBkêÕ„kĤüÈLáUŒ~YÒhttp://t.co/ABNnWfTN
“jg)(WˆF"
Again, given the serialized loading of several hundred files, how can I (a) identify what is causing this break in the read.csv() process, and (b) fix the problem with code, rather than by hand?
Thanks so much for your help.
1)
suppressWarnings(TestImport <- read.csv("Test_File.csv") )
2) Unmatched quotes are the most common cause of apparent premature closure. You could try adding all of these:
quote="", na,strings="", comment.char=""

Resources