R read_delim recognizing pipe delimiter inconsistently - r

I have run into some problems while importing a pipe delimited file. The file consistently delimits but something is getting in the way of R reading some of the delimiters while parsing. R reads in 10 columns when there should be 11, even though the appropriate number of pipes are in place.
A very small sample of the data can be found here: https://drive.google.com/file/d/1ek6-H5EWKCaPfDTfB2muqYBjJz1fM3pf/view
dat <- read_delim("~/Desktop/foo.txt", delim = "|", col.names = TRUE)
I've tried playing around with how R treats the quotes... quote = "/"" did nothing to help and ignoring the quotes with quote = "" made an even bigger mess of the import.
Any thoughts on how to fix the problem?

Feel free to use fread() in data.table package as below.
library(data.table)
FOO3<-fread("~/Downloads/foo.txt",sep = "|",fill = T)
Below is the import dataset I got.

Related

Importing a huge csv file while fread doesn't work in R

I want to import a big csv file in R (approximately 14 million rows and 13 columns). So I tried to use fread with the following code :
my_data <- fread(my_file,
sep = ";",
header = TRUE,
na.strings=c(""," ","NA"),
quote = "",
fill = TRUE,
check.names=FALSE,
stringsAsFactors=FALSE))
However, I got the following error :
Error in fread(path_alertes_profil, sep = ";", header = TRUE, na.strings = c("", :
Expecting 13 cols, but line 18533 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=';' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
Therefore I tried to import my file with the function read_delim from the readr package, with the same parameters. It worked since my file appeared in the global environment (I'm working with RStudio). However, it only got 741629 rows instead of the 14+ million rows
How can I solve this problem (I tried to find a solution for the error when using fread() but didn't find any useful resource)

Is there a sed type package in R for removing embedded NULs?

I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).
R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).
I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.
Suggestions?
Maybe this could be of help:
in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv",
open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"),
con = "StormEvents_ascii.csv")
I was able to read the csv file without errors with this call do read.table:
options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE,
sep = ",", fill = TRUE, quote = '"')
Obviously you'd need to change the class of several columns, since all are considered character as it is.
Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see
Removing "NUL" characters (within R)
An update for May 2020: The tidyverse and data.table both still choke on null characters within files however the base::read.*() family and readLines() will gracefully skip them with the skipNul=TRUE option. You can read a file in skipping over null characters and then write it back out again.

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

Getting rid of BOM between SAS and R

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
 if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.
As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

reading big data file using cbc.read.table

I'm trying to read a giant DF cbc.read.table:
my.df <- cbc.read.table("df.csv",sep = ";", header =F)
This is what I get:
Error in cbc.read.table("2012Q2.csv", sep = "|", header = F) :
No rows to read
The wd is set correctly. Inprinciple it works using read.table, just that it doesn't read in all lines (about two million)
Has anybody an idea what I can do about this?
SOLUTION:
Hi again, the following thread helped me out:
R: Why does read.table stop reading a file?
The problem was caused by quotation marks, probably because some of them were not closing. I simply used an editor and deleted all double and single quotation marks as well as all hash marks. It's working now.
#Anthony: Thanks for your question. I noticed that the problem did not occur in the first three lines which is why I got idea that it's an issue with the file. Thanks!
Paul

Resources