read.csv - unknown character and embedded quotes - r

I have a .csv that causes different problems with read.table() and fread().
There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.
fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.
Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?
Thanks in advance!

In case of Paul's file, I was able to read the file (after some experimentation) using fread() with the cmd "unzip -cq" and quote = "" parameters without error or warnings. I suppose that this might work as well with Kristian's file.
On Windows, it might be necessary to install the Rtools beforehand.
library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
"extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")
Caveat: This will download a file of 38 MBytes
According to the unzip man page, the -c option automatically performs ASCII-EBCDIC conversion.
The quote = "" was required because in at least one case a data field contained double quotes within the text.
I have also tried the -p option of unzip which extracts the data without conversion. Then, we can see that there is \032 embedded in the string.
txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")
txt2[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOP\032OULOS
The \032 does not appear in the converted version
txt1[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOPOULOS
We can search for all occurrences of \032 in all character fields by
melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
value %flike% "\032"][order(`CUST-ID`)]
CUST-ID variable value
1: 1253096 LEGAL-NAME JOHN A. GIANNAKOP\032OULOS
2: 2050751 DBA-NAME colbert ball tax tele\032hone rd
3: 2082166 LEGAL-NAME JUAN DE J. MORALES C\032TALA
4: 2273606 LEGAL-NAME INTRINSIC DM\032 INC.
5: 2300016 MAIL-ADDR1 PO BOX \03209
6: 2346154 LEGAL-NAME JOEL I GONZ\032LEZ-APONTE CPA
7: 2384445 LEGAL-NAME NUMBERS CAF\032 PLLC
8: 2518214 MAIL-ADDR1 556 W 800 N N\03211
9: 2518214 BUSN-ADDR1 556 W 800 N N\03211
10: 13718109 DBA-NAME World Harvest Financial Grou\032
11: 13775763 LEGAL-NAME Fiscally Responsible Consulting LLC\032
12: 13775763 DBA-NAME Fiscally Responsible Consulting LLC\032
This may help to identify the records of the file to fix manually.

I hit this problem today, so its still there in R 4.0.5
The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.
In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table get a little further. I found no encoding setting in R that made a difference on this problem.
I found 2 solutions.
Get rid of the "^Z" symbol with text tools before using read.csv.
Switch to Python. The pandas package function read_csv, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame that results, the unrecognized character is in the data, it looks like an empty square.
If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.

Related

Removing "NUL" characters (within R)

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__

read.table ignores some \t and \n in a table

I've encountered a strange problem when reading in an annotation file from an RNAseq experiment.
I am trying to read in the tab-separated file (http://we.tl/qCjv4N3LF2) and then search the annotations (in the fourth column) for the pattern "bahd", to find entries like "bahd acyltransferase dcr" and then display all the IDs (first column) that belong to these entries. The code is:
ShootAnnot<-read.table("annotation1.txt",sep="\t")
matches<-grep("bahd",ShootAnnot[,4],ignore.case=TRUE)
ShootAnnot[matches,1]
Weirdly, I noticed this did not find all the gene annotations that I know to be there - only 9 matches out of 12 in the file. When I scanned the table for the missing entries I found one line in the file where it seems R failed to interpret the separation patterns "\t" and "\n" for a bit.
look at line 4825 in the dataset:
ShootAnnot[4825,]
for some reason, the sixth cell in that line contains a big chunk of data, with many complete lines and the appropriate "\t" and "\n" cell and line separation patterns all in one cell. Then it suddenly jumps back into separating cells and lines correctly.
I have got a bunch of these files, so I would like to make sure I can resolve any issues like that automatically. Any ideas what might be causing this?
Thanks!
I'm not sure why it goes haywire (maybe a DOS CR/LF thing), but the file is pretty big and if you plug it into data.table you will get a pretty decent speedup just from reading the data.
library(data.table)
ShootAnnot <- fread("~/Downloads/annotation1.txt")
ShootAnnot[like(Blast2GO_GO_Description,"bahd"), "#ID", with=FALSE]
which will give you
#ID
1: c112902_g1_i1_m.105401
2: c11459_g1_i1_m.4290
3: c11459_g2_i1_m.4292
4: c186946_g1_i1_m.110882
5: c24956_g1_i1_m.8768
6: c265515_g1_i1_m.117383
7: c28096_g1_i1_m.10253
8: c37936_g1_i1_m.14867
9: c40683_g1_i1_m.17292
10: c54651_g1_i1_m.34709
11: c54651_g2_i1_m.34711
12: c921_g1_i1_m.351
(you don't have any non-lower-case "bahd"'s in your file)

how to resolve read.fwf run time error: invalid multibyte string in R

I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks

un-quote an R string?

TL;DR
I have a snippet of text
str <- '"foo\\dar embedded \\\"quote\\\""'
# cat(str, '\n') # gives
# "foo\dar embedded \"quote\""
# i.e. as if the above had been written to a CSV with quoting turned on.
I want to end up with the string:
str <- 'foo\\dar embedded "quote"'
# cat(str, '\n') # gives
# foo\dar embedded "quote"
essentially removing one "layer" of quoting. How may I do this?
(Initial attempt -- eval(parse(text=str)), which works unless you have something like \\dar, where you get the error "\d is an unrecognized escape in character string ...").
Gory details (optional)
The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines with embedded quotes (i.e. forgot to escape/remove them).
It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines and embedded quotes (or something like that), the function fails (fair enough).
I had since closed my R session so my only access to my data was through my munged CSV. So I wrote some spaghetti code to simply readLines my CSV and split everything up to reconstruct my dataframe again. However, since all my character columns were quoted in the CSV, I have a few columns in my restored dataframe that are still quoted that I want to unquote.
Messy, I know. I'll remember to save an original version of the data next time (save, saveRDS).
For those interested, the header row and three rows of my CSV are shown below (all the characters are ASCII)
"quote";"id";"date";"author";"context"
"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"February 28, 2013";"nhqdb";"nhqdb"
"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"February 28, 2013";"nhqdb";"nhqdb"
"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"< bcode> n - a spherical amulet. You are lucky! Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"February 28, 2013";"nhqdb";"nhqdb"
The first two columns of each row are the same, being the quote (the first row has no embedded newlines in the quote; the second and third do). Separator is ';'.
> read.table('test.csv', sep=';', header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
# same for with ,allowEscape=T
Use regular expressions:
str <- gsub('^"|"$', '', gsub('\\\"', '"', str, fixed = TRUE))
[EDIT 3: the OP has posted three separate versions of this - two of them irreproducible, interspersed with complaining. Due to this timewasting behavior and several people downvoting, I'm leaving the original answer to version 2 of the question.]
EDIT 1: My solution to the second version of the OP's question was this:
txt <- read.csv('escaped.csv', header=T, allowEscapes=T, sep=';')
EDIT 2: We now get a third version. Finally some reproducible code after 36 minutes asking and waiting. Due to the behavior of the OP and other posters I'm not inclined to waste more time on this. I'm going to complain about both of your behavior on MSO. Downvote yourselves silly.
ORIGINAL:
gsub is the ugly way.
Use read.csv(..., allowEscapes=TRUE, quote=..., encoding=...) arguments. See the manpage, section on Encoding
If you want actual code, you need to give us a full line or two of your CSV file.
See also SO: "How to detect the right encoding for read.csv?"
Quoting the relevant part of your question:
The reason my strings are quoted once-too-many times is I kludged some
data processing -- I wrote str (well, a dataframe in my case) to a
table with quoting enabled, but forgot that many of the columns in my
dataframe had embedded newlines within quotes (i.e. forgot to
escape/remove them).
It turns out that when I read.table a file with multiple columns in
the same row that have embedded newlines within quotes, the function
fails (fair enough).

R read.table csv with classic-mac line endings

I have a comma-separated value file that looks like this when I open it in vim:
12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M
and so forth. I believe this means my CSV uses CR-only (classic mac) line endings. R's read.table() function ostensibly requires LF line endings, or some variant thereof.
I know I can preprocess the file, and that's probably what I'll do.
That solution aside: is there a way to import CR files directly into R? For instance, write.table() has an "eol" parameter one can use to specify the line ending of outputs -- but I don't see a similar parameter for read.table() (cf. http://stat.ethz.ch/R-manual/R-patched/library/utils/html/read.table.html).
R will not recognize "^M" as anything useful.(I suppose it's possible that vim is just showing you a cntrl-M as that character.) If that were in a text-connection-stream R will think it's not a valid escaped-character, since "^" is not used for that purpose. You might need to do the pre-processing, unless you want to pass it through scan() and substitute using gsub():
subbed <- gsub("\\^M", "\n", scan(textConnection("12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M"), what="character"))
Read 1 item
> read.table(text=subbed, sep=",")
V1 V2 V3 V4 V5
1 12 31 50 NA 12
2 34 23 45 2 12
3 12 31 50 NA 12
4 34 23 45 2 12
I suppose it's possible that you may need to use "\\m" as the patt argument to gsub.
A further note: The help page for scan says: "Whatever mode the connection is opened in, any of LF, CRLF or CR will be accepted as the EOL marker for a line and so will match sep = "\n"." So the linefeed character ("\n"if that's what they are) should have been recognized them, since read.table is based on scan. You should look at ?Quotes for information on escape characters.
If this vim tutorial is to be believed those may be DOS-related characters since it offers this advice:
Strip DOS ctrl-M's:
:1,$ s/{ctrl-V}{ctrl-M}//
There is an R native solution that requires no preprocessing or external hacks. You should use the encoding input argument to the read.table function and set it equal to "latin1" for Mac character encoding.
For example, say your file in Mac (^M for return) format is saved as test.csv, load as follows:
test <- read.table("./test.csv", sep=",", encoding="latin1")
To see what options you can pass the encoding argument type ?Encoding into the R interpreter and you will see "latin1", "UTF-8", "bytes" or "unknown"are the supported encodings.
This is the best & cleanest way to do this.

Resources