data.table::fread and Unbalanced " - r

When I tried to read a csv file using data.table:fread(fn, sep='\t', header=T), it gives an "Unbalanced " observed on this line" error. The data has 3 integer variables and 1 string variable. The strings in the csv file are not enclosed with ", and yes there are some lines that contains " within the string variable and the " characters are not in pairs.
I am wondering is it possible to let fread just ignore the unpaired " in the variable and continue reading data? Thanks.
Here is the sample data(just one record)
N_ID VISIT_DATE REQ_URL REQType
175931 2013-3-8 23:40:30 http://aaa.com/rest/api2.do?api=getSetMobileSession&data={"imei":"60893ZTE-CN13cd","appkey":"android_client","content":"Z0JiRA0qPFtWM3BYVltmcx5MWF9ZS0YLdW1ydXoqPycuJS8idXdlY3R0TGBtU 1

UPDATE: Now implemented in v1.8.11
From NEWS :
fread now accepts quotes (both ' and ") in the middle of fields,
whether the field starts with " or not, rather than the 'unbalanced
quotes' error, #2694. Thanks to baidao for reporting. It was known and
documented at the top of ?fread (text now removed). If a field starts
with " it must end with " (necessary if the field separator itself is in the
field contents). Embedded quotes can be in column names too. Newlines (\n)
still can't be in quoted fields or quoted column names, yet.
Yes as #agstudy said, embedded quotes are a known documented problem not yet implemented since fread is new. Strictly speaking, I suppose these ones aren't embedded because the string in your example doesn't start with a quote, though.
Anyway, I've filed this as a bug report so it doesn't get forgotten. To be done in the next release. Thanks for highlighting.
#2694 : Strings including quotes but not starting with quote in fread

Related

R .csv not read in correctly because there are double quotes in the text

I have a .csv file that contains all text fields. However, some of the text fields contain an unescaped double quote character, eg:
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Lines 1 and 2 are fine but 3 doesn't read in correctly. At the moment I am manually going through the file in Notepad++ to try and remove such quotes. Ideally I'd like R to be able to handle this but I think that the unescaped nature of the unmatched double quote makes such an expectation unreasonable.
In Notepad++ I am trying to build a regular expression to identify double quotes that are not preceded or succeeded by a comma. The logic is that a valid double quote will be at the start or end of a field and this is signified by an adjacent comma. This might help to identify the majority of my cases, which I can then deal with.
Just to say that I have about 3.4 million records and about 0.1% appear to be problematic.
EDIT:
fread from data.table has been suggested as an alternative, but use of fread is even less successful:
1: In fread(paste(infilename, "1", ".csv", sep = "")) :
Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line
Nether of the suggested options works. I think this is because the "Text" field can also contain CRLF characters. The read.csv appears to just ignore these (good) whilst fread takes exception. Sorry that I can not make the actual text available, but here is some more comprehensive test data, that has both the unmatched double quote (read.csv has issues with) and CRLF (fread has issues with).
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Help with the regex in Notepad++ would be great.
Perhaps one option could be to use a conditional replacement in notepad++.
You could find all the strings that start with a double quote which start with a comma or at the start of the string.
Then match not a double quote until you encounter the next double quote where a comma follows or the end of the string. These are the lines white are ok, so for the alternation part that you want to capture and replace match a double quote not between comma's.
Find what:
(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)
Replace with:
A conditional replacement. If group 1, then replace with empty, else replace with the match.
(?{1}:$0)
Regex demo
Explanation
(?:^|,) Match either a comma or assert the start of the string
"[^"\n]*" Match the double quotes when there is no double quote in between
(?=$|,) Assert what is on the right is either the end of the string or a comma
| OR
(?<!,)(")(?!,)Capture a double quote in group1 while asserting what is on the left and on the right is not a comma
Seems to work rather well with data.table::fread:
fread("E:/temp/test.txt")
# ID Text Optional text "Date"
#1: 1 Today is going to be a good day 2013-02-03
#2: 2 And I am inspired by the quote "every dog must have it's day" Hi 2013-01-01
#3: 3 Did not the bard say "All the World's a stage" this quote is so true Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
# Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Dealing with quotation marks in a quote-surrounded string

Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)

Read csv file in R with double quotes

Suppose I have a csv file looks like this:
Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""
desired output should be:
df <- data.frame(Type='A',ID=3, NAME=NA, CONTENT='I have comma, ha!',
RESPONSE='I have open double quotes\"', GRADE=A, SOURCE=NA)
df
Type ID NAME CONTENT RESPONSE GRADE SOURCE
1 A 3 NA I have comma, ha! I have open double quotes" A NA
I tried to use read.csv, since the data provider uses quote to escape comma in the string, but they forgot to escape double quotes in string with no comma, so no matter whether I disable quote in read.csv I won't get desired output.
How can I do this in R? Other package solutions are also welcome.
fread from data.table handles this just fine:
library(data.table)
fread('Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""')
# Type ID NAME CONTENT RESPONSE GRADE SOURCE
#1: A 3 I have comma, ha! I have open double quotes" A
I'm not too sure about the structure of CSV files, but you said the author had escaped the comma in the text under content.
This works to read the text as is with the " at the end.
read.csv2("Test.csv", header = T,sep = ",", quote="")
This is not valid CSV, so you'll have to do your own parsing. But, assuming the convention is as follows, you can just toggle with scan to take advantage of most of its abilities:
If the field starts with a quote, it is quoted.
If the field does not start with a quote, it is raw
next_field<-function(stream) {
p<-seek(stream)
d<-readChar(stream,1)
seek(stream,p)
if(d=="\"")
field<-scan(stream,"",1,sep=",",quote="\"",blank=FALSE)
else
field<-scan(stream,"",1,sep=",",quote="",blank=FALSE)
return(field)
}
Assuming the above convention, this sufficient to parse as follows
s<-file("example.csv",open="rt")
header<-readLines(s,1)
header<-scan(what="",text=header,sep=",")
line<-replicate(length(header),next_field(s))
setNames(as.data.frame(lapply(line,type.convert)),header)
Type ID NAME CONTENT RESPONSE GRADE SOURCE
1 A 3 NA I have comma, ha! I have open double quotes" A NA
However, in practice you might want to first write back the fields, quoting each, to another file, so you can just read.csv on the corrected format.

Treating "#" as a regular character when reading data

I'm almost certain this has been asked before but due to a certain social media app I drowning in unrelated search results.
So the data set that I'm importing contains actual "#", as in Apartment #404, and I'd like to if possible preserve the character but R thinks it's an end of line or something. At first it would bomb out on the first occurrence, then I set fill=TRUE and now it just ignores the rest of the line after that.
How does one instruct R to treat #'s as regular characters?
If you are not using "#" as a comment symbol in your data, you can use
read.table(..., comment.char="")
That should treat "#" like any other character.

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

Resources