Dealing with quotation marks in a quote-surrounded string - r

Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?

Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.

This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)

Related

R .csv not read in correctly because there are double quotes in the text

I have a .csv file that contains all text fields. However, some of the text fields contain an unescaped double quote character, eg:
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Lines 1 and 2 are fine but 3 doesn't read in correctly. At the moment I am manually going through the file in Notepad++ to try and remove such quotes. Ideally I'd like R to be able to handle this but I think that the unescaped nature of the unmatched double quote makes such an expectation unreasonable.
In Notepad++ I am trying to build a regular expression to identify double quotes that are not preceded or succeeded by a comma. The logic is that a valid double quote will be at the start or end of a field and this is signified by an adjacent comma. This might help to identify the majority of my cases, which I can then deal with.
Just to say that I have about 3.4 million records and about 0.1% appear to be problematic.
EDIT:
fread from data.table has been suggested as an alternative, but use of fread is even less successful:
1: In fread(paste(infilename, "1", ".csv", sep = "")) :
Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line
Nether of the suggested options works. I think this is because the "Text" field can also contain CRLF characters. The read.csv appears to just ignore these (good) whilst fread takes exception. Sorry that I can not make the actual text available, but here is some more comprehensive test data, that has both the unmatched double quote (read.csv has issues with) and CRLF (fread has issues with).
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Help with the regex in Notepad++ would be great.
Perhaps one option could be to use a conditional replacement in notepad++.
You could find all the strings that start with a double quote which start with a comma or at the start of the string.
Then match not a double quote until you encounter the next double quote where a comma follows or the end of the string. These are the lines white are ok, so for the alternation part that you want to capture and replace match a double quote not between comma's.
Find what:
(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)
Replace with:
A conditional replacement. If group 1, then replace with empty, else replace with the match.
(?{1}:$0)
Regex demo
Explanation
(?:^|,) Match either a comma or assert the start of the string
"[^"\n]*" Match the double quotes when there is no double quote in between
(?=$|,) Assert what is on the right is either the end of the string or a comma
| OR
(?<!,)(")(?!,)Capture a double quote in group1 while asserting what is on the left and on the right is not a comma
Seems to work rather well with data.table::fread:
fread("E:/temp/test.txt")
# ID Text Optional text "Date"
#1: 1 Today is going to be a good day 2013-02-03
#2: 2 And I am inspired by the quote "every dog must have it's day" Hi 2013-01-01
#3: 3 Did not the bard say "All the World's a stage" this quote is so true Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
# Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Split CSV type file in R using strsplit

I am trying to split a string that would eventually be taken out of a CSV file using readLine(). (I know read.csv() works better, but the CSV file can have different number of columns for each row. For example, 1st row have 2 column, 2nd row 4 line, 3rd row 2 line.)
Say, the string I am going to parse looks like this:
2011-05-04, "weqr, wrqw", "qweqrw", 12
Eventually, I want it to be split into four parts, meaning I am splitting on commas but only when the comma is outside the quotation marks.
A quick google gives me a JAVA solution which takes advantage of the regular expression ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
But doing something like a<-strsplit(x,",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)") will generate an error: invalid regular expression.

Treating "#" as a regular character when reading data

I'm almost certain this has been asked before but due to a certain social media app I drowning in unrelated search results.
So the data set that I'm importing contains actual "#", as in Apartment #404, and I'd like to if possible preserve the character but R thinks it's an end of line or something. At first it would bomb out on the first occurrence, then I set fill=TRUE and now it just ignores the rest of the line after that.
How does one instruct R to treat #'s as regular characters?
If you are not using "#" as a comment symbol in your data, you can use
read.table(..., comment.char="")
That should treat "#" like any other character.

Reading a csv file with embedded quotes into R

I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

Resources