I have 2.000+ tables, some with hundreds of lines, that I'm downloading from a web service (of botanical names) and saving to disk for further inspection.
Since some text fields have carriage returns, I decided to quote everything. But some fields have " characters, others have ' characters, so these characters can't be used for quoting (I could try to escape them, but some are already escaped, and this would easily become a mess. I thought it would be easier to use a different quote character). I tried %, only to find that some fields also use this character. So I need something different. I tried ¨ ☺ π and 人, but nothing seems to work! All of them appear correctly on screen (RKWard in Ubuntu 14.04), all are saved correctly with write.table, but NONE can be read with read.table or read.csv. I'm using UTF-8 as fileEncoding. I get the message "invalid multibyte string", even for ☺ (which is ASCII 1st character).
Sys.getlocale(category="LC_ALL")
gives
"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=pt_BR.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=pt_BR.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=pt_BR.UTF-8;LC_NAME=pt_BR.UTF-8;LC_ADDRESS=pt_BR.UTF-8;LC_TELEPHONE=pt_BR.UTF-8;LC_MEASUREMENT=pt_BR.UTF-8;LC_IDENTIFICATION=pt_BR.UTF-8"
I have tried changing the locale to chinese, to use the 人 (what shouldn't be needed, I guess, since it displays and saves correctly), but also didn't work. I get
OS reports request to set locale to "chinese" cannot be honored
OS reports request to set locale to "Chinese" cannot be honored
OS reports request to set locale to "zh_CN.utf-8" cannot be honored
Now the most strange: if the chinese characters are in the body of data, they're read without problem. It seems they just can't go as quotes!
Any ideas? Thanks in advance.
I'm not sure this is the solution you're looking for, but if I understood correctly you have CR/LF characters in your text which are a problem to read the data as a table. If so, you can use readLines which automatically escapes \r, \n and \r\n and then read as a table. For example, consider the file crlf.txt:
col1 col2 col3 col4 col5
1 \n 3 \r 5
a \r\n 3 2 2
You can use
> readLines("crlf.txt")
[1] "col1 col2 col3 col4 col5" "1 \\n 3 \\r 5 "
[3] "a \\r\\n 3 2 2"
And then:
> read.table(text=readLines("crlf.txt"), header = T)
col1 col2 col3 col4 col5
1 1 \\n 3 \\r 5
2 a \\r\\n 3 2 2
Obviously the line breaks are now escaped when printed, otherwise they would actually break the lines.
See ?scan (scan is used by read.table):
quote: the set of quoting characters as a single character string or ‘NULL’. In a multibyte locale the quoting characters must be ASCII (single-byte).
The easiest option would be to replace all your embedded new lines with another string prior to importing the file, and then reintroduce the new lines later using gsub.
Related
I have a .csv file that contains all text fields. However, some of the text fields contain an unescaped double quote character, eg:
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Lines 1 and 2 are fine but 3 doesn't read in correctly. At the moment I am manually going through the file in Notepad++ to try and remove such quotes. Ideally I'd like R to be able to handle this but I think that the unescaped nature of the unmatched double quote makes such an expectation unreasonable.
In Notepad++ I am trying to build a regular expression to identify double quotes that are not preceded or succeeded by a comma. The logic is that a valid double quote will be at the start or end of a field and this is signified by an adjacent comma. This might help to identify the majority of my cases, which I can then deal with.
Just to say that I have about 3.4 million records and about 0.1% appear to be problematic.
EDIT:
fread from data.table has been suggested as an alternative, but use of fread is even less successful:
1: In fread(paste(infilename, "1", ".csv", sep = "")) :
Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line
Nether of the suggested options works. I think this is because the "Text" field can also contain CRLF characters. The read.csv appears to just ignore these (good) whilst fread takes exception. Sorry that I can not make the actual text available, but here is some more comprehensive test data, that has both the unmatched double quote (read.csv has issues with) and CRLF (fread has issues with).
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
Help with the regex in Notepad++ would be great.
Perhaps one option could be to use a conditional replacement in notepad++.
You could find all the strings that start with a double quote which start with a comma or at the start of the string.
Then match not a double quote until you encounter the next double quote where a comma follows or the end of the string. These are the lines white are ok, so for the alternation part that you want to capture and replace match a double quote not between comma's.
Find what:
(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)
Replace with:
A conditional replacement. If group 1, then replace with empty, else replace with the match.
(?{1}:$0)
Regex demo
Explanation
(?:^|,) Match either a comma or assert the start of the string
"[^"\n]*" Match the double quotes when there is no double quote in between
(?=$|,) Assert what is on the right is either the end of the string or a comma
| OR
(?<!,)(")(?!,)Capture a double quote in group1 while asserting what is on the left and on the right is not a comma
Seems to work rather well with data.table::fread:
fread("E:/temp/test.txt")
# ID Text Optional text "Date"
#1: 1 Today is going to be a good day 2013-02-03
#2: 2 And I am inspired by the quote "every dog must have it's day" Hi 2013-01-01
#3: 3 Did not the bard say "All the World's a stage" this quote is so true Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
# Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
Suppose I have a csv file looks like this:
Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""
desired output should be:
df <- data.frame(Type='A',ID=3, NAME=NA, CONTENT='I have comma, ha!',
RESPONSE='I have open double quotes\"', GRADE=A, SOURCE=NA)
df
Type ID NAME CONTENT RESPONSE GRADE SOURCE
1 A 3 NA I have comma, ha! I have open double quotes" A NA
I tried to use read.csv, since the data provider uses quote to escape comma in the string, but they forgot to escape double quotes in string with no comma, so no matter whether I disable quote in read.csv I won't get desired output.
How can I do this in R? Other package solutions are also welcome.
fread from data.table handles this just fine:
library(data.table)
fread('Type,ID,NAME,CONTENT,RESPONSE,GRADE,SOURCE
A,3,"","I have comma, ha!",I have open double quotes",A,""')
# Type ID NAME CONTENT RESPONSE GRADE SOURCE
#1: A 3 I have comma, ha! I have open double quotes" A
I'm not too sure about the structure of CSV files, but you said the author had escaped the comma in the text under content.
This works to read the text as is with the " at the end.
read.csv2("Test.csv", header = T,sep = ",", quote="")
This is not valid CSV, so you'll have to do your own parsing. But, assuming the convention is as follows, you can just toggle with scan to take advantage of most of its abilities:
If the field starts with a quote, it is quoted.
If the field does not start with a quote, it is raw
next_field<-function(stream) {
p<-seek(stream)
d<-readChar(stream,1)
seek(stream,p)
if(d=="\"")
field<-scan(stream,"",1,sep=",",quote="\"",blank=FALSE)
else
field<-scan(stream,"",1,sep=",",quote="",blank=FALSE)
return(field)
}
Assuming the above convention, this sufficient to parse as follows
s<-file("example.csv",open="rt")
header<-readLines(s,1)
header<-scan(what="",text=header,sep=",")
line<-replicate(length(header),next_field(s))
setNames(as.data.frame(lapply(line,type.convert)),header)
Type ID NAME CONTENT RESPONSE GRADE SOURCE
1 A 3 NA I have comma, ha! I have open double quotes" A NA
However, in practice you might want to first write back the fields, quoting each, to another file, so you can just read.csv on the corrected format.
When I tried to read a csv file using data.table:fread(fn, sep='\t', header=T), it gives an "Unbalanced " observed on this line" error. The data has 3 integer variables and 1 string variable. The strings in the csv file are not enclosed with ", and yes there are some lines that contains " within the string variable and the " characters are not in pairs.
I am wondering is it possible to let fread just ignore the unpaired " in the variable and continue reading data? Thanks.
Here is the sample data(just one record)
N_ID VISIT_DATE REQ_URL REQType
175931 2013-3-8 23:40:30 http://aaa.com/rest/api2.do?api=getSetMobileSession&data={"imei":"60893ZTE-CN13cd","appkey":"android_client","content":"Z0JiRA0qPFtWM3BYVltmcx5MWF9ZS0YLdW1ydXoqPycuJS8idXdlY3R0TGBtU 1
UPDATE: Now implemented in v1.8.11
From NEWS :
fread now accepts quotes (both ' and ") in the middle of fields,
whether the field starts with " or not, rather than the 'unbalanced
quotes' error, #2694. Thanks to baidao for reporting. It was known and
documented at the top of ?fread (text now removed). If a field starts
with " it must end with " (necessary if the field separator itself is in the
field contents). Embedded quotes can be in column names too. Newlines (\n)
still can't be in quoted fields or quoted column names, yet.
Yes as #agstudy said, embedded quotes are a known documented problem not yet implemented since fread is new. Strictly speaking, I suppose these ones aren't embedded because the string in your example doesn't start with a quote, though.
Anyway, I've filed this as a bug report so it doesn't get forgotten. To be done in the next release. Thanks for highlighting.
#2694 : Strings including quotes but not starting with quote in fread
I'm working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don't understand.
Here's an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia's: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I'm missing it.
PS I don't save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I'm not attached to any particular method of reading in the file (ie read.xls) if you've got a better alternative.
Maybe this will help (I'll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn't get you the answer first).
On my Linux system, when I do the following:
iconv(z$text, "", "cp1252")
I get:
[1] "\x93 \x94 curly quotes" "en dash (\x96) and the em dash (\x97)"
[3] "\x91 \x92 curly apostrophe-ugg" "\x85 ellipsis are uck in R"
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use gsub the way you intend to.
See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn't have those characters, like ASCII and set sub to "byte". On my machine, that gives me:
iconv(z$text, "", "ASCII", "byte")
# [1] "<e2><80><9c> <e2><80><9d> curly quotes"
# [2] "en dash (<e2><80><93>) and the em dash (<e2><80><94>)"
# [3] "<e2><80><98> <e2><80><99> curly apostrophe-ugg"
# [4] "<e2><80><a6> ellipsis are uck in R"
It's ugly, but UTF-8(e2, 80, 9c) is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.
Try
> iconv(z, "UTF-8", "UTF-8")
[1] "c(\"“—” curly quotes\", \"en dash (–) and the em dash (—)\", \"‘—’ curly apostrophe-ugg\", \"… ellipsis are uck in R\")"
[2] "c(1, 2, 3, 4)"
windows is very problematic with encodings. Maybe you can look at http://www.vmware.com/products/player/ and run linux.
This works on my windows box. Initial input was as you had. You may have a different experience.
I have a comma-separated value file that looks like this when I open it in vim:
12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M
and so forth. I believe this means my CSV uses CR-only (classic mac) line endings. R's read.table() function ostensibly requires LF line endings, or some variant thereof.
I know I can preprocess the file, and that's probably what I'll do.
That solution aside: is there a way to import CR files directly into R? For instance, write.table() has an "eol" parameter one can use to specify the line ending of outputs -- but I don't see a similar parameter for read.table() (cf. http://stat.ethz.ch/R-manual/R-patched/library/utils/html/read.table.html).
R will not recognize "^M" as anything useful.(I suppose it's possible that vim is just showing you a cntrl-M as that character.) If that were in a text-connection-stream R will think it's not a valid escaped-character, since "^" is not used for that purpose. You might need to do the pre-processing, unless you want to pass it through scan() and substitute using gsub():
subbed <- gsub("\\^M", "\n", scan(textConnection("12,31,50,,12^M34,23,45,2,12^M12,31,50,,12^M34,23,45,2,12^M"), what="character"))
Read 1 item
> read.table(text=subbed, sep=",")
V1 V2 V3 V4 V5
1 12 31 50 NA 12
2 34 23 45 2 12
3 12 31 50 NA 12
4 34 23 45 2 12
I suppose it's possible that you may need to use "\\m" as the patt argument to gsub.
A further note: The help page for scan says: "Whatever mode the connection is opened in, any of LF, CRLF or CR will be accepted as the EOL marker for a line and so will match sep = "\n"." So the linefeed character ("\n"if that's what they are) should have been recognized them, since read.table is based on scan. You should look at ?Quotes for information on escape characters.
If this vim tutorial is to be believed those may be DOS-related characters since it offers this advice:
Strip DOS ctrl-M's:
:1,$ s/{ctrl-V}{ctrl-M}//
There is an R native solution that requires no preprocessing or external hacks. You should use the encoding input argument to the read.table function and set it equal to "latin1" for Mac character encoding.
For example, say your file in Mac (^M for return) format is saved as test.csv, load as follows:
test <- read.table("./test.csv", sep=",", encoding="latin1")
To see what options you can pass the encoding argument type ?Encoding into the R interpreter and you will see "latin1", "UTF-8", "bytes" or "unknown"are the supported encodings.
This is the best & cleanest way to do this.