R Reading a badly formatted csv with uneven quotes and separators in fields - r

I have a badly formatted csv file (I did not make it) that includes both separators and broken quotes in some fields. I would like to read this into R.
Three lines of the table look something like this:
| ids |info | text |
| id 1 |extra_info;1998| text text text |
| id 2 |extra_info2 | text with broken dialogues quotes "hi! |
#the same table in R string could be
string <- "ids;info;text\n\"id 1\";\"extra_info;1998\";\"text text text\"\n\"id 2\";extra_info2;\"text with broken dialogues quotes \"hi!\" \n"
With " quotes surrounding any field with more than one word as is common in csv-s, and semicolon ; used as a separator. Unfortunately the way it was built, the last column (and it is always last), can contain a random number of semicolons or quotes within a text bulk, and these quotes are not always escaped.
I'm looking for a way to read this file. So far I have come up with a really complicated workflow to replace the first N separators with another less used separator when they are in the beginning of line with regex (from here) - because text is always last, however this still fails currently when there is an uneven number of quotes in the line.
I'm thinking there must be an easier way to do this, as badly formed csv-s should be a reoccurring problem here. Thanks.

data.table::fread works wonders:
library(data.table)
test <- fread("test.csv")
# Remove extraneous columns
test$V1 <- NULL
test$V5 <- NULL

Related

Using R, how does one extract multiple URLs/pattern matches from a string in a dataset, and then place each URL in its own adjacent column?

I have a (large) dataset that initially consists of an identifier and associated text (in raw HTML). Oftentimes the text will include one or more embedded links. Here's a sample dataset:
id text
1 <p>I love dogs!</p>
2 <p>My <strong>favorite</strong> dog is this kind.</p>
3 <p>I've had both Labs and Huskies in my life.</p>
What I'd like as output (with the text column included in the same spot, but I removed it for visibility here) is:
id link1 link2
1
2 doge.com
3 labs.com huskies.com
I've tried using str_extract_all() paired with <a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1, but even when I double escape the backslashes I either get an "unexpected" error OR it keeps asking me for more and I have to Escape out. I feel like this method is the one I want and SHOULD work, but I can't seem to get the regex to play nicely. Here are my results so far:
> str_extract_all(text, "<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1")
Error: '\s' is an unrecognized escape in character string starting ""<a\s"
> str_extract_all(text, perl(<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1))
Error: unexpected '<' in "str_extract_all(text, perl(<"
> str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1")
+
> str_extract_all(text, perl(<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1))
Error: unexpected '<' in "str_extract_all(text, perl(<"
I've also tried parseURI from the XML package and for whatever reason it crashes my R session.
The other solutions I've found to date either only deal with single links, or return items in a list or vector altogether. I want to keep things separated by their identifier and in a dataset.
If needed, I could tolerate generating a separate dataset and merging them together, but there will be cases where there are no links, so I'd want to avoid any pitfalls of rows being deleted due to not having a value in any of the link columns.
R does not like quotes within strings so in your example above R is considering the string ongoing:
str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1")
R is still looking for the end of the string since it was not escaped in the regex. R has special cases in which as single \ can be used for escaping, (e.g \n for new line), see this. \' escapes a single quote and \" escapes a double quote in R regex:
str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=([\"])(.*?)\\1", text, perl=T)
"\ itself is a special character that needs escape, e.g. \\d. Do not
confuse these regular expressions with R escape sequences such as
\t."
or in your case \"

Dealing with quotation marks in a quote-surrounded string

Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)

How to skip the split functionality for same value of split pattern in fn:tokenize

I am trying to split a string (a delimited string separated using '|' or ','). I used fn:tokenize to implement this. Consider below the example text in which I have 4 columns text out of which in 3rd column i got the same value as split pattern.
fn:tokenize("column1|column2|||column4", "|")
Result of the above code is giving me 5 values in which 2 are empty:
column1
column2
column4
I also tried with adding quotes to column3 value, which is also not giving me the expected result.
In MarkLogic 9 you can define your own custom tokenizer.
Apart from fn:tokenize splitting by regular expressions and thus requiring | to be escaped, this seems like a horrible data format. Putting apart issues indicated by Michael Kay and expecting that || will always indicate a new field starting with |, and there are never empty columns, you can apply a simple hack and replace the pipe symbols by another character, and converting back afterwards. This requires you find some character in the Unicode range not allowed in your data set, though.
for $token in fn:tokenize(fn:replace("column1|||||column4", "\|\|", "|_"), "\|")
return fn:replace($token, "_", "|")
Result:
column1
|
|
column4
If the assumptions I made do not apply to your use case, you will have to determine another set of similar strict assumptions to be able to parse your contents.

Reading a csv file with embedded quotes into R

I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

Resources