How to detect comma vs semicolon separation in a file before reading it? [duplicate] - r

I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv().
Some have a semicolon as a delimiter, then I use read.csv2().
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before?
How can I check if there was an error without actually seeing it?

Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.

A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()

Related

read.csv Fails Due to Rows With A Trailing Comma

I am reading from an API into a CSV file.
I then use R to perform calculations on that data. I am using read.csv to read the data into R.
In a few cases, the last column of a row has a blank value so the row ends in a comma.
This causes read.csv to fail.
Short of writing a script to fix the file, is there any way to read the CSV with a row or rows ending with a trailing comma?
I see what I did wrong. Some of my CSV fields are enclosed in double quotes, however I failed to define a quote character in my read.csv statement.
Here is my corrected statement:
MyData <<- read.csv(file=“myfile.csv”, header=TRUE, stringsAsFactors=FALSE, sep=“,”, quote=“\””)
Note that the quote parameter is escaped with a backslash.
Thanks to all.

How to check if CSV file has a comma or a semicolon as separator?

I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv().
Some have a semicolon as a delimiter, then I use read.csv2().
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before?
How can I check if there was an error without actually seeing it?
Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.
A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()

Reading a csv file with embedded quotes into R

I have to work with a .csv file that comes like this:
"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"
If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?
I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then
read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))
should work. Otherwise
read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))
is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.
(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)
I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in
"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"
Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:
data.text <- readLines("fullofquotes.csv") # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text) # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)
Or, of course, all in a single line
data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))

Copy to without quotes

I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)

R - read.table imports half of the dataset - no errors nor warnings

I have a csv file with ~200 columns and ~170K rows. The data has been extensively groomed and I know that it is well-formed. When read.table completes, I see that approximately half of the rows have been imported. There are no warnings nor errors. I set options( warn = 2 ). I'm using 64-bit latest version and I increased the memory limit to 10gig. Scratching my head here...no idea how to proceed debugging this.
Edit
When I said half the file, I don't mean the first half. The last observation read is towards the end of the file....so its seemingly random.
You may have a comment character (#) in the file (try setting the option comment.char = "" in read.table). Also, check that the quote option is set correctly.
I've had this problem before how I approached it was to read in a set number of lines at a time and then combine after the fact.
df1 <- read.csv(..., nrows=85000)
df2 <- read.csv(..., skip=84999, nrows=85000)
colnames(df1) <- colnames(df2)
df <- rbind(df1,df2)
rm(df1,df2)
I had a similar problem when reading in a large txt file which had a "|" separator. Scattered about the txt file were some text blocks that contained a quote (") which caused the read.xxx function to stop at the prior record without throwing an error. Note that the text blocks mentioned were not encased in double quotes; rather, they just contained one double quote character here and there (") which tripped it up.
I did a global search and replace on the txt file, replacing the double quote (") with a single quote ('), solving the problem (all rows were then read in without aborting).

Resources