I have a tab-delimited file that I am trying to read into R. String variables are denoted with double quotes, however, there are additional double quotes inside some of the strings themselves:
There are also coordinates in some of the string variables that produce the same issue of quotes inside the string:
When trying to read into R, it provide the following error:
"Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 237 did not have 55 elements."
How can I read these files into R without having it split up the string? Alternatively, is there a way to save the source data differently so that this issue does not occur when reading into R?
Related
I've been trying to read in a pipe delimited csv file containing 96 variables about some volunteer water quality data. Randomly within the file, there's single and double quotation marks as well as semi-colons, dashes, slashes, and likely other special characters
Name: Jonathan "Joe" Smith; Jerry; Emily; etc.
From the output of several variables (such as IsNewVolunteer), it seems that r is having issues reading in the data. IsNewVolunteer should always be Y or N, but numbers are appearing and when I queried those lines it appears that the data is getting shifted. Variables that are clearly not names are in the Firstname and lastname column.
The original data format makes it a little difficult to see and troubleshoot, especially due to extra variables. I would find a way to remove them, but the goal of the work with R is to provide code that will be able to run on a dataset that is frequently updated.
I've tried
read.table("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
But that produces the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 132 did not have 94 elements
However, there's nothing out of the ordinary that I've noticed about line 132. I've had more success with
read.csv("dnrvisualstream.csv",sep="|",stringsAsFactors = FALSE,quote="")
but that still produces offsets and errors as discussed above. Is there something I'm doing incorrectly? Any information would be helpful.
I think it's one of two issues:
Encoding is either UTF-8 or UTF-16:
Try this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-8)
or this...
read.csv("dnrvisualstream.csv", sep = "|", stringsAsFactors = FALSE, quote = "", encoding = UTF-16)
Too many separators:
If this doesn't work, right-click on your .csv file and open it in a text editor. You have multiple |||| separators in rows 2,3,4, and 21,22 that are visible in your screenshot. Press CTRL+H to find and replace:
Find: ||||
Replace: |
Save the new file and try to open in R again.
I have the following code:
myTable[i,] = strsplit(line, split=";")[[1]]
write.csv(myTable[-1,], file="episodes_cleared.csv", sep=";", row.names=FALSE, quote=FALSE)
Unfortunately, the separator still is ',':
iEpisodeId,iPatientId,sTitle,sICPc,dStart,dEnd,bProblem
Running the code gives me:
Warning messages:
1: In write.csv(myTable[-1, ], file = "episodes_cleared.csv", sep = ";", : attempt to set 'sep' ignored
2: In write.csv(myTable[-1, ], file = "episodes_cleared.csv", sep = ";", :
attempt to set 'sep' ignored
What am I doing wrong?
First, you should provide a reproducible example.
However, if you use write.csv2 it defaults to using a semicolon as the separator.
The documentation https://stat.ethz.ch/R-manual/R-devel/library/utils/html/write.table.html says:
write.csv uses "." for the decimal point and a comma for the separator.
write.csv2 uses a comma for the decimal point and a semicolon for the separator, the Excel convention for CSV files in some Western European locales.
These wrappers are deliberately inflexible: they are designed to ensure that the correct conventions are used to write a valid file. Attempts to change append, col.names, sep, dec or qmethod are ignored, with a warning.
So it not only defaults to this seperator, it enforces it. Use
write.table(data,file="data.csv",sep=",",dec = " ")
You can also use fwrite, fast CSV writer, with option sep = ";".
I have this similar problem: read.csv warning 'EOF within quoted string' prevents complete reading of file
That is, when I load a csv R says:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
I can get rid of this error by applying: quotes="" to read.csv
But the main problem still exists, only 22111 rows of 689233 in total are read into R. I would like to try removing all special characters from the csv to see if this clears the problem.
Related I found this: How to remove specific special characters in R
But is there a way to do it in read.csv, that is in the phase when I'm reading in the file?
Did you try fread from data.table? It can optimize the task and likely deal with some common issues. As you haven't provide any piece of data, I'm giving a silly example:
> fread('col1,col2\n5,"4\n3"')
col1 col2
1: 5 4\n3
It was indeed a special charcter. There was a → (arrow, hexadecimal value 0x1A) on line 22,112.
After deleting the arrow I get the data to load normally!
Solution of datatable expord csv with special chahracters
Find charset from
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.js
or
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.min.js
and change it to 'UTF-8-BOM'from 'UTF-8'
I have seen in several cases that while read.table() is not able to read a tab delimited file (for example the annotation table of a microarray) returning the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line xxx did not have yyy elements
read.csv() works perfectly on the same file with no errors. I think also the speed of read.csv() is also higher than read.table().
Even more: read.table() is doing very crazy reading a file of me. It makes this error while reading line 100, but when I copy and paste lines 90 to 110 just after the head of the same file, it still makes error of line 100+21 (new lines copied at the beginning). If there is any problem with that line, why doesn't it report that error while reading the pasted line at the beginning? I confirm that read.csv() reads the same file with no error.
Do you have any idea of why read.table() is unable to read the same files that read.csv() works on it? Also is there any reason to use read.table() in any cases?
read.csv is a fairly thin wrapper around read.table; I would be quite surprised if you couldn't exactly replicate the behaviour of read.csv by supplying the correct arguments to read.table. However, some of those arguments (such as the way that quotation marks or comment characters are handled) could well change the speed and behaviour of the function.
In particular, this is the full definition of read.csv:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...) {
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
}
so as stated it's just read.table with a particular set of options.
As #Chase states in the comments below, the help page for read.table() says just as much under Details:
read.csv and read.csv2 are identical to read.table except for the defaults. They are intended for reading ‘comma separated value’ files (‘.csv’) or (read.csv2) the variant used in countries that use a comma as decimal point and a semicolon as field separator.
Don't use read.table to read tab-delimited files, use read.delim. (It is just a thin wrapper around read.table but it sets the options to appropriate values)
read_table() does fail sometime on tab sep'ed file and setting sep='\s+' may help assuming item in your table have no space
I am having difficulty getting R to read a .txt or .csv file that contains apostrophes.
Some of my columns contain descriptive text, such as "Attends to customers' needs" or "Sheriff's deputy". My file opens correctly in Excel (that is, all the data appear in the correct cells; there are 3 columns and about 8000 rows, and there is no missing data). But when I ask R to read the file, this is what happens:
data <-read.table("datafile.csv", sep=",", header=TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 520 did not have 3 elements
(Line 520 is the first line that contains an apostrophe.)
If I go into the .txt or .csv file and manually remove all the apostrophes, then R reads the file correctly. However, I'd rather keep the apostrophes if I can.
I am new to R and would be grateful for any help.
By default, read.table sees single and double quotes as quoting characters. You need to add quote="\"" to your read.table call. Or, you could just use read.csv, which only sees double quotes as quoting characters by default.
Thoroughly studying the options in ?read.table will pay off in the long run. The default values for quoting characters is quote = "\"'", which is really only two characters after R parses that expression, single-quote and double-quote. You can remove them both from consideration using quotes=NA. It's sometimes necessary to also remove the 'comment.char' defaulting to "#", and it may be helpful to change 'as.is' to TRUE to prevent strings from getting converted to factors.
Setting the parameter quote="\\" in read.table should do the trick.