Reading CSV with multi-line columns in R - r

A dataset I am trying to read oddly contains a whole lot of multi-line texts in one column. read.csv("the_ill_formated_file.csv") is able to read some of it, with a number of columns mixed up for some rows then throws up a warning message
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
fread("the_ill_formated_file.csv") is unable to read it at all throwing up this error message
Error in fread("the_ill_formated_file.csv") :
Internal error. No eol2 immediately before line 30, 'p' instead
In addition: Warning message:
In fread("the_ill_formated_file.csv") :
Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.
The following is a snippet of how the file is formatted:
"comment_id", "comment", "post_date", "reply_count", "reply_ids"
1001, "This comment is multi-line with
space between each line!
Quite a fancy format this one", "2015-08-16" , 3, "{1,2,3}"
1002, "This second row is all on a single line, which is the usual format read.csv/fread in R will expect it", "2015-08-17" , 0, "{}"
Got the same mixed up columns when I opened it in Excel.
Thanks in advance for the assistance.

Related

Expected a real got "1,000.00"

I am trying to use scan() to read in a very large data set that has variables beginning on different rows. One column contains numbers, some of which are simply 20, others are 1,000.00. My code to read this data in looks like:
largedata<-scan(paste0(folder,df.txt),
what=list("","","",0,"","","",""),
skip=20,sep="\t",quote="",dec=".")
But my larger numbers, in the thousands, are not cooperating. I included the dec="." to try to eliminate the problem, but I am still getting this error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '1,000.00'
If I just make my numeric column to be character with all the others, my data set reads in fine, but when I convert to data.frame, everything turns into a factor, and when converting the factor to numeric, my numeric values of 1,000.00 and above all become NA. Is there a way that I can read in numbers that have both commas and decimals in this format? I cannot think of another way to read in my data other than using scan().

Why does read.csv() sometimes get errors when specifying colClasses as "character"?

I am using read.csv() to make a data.table. When importing the columns, I need them to be imported as either 'character' or 'numeric'.
I'm using the following code (simplified for brevity):
dataCols <- c(a="character", b="character", c="numeric", d="character")
data <- data.table(read.csv(file="data.csv", row-names=1, stringsAsFactors=F, colClasses=dataCols))
For ease, I would like to have the dataCols vector be a list of all possible columns as I'm reading a number of csv files which represent the data at various parts of a process (which my code is meant to be checking for equality).
If I use the above code to read a csv file which has all the columns a, b, c and d it reads okay. If, however, I try to read a csv which only has columns a-c, I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '"abc"'
where "abc" is the contents of row 1 in column b.
I'm telling it to read the column as a character, and it's getting a character, but it's giving me an error. Why is this?
Frustratingly, when I was doing this with a different thing the other day, if i put extra colClasses in it just gave me a warning that said 'there are more colclasses listed than exist in your csv'.
I'm completely at a loss as to why these errors are a) different and, in the case of the problem I described above, even appearing in the first place.

How to remove special characters while loading a csv in R?

I have this similar problem: read.csv warning 'EOF within quoted string' prevents complete reading of file
That is, when I load a csv R says:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
I can get rid of this error by applying: quotes="" to read.csv
But the main problem still exists, only 22111 rows of 689233 in total are read into R. I would like to try removing all special characters from the csv to see if this clears the problem.
Related I found this: How to remove specific special characters in R
But is there a way to do it in read.csv, that is in the phase when I'm reading in the file?
Did you try fread from data.table? It can optimize the task and likely deal with some common issues. As you haven't provide any piece of data, I'm giving a silly example:
> fread('col1,col2\n5,"4\n3"')
col1 col2
1: 5 4\n3
It was indeed a special charcter. There was a → (arrow, hexadecimal value 0x1A) on line 22,112.
After deleting the arrow I get the data to load normally!
Solution of datatable expord csv with special chahracters
Find charset from
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.js
or
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.min.js
and change it to 'UTF-8-BOM'from 'UTF-8'

Error reading .csv file from website into R

I try to read .csv file from website into R as following:
poll = read.csv("http://www.aec.gov.au/About_AEC/cea- notices/files/2013/prdelms.gaz.statics.130901.09.00.02.csv")
But then I got the warning message:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
Then I searched previous StackOverflow, and changed my code to:
poll = read.csv("http://www.aec.gov.au/About_AEC/cea-notices/files/2013/prdelms.gaz.statics.130901.09.00.02.csv", quote="")
This seemed to solve problem, I got no warning, and got 8855 * 26 data. My question is:
What did the original problem mean, and why did the second code fix it?
Thank you!
Your file contains a symbol ", but this symbol is normally interpreted as a quote. This broke the line that contains the symbol. You have to disable the use of this symbol as a quote.

Error in reading a massive file

I have very large text message which contain "",*\n* but while reading a file whose one of the column contain text is not getting read properly just because message contain "" and "\n". I have used the following
dat = read.csv("abc", header=F, sep=",", quote ="\"'", stringsAsFactors=FALSE, allowEscapes=T, flush=T, comment.char="")
It reads file incorrectly with read.csv and reading as a table,getting an error
dat = read.table("abc", header =F,sep="," , quote = "\"'",,stringsAsFactors = FALSE,allowEscapes=T,flush =T,comment.char="")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 38 did not have 20 elements
So my row gets break in message column,I saved my file as eol ='\r\r\n' and quote=T but while reading I didn't find any parameter to read it back in the same format.
saved file as
write.table(z,file="abc",append=F,quote=T,sep=",",eol="\r\r\r\r\r\n",row.names=F,col.names=F)
in this example
"In case you know,give some hint
lot of text.....
.................
---------------------------------------------------------------------------
\"thank you very much for your time
and your effort\"
---------------------------------------------------------------------------"
it breaks after
"In case you know,give some hint
lot of text.....
.................
---------------------------------------------------------------------------
\"thank you very much for your time
while reading how can i use eol in order to retrieve complete text message in the same column .I am not able to read a written file back,though the file successfully uploded in Mysql with loading script.Any help in this direction.
thanks.

Resources