How to remove special characters while loading a csv in R?

How to remove special characters while loading a csv in R? - r

I have this similar problem: read.csv warning 'EOF within quoted string' prevents complete reading of file
That is, when I load a csv R says:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
I can get rid of this error by applying: quotes="" to read.csv
But the main problem still exists, only 22111 rows of 689233 in total are read into R. I would like to try removing all special characters from the csv to see if this clears the problem.
Related I found this: How to remove specific special characters in R
But is there a way to do it in read.csv, that is in the phase when I'm reading in the file?

Did you try fread from data.table? It can optimize the task and likely deal with some common issues. As you haven't provide any piece of data, I'm giving a silly example:
> fread('col1,col2\n5,"4\n3"')
col1 col2
1: 5 4\n3

It was indeed a special charcter. There was a → (arrow, hexadecimal value 0x1A) on line 22,112.
After deleting the arrow I get the data to load normally!

Solution of datatable expord csv with special chahracters
Find charset from
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.js
or
https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.min.js
and change it to 'UTF-8-BOM'from 'UTF-8'

Related

Data import error: importing csv into R [duplicate]

I am using read.table to read a data file. and got the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'true'
I know that means there's some error in my data file, the problem is how can I find where is it. The error message did not tell which row has the issue, it's hard for me to find it. Or how can I skip these rows?
Here's my R code:
data<-read.csv("/home/jianfezhang/prod/conversion_yaap/data/part-r-00000",
sep="\t",
col.names=c("site",
"treatment",
"mode",
"segment",
"source",
"itemId",
"leaf_categ_id",
"condition_id",
"auct_type_code",
"start_price_lstg_curncy",
"bin_price_lstg_curncy",
"start_price_variance",
"start_price_mean",
"start_price_media",
"bin_price_variance",
"bin_price_mean",
"bin_price_media",
"is_sold"),
colClasses=c(rep("factor",5),"numeric",rep("factor",3),rep("numeric",8),"factor")
);

The error you get is caused by a the colClasses argument - some values in the file to not match the datatypes you specified.
Most of the time I encounter something like this, I probably just had some counting problem with the colClasses argument, e.g it would maybe be
colClasses=c(rep("factor",5),"numeric", rep("factor",4), rep("numeric",7),"factor")
instead of your default values.
That may be simply checked by carefully comparing the contents of the first lines of your file with the datatypes you specified.
If this does not do the trick for you, you probably have some wrong datatype where you do not expect it. A simple, yet slow approach is to remove the colClasses argument and first read the whole file without specific options - probably add stringsAsFactors=FALSE to get only character values. This probably should work.
Then you may try to convert each column one by one, like
data$itemId <- as.numeric(data$itemId)
and then check the result for NA values, easily done by summary(data$itemId). If you got NA values, you can call which(is.na(data$itemId)) to get the row number and check your original file whether the NA in fact is valid or if you have some data problems there.
Most of the time you will be able to narrow down your problem this way.
If your file a lot of columns, however, this quickly becomes a lot of work....

R stops reading a table when coming across "#"

I'm new to R and I'm trying to read a tsv file where sometimes there is a "#" in the table. R just stopped reading when coming across the "#" and gave me the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 6227 did not have 6 elements
I looked at that line in the file and I found the "#". The data looks like this:
CM School Supply #1 Upland CA 3 8 Shopping
When I delete it R can continue reading the table,but I have more "#"s in the file...
How to set the variables in the read.table()? I tried to search for a solution everywhere but failed... Hope someone here can help me out. Thanks!

You can completely turn off read.table()'s interpretation of comment characters (by default set to "#") by setting comment.char="" in your call to read.table().

How to read a tsv file in R where some elements contain \t?

A question closer to mine was asked ans answered here.
My problem if fairly simple: I need to import in R a .tsv file, but I cannot because some elements contain a \t so that I received an error like:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 34 did not have 6 elements
One way to proceed would be to use gsub in order to replace the \ts. But the file is quite big in size, around 11GB, and doing this pre-processing would probably be too much for my machine. Any idea about a possible short-cut here?
Some context: at the end I need to import the whole dataset into a SQL database; I could do it without doing this conversion but at that point I would have the same problem.

Read csv data file in R

I am using read.table to read a data file. and got the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'true'
I know that means there's some error in my data file, the problem is how can I find where is it. The error message did not tell which row has the issue, it's hard for me to find it. Or how can I skip these rows?
Here's my R code:
data<-read.csv("/home/jianfezhang/prod/conversion_yaap/data/part-r-00000",
sep="\t",
col.names=c("site",
"treatment",
"mode",
"segment",
"source",
"itemId",
"leaf_categ_id",
"condition_id",
"auct_type_code",
"start_price_lstg_curncy",
"bin_price_lstg_curncy",
"start_price_variance",
"start_price_mean",
"start_price_media",
"bin_price_variance",
"bin_price_mean",
"bin_price_media",
"is_sold"),
colClasses=c(rep("factor",5),"numeric",rep("factor",3),rep("numeric",8),"factor")
);

The error you get is caused by a the colClasses argument - some values in the file to not match the datatypes you specified.
Most of the time I encounter something like this, I probably just had some counting problem with the colClasses argument, e.g it would maybe be
colClasses=c(rep("factor",5),"numeric", rep("factor",4), rep("numeric",7),"factor")
instead of your default values.
That may be simply checked by carefully comparing the contents of the first lines of your file with the datatypes you specified.
If this does not do the trick for you, you probably have some wrong datatype where you do not expect it. A simple, yet slow approach is to remove the colClasses argument and first read the whole file without specific options - probably add stringsAsFactors=FALSE to get only character values. This probably should work.
Then you may try to convert each column one by one, like
data$itemId <- as.numeric(data$itemId)
and then check the result for NA values, easily done by summary(data$itemId). If you got NA values, you can call which(is.na(data$itemId)) to get the row number and check your original file whether the NA in fact is valid or if you have some data problems there.
Most of the time you will be able to narrow down your problem this way.
If your file a lot of columns, however, this quickly becomes a lot of work....

How to read a .csv file containing apostrophes into R?

I am having difficulty getting R to read a .txt or .csv file that contains apostrophes.
Some of my columns contain descriptive text, such as "Attends to customers' needs" or "Sheriff's deputy". My file opens correctly in Excel (that is, all the data appear in the correct cells; there are 3 columns and about 8000 rows, and there is no missing data). But when I ask R to read the file, this is what happens:
data <-read.table("datafile.csv", sep=",", header=TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 520 did not have 3 elements
(Line 520 is the first line that contains an apostrophe.)
If I go into the .txt or .csv file and manually remove all the apostrophes, then R reads the file correctly. However, I'd rather keep the apostrophes if I can.
I am new to R and would be grateful for any help.

By default, read.table sees single and double quotes as quoting characters. You need to add quote="\"" to your read.table call. Or, you could just use read.csv, which only sees double quotes as quoting characters by default.

Thoroughly studying the options in ?read.table will pay off in the long run. The default values for quoting characters is quote = "\"'", which is really only two characters after R parses that expression, single-quote and double-quote. You can remove them both from consideration using quotes=NA. It's sometimes necessary to also remove the 'comment.char' defaulting to "#", and it may be helpful to change 'as.is' to TRUE to prevent strings from getting converted to factors.

Setting the parameter quote="\\" in read.table should do the trick.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove special characters while loading a csv in R? - r

Did you try fread from data.table? It can optimize the task and likely deal with some common issues. As you haven't provide any piece of data, I'm giving a silly example: > fread('col1,col2\n5,"4\n3"') col1 col2 1: 5 4\n3

It was indeed a special charcter. There was a → (arrow, hexadecimal value 0x1A) on line 22,112. After deleting the arrow I get the data to load normally!

Solution of datatable expord csv with special chahracters Find charset from https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.js or https://cdn.datatables.net/buttons/1.1.2/js/buttons.html5.min.js and change it to 'UTF-8-BOM'from 'UTF-8'

Related

Data import error: importing csv into R [duplicate]

R stops reading a table when coming across "#"

How to read a tsv file in R where some elements contain \t?

Read csv data file in R

How to read a .csv file containing apostrophes into R?

Categories

Resources