Expected a real got "1,000.00" - r

I am trying to use scan() to read in a very large data set that has variables beginning on different rows. One column contains numbers, some of which are simply 20, others are 1,000.00. My code to read this data in looks like:
largedata<-scan(paste0(folder,df.txt),
what=list("","","",0,"","","",""),
skip=20,sep="\t",quote="",dec=".")
But my larger numbers, in the thousands, are not cooperating. I included the dec="." to try to eliminate the problem, but I am still getting this error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '1,000.00'
If I just make my numeric column to be character with all the others, my data set reads in fine, but when I convert to data.frame, everything turns into a factor, and when converting the factor to numeric, my numeric values of 1,000.00 and above all become NA. Is there a way that I can read in numbers that have both commas and decimals in this format? I cannot think of another way to read in my data other than using scan().

Related

Data import error: importing csv into R [duplicate]

I am using read.table to read a data file. and got the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'true'
I know that means there's some error in my data file, the problem is how can I find where is it. The error message did not tell which row has the issue, it's hard for me to find it. Or how can I skip these rows?
Here's my R code:
data<-read.csv("/home/jianfezhang/prod/conversion_yaap/data/part-r-00000",
sep="\t",
col.names=c("site",
"treatment",
"mode",
"segment",
"source",
"itemId",
"leaf_categ_id",
"condition_id",
"auct_type_code",
"start_price_lstg_curncy",
"bin_price_lstg_curncy",
"start_price_variance",
"start_price_mean",
"start_price_media",
"bin_price_variance",
"bin_price_mean",
"bin_price_media",
"is_sold"),
colClasses=c(rep("factor",5),"numeric",rep("factor",3),rep("numeric",8),"factor")
);
The error you get is caused by a the colClasses argument - some values in the file to not match the datatypes you specified.
Most of the time I encounter something like this, I probably just had some counting problem with the colClasses argument, e.g it would maybe be
colClasses=c(rep("factor",5),"numeric", rep("factor",4), rep("numeric",7),"factor")
instead of your default values.
That may be simply checked by carefully comparing the contents of the first lines of your file with the datatypes you specified.
If this does not do the trick for you, you probably have some wrong datatype where you do not expect it. A simple, yet slow approach is to remove the colClasses argument and first read the whole file without specific options - probably add stringsAsFactors=FALSE to get only character values. This probably should work.
Then you may try to convert each column one by one, like
data$itemId <- as.numeric(data$itemId)
and then check the result for NA values, easily done by summary(data$itemId). If you got NA values, you can call which(is.na(data$itemId)) to get the row number and check your original file whether the NA in fact is valid or if you have some data problems there.
Most of the time you will be able to narrow down your problem this way.
If your file a lot of columns, however, this quickly becomes a lot of work....

Why does read.csv() sometimes get errors when specifying colClasses as "character"?

I am using read.csv() to make a data.table. When importing the columns, I need them to be imported as either 'character' or 'numeric'.
I'm using the following code (simplified for brevity):
dataCols <- c(a="character", b="character", c="numeric", d="character")
data <- data.table(read.csv(file="data.csv", row-names=1, stringsAsFactors=F, colClasses=dataCols))
For ease, I would like to have the dataCols vector be a list of all possible columns as I'm reading a number of csv files which represent the data at various parts of a process (which my code is meant to be checking for equality).
If I use the above code to read a csv file which has all the columns a, b, c and d it reads okay. If, however, I try to read a csv which only has columns a-c, I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '"abc"'
where "abc" is the contents of row 1 in column b.
I'm telling it to read the column as a character, and it's getting a character, but it's giving me an error. Why is this?
Frustratingly, when I was doing this with a different thing the other day, if i put extra colClasses in it just gave me a warning that said 'there are more colclasses listed than exist in your csv'.
I'm completely at a loss as to why these errors are a) different and, in the case of the problem I described above, even appearing in the first place.

Read csv data file in R

I am using read.table to read a data file. and got the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'true'
I know that means there's some error in my data file, the problem is how can I find where is it. The error message did not tell which row has the issue, it's hard for me to find it. Or how can I skip these rows?
Here's my R code:
data<-read.csv("/home/jianfezhang/prod/conversion_yaap/data/part-r-00000",
sep="\t",
col.names=c("site",
"treatment",
"mode",
"segment",
"source",
"itemId",
"leaf_categ_id",
"condition_id",
"auct_type_code",
"start_price_lstg_curncy",
"bin_price_lstg_curncy",
"start_price_variance",
"start_price_mean",
"start_price_media",
"bin_price_variance",
"bin_price_mean",
"bin_price_media",
"is_sold"),
colClasses=c(rep("factor",5),"numeric",rep("factor",3),rep("numeric",8),"factor")
);
The error you get is caused by a the colClasses argument - some values in the file to not match the datatypes you specified.
Most of the time I encounter something like this, I probably just had some counting problem with the colClasses argument, e.g it would maybe be
colClasses=c(rep("factor",5),"numeric", rep("factor",4), rep("numeric",7),"factor")
instead of your default values.
That may be simply checked by carefully comparing the contents of the first lines of your file with the datatypes you specified.
If this does not do the trick for you, you probably have some wrong datatype where you do not expect it. A simple, yet slow approach is to remove the colClasses argument and first read the whole file without specific options - probably add stringsAsFactors=FALSE to get only character values. This probably should work.
Then you may try to convert each column one by one, like
data$itemId <- as.numeric(data$itemId)
and then check the result for NA values, easily done by summary(data$itemId). If you got NA values, you can call which(is.na(data$itemId)) to get the row number and check your original file whether the NA in fact is valid or if you have some data problems there.
Most of the time you will be able to narrow down your problem this way.
If your file a lot of columns, however, this quickly becomes a lot of work....

How to read a .csv file containing apostrophes into R?

I am having difficulty getting R to read a .txt or .csv file that contains apostrophes.
Some of my columns contain descriptive text, such as "Attends to customers' needs" or "Sheriff's deputy". My file opens correctly in Excel (that is, all the data appear in the correct cells; there are 3 columns and about 8000 rows, and there is no missing data). But when I ask R to read the file, this is what happens:
data <-read.table("datafile.csv", sep=",", header=TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 520 did not have 3 elements
(Line 520 is the first line that contains an apostrophe.)
If I go into the .txt or .csv file and manually remove all the apostrophes, then R reads the file correctly. However, I'd rather keep the apostrophes if I can.
I am new to R and would be grateful for any help.
By default, read.table sees single and double quotes as quoting characters. You need to add quote="\"" to your read.table call. Or, you could just use read.csv, which only sees double quotes as quoting characters by default.
Thoroughly studying the options in ?read.table will pay off in the long run. The default values for quoting characters is quote = "\"'", which is really only two characters after R parses that expression, single-quote and double-quote. You can remove them both from consideration using quotes=NA. It's sometimes necessary to also remove the 'comment.char' defaulting to "#", and it may be helpful to change 'as.is' to TRUE to prevent strings from getting converted to factors.
Setting the parameter quote="\\" in read.table should do the trick.

Data type error with scan

I have a .dat file of time series data for options so it includes trade date and expiration date, in addition to the price data for which I want to do timeseries analysis in R.
I am new to R, so I have been following some of the examples online. in my attempt to upload the data as a data-frame, I tried the scan(), but I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '2010-Aug-09,2011-Aug-19,C00026000,0.23985,5.53,0.999999,0.00712328'
I understand it is expecting real but I need to enter the dates and the option ticker to make sense of the time series, so can someone give me some guidance on how I go about it.thx.
Scan requires that you specify the contents of the data; by default it assumes that you are just reading in numbers (which you aren't).
As per Joran's comment, read.csv (or read.table) is much more user friendly for reading in data frames from file. Use that instead.
I'll reiterate that scan is a pretty low-level function and in almost every case you're better off using read.table or read.csv.
But to get scan to work on what I am inferring is in your .dat file, you need to tell it (at least) what the field separator is and what the data types are. So something like:
scan('temp.dat',sep=',',what=list('character','character','character','numeric','numeric','numeric','numeric'))
would do the trick.

Resources