R read.csv colClasses scan() expected 'a logical', got '"FALSE"' - r

I'm reading a 25-column csv using read.csv with colClasses specified. The penultimate column is logical so I specified "logical" in the colClasses and got the error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a logical', got '"FALSE"'
If I read in the file without colClasses - read.csv(file = i) - it works fine AND the column is imported as logical! Similarly if I use colClasses but label that column as NA. The column's unique values are NA FALSE TRUE. Opening the file in LibreOffice Calc, the entries are unquoted i.e. FALSE and not "FALSE".
Can anyone think why this is happening? It seems highly unintuitive / buglike - though experience has taught me it's 99% likely to be me doing something stupid!
Thanks

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Set column types for csv with read.csv.ffdf

I am using a payments dataset from Austin Text Open Data. I am trying to load the data with the following code:-
library(ff)
asd <- read.table.ffdf(file = "~/Downloads/Fiscal_Year_2010_eCheckbook_Payments.csv", first.rows = 100, next.ros = 50, FUN = "read.csv", VERBOSE = TRUE)
This shows me the following error:-
read.table.ffdf 301..Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '7AHM'
This happens on 339th line of csv file at 5th column of the dataset. The reason why I think this is happening is that all the values of the 5th column are integers where as this happens to be string. But the actual type of the column should be string.
So I wanted to know if there was a way I could set the types of the column
Below I am providing the types for all the columns in a vector:-
c("character","integer","integer","character","character", "character","character","character","character","character","integer","character","character","character","character","character","character","character","integer","character","character","character","character","character","integer","integer","integer","character","character","character","character","double","character","integer")
You can also find the type of each column from the description of the dataset.
Please also keep in mind that I am very new to this library. Practically just found out about it today.
Maybe you need to transform your data type...The following is just an example that maybe to help you.
data <- transform(
data,
age=as.integer(age),
sex=as.factor(sex),
cp=as.factor(cp),
trestbps=as.integer(trestbps),
choi=as.integer(choi),
fbs=as.factor(fbs),
restecg=as.factor(restecg),
thalach=as.integer(thalach),
exang=as.factor(exang),
oldpeak=as.numeric(oldpeak),
slope=as.factor(slope),
ca=as.factor(ca),
thai=as.factor(thai),
num=as.factor(num)
)
sapply(data, class)

In Scan EOF error while reading CSV file

I am trying to read a CSV file into R. I tried:
data <- read.csv(file="train.csv")
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
But, this reads in only a small percentage of the total observations. Then I tried removing quotes:
data <- read.csv(file="train.csv",quote = "",sep = ",",header = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Since the data is text, it seems there is some issue with the delimiter.
It is difficult to share the entire data set as it is huge. I tried going to the line where the error comes, but there seems to be no non printable character. I also tried other readers like fread(), but to no avail.
Have encountered this before. Can be very tricky. Try a specialized CSV reader.:
library(readr)
data <- read_csv(file="train.csv")
This should do it.

R workarounds: fread versus read.table

I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?
ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)

Reading csv files

I'm having trouble reading .csv files into R, e.g.
df1991 <- read.csv("http://dl.dropbox.com/s/vwdw2tsmgiiuxfa/1991.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
fishdata <- read.csv("http://dl.dropbox.com/s/pin16l691p6j4ll/fishdata.csv", row.names=NULL)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I've tried all sorts of variations of the header & row.names arguments.
I want to import the .csv files from dropbox for convenience, I have done so in the past without trouble. Any suggestions?
Acceptable CSV, so perhaps your default settings. Locale (for comma interpreted as decimal)?
Could it be that the error message should be the other way around, that there are more column names than columns?
Grasping this straw, the first column of data might be interpreted as row labels, for which no column name might be required. It would then expect all the given column-names to relate to the columns of data that come after the first column. So, more column names than columns. Resolved by something like a 'row-names=1' import parameter.

Resources