R workarounds: fread versus read.table - r

I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?

ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

R read.csv colClasses scan() expected 'a logical', got '"FALSE"'

I'm reading a 25-column csv using read.csv with colClasses specified. The penultimate column is logical so I specified "logical" in the colClasses and got the error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a logical', got '"FALSE"'
If I read in the file without colClasses - read.csv(file = i) - it works fine AND the column is imported as logical! Similarly if I use colClasses but label that column as NA. The column's unique values are NA FALSE TRUE. Opening the file in LibreOffice Calc, the entries are unquoted i.e. FALSE and not "FALSE".
Can anyone think why this is happening? It seems highly unintuitive / buglike - though experience has taught me it's 99% likely to be me doing something stupid!
Thanks

read.table quits after encountering special symbol

I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R

read txt file with some data missing

I do realize similar questions have already been posed, but given that none of the answers provided solves my problem, frustration is beginning to set in. The problem is the following: I have 27 identically shaped time series data (date, Open, High, Low, Last) in txt format and I want to import them in R as .txt file in such a way that the first line read is the one with all 5 data. Example given below shows that while the data in text file starts on 1984-01-03, I would like for the file to be read from 1990-11-05 (since Open is missing for earlier dates), save first column of dates as rownames and save other 4 columns as numeric with the obvious name for each of the column.
Open High Low Last
1984-01-03 1001.40 997.50 997.50
1984-01-04 999.50 993.30 998.60
1990-11-05 2038.00 2050.20 2038.00 2050.10
1990-11-06 2055.00 2071.00 2052.20 2069.80
Given that this is a common problem, I have tried the following code:
ftse <- read.table("FTSE.txt", sep="", quote="", dec=".", as.is=TRUE,
blank.lines.skip=TRUE, strip.white=TRUE,na.strings=c("","NA"),
row.names=1, col.names=c("Open","High","Low","Last"))
I have tried all sorts of combination also with specifying colClasses, header=TRUE and other commands (for fill=TRUE the data is actually read, but this is exactly what I dont want) but I always get the following error (or the number of the line in error message is different)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1731 did not have 4 elements
Line 1731 is the one corresponding to the date 1984-01-03. I am kindly asking for help, since I can not afford to lose any more time on such issues, so please provide suggestions how I can fix that. Thank you in advance.
I don't know what the general solution might be but a combination of readLines and read.fwf might work in your case:
ftse.lines <- readLines("FTSE.txt")
ftse.lines <- ftse.lines[ftse.lines != ""] # skip empty lines
ftse <- read.fwf(textConnection(ftse.lines), widths=c(11,8,8,8,8), skip=1, row.names=1)
names(ftse) <- c("Open", "Hi", "Lo", "Last")
You may need to modify some parts but it works with your example.
The following (using just read.fwf) also works:
ftse <- read.fwf("FTSE.txt", widths=c(11,8,8,8,8), col.names=c("blah", "Open", "Hi", "Lo", "Last"), skip=1)
And then try to convert the first col to rownames if that's really needed.

Read.table can't get all data lines

There is a strange thing when i use read.table to get data.
data=read.table('/home/tiger/nasdaqlisted.txt',head=T,sep='|')
dim(data)
[1] 750 6
in fact,there are 2454 lines in the file,what's wrong?
http://freeuploadfiles.com/bb3cwypih2d2
I think the issue comes from the fact that some of the names contain the quote character ' (in names such as Angie's List, Inc.).
The default argument in read.table for quote being "\"'" it needs to be changed for your data to be read correctly.
read.table("path/to/file", header=TRUE, sep="|", quote="")
As per #mrdwab suggestion, read.delim having "\"" as default quote argument will work without needing any change:
read.delim("path/to/file", header=TRUE, sep="|")

Resources