read.table quits after encountering special symbol - r

I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?

If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R

Related

tryCatch - withCallingHandlers - recover from error

I have a csv file (aprox 1000 lines) with some sample data. while reading the csv with read.table
read.table(csv_File,header = FALSE, sep=",",na.strings = '')
I was getting an error,
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 515 did not have 5 elements
Is there any way, by using tryCatch and withCallingHandlers, to print this error message and continue with the rest of the file?
all I am expecting is to get error messages/ stack trace in case of errors and process the rest of the lines in csv.
No, as far as I know there's no way to get read.table to skip lines that contain errors. What you should do is use the count.fields function to find how many fields are in each line of your file, then read the whole file, delete the bad lines, and read again. For example:
fields <- count.fields(csv_File, sep = ",")
bad <- fields != 5
lines <- readLines(csv_File)
# At this point you could display the bad lines or
# give some other information about them.
# Then delete them and read again:
lines <- lines[!bad]
f <- tempfile()
writeLines(lines, f)
read.table(f, header = FALSE, sep=",", na.strings = '')
unlink(f)
EDITED to add:
I should mention that the readr package does a better job when files contain problems. If you use
library(readr)
read_csv(csv_File, col_names = FALSE)
it will produce a "tibble" instead of a data frame, but otherwise should do what you want. Each line that has problems will be reported, and the overall problems will be kept with the dataset in case you want to examine them later.

How can I open file with Fread by skipping problematic rows

Basically, I am trying to read csv with Libray(data.table), fread but give me the error. I understand it stuck around line 342637 but cannot figure out how to read csv or skip this problematic line. I have tried all the options I have found online but still stuck in the same place. Since data is huge and I can't check what is wrong around line 342637. Is there any other way to read this csv file?
data.table ver: 1.10.4.3
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8")
Read 13.1% of 1837283 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8") :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", fill=TRUE)
Read 13.6% of 1837284 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Tryagain with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", sep=",")
Read 13.6% of 1837283 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread( "user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", sep=",", fill=TRUE, blank.lines.skip=TRUE)
Read 14.2% of 1837284 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
One option would be to do 2 fread() calls - one for the first 342636 rows and then one for the rest of the rows:
user_start <- fread('user.csv', nrows = 342636)
user_end <- fread('user.csv', skip = 342637)
user <- rbindlist(list(user_start, user_end))

How to convert a factor type into a numeric type in R after reading a csv file?

After reading a csv file
data<-read.table(paste0('C:/Users/data/','30092017ARB.csv'),header=TRUE, sep=";")
I get for rather all numeric variable factor as the type, specially for the last column.
I tried all suggestion here However, I get a warning for all suggestions
Warning message:
NAs introduced by coercion
Some one mentioned even in this post:
"Every answer in this post failed to generate results for me , NAs were getting generated."
any idea how can I solve this problem?
Addendum: in the following pic you can see one possible approach suggested in here
However, I get always the same NA .
The percent sign is clearly the problem. Replace the "%" by the empty string, "", and then convert to numeric.
data[[3]] <- sub("%", "", data[[3]])
data[[3]] <- as.numeric(data[[3]])
You can do this in one line of code,
data[[3]] <- as.numeric(sub("%", "", data[[3]]))
Also, two notes on reading the data in.
First, some files use the semi-colon as a column separator. This is very used in countries where the decimal point is the comma. That is why R has two functions to read files in the CSV format.
These functions are both calls to read.table with some defaults changed.
read.csv - Sets arguments header = TRUE and sep = ",".
read.csv2 - Sets arguments header = TRUE, sep = ";" and dec = ",".
For a full explanation see read.table or at an R prompt run help("read.table").
Second, you can avoid factor problems if you use argument stringsAsFactors = FALSE from the start, when reading in the data.

ff package read a text file wrong

I would like to read a large text file seperated by "|". So I used the below code.
sampleData <- read.table(file = '2013_4MM01_7-11_CD.txt',header =TRUE, sep = '|', nrows=10)
pos<- read.table.ffdf(file="2013_4MM01_7-11_CD.txt", header=TRUE, VERBOSE=TRUE,
FUN='read.table', sep = '|',
first.rows=10000, next.rows=50000, colClasses=classes)
I used "ff" package and checked my data after the running code ended. My data had some long numeric variable like "201304012371090245546" and the object read from the data was wrong. My ffdf object contained many duplicated rows and even a number that was not in the original txt file. I checked this by SAS. Please give me some useful advice.

R workarounds: fread versus read.table

I would like to understand why this difference exists between read.table and fread. Maybe you'll know the workaround to make fread work. I have two lines of code that perform the same goal-to read a file. fread performs faster and more efficiently than read.table, but read.table produces less no errors on the same data set.
SUCCESSFUL READ.TABLE approach
table <- read.table("A.txt",header=FALSE,sep = "|", fill=TRUE, quote="", stringsAsFactors=FALSE)
FREAD approach
table <- fread("A.txt",header=FALSE,sep = "|")
FREAD returns the classic error, which I explored,
Expected sep ('|') but new line or EOF ends field 44 on line 57193 when reading data
Initially, read.table returned what I think is a similar error when fill=TRUE was not included and would not read the file.
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 7 did not have 45 elements
I am thinking that the errors might be similar in some way. According to the documentation, fill allows the following. If TRUE then in case the rows have unequal length, blank fields are implicitly added.
Is there a work around similar to fill=TRUE that can solve might address the fread problem?
ANSWER FROM MATT DOWLE: Fill option for fread
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
This answer highlights how data.table can now fill using fread.
https://stackoverflow.com/a/34197074/1569064
fread(input, sep="auto", sep2="auto", nrows=-1L, header="auto", na.strings="NA",
stringsAsFactors=FALSE, verbose=getOption("datatable.verbose"), autostart=1L,
skip=0L, select=NULL, drop=NULL, colClasses=NULL,
integer64=getOption("datatable.integer64"), # default: "integer64"
dec=if (sep!=".") "." else ",", col.names,
check.names=FALSE, encoding="unknown", quote="\"",
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL,
showProgress=getOption("datatable.showProgress"), # default: TRUE
data.table=getOption("datatable.fread.datatable") # default: TRUE
)

Resources