How can I open file with Fread by skipping problematic rows - r

Basically, I am trying to read csv with Libray(data.table), fread but give me the error. I understand it stuck around line 342637 but cannot figure out how to read csv or skip this problematic line. I have tried all the options I have found online but still stuck in the same place. Since data is huge and I can't check what is wrong around line 342637. Is there any other way to read this csv file?
data.table ver: 1.10.4.3
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8")
Read 13.1% of 1837283 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8") :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", fill=TRUE)
Read 13.6% of 1837284 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Tryagain with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", sep=",")
Read 13.6% of 1837283 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
user <- fread( "user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", sep=",", fill=TRUE, blank.lines.skip=TRUE)
Read 14.2% of 1837284 rows
Error in fread("user.csv", stringsAsFactors = FALSE, encoding = "UTF-8", :
Expecting 77 cols, but line 342637 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

One option would be to do 2 fread() calls - one for the first 342636 rows and then one for the rest of the rows:
user_start <- fread('user.csv', nrows = 342636)
user_end <- fread('user.csv', skip = 342637)
user <- rbindlist(list(user_start, user_end))

Related

Reliably importing CSV columns as "double"

I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.
for (E in EDCODES) {
Filename <- paste("$. Data/2. Liabilities/",
E,
sep="")
Framename <- gsub("\\..*",
"",
E)
assign(Framename,
read.csv(Filename,
header = TRUE,
sep = ",",
stringsAsFactors = FALSE,
na.strings = c("\"ND",
"ND,5",
"5\""),
colClasses = c("BAA35" = "double"),
encoding = "UTF-8",
quote = ""))}
First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".
If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.
The data has 78 columns, so I don't believe posting it here will display in a usable way.
Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?
I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?

Problems with displaying .txt file (delimiters)

I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.

tryCatch - withCallingHandlers - recover from error

I have a csv file (aprox 1000 lines) with some sample data. while reading the csv with read.table
read.table(csv_File,header = FALSE, sep=",",na.strings = '')
I was getting an error,
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 515 did not have 5 elements
Is there any way, by using tryCatch and withCallingHandlers, to print this error message and continue with the rest of the file?
all I am expecting is to get error messages/ stack trace in case of errors and process the rest of the lines in csv.
No, as far as I know there's no way to get read.table to skip lines that contain errors. What you should do is use the count.fields function to find how many fields are in each line of your file, then read the whole file, delete the bad lines, and read again. For example:
fields <- count.fields(csv_File, sep = ",")
bad <- fields != 5
lines <- readLines(csv_File)
# At this point you could display the bad lines or
# give some other information about them.
# Then delete them and read again:
lines <- lines[!bad]
f <- tempfile()
writeLines(lines, f)
read.table(f, header = FALSE, sep=",", na.strings = '')
unlink(f)
EDITED to add:
I should mention that the readr package does a better job when files contain problems. If you use
library(readr)
read_csv(csv_File, col_names = FALSE)
it will produce a "tibble" instead of a data frame, but otherwise should do what you want. Each line that has problems will be reported, and the overall problems will be kept with the dataset in case you want to examine them later.

How to solve the "EOF within quoted string" in a web scraping function?

So I'm using this web scraping function boldseqspec which returns data on specimens of several species based on a vector of taxonomic groups given in the "taxon" argument like this:
df<-bold_seqspec(taxon=c("group1","group2","group3"), format = "tsv")
But recently for some cases I'm getting the following message and subsequently losing information when I use it:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I've gotten this before in read.delim but I solved it with this:
df<-read.delim("file.txt",quote = "",comment.char = "")
Reproducible example:
install.packages("bold")
library(bold)
df<-bold_seqspec(taxon=c("Cnidaria","Hippocampus"), format = "tsv", marker="COI-5P")
The problem is that the function I'm using for data mining (boldseqspec) doesn't have the quote and comment.char arguments.
It seems that the tsv output contains both double-quotes and single quotes and this is why the parsing mucks up (for more information on the matter - actually it is also about a biological data set, too - see EOF-within-quoted-string / difference between read.delim() and read.table())
Hence, if you can't set quotations, I'd suggest using a workaround by setting the format = "xml" and convert the xml to a DataFrame in a further step with the libraries XML (or xml2?) and dplyr.
library(XML)
library(dplyr)
xml = bold_seqspec(taxon=c("Cnidaria","Hippocampus"), format = "xml", marker="COI-5P")
df= xmlToDataFrame(xml , stringsAsFactors = FALSE,) %>%
mutate_all(~type.convert(., as.is = T))
Hope this helps.

read.table quits after encountering special symbol

I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R

Resources