fread not reading all records and no warning message - r

I'm trying to load some data using fread. While loading it shows the correct number of records, but when its finished loading, the no. of records are comparatively less.
Surprising it doesn't show any warnings. Please can someone advise? see attached pic
click here
Thanks

One common reason is un-clean data with inappropriat un-ended quotations.
E.g., if you have data like this:
number_column,text_column
1,text data 1
2,"text with single quote here
3,text data 3
EVERYTHING after the single quote will be included in the text_column on the 2nd line. This is actually the correct way to interpret, it's just that your CSV/TSV file is broken.
The easiest solution is to use quote="" as a parameter, but the real solution is to go through your TSV/CSV file and fix all the issues manually, since the interpreter cannot know exactly what you want if the file is broken.

Related

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Accessing large spreadsheets written from R

I am using the following R script to write the data. table into the excel file in my set directory. However, the size of the file is in GB's as the total rows are 50 million+. Hence upon opening the file, I just see a blank grey screen and nothing else.
How can I see the contents in the file?
The first line is just for illustration purpose.
Final1 <- rep(iris, times = 1000000)
fwrite(Final1,"data2.csv")
You mention that this is part of the report. I would be willing to bet good money that whoever will be reading this report will not check all or most of the values by hand. In which case, you don't need a format that is easily browsable, e.g. xlsx or even csv. If this indeed is the case, you might want to try a (relational) database. If you do not have anything centralized, you might want to give SQLite a try. You save everything into one file which acts as a database. There are packages that handle this interaction in R. You can try with sqldf or RSQLite.

Fread unusual line ending causing error

I am attempting to download a large database of NYC taxi data, publicly available at the NYC TLC website.
library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)
Executing the above code successfully downloads the data (which takes a few minutes), but then fails to parse due to an internal error. I have tried removing header = T as well.
Is there a workaround in order to deal with the "unusual line endings" in fread ?
Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", :
Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.
It seems that the issues might be caused due the presence of a blank line between the header and data in the original .csv file. Deleting the line from the .csv using notepad++ seemed to fix it for me.
Sometimes other options like read.csv/read.table can behave differently... so you can always try that. (Maybe the source code tells why, havent looked into that).
Another option is to use readLines() to read in such a file. As far as I know, no parsing/formatting is done here. So this is, as far as I know, the most basic way to read a file
At last, a quick fix: use the option 'skip = ...' in fread, or control the end by saying 'nrows = ...'.
There is something fishy with fread. data.table is the faster, more performance oriented for reading large files, however in this case the behavior is not optimal. You may want to raise this issue on github
I am able to reproduce the issue on downloaded file even with nrows = 5 or even with nrows = 1 but only if stick to the original file. If I copy paste the first few rows and then try, the issue is gone. The issue also goes away if I read directly from the web with small nrows. This is not even an encoding issue, hence my recommendation to raise an issue.
I tried reading the file using read.csv and 100,000 rows without an issue and under 6 seconds.
feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)
header = T is a redundant argument so would not make a difference for fread but is needed for read.csv.

Track the exact place of a not encoded character in an R script file

more a tip question that can save lots of time in many cases. I have a script.R file which I try to save and get the error:
Not all of the characters in ~/folder/script.R could be encoded using ASCII. To save using a different encoding, choose "File | Save with Encoding..." from the main menu.
I was working on this file for months and today I was editing like crazy my code and got this error for the first time, so obviously I inserted a character that can not be encoded while I was working today.
My question is, can I track and find this specific character and where exactly in the document is?
There are about 1000 lines in my code and it's almost impossible to manually search it.
Use tools::showNonASCIIfile() to spot the non-ascii.
Let me suggest two slight improvements this.
Process:
Save your file using a different encoding (eg UTF-8)
set a variable 'f' to the name of that file. something like this f <- yourpath\\yourfile.R
Then use tools::showNonASCIIfile(f) to display the faulty characters.
Something to check:
I have a Markdown file which I run to output to Word document (not important).
Some of the packages I used to initialise overload previous functions. I have found that the warning messages sometimes have nonASCII characters and this seems to have caused this message for me - some fault put all that output at the end of the file and I had to delete it anyway!
Check where characters are coming back from Warnings!
Cheers
Expanding the accepted answer with this answer to another question, to check for offending characters in the script currently open in RStudio, you can use this:
tools::showNonASCIIfile(rstudioapi::getSourceEditorContext()$path)

Using R, import data from web

I have just started using R, so this may be a very dumb question. I am trying to import the data using:
emdata=read.csv(file="http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV",header=TRUE)
My problem is that it reads the csv file into a single column ( by the way, the lottery data is simply because it is publicly available to download - using as an exercise to understand what I can and can't do in R), instead of formatting it into however many columns of data there are. Would someone mind helping out, please, even though this is trivial
Hm, that's kind of obnoxious for a page purporting to be in csv format. You can skip the first 5 lines, which will cause R to read (most of) the rest of the file correctly.
emdata=read.csv(file=...., header=TRUE, skip=5)
I got the number of lines to skip by looking at the source. You'll still have to remove the cruft in the middle and end, and then clean up the columns (they'll all be factors because of the embedded text).
It would be much easier to save the page to your hard disk, edit it to remove all the useless bits, then import it.
... to answer your REAL question, yes, you can import data directly from the web. In general, wherever you would read a file, you can substitute a fully qualified URL -- R is smart enough to do the Right Thing[tm]. This specific URL just happens to be particularly messy.
You could read text from the given url, filter out the obnoxious lines and then read the result as CSV like so:
lines <- readLines(url("http://lottery.merseyworld.com/cgi-bin/lottery?days=19&Machine=Z&Ballset=0&order=1&show=1&year=0&display=CSV"))
read.csv(text=lines[grep("([^,]*,){5,}", lines)])
The above regular expression matches any lines containing at least five commas.

Resources