Deal with escaped commas in CSV file? - r

I'm reading in a file in R using fread as such
test.set = fread("file.csv", header=FALSE, fill=TRUE, blank.lines.skip=TRUE)
Where my csv consists of 6 columns. An example of a row in this file is
"2014-07-03 11:25:56","61073a09d113d3d3a2af6474c92e7d1e2f7e2855","Securenet Systems Radio Playlist Update","Your Love","Fred Hammond & Radical for Christ","50fcfb08424fe1e2c653a87a64ee92d7"
However, certain rows are formatted in a particular way when there is a comma inside one of the cells. For instance,
"2014-07-03 11:25:59","37780f2e40f3af8752e0d66d50c9363279c55be6","Spotify","\"Hello\", He Lied","Red Box","b226ff30a0b83006e5e06582fbb0afd3"
produces an error of the sort
Expecting 6 cols, but line 5395818 contains text after processing all
cols. Try again with fill=TRUE. Another reason could be that fread's
logic in distinguishing one or more fields having embedded sep=','
and/or (unescaped) '\n' characters within unbalanced unescaped quotes
has failed. If quote='' doesn't help, please file an issue to figure
out if the logic could be improved.
As you can see, the value that is causing the error is "\"Hello\", He Lied", which I want to be read by fread as "Hello, He Lied". I'm not sure how to account for this, though - I've tried using fill=TRUE and quote="" as suggested, but the error still keeps coming up. It's probably just a matter of finding the right parameter(s) for fread; anyone know what those might be?

In read.table() from base R this issue is solvable.
Using Import data into R with an unknown number of columns?
In fread from data.table this is not possible.
Issue logged for this : https://github.com/Rdatatable/data.table/issues/2669

Related

read_csv does not work separate commas and not capture separate rows

I am trying to parse a text log file like this, I can use the default read.csv to parse this file.
test <- read.csv("test.txt", header=FALSE)
It separated all comma parts, though not perfectly put in a dataframe, further manipulation can be done to improve.
However, I can not seem to do so using readr package
test <- read_csv("test.txt", header=FALSE)
All observations turn into 1 row, no separation between commas.
I am learning this package so any help would be great.
{"dev_id":"f8:f0:05:xx:db:xx","data":[{"dist":[7270,7269,7269,7275,7270,7271,7265,7270,7274,7267,7271,7271,7266,7263,7268,7271,7266,7265,7270,7268,7264,7270,7261,7260]},{"temp":0},{"hum":0},{"vin":448}],"time":4485318,"transmit_time":4495658,"version":"1.0"}
{"dev_id":"f8:xx:05:xx:d9:xx","data":[{"dist":[6869,6868,6867,6871,6866,6867,6863,6865,6868,6869,6868,6860,6865,6866,6870,6861,6865,6868,6866,6864,6866,6866,6865,6872]},{"temp":0},{"hum":0},{"vin":449}],"time":4405316,"transmit_time":4413715,"version":"1.0"}
{"dev_id":"xx:f0:05:e8:da:xx","data":[{"dist":[5775,5775,5777,5772,5777,5770,5779,5773,5776,5777,5772,5768,5782,5772,5765,5770,5770,5767,5767,5777,5766,5763,5773,5776]},{"temp":0},{"hum":0},{"vin":447}],"time":4461316,"transmit_time":4473307,"version":"1.0"}
{"dev_id":"xx:f0:xx:e8:xx:0a","data":[{"dist":[4358,4361,4355,4358,4359,4359,4361,4358,4359,4360,4360,4361,4361,4359,4359,4356,4357,4361,4359,4360,4358,4358,4362,4359]},{"temp":0},{"hum":0},{"vin":424}],"time":5190320,"transmit_time":5198748,"version":"1.0"}
Thanks to #Dave2e pointing out that this file is in JSON format, I found the way to parse it using ndjson::stream_in.

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Reading large csv file in R

I have a number of csv-files of different size, but all somewhat big. Using read.csv to read them into R takes longer than I've been patient to wait so far (several hours). I managed to read the biggest file (2.6 gb) very fast (less than a minute) with data.table's fread.
My problem occurs when I try to read a file of half the size. I get the following error message:
Error in fread("C:/Users/Jesper/OneDrive/UdbudsVagten/BBR/CO11700T.csv",:
Expecting 21 cols, but line 2557 contains text after processing all
cols. It is very likely that this is due to one or more fields having
embedded sep=';' and/or (unescaped) '\n' characters within unbalanced
unescaped quotes.
fread cannot handle such ambiguous cases and those
lines may not have been read in as expected. Please read the section
on quotes in ?fread.
Through research I've found suggestions to add quote = "" to the code, but it doesn't help me. I've tried using the bigmemory package, but R crashes when I try. I'm on a 64 bit system with 8 gb of ram.
I know there are quite a few threads on this subject, but I haven't been able to solve the problem with any of the solutions. I would really like to use fread (given my good experience with the bigger file), and it seems like there should be some way to make it work - just can't figure it out.
Solved this by installing SlickEdit and using it to edit the lines that caused the trouble. A few characters like ampersand, quotation marks, and apostrophes were consistently encoded to include semicolon - e.g. & instead of just &. As semicolon was the seperator in the text document, this caused the problem in reading with fread.

how to have fread perform like read.delim

I've got a large tab-delimited data table that I am trying to read into R using the data.table package fread function. However, fread encounters an error. If I use read.delim, the table is read in properly, but I can't figure out how to configure fread such that it handles the data properly.
In an attempt to find a solution, I've installed the development version of data.table, so I am currently running data.table v1.9.7, under R v3.2.2, running on Ubuntu 15.10.
I've isolated the problem to a few lines from my large table, and you can download it here.
When I used fread:
> fread('problemRows.txt')
Error in fread("problemRows.txt") :
Expecting 8 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
I tried using the parameters used by read.delim:
fread('problemRows.txt', sep="\t", quote="\"")
but I get the same error.
Any thoughts on how to get this to read in properly? I'm not sure what exactly the problem is.
Thanks!
With this recent commit c1b7cda, fread's quote logic got a bit cleverer in handling such tricky cases. With this:
require(data.table) # v1.9.7+
fread("my_file.txt")
should just work. The error message is now more informative as well if it is unable to handle. See #1462.
As explained in the comments, specifying the quotes argument did the trick.
fread("my_file.txt", quote="")

'Incomplete final line' warning when trying to read a .csv file into R

I'm trying to read a .csv file into R and upon using this formula:
pheasant<-read.table(file.choose(),header=TRUE,sep=",")
I get this warning message:
"incomplete final line found by readTableHeader on 'C:\Documents and Settings..."
There are a couple of things I thought may have caused this warning, but unfortunately I don't know enough about R to diagnose the problem myself so I thought I'd post here in the hope someone else can diagnose it for me!
the .csv file was originally an Excel file, which I saved into .csv format
the file comprises three columns of data
each data column is of a differing length, i.e. there are a different number of values in each column
I want to compare the means (using t-test or equivalent depending on normal / not normal distribution) of two of the columns at a time, so for example, t-test between column 1 values and column 2 values, then a t-test of column 1 and column 3 values, etc.
Any help or suggestions would be seriously appreciated!
The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)). The original intention of this message was to warn you that the file may be incomplete; most datafiles have an EOL character as the very last character in the file.
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return
Save the file
The problem is easy to resolve;
it's because the last line MUST be empty.
Say, if your content is
line 1,
line2
change it to
line 1,
line2
(empty line here)
Today I met this kind problem, when I was trying to use R to read a JSON file, by using command below:
json_data<-fromJSON(paste(readLines("json01.json"), collapse=""))
; and I resolve it by my above method.
Are you really sure that you selected the .csv file and not the .xls file? I can only reproduce the error if I try to read in an .xls file. If I try to read in a .csv file or any other text file, it's impossible to recreate the error you get.
> Data <- read.table("test.csv",header=T,sep=",")
> Data <- read.table("test.xlsx",header=T,sep=",")
Warning message:
In read.table("test.xlsx", header = T, sep = ",") :
incomplete final line found by readTableHeader on 'test.xlsx'
readTableHead is the c-function that gives the error. It tries to read in the first n lines (standard the first 5 ) to determine the type of the data. The rest of the data is read in using scan(). So the problem is the format of the file.
One way of finding out, is to set the working directory to the directory where the file is. That way you see the extension of the file you read in. I know on Windows it's not shown standard, so you might believe it's csv while it isn't.
The next thing you should do, is open the file in Notepad or Wordpad (or another editor) and check that the format is equivalent to my file test.csv:
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,
This file will give you the following dataframe :
> read.table(testfile,header=T,sep=",")
Test1 Test2 Test3
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
5 5 5 NA
6 NA 6 NA
The csv format saved by excel seperates all cells with a comma. Empty cells just don't have a value. read.table() can easily deal with this, and recognizes empty cells just fine.
Use readLines() (with warn = FALSE) to read the file into a character vector first.
After that use the text = option to read the vector into a data frame with read.table()
pheasant <- read.table(
text = readLines(file.choose(), warn = FALSE),
header = TRUE,
sep = ","
)
I realized that several answers have been provided but no real fix yet.
The reason, as mentioned above, is a "End of line" missing at the end of the CSV file.
While the real Fix should come from Microsoft, the walk around is to open the CSV file with a Text-editor and add a line at the end of the file (aka press return key).
I use ATOM software as a text/code editor but virtually all basic text editor would do.
In the meanwhile, please report the bug to Microsoft.
Question: It seems to me that it is a office 2016 problem. Does anyone have the issue on a PC?
I have solved this problem with changing encoding in read.table argument from fileEncoding = "UTF-16" to fileEncoding = "UTF-8".
I received the same message. My fix included: I deleted all the additional sheets (tabs) in the .csv file, eliminated non-numeric characters, resaved the file as comma delimited and loaded in R v 2.15.0 using standard language:
filename<-read.csv("filename",header=TRUE)
As an additional safeguard, I closed the software and reopened before I loaded the csv.
In various European locales, as the comma character serves as decimal point, the read.csv2 function should be used instead.
I got this problem once when I had a single quote as part of the header. When I removed it (i.e. renamed the respective column header from Jimmy's data to Jimmys data), the function returned no warnings.
In my case, it was literally the final line. The issue was fixed by literally adding a blank row at the bottom of the CSV file.
FROM
cola,colb,colc
1,2,3
4,5,6
7,8,9
INTO
cola,colb,colc
1,2,3
4,5,6
7,8,9
Take a look closer on that extra space at the very last row. Just add that blank line and it will fix the issue.
NOTE
It seems that R's CSV parser is looking for that very last new line character as the new line separator. This is more known to programmers as the \r\n or \r characters.
The problem that you're describing occurred for me when I renamed a .xlsx as .csv.
What fixed it for me was going "Save As" and then saving it as a .csv again.
To fix this issue through R itself, I just used read.xlsx(..) instead of a read.csv(). Works like a charm!! You do not even have to rename. Renaming an xlsx into to csv is not a viable solution.
Open the file in text wrangler or notepad ++ and show the formating e.g. in text wrangler you do show invisibles. That way you can see the new line or tabs characters
Often excel will add all sorts of tabs in the wrong places and not a last new line character, but you need to show the symbols to see this.
My work around was that I opened the csv file in a text editor, removed the excessive commas on the last value, then saved the file. For example for the following file
Test1,Test2,Test3
1,1,1
2,2,2
3,3,3
4,4,
5,5,
,6,,
Remove the commas after 6, then save the file.
I've experienced a similar problem, however this appears to a generic warning, and may not in fact be related to the line-end character. In my case it was giving this error because the file I was using contained Cyrillic characters, once I replaced them with latin characters the error disappeared.
I tried different solutions, such as using a text editor to insert a new line and get the End Of Line character as recommended in the top answer above. None of these worked, unfortunately.
The solution that did finally work for me was very simple: I copy-pasted the content of a CSV file into a new blank CSV file, saved it, and the problem was gone.
There is a quite simple solution (if it is indeed the finale line which is causing troubles) where you don't need to open the file before reading it:
cat("\n", file = "your/File/Dir", append = TRUE)
Found this solution here.

Resources