reading big data file using cbc.read.table

reading big data file using cbc.read.table - r

I'm trying to read a giant DF cbc.read.table:
my.df <- cbc.read.table("df.csv",sep = ";", header =F)
This is what I get:
Error in cbc.read.table("2012Q2.csv", sep = "|", header = F) :
No rows to read
The wd is set correctly. Inprinciple it works using read.table, just that it doesn't read in all lines (about two million)
Has anybody an idea what I can do about this?
SOLUTION:
Hi again, the following thread helped me out:
R: Why does read.table stop reading a file?
The problem was caused by quotation marks, probably because some of them were not closing. I simply used an editor and deleted all double and single quotation marks as well as all hash marks. It's working now.
#Anthony: Thanks for your question. I noticed that the problem did not occur in the first three lines which is why I got idea that it's an issue with the file. Thanks!
Paul

Related

R imports not the whole data from csv. How to fix it?

I faced an issue of importing data from csv file to R.
Some basic information on the file. There are 1941 rows and 78 columns.
When I import data using the following command
data = read.csv("data.csv", header = T, sep = ";")
I get 824 rows only.
But when I convert the file into the xlsx format and then import the xlsx file using this command
data = read_excel("data.xlsx")
everything is ok.
I cannot fix the problem because I don't know where it is.
Can you help me please?
P.S.
Unfortunately I cannot share file eith you as soon as that file is a top secret.

The solution of the problem is to add the parameter quote="" in the code like this:
data = read.csv("data.csv", header = T, sep = ";", quote = "")
That's it.

Post the error/warning message if any.

When you open your data see if you have problematic characters inside columns, like tabs, comas, new lines etc.
I would suggest to read by line as a text file to check the issue.
Without looking onto what in the data causing the problem I guess no one could give you a solution.

R read_delim recognizing pipe delimiter inconsistently

I have run into some problems while importing a pipe delimited file. The file consistently delimits but something is getting in the way of R reading some of the delimiters while parsing. R reads in 10 columns when there should be 11, even though the appropriate number of pipes are in place.
A very small sample of the data can be found here: https://drive.google.com/file/d/1ek6-H5EWKCaPfDTfB2muqYBjJz1fM3pf/view
dat <- read_delim("~/Desktop/foo.txt", delim = "|", col.names = TRUE)
I've tried playing around with how R treats the quotes... quote = "/"" did nothing to help and ignoring the quotes with quote = "" made an even bigger mess of the import.
Any thoughts on how to fix the problem?

Feel free to use fread() in data.table package as below.
library(data.table)
FOO3<-fread("~/Downloads/foo.txt",sep = "|",fill = T)
Below is the import dataset I got.

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.

There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)

Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

Getting rid of BOM between SAS and R

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
ï»¿ if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.

As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).

or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

More problems with "incomplete final line"

This problem is similar to that seen here.
I have a large number of large CSVs which I am loading and parsing serially through a function. Many of these CSVs present no problem, but there are several which are causing problems when I try to load them with read.csv().
I have uploaded one of these files to a public Dropbox folder here (note that the file is around 10.4MB).
When I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and Rhelp for solutions. Maddeningly, when I run
Import <- read.csv("http://dl.dropbox.com/u/83576/Candidate%20Mentions.csv")
using the Dropbox URL instead of my local path, it loads, but when I then save that very data frame and try to reload it thus:
write.csv(Import, "Test_File.csv", row.names = F)
TestImport <- read.csv("Test_File.csv")
I get the "incomplete final line" warning again.
So, I am wondering why the Dropbox-loaded version works, while the local version does not, and how I can make my local versions work -- since I have somewhere around 400 of these files (and more every day), I can't use a solution that can't be automated in some way.
In a related problem, perhaps deserving of its own question, it appears that some "special characters" break the read.csv() process, and prevent the loading of the entire file. For example, one CSV which has 14,760 rows only loads 3,264 rows. The 3,264th row includes this eloquent Tweet:
"RT #akiron3: ácÎå23BkªÐÞ'q(#BarackObama )nÄ¤Ã¿ükTPPÂ ÍþnÄ¤üÈ’áY‹ªÐÞÄ¤Ã¿üŽ
\&’ŸõWˆFSnÄ¤©’FhÎåšBkêÕ„kÄ¤üÈLáUŒ~YÒhttp://t.co/ABNnWfTN
“jg)(WˆF"
Again, given the serialized loading of several hundred files, how can I (a) identify what is causing this break in the read.csv() process, and (b) fix the problem with code, rather than by hand?
Thanks so much for your help.

1)
suppressWarnings(TestImport <- read.csv("Test_File.csv") )
2) Unmatched quotes are the most common cause of apparent premature closure. You could try adding all of these:
quote="", na,strings="", comment.char=""

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

reading big data file using cbc.read.table - r

Related

R imports not the whole data from csv. How to fix it?

R read_delim recognizing pipe delimiter inconsistently

Loading csv into R with `sep=,` as the first line

Getting rid of BOM between SAS and R

More problems with "incomplete final line"

Categories

Resources