Troubles with cbc.read.table function in R [duplicate] - r

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Some issues trying to read a file with cbc.read.table function in R + using filter while reading files
a)I'm trying to read a relatively big .txt file with the function cbc.read.table from the colbycol package in R. According to what I've been reading this package makes job easier when we have large files (more than a GB to be read in R) and we don't need all of the columns/variables for our analysis. Also, I read that the function cbc.read.table could support the same read.table's parameters. However, if I pass the parameter nrows (in order to get a preview of my file in R) I get the following error:
#My line code. I'm just reading columns 5,6,7,8 out of 27
i.can <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=5:8, nrows=20)
#error message
Error in read.table(file, nrows = 50, sep = sep, header = header, ...) :
formal argument "nrows" matched by multiple actual arguments
So, my question is: could you tell me how can I solve this problem?
b) After that, I tried to read all instances with the following code:
i.can.b <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=4:8) #done perfectly
my.df <- as.data.frame(i.can.b) #getting error in this line
Error in readSingleKey(con, map, key) : unable to obtain value for key 'Company' #Company is a string column in my data set
So, my question is again: How can I solve this?
c) Do you know a way in which I can filter (by conditions on instances) while reading files?

In reply to a):
cbc.read.table() reads in the data in 50 row chunks:
tmp.data <- read.table(file, nrows = 50, sep = sep, header = header,
...)
Since the function already assigns the nrows argument the value 50, when it passes the nrows argument that you specify, there are two nrows arguments passed to read.table(), resulting in the error. To me, this seems to be a bug. To get around this, you can either modify the cbc.read.table() function to handle the specified nrows argument or accept something like a max.rows argument (and perhaps pass it along to the maintainer as a potential patch). Alternatively, you can specify the sample.pct argument, which specifies the proportion of rows to read. So, if the file contains 100 rows, and you only want 50: sample.pct = 0.5.
In reply to b):
Not sure what that error means. It is hard to diagnose without a reproducible example. Do you get the same error if you read in a smaller file?
In reply to c):
I generally prefer storing very large character data in a relational database, such as MySQL. It might be easier in your case to use the RSQLite package, which embeds an SQLite engine within R. Then SQL SELECT queries can be used to retrieve conditional subsets of data. Other packages for larger-than-memory data can be found under Large memory and out-of-memory data here: http://cran.r-project.org/web/views/HighPerformanceComputing.html

Related

Error encountered with using rbindlist: column 25 of result is determined to be integer64 but maxType == 'Character' !=REALSXP

I used the following function to merge all .csv files in my directory into one dataframe:
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }
dataframe = multmerge(path)
This code produces this error:
Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP
The code has worked on the same csv files before...I'm not sure what's changed and what the error message means.
So in looking at the documentation of fread I just noticed there is an integer64 option so are you dealing with integers greater than 2^31?
EDIT: I added the tryCatch which will print a formatted message to the console indicating which files are causing the error with the actual error message. However for rbindlist to then execute over the normal files you need to create a dummy list that will produce an extra column called ERROR which will have NAs in all rows except the bottom one(s) which will have the name of the problem file as its value(s).
I suggest after you run this code through once, delete the ERROR column and extra row(s) from the data.table and then save this combined file as a .csv. I would then move all the files that combined properly into a different folder and only have the current combined file and the ones that didn't load properly in the path. Then rerun the function without the colClasses specified. I combined everything into one script so it's hopefully less confusing:
#First Initial run without colClasses
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
#You should get the original error message and identify the filename.
dataframe = multmerge(path)
#Delete placeholder column and extra rows
#You will get as many extra rows as you have problem files -
#most likely just the one with column 25 or any others that had that same issue with column 25.
#Note the out of bounds error message will probably go away with the colClasses argument pulled out.)
#Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
#Move all files but problem file into new folder
#Now you should only have the big one and only one in your path.
#Rerun the function but add the colClasses argument this time
#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
dataframe2 = multmerge(path)
Now we know the source of the error is column 25 which we can specify in colClasses. If you run the code and you get the same error message for a different column simply add the number of that column after the 25. Once you have the dataframe inputted I would check what is going on in that column (or any others if you must add other columns). Maybe there was a data entry error in one of the files or different encoding of an NA value. That's why I say to initially convert that column to character first because you will lose less information than converting to numeric first.
Once you have no errors always write the cleaned combined data.table to a csv that is contained in your folder and always move the individual files that have been combined into the other folder. That way when you add new files you will only be combining the big one and a few others so that in the future you can see what is going on easier. Just keep notes as to which files gave you trouble and which columns. Does that make sense?
Because files are often so idiosyncratic you will have to be flexible but this approach to the workflow should make it easy to identify problem files and add what you need to add to the fread to make it work. Basically archive the files that have been processed and keep track of exceptions like the column 25 one and keep the most current combined file and ones that haven't been processed together in the active path. Hope that helps and good luck!

sqldf returns zero observations

I have a number of large data files (.csv) on my local drive that I need to read in R, filter rows/columns, and then combine. Each file has about 33,000 rows and 575 columns.
I read this post: Quickly reading very large tables as dataframes and decided to use "sqldf".
This is the short version of my code:
Housing <- file("file location on my disk")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing', file.format = list(eol="/n")) *I am using Windows
I see "Housing_filtered" data.frame is created with Var1, but zero observations. This is my very first experience with sqldf. I am not sure why zero observations are returned.
I also used "read.csv.sql" and still I see zero observations.
Housing_filtered <- read.csv.sql(file = "file location on my disk",
sql = "select Var01 from file",
eol = "/n",
header = TRUE, sep = ",")
You never really imported the file as a data.frame like you think.
You've opened a connection to a file. You mentioned that it is a CSV. Your code should look something like this if it is a normal CSV file:
Housing <- read.csv("my_file.csv")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing')
If there's something non-standard about this CSV file please mention what it is and how it was created.
Also, to another point that was made in the comments, if you do for some reason need to manually input the line breaks use \n where you were using /n. Any error is not being caused by that change, but rather you're getting passed 1 problem and on to another, probably due to improperly handling missing data, space, commas in text fields that aren't being handled, etc.
If there are still data errors can you please use R code to create a small file that is reflective of the relevant characteristics of your data and which produces the same error when you import it? This may help.

Reading large csv file with missing data using bigmemory package in R

I am using large datasets for my research (4.72GB) and I discovered "bigmemory" package in R that supposedly handles large datasets (up to the range of 10GB). However, when I use read.big.matrix to read a csv file, I get the following error:
> x <- read.big.matrix("x.csv", type = "integer", header=TRUE, backingfile="file.bin", descriptorfile="file.desc")
Error in read.big.matrix("x.csv", type = "integer", header = TRUE,
: Dimension mismatch between header row and first data row.
I think the issue is that the csv file is not full, i.e., it is missing values in several cells. I tried removing header = TRUE but then R aborts and restarts the session.
Does anyone have experience with reading large csv files with missing data using read.big.matrix?
It may not be solving your problem directly, but you might find a package of mine filematrix useful. The relevant function is fm.create.from.text.file.
Please let me know if it works for your data file.
Did you check bigmemory PDF at https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf?
It was clearly described right there.
write.big.matrix(x, 'IrisData.txt', col.names=TRUE, row.names=TRUE)
y <- read.big.matrix("IrisData.txt", header=TRUE, has.row.names=TRUE)
# The following would fail with a dimension mismatch:
if (FALSE) y <- read.big.matrix("IrisData.txt", header=TRUE)
Basically, error means there is a column in the CSV file with row names. If you don't pass has.row.names=TRUE, bigmemory will consider row names a separate column, and without header you'll get mismatch.
I personally found data.table package more useful for dealing with large data set cases, YMMV

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

Import text file using ff package

I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).
So, I use the following code to import:
test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)
but this returns the following error message :
Error in read.table.ffdf(FUN = "read.csv2", ...) :
only ffdf objects can be used for appending (and skipping the first.row chunk)
What am I doing wrong?
Here are the first 5 rows of the text file:
CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1 1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2 1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3 1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4 1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5 1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z
I encountered a similar problem related to reading csv into ff objects. On using
read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")
instead of implicit call
read.csv2.ffdf("FD_INDCVIZC_2010.txt")
I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.
You could try the following code:
read.csv2.ffdf("FD_INDCVIZC_2010.txt",
sep = "\t",
VERBOSE = TRUE,
first.rows = 100000,
next.rows = 200000,
header=T)
I am assuming that since its a txt file, its a tab-delimited file.
Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.
If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:
textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)
You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.
This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.
perhaps you could try the following code:
read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )

Resources