Split big data in R - r

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.
Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.
Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.
To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.
My question is: how can I get the number of lines from the big data file without fully loading it into R?
Suppose I can only use R. So cannot use any other programming languages.
Thank you.

Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part).
The gist is to use ireadLines to open an iterator over the file

For Windows, something like this should work
fname <- "blah.R" # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"

Related

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

Massive text file to be read into R

so the story is i have a 30 gig txt file that need to be read into R, it contains two cols and about 2 billion rows of integers! I dont want to load the whole thing in one go, sizeable chunks will suffice.
I've tried using read.table with arguments like nrow = 10000000 and skip = "stupidly_large_number"
but i get the following error when i get far through the file
Error in readLines(file, skip):
cannot allocate vector of length 1800000000
Please help me get at the data and thanks in advance!
it seems to me that you may need to split the text file into manageable chunks first before trying to process them. The unix split command should do the trick, but I don't know if you're on a platform that command exists on.

Excel data organized in multiple nested rows, can R read it?

Please see the picture. I've started using R, and know how/that it can read files from Excel, but can it read something formatted like this?
http://www.flickr.com/photos/68814612#N05/8632809494/
(my apologies, upload was not working for me)
Elaborating on some of what's in the comments:
If you load the file into Excel, you can save it as a fixed-width or comma-delimited text file. Either should be easy to read into R.
The following may be obvious to you already.
(First, a question: Are you sure that you can't get the data in a format that has one set of data per line? Is it possible that the file you're getting was generated from a different file format that is more conducive to loading the data into R?)
Whether you should start rearranging the data in R or instead manipulate the raw text depends on what comes naturally to you (or to people you have around who can help). For me, personally, I would rearrange the text file outside of R before loading it into R. That's what's easiest for me. Perl is a great language for this purpose, but you could also do it with Unix shell scripts if that's accessible to you, or using a powerful editor such as Vim or Emacs. If you have no preference, I'd suggest Perl. If you have any significant programming experience, you'll be able to learn what you need. On the other hand, you're already loading it into R, so maybe it would be better to process the data there.
For example, you could execute a loop that goes the text file line by line and does something like this:
while (still have lines to read) {
read first header line into an vector if this is the first time through the loop
otherwise, read it and throw it away
read data line 1 into an vector
read second header line into vector if this is the first time
otherwise, read it and throw it away
read data line 2 into an vector
read third header line into vector if this is the first time
otherwise, read it and throw it away
read data line 3 into an vector
if this is first time through, concatenate the header vectors; store as next row
in something (a file, a matrix, a dataframe, etc.)
concatenate the data vectors you've been saving, and store as next row in same thing
}
write out the whole 2D data structure
Or if the headers will never change, then you could just embed them literally into the script before the loop, and throw them out no matter what. That will make the code cleaner. Or read the first few lines of the file separately to get the headers, and then have a separate script to read the data and add it to the file with the headers in it. (The headers will probably be useful in R, so I would suggest preserving them at the top of the text file.)

Open large files with R

I want to process a file (1.9GB) that contains 100.000.000 datasets in R.
Actually I only want to have every 1000th dataset.
Each dataset contains 3 Columns, separated by a tab.
I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once.
Can I tell R directly to load only every 1000th dataset from the file?
After reading the file I want to bin the data of column 2.
Is it possible to directly bin the number written in column 2?
Is it possible the read the file line by line, without loading the whole file into the memory?
Thanks for your help.
Sven
You should pre-process the file using another tool before reading into R.
To write every 1000th line to a new file, you can use sed, like this:
sed -n '0~1000p' infile > outfile
Then read the new file into R:
datasets <- read.table("outfile", sep = "\t", header = F)
You may want to look at the manual devoted to R Data Import/Export.
Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.
In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.
Maybe package colbycol could be usefull to you.

Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt

Resources