How can I load a large (3.96 gb) .tsv file in R studio - r

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))

If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread

You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()

From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

Related

Reading large numeric TSV file into memory in R

I am trying to read a file representing a numeric matrix with 4.5e5 rows and 2e3 columns. First line is the header with ncol+1 words, while each row begins with a row name. In txt format it is around 17G in size.
I tried using:
read.table(fname, header=TRUE)
but the operation ate all 64G of RAM available. I assume it loaded it in a wrong structure.
Usually people discuss speed, is there a way to import it so it fits properly? Performance is not a primary issue.
EDIT: I managed to read it with read.table:
colclasses = c("character",rep("numeric",2000))
betas = read.table(beta_fname, header=TRUE, colClasses=colclasses, row.names=1)
But documentation still recommends "scan" for memory usage. What would be the "scan" alternative?
There are several things you might try. Google about reading large files and they might point you to using 'fread' in data.table. You can also try 'read_delim_chunked' that might help. Also break the file into smaller pieces, read each one in, write out an RDS file. When complete you might be able to read in the RDS files and combine using less space.

data.table v.1.11.0+ no longer freads data file that was fread by v.1.10.4-3

I've encountered a possible bug in the new version of data.table. I have a 2GB .csv file with c. 3 million rows and 67 columns. I can use fread() to read it all fine from data.table v.1.10.4-3, but v.1.11.0+ terminates at a row somewhere down the middle. The base read.csv() also hits the same problem. I really like data.table and want to create a bug report on Github, but obviously I can't upload the 2GB data file anywhere.
I need a way of splicing maybe ~10 rows around the problematic point (the row number is known) in order to create a portable reproducible example. Any ideas how I can do that without reading in the .csv file?
Also, is there a program I can use to open the raw file to look at the problematic point and see what causes the issue? Notepad/Excel won't open a file this big.
EDIT: the verbose output.
EDIT2: this is the problematic line. It shows that what is supposed to be one line is somehow split into 3 lines. I can only assume it is due to an export bug in an ancient software (SAP Business Objects) that was used to create the CSV. It is unsurprising that it causes an issue. However, it surprising that data.table v.1.10.4-3 was able to handle it in a smart way and read it correctly, whereas v.1.11.0+ could not. Could it do something with encoding or technical hidden characters?
EDIT3: proof that this is what really happens.
Thanks for including the output. It shows that fread is issuing a warning. Did you miss this warning before?
Warning message:
In fread("Data/FP17s with TCD in March 2018.csv", na.strings = c("#EMPTY", :
Stopped early on line 138986. Expected 67 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<916439/0001,Q69,GDS Contract,MR A SYED,916439,Mr,SYED A Mr,A,SYED,58955,3718.00,Nine Mile Ride Dental Practice,Dental Surgery,193 Nine Mile Ride,Finchampstead,WOKINGHAM,RG40 4JD,2181233168.00,TORIN,FASTNEDGE,1 ANCHORITE CLOSE,>>
This is very helpful, surely. It tells you the line number: 138986. It says that this line is 22 fields but it expects 67. Could the warning be better by stating why it is expecting 67 fields at that point (e.g. by saying there are 67 column names and it has seen 67 columns up to that point?) It gives you a hint of what to try (fill=TRUE) which would fill that too-short line with NA in columns 23:67. Then it includes the data from the line, too.
Does it work with fill=TRUE, as the warning message suggests?
You say it worked in 1.10.4-3 but I suspect it's more likely it stopped early there too, but without warning. If so, that was a bug not to warn, now fixed.
Using Powershell on Windows:
Get-Content YourFile.csv | Select -Index (0,19,20,21,22) > OutputFileName.csv
would dump the header and lines 20-23 into a new file.
Use a combination of skip and nrow:
You mentioned that you have no problem reading the file with v.1.10.4-3, right?. So use that to skip most of the .csv and set nrow to the number of rows you want. Once you have that data.table, you can write that portion of the file and you have a portable reproducible example.
For example:
DT <- fread(my_file.csv, skip=138981, nrow=10)

Reading large csv file in R

I have a number of csv-files of different size, but all somewhat big. Using read.csv to read them into R takes longer than I've been patient to wait so far (several hours). I managed to read the biggest file (2.6 gb) very fast (less than a minute) with data.table's fread.
My problem occurs when I try to read a file of half the size. I get the following error message:
Error in fread("C:/Users/Jesper/OneDrive/UdbudsVagten/BBR/CO11700T.csv",:
Expecting 21 cols, but line 2557 contains text after processing all
cols. It is very likely that this is due to one or more fields having
embedded sep=';' and/or (unescaped) '\n' characters within unbalanced
unescaped quotes.
fread cannot handle such ambiguous cases and those
lines may not have been read in as expected. Please read the section
on quotes in ?fread.
Through research I've found suggestions to add quote = "" to the code, but it doesn't help me. I've tried using the bigmemory package, but R crashes when I try. I'm on a 64 bit system with 8 gb of ram.
I know there are quite a few threads on this subject, but I haven't been able to solve the problem with any of the solutions. I would really like to use fread (given my good experience with the bigger file), and it seems like there should be some way to make it work - just can't figure it out.
Solved this by installing SlickEdit and using it to edit the lines that caused the trouble. A few characters like ampersand, quotation marks, and apostrophes were consistently encoded to include semicolon - e.g. & instead of just &. As semicolon was the seperator in the text document, this caused the problem in reading with fread.

Split big data in R

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.
Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.
Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.
To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.
My question is: how can I get the number of lines from the big data file without fully loading it into R?
Suppose I can only use R. So cannot use any other programming languages.
Thank you.
Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part).
The gist is to use ireadLines to open an iterator over the file
For Windows, something like this should work
fname <- "blah.R" # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"

Large csv file fails to fully read in to R data.frame

I am trying to load a fairly large csv file into R. It has about 50 columns and 2million row.
My code is pretty basic, and I have used it to open files before but none this large.
mydata <- read.csv('file.csv', header = FALSE, sep=",", stringsAsFactors = FALSE)
The result is that it reads in the data but stops after 1080000 rows or so. This is roughly where excel stops as well. Is their way to get R to read the whole file in? Why is it stopping around half way.
Update: (11/30/14)
After speaking with the provider of the data it was discovered that they may have been some corruption issue with the file. A new file was provided which also is smaller and loads into R easily.
As, "read.csv()" read up to 1080000 rows, "fread" from library(data.table) should read it with ease. If not, there exists two other options, either try with library(h20) or with "fread" you can use select option to read required columns (or read in two halves, do some cleaning and can merge them back).
You can try using read.table and include the parameter colClasses to specify the type of the individual columns.
With your current code, R will read all data first as strings and then check for each column if it is convertible e. g. to a numeric type, which needs more memory than reading right away as numeric. colClasses will also allow you to ignore columns you might not need.

Resources