R reading a huge csv - r

I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.
If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field:
--max-vsize=500M
Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.
> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update1
My 1st try based upon suggested answer
> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update2
my 2nd try based upon suggested answer is as below
thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE,
+ first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)
How could I read this file into a single object so that I can analyze the entire data in one go
------------------update 3
We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM

Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.
In addition, you can read in the file in chunks:
file_in <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x <- readLines(file_in, n=chunk_size)
You can use data.table to handle reading and manipulating large files more efficiently:
require(data.table)
fread("in.csv", header = T)
If needed, you can leverage storage memory with ff:
library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE,
first.rows=10000, next.rows=50000, colClasses=NA)

You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.
Install SQLite
https://www.sqlite.org/download.html
Enter the data into a new SQLite database
Save the following in a new file named import.sql
CREATE TABLE tableName (COL1, COL2, COL3, COL4);
.separator ,
.import YOURDATA.csv tableName
Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.
Import the data into the SQLite database via the command line
sqlite3.exe BIGDATA.sqlite3 < import.sql
Point dplyr to the SQLite database
As we're using SQLite, all of the dependencies are handled by dplyr already.
library(dplyr)
my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)
my_tbl <- tbl(my_db, "tableName")
Do your exploratory analysis
dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.
my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))
#> Source: query [?? x 2]
#> Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>
#> COL2 meanVal
#> <chr> <dbl>
#> 1 1979 15.26476
#> 2 1980 16.09677
#> 3 1981 15.83936
#> 4 1982 14.47380
#> 5 1983 15.36479

This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.
DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 40001072 bytes #40 MB
Two OOM larger:
DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 4000001072 bytes #4.00 GB
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).
x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).
Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.

This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling memory.limit(new) where new an integer with R's new memory.limit in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).
If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.
Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2)) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.

You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a for loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.
x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
read.csv(con, nrows=CHUNK_SIZE,...)
}
Hope that helps.

Related

Character string limit reached early using fread from data.table, CSV lines look fine

I'm reading a moderately big CSV with fread but it errors out with "R character strings are limited to 2^31-1 bytes". readLines works fine, however. Pinning down the faulty line (2965), I am not sure what's going on: doesn't seem longer than the next one, for example.
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
library(data.table)
options(timeout=10000)
download.file("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv",
destfile = "trip_data.csv", mode = "wb")
dt = fread("trip_data.csv")
#> Error in fread("trip_data.csv"): R character strings are limited to 2^31-1 bytes
lines = readLines("trip_data.csv")
dt2955 = fread("trip_data.csv", nrows = 2955)
#> Warning in fread("trip_data.csv", nrows = 2955): Previous fread() session was
#> not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
dt2956 = fread("trip_data.csv", nrows = 2956)
#> Error in fread("trip_data.csv", nrows = 2956): R character strings are limited to 2^31-1 bytes
lines[2955]
#> [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"
lines[2956]
#> [1] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"
Created on 2022-02-12 by the reprex package (v2.0.1)
When trying to read part of the file (around 100k rows) I got:
Warning message:
In fread("trip_data.csv", skip = i * 500, nrows = 500) :
Stopped early on line 2958. Expected 18 fields but found 19. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9>>
>
After removing it I was able to read at least 100 k rows
da = data.table(check.names = FALSE)
for (i in 0:200) {
print(i*500)
dt = fread("trip_data.csv", skip = i*500, nrows = 500, fill = TRUE)
da <- rbind(da, dt, use.names = FALSE)
}
str(da)
Classes ‘data.table’ and 'data.frame': 101000 obs. of 18 variables:
$ vendor_id : chr "" "CMT" "CMT" "CMT" ...
$ pickup_datetime : POSIXct, format: NA "2010-03-22 17:05:03" "2010-03-22 19:24:29" ...
$ dropoff_datetime : POSIXct, format: NA "2010-03-22 17:22:51" "2010-03-22 19:40:13" ...
$ passenger_count : int NA 1 1 1 3 1 1 1 1 1 ...
[...]
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
Then you can read it line by line, checking length of the list, and binding it to data table.
Regards,
Grzegorz

Making Demogdata in R

I am new to the R and I am trying to fit mortality data from HMD. However everytime I do a correction, I get an error message about read.demogdata or more specially creating demogdata. I can not create my own demogdata. Here is my code, but I can not move forward:
library(demography)
library(forecast)
aus<-read.demogdata("AUS.Mx_1x1.txt", type="mortality", label="AUS", skip=2)
tra.data<-window(aus, ages=0:16, years=1921:1986)
test.data<-window(aus, ages=0:16, years=1987:2016)
lcmodel<-lca(tra.data)
Error in lca(tra.data) : Not demography data
In where I made a mistake? How can I overcome?
Many thanks in advance..
Also this:
aus2<-read.demogdata("Mx_1x1.txt", "Exposures_1x1.txt",type="mortality",label="AUS")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 3884 did not have 3 elements
In matrix(tmp1[, i + 2], nrow = m, ncol = n) :

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

R freezes and my computer too

I don't know whether my question is an easy one to answer, but let's ask it. I'm using R for corpus linguistics and I want to make concordance with a regular expression, using "exact.matches" (cf. St. Th. Gries). The problem is that when I let R run the script, it freezes a long time and my computer freezes too. So I have to restart everything with the power button of my computer.
What I want to try to analyse is a collection of 100 texts (in txt format). The whole bundle is 17,254,537 tokens big, but I have tried to run the code for 20 files at a time. Same problem: everything freezes. Here comes the code:
rm(list=ls(all=T))
setwd("C:/Users/Christophe/Documents/Doctorat_ULg/Corpora/Dutch/Gutenberg_corpus_NL")
source("C:/_qclwr/_scripts/_scripts_code-exerciseboxes_chapters_3-5/exact_matches_new.R")
corpus.files.1<-choose.files() # to load the first 58 text files
corpus.files.2<-choose.files() # to load the 42 other files
whole.corpus.file<-c(corpus.files.1, corpus.files.2) # to concatenate everything into one vector
all.matches.verbs<-vector()
for(i in whole.corpus.files) {
current.corpus.file<-scan(i, what="char", sep="\n", quiet=T)
current.matches.verbs<-exact.matches("aan<prep>", current.corpus.file, case.sens=F, pcre=T)
if(length(current.matches.verbs)==0) { next }
all.matches.verbs<-append(all.matches.verbs, current.matches.verbs)
}
Is there an easy way to solve this problem? It seems it is a problem of memory. I typed the following, if it can help:
> memory.size()
[1] 35.02
> memory.limit()
[1] 3976
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 558406 29.9 818163 43.7 741108 39.6
Vcells 1039743 8.0 1757946 13.5 1300290 10.0
I thank you in advance for your precious help.
Best,
CBechet.
There is an alternative to the for-loop:
rm(list=ls(all=T))
setwd("C:/Users/Christophe/Documents/Doctorat_ULg/Corpora/Dutch/Gutenberg_corpus_NL")
source("C:/_qclwr/_scripts/_scripts_code-exerciseboxes_chapters_3-5/exact_matches_new.R")
corpus.files.1<-choose.files() # loads the first set of corpus files
corpus.files.2<-choose.files() # loads the second set of corpus files
whole.corpus.file<-c(corpus.files.1, corpus.files.2) # concatenate all the corpus files into one vector
whole.text <-unlist(lapply(whole.corpus.file, function(x) scan(x, what="char", sep="\n", quiet=T))) # reads the content of the files in the vector
And the data is still too big (and I'm not using a for-loop):
Error: cannot allocate vector of size 4.3 Mb
In addition: Warning messages:
1: In substr(lines, if (characters.around != 0) starts - characters.around else 1, :
Reached total allocation of 3976Mb: see help(memory.size)
2: In substr(lines, if (characters.around != 0) starts - characters.around else 1, :
Reached total allocation of 3976Mb: see help(memory.size)
3: In substr(lines, if (characters.around != 0) starts - characters.around else 1, :
Reached total allocation of 3976Mb: see help(memory.size)
4: In substr(lines, if (characters.around != 0) starts - characters.around else 1, :
Reached total allocation of 3976Mb: see help(memory.size)

fread skip and autostart issue

I have the following code:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip)
Which produces the following error:
Error in fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + :
Expected sep (',') but new line, EOF (or other non printing character) ends field 14 on line 1003 when detecting types: 10066652 ТранÑпорт Ðвтомобили Ñ Ð¿Ñ€Ð¾Ð±ÐµÐ³Ð¾Ð¼ Nissan R Nessa, 1998 Ð¢Ð°Ñ€Ð°Ð½Ñ‚Ð°Ñ Ð² отличном ÑоÑтоÑнии. на прошлой неделе возили на тех. ОбÑлуживание. Ð’ дорожных неприÑтноÑÑ‚ÑÑ… не был учаÑтником. Детали кузова без коцок и терок. ПредназначалаÑÑŒ Ð´Ð»Ñ Ð¿Ð¾ÐµÐ·Ð´Ð¾Ðº на природу, Отдам только в добрые руки. Ð’ Ñалон не поÑтавлю не звоните "{""Марка"":""Nissan"", ""Модель"":""R Nessa"", ""Год выпуÑка"":""1998"", ""Пробег"":""180 000 - 189 999"", ""Тип кузова"":""МинивÑн"", ""Цвет"":""Оранжевый"", ""Объём двигателÑ"":""2.4"", ""Коробка передач"":""МеханичеÑкаÑ
I have tried changing it to this:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + 2))
Which is based on what I read on a similar question skip and autostart in fread
However, it produces a similar error as above.
How can I skip the first 1000 rows, and read the next thousand? My expected output is 1000 rows total, skipping the first thousand from my CSV file, and reading the second thousand.
Note: Reading the file with raw_test <- fread("avito_test.tsv", nrows = 1000, skip = -1) works well for getting me only the first thousand, but I am trying to get only the second thousand.
Edit: The data is publicly available at http://www.kaggle.com/c/avito-prohibited-content/data
Edit: Environment and package info:
> packageVersion("data.table")
[1] ‘1.9.3’
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Resources