Related
How can I get the current percent CPU usage in R? Ideally, it would work for both Unix and Windows platforms.
In Windows platform, I used following code:
a <- system("wmic cpu get loadpercentage", intern = TRUE)
as.numeric(gsub("\\D", "", a[2]))
Is there a better way(or a function in a package) to get the current CPU usage, such that works with both Unix and Windows platforms?
According to how to get current cpu and ram usage in python? and the "reticulate" package:
library(reticulate)
aa<-reticulate::import("psutil")
aa$cpu_percent()
The function return the current percent usage of CPU as shown in below(6% currently used)
But this way needs Python to be installed on the platform.
The question
Is there an R function to retrieve CPU and RAM information?
ask for hardware information (not current percent usage CPU)as follows(This is not even close to My question!!!):
> system("lscpu | grep 'Model name:'")
Model name: Intel(R) Core(TM) i7-8700 CPU # 3.20GHz
> system("lsmem | grep 'Total online memory'")
Total online memory: 16G
> library(benchmarkme)
> get_cpu()
$vendor_id
[1] "GenuineIntel"
$model_name
[1] "Intel(R) Core(TM) i5-7400 CPU # 3.00GHz"
$no_of_cores
[1] 4
> get_ram()
34.3 GB
So, the answer to the question, two function get_ram() ,get_cpu() , return total available RAM and CPU! not current percent usage of RAM and CPU. That is, get_ram() function return 32GB, not 6 percent that used now!
I think, accepted answer in question R: how to check how many cores/CPU usage available, does not calculated the current percent of CPU:
Windows platform(the accepted answer R: how to check how many cores/CPU usage available):
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df[grepl("Rgui|rstudio", df$process),]
# process cpu
# 105 Rgui 0
# 108 rstudio 0
And the data.frame 'df' is:
I can not find any way to calculate current percent of CPU usage based on the answer. Perhaps I misunderstood something, So based that answer, R: how to check how many cores/CPU usage available, ,give me the current percent of CPU usage on comment.
I tried to extract the current percent CPU usage base on R: how to check how many cores/CPU usage available, When I look at the result of
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
I find two rows, Idle and _Total as follows:
df1<-df %>% filter(process %in% c("Idle","_Total"))
df1
So 1-Idle/_Total should be the percent current CPU usage. I calculate this as follows:
for(i in 1:1000){
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df1 <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df1<-df1 %>% filter(process %in% c("Idle","_Total"))
df1<-df1 %>% mutate(cpu=as.numeric(df1$cpu))
Idle<-df1 %>% filter(process=="Idle")
Total<-df1 %>% filter(process=="_Total")
message(1-Idle$cpu/Total$cpu)
}
and the result is:
that make no sense!!
When I look to the Python code, and an answer should be like it, it easily calculate the current CPU usage:
First with PowerShell install the psutil module as follows:
pip install psutil
and then use it in R as follows:
> library(reticulate)
> aa<-reticulate::import("psutil")
> aa$cpu_percent()
[1] 9.2
I've written a function to download multiple files from NOAA's database. Firstly, I've got sites which is a list of site ID's that I want to download off the website. It looks like this:
> head(sites)
[[1]]
[1] "9212"
[[2]]
[1] "10158"
[[3]]
[1] "11098"
> length(sites)
[1] 2504
My function is shown below.
tested<-lapply(seq_along(sites), function(x) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
check=GET(v$statusUrl)
j<-content(check)
URL<-j$archive
download.file(URL, destfile=paste0('./tree_ring/', no, '.zip'))
})
The weird issue is that it works for the first three sites (downloads properly), but then it stops after the three sites and throws the following error:
Error in charToRaw(URL) : argument must be a character vector of length 1
I've tried manually downloading the 4th and 5th site (using the same code as above, but not within function) and it works fine. What could be going on here?
EDIT 1: Showing more site ID's as requested
> dput(sites[1:6])
list("9212", "10158", "11098", "15757", "15777", "15781")
I converted your code to a for loop so I could see the most recent values of all your variables when things fail.
The fails aren't consistently on the 4th site. Running your code a few times, sometimes it fails on 2, or 3, or 4. When it fails, if I look at j, I see this:
$message
[1] "finalizing archive"
$status
[1] "working"
$message
[1] "finalizing archive"
$status
[1] "working"
If I re-run check=GET(v$statusUrl); j<-content(check) a few seconds later, then I see
$archive
[1] "https://www.ncdc.noaa.gov/web-content/paleo/bundle/1986420067_2020-04-23.zip"
$status
[1] "complete"
So, I think it takes the server a little bit of time to prepare the file for download, and sometimes R asks for the file before it's ready, which causes an error. A simple fix might look like this:
check_status <- function(v) {
check <- GET(v$statusUrl)
content(check)
}
for(x in seq_along(sites)) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
try_counter <- 0
j <- check_status(v)
while(j$status != "complete" & try_counter < 100) {
Sys.sleep(0.1)
j <- check_status(v)
}
URL<-j$archive
download.file(URL, destfile=paste0(no, '.zip'))
}
If the status isn't ready, this version will wait 0.1 seconds before checking again, up to 10 seconds.
My memory.limit() is 3583,I have a 64-bit machine with 8G RAM at home,and just remote access to my computer in the office then found it was also 8G RAM.So
I can't run the R codes below successfully,should I reset the memory limit?But someone thinks it's a dangerous approach, could anyone tell me how to solve this problem? Thanks in advance!
loop<-1000;T<-45
bbb<-list()
for(i in 1:loop)
{
bbb[[i]]<-list()
bbb[[i]][[1]]<-matrix(rep(1,loop*(T-1)),loop,T-1)
bbb[[i]][[2]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[3]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[4]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[5]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[6]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[7]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[8]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[9]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[10]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[11]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[12]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[13]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[14]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[15]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[16]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[17]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[18]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[19]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[20]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[21]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[22]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[23]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[24]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[25]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[26]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[27]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[28]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[29]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[30]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[31]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[32]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[33]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
}
I suppose it depends what you doing with the matrix list, but maybe you could break your task into smaller chunks? Or you can try using lapply, which runs much faster on my machine but ultimately creates an object of exactly the same size. I think lapply has some memory saving advantages when repeating data.
If this doesn't work, try looking into the Matrix package and sparse matrices.
create_bbb <- function(loop = 1000, T = 45){
inner.list <- lapply(1:33, FUN = function(x){
if(x == 1) fill <- 1
else fill <- 0
return(matrix(rep(fill, loop * (T-1)), loop, T-1))
})
bbb <- lapply(1:loop, function(.) inner.list)
return(bbb)
}
bbb_test <- create_bbb()
# Check
all.equal(bbb, bbb_test)
# TRUE
Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)
I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.
If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field:
--max-vsize=500M
Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.
> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update1
My 1st try based upon suggested answer
> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
------------------- Update2
my 2nd try based upon suggested answer is as below
thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE,
+ first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)
How could I read this file into a single object so that I can analyze the entire data in one go
------------------update 3
We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM
Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.
In addition, you can read in the file in chunks:
file_in <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x <- readLines(file_in, n=chunk_size)
You can use data.table to handle reading and manipulating large files more efficiently:
require(data.table)
fread("in.csv", header = T)
If needed, you can leverage storage memory with ff:
library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE,
first.rows=10000, next.rows=50000, colClasses=NA)
You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.
Install SQLite
https://www.sqlite.org/download.html
Enter the data into a new SQLite database
Save the following in a new file named import.sql
CREATE TABLE tableName (COL1, COL2, COL3, COL4);
.separator ,
.import YOURDATA.csv tableName
Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.
Import the data into the SQLite database via the command line
sqlite3.exe BIGDATA.sqlite3 < import.sql
Point dplyr to the SQLite database
As we're using SQLite, all of the dependencies are handled by dplyr already.
library(dplyr)
my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)
my_tbl <- tbl(my_db, "tableName")
Do your exploratory analysis
dplyr will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.
my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))
#> Source: query [?? x 2]
#> Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>
#> COL2 meanVal
#> <chr> <dbl>
#> 1 1979 15.26476
#> 2 1980 16.09677
#> 3 1981 15.83936
#> 4 1982 14.47380
#> 5 1983 15.36479
This may not be possible on your computer. In certain cases, data.table takes up more space than its .csv counterpart.
DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 40001072 bytes #40 MB
Two OOM larger:
DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 4000001072 bytes #4.00 GB
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv).
x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv).
Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.
This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling memory.limit(new) where new an integer with R's new memory.limit in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).
If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.
Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2)) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.
You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a for loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.
x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
read.csv(con, nrows=CHUNK_SIZE,...)
}
Hope that helps.