How to get current percent of CPU usage in R? - r

How can I get the current percent CPU usage in R? Ideally, it would work for both Unix and Windows platforms.
In Windows platform, I used following code:
a <- system("wmic cpu get loadpercentage", intern = TRUE)
as.numeric(gsub("\\D", "", a[2]))
Is there a better way(or a function in a package) to get the current CPU usage, such that works with both Unix and Windows platforms?
According to how to get current cpu and ram usage in python? and the "reticulate" package:
library(reticulate)
aa<-reticulate::import("psutil")
aa$cpu_percent()
The function return the current percent usage of CPU as shown in below(6% currently used)
But this way needs Python to be installed on the platform.
The question
Is there an R function to retrieve CPU and RAM information?
ask for hardware information (not current percent usage CPU)as follows(This is not even close to My question!!!):
> system("lscpu | grep 'Model name:'")
Model name: Intel(R) Core(TM) i7-8700 CPU # 3.20GHz
> system("lsmem | grep 'Total online memory'")
Total online memory: 16G
> library(benchmarkme)
> get_cpu()
$vendor_id
[1] "GenuineIntel"
$model_name
[1] "Intel(R) Core(TM) i5-7400 CPU # 3.00GHz"
$no_of_cores
[1] 4
> get_ram()
34.3 GB
So, the answer to the question, two function get_ram() ,get_cpu() , return total available RAM and CPU! not current percent usage of RAM and CPU. That is, get_ram() function return 32GB, not 6 percent that used now!
I think, accepted answer in question R: how to check how many cores/CPU usage available, does not calculated the current percent of CPU:
Windows platform(the accepted answer R: how to check how many cores/CPU usage available):
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df[grepl("Rgui|rstudio", df$process),]
# process cpu
# 105 Rgui 0
# 108 rstudio 0
And the data.frame 'df' is:
I can not find any way to calculate current percent of CPU usage based on the answer. Perhaps I misunderstood something, So based that answer, R: how to check how many cores/CPU usage available, ,give me the current percent of CPU usage on comment.
I tried to extract the current percent CPU usage base on R: how to check how many cores/CPU usage available, When I look at the result of
df <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
I find two rows, Idle and _Total as follows:
df1<-df %>% filter(process %in% c("Idle","_Total"))
df1
So 1-Idle/_Total should be the percent current CPU usage. I calculate this as follows:
for(i in 1:1000){
a <- system("wmic path Win32_PerfFormattedData_PerfProc_Process get Name,PercentProcessorTime", intern = TRUE)
df1 <- do.call(rbind, lapply(strsplit(a, " "), function(x) {x <- x[x != ""];data.frame(process = x[1], cpu = x[2])}))
df1<-df1 %>% filter(process %in% c("Idle","_Total"))
df1<-df1 %>% mutate(cpu=as.numeric(df1$cpu))
Idle<-df1 %>% filter(process=="Idle")
Total<-df1 %>% filter(process=="_Total")
message(1-Idle$cpu/Total$cpu)
}
and the result is:
that make no sense!!
When I look to the Python code, and an answer should be like it, it easily calculate the current CPU usage:
First with PowerShell install the psutil module as follows:
pip install psutil
and then use it in R as follows:
> library(reticulate)
> aa<-reticulate::import("psutil")
> aa$cpu_percent()
[1] 9.2

Related

Memory problem in R "Error:cannot allocate vector of size 344 kb"

My memory.limit() is 3583,I have a 64-bit machine with 8G RAM at home,and just remote access to my computer in the office then found it was also 8G RAM.So
I can't run the R codes below successfully,should I reset the memory limit?But someone thinks it's a dangerous approach, could anyone tell me how to solve this problem? Thanks in advance!
loop<-1000;T<-45
bbb<-list()
for(i in 1:loop)
{
bbb[[i]]<-list()
bbb[[i]][[1]]<-matrix(rep(1,loop*(T-1)),loop,T-1)
bbb[[i]][[2]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[3]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[4]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[5]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[6]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[7]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[8]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[9]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[10]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[11]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[12]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[13]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[14]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[15]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[16]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[17]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[18]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[19]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[20]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[21]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[22]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[23]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[24]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[25]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[26]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[27]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[28]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[29]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[30]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[31]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[32]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
bbb[[i]][[33]]<-matrix(rep(0,loop*(T-1)),loop,T-1)
}
I suppose it depends what you doing with the matrix list, but maybe you could break your task into smaller chunks? Or you can try using lapply, which runs much faster on my machine but ultimately creates an object of exactly the same size. I think lapply has some memory saving advantages when repeating data.
If this doesn't work, try looking into the Matrix package and sparse matrices.
create_bbb <- function(loop = 1000, T = 45){
inner.list <- lapply(1:33, FUN = function(x){
if(x == 1) fill <- 1
else fill <- 0
return(matrix(rep(fill, loop * (T-1)), loop, T-1))
})
bbb <- lapply(1:loop, function(.) inner.list)
return(bbb)
}
bbb_test <- create_bbb()
# Check
all.equal(bbb, bbb_test)
# TRUE

poor performance of workers in foreach package of R

I am working with foreach and doParallel package in Windows, but the CPU used in the code is less than 10% during the foreach function. This is the code that I use with a small example.
library(doParallel)
library(foreach)
library(dplyr)
library(Matrix)
cl <- detectCores() - 1
registerDoParallel(cl)
n_max=1300000
df=data.frame(fromID=sample(c(1:1300000),2000,replace=TRUE),
toID=sample(c(1:1300000),2000,replace=TRUE),
group=sample(c(1:10),2000,replace=TRUE))
As=foreach (i=1:10,.packages=c('dplyr','Matrix'))%dopar%{
databygroup=filter(df,group==i)
sparseMatrix(i=databygroup$fromID,j=databygroup$toID,x=1,dims=c(n_max,n_max))
}
stopImplicitCluster()
Before using the foreach, I have this result to know how many workers are active.
> cat(sprintf('%s backend is registered\n',
+ if(getDoParRegistered()) 'A' else 'No'))
A backend is registered
> cat(sprintf('Running with %d worker(s)\n', getDoParWorkers()))
Running with 35 worker(s)
> (name <- getDoParName())
[1] "doParallelSNOW"
> (ver <- getDoParVersion())
[1] "1.0.11"
> if (getDoParRegistered())
+ cat(sprintf('Currently using %s [%s]\n', name, ver))
Currently using doParallelSNOW [1.0.11]
The message that I received is this for several connections
"In if (.Internal(exists(package, .Internal(getNamespaceRegistry()), ... :
closing unused connection 70..."
And after using "stopImplicitCluster" function, the number of workers is the same. So, I am not able to close the workers.
stopCluster(cl) doesn´t work
> cat(sprintf('Running with %d worker(s)\n', getDoParWorkers()))
Running with 2 worker(s)
> (name <- getDoParName())
[1] "doParallelSNOW"
> (ver <- getDoParVersion())
[1] "1.0.11"
> if (getDoParRegistered())
+ cat(sprintf('Currently using %s [%s]\n', name, ver))
Currently using doParallelSNOW [1.0.11]
> stopCluster(cl)
> cat(sprintf('Running with %d worker(s)\n', getDoParWorkers()))
Running with 2 worker(s)
> stopCluster(cl)
Error in summary.connection(connection) : invalid connection
I don´t know why the parallelization is not working.
Thank you for your time
The code is correct, but the problem is that the computer takes a lot of time among tasks. That's why the used CPU is so low.

Slow wordcloud in R

Trying to create a word cloud from a 300MB .csv file with text, but its taking hours on a decent laptop with 16GB of RAM. Not sure how long this should typically take...but here's my code:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
dfTemplate <- read.csv("CleanedDescMay.csv", header=TRUE, stringsAsFactors = FALSE)
template <- dfTemplate
template <- Corpus(VectorSource(template))
template <- tm_map(template, removeWords, stopwords("english"))
template <- tm_map(template, stripWhitespace)
template <- tm_map(template, removePunctuation)
dtm <- TermDocumentMatrix(template)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
par(bg="grey30")
png(file="WordCloudDesc1.png", width=1000, height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per = 0.3, max.words=500)
title(main = "Top Template Words", font.main=1, col.main="cornsilk3", cex.main=1.5)
dev.off()
Any advice is appreciated!
Step 1: Profile
Have you tried profiling your full workflow yet with a small subset to figure out which steps are taking the most time? Profiling with RStudio here
If not, that should be your first step.
If the tm_map() functions are taking a long time:
If I recall correctly, I found working with stringi to be faster than the dedicated corpus tools.
My workflow wound up looking like the following for the pre-cleaning steps. This could definitely be optimized further -- magrittr pipes %>% do contribute to some additional processing time, but I feel like that's an acceptable trade-off for the sanity of not having dozens of nested parenthesis.
library(data.table)
library(stringi)
library(parallel)
## This function handles the processing pipeline
textCleaner <- function(InputText, StopWords, Words, NewWords){
InputText %>%
stri_enc_toascii(.) %>%
toupper(.) %>%
stri_replace_all_regex(.,"[[:cntrl:]]"," ") %>%
stri_replace_all_regex(.,"[[:punct:]]"," ") %>%
stri_replace_all_regex(.,"[[:space:]]+"," ") %>% ## Replaces multiple spaces with
stri_replace_all_regex(.,"^[[:space:]]+|[[:space:]]+$","") %>% ## Remove leading and trailing spaces
stri_replace_all_regex(.,"\\b"%s+%StopWords%s+%"\\b","",vectorize_all = FALSE) %>% ## Stopwords
stri_replace_all_regex(.,"\\b"%s+%Words%s+%"\\b",NewWords,vectorize_all = FALSE) ## Replacements
}
## Replacement Words, I would normally read in a .CSV file
Replace <- data.table(Old = c("LOREM","IPSUM","DOLOR","SIT"),
New = c("I","DONT","KNOW","LATIN"))
## These need to be defined globally
GlobalStopWords <- c("AT","UT","IN","ET","A")
GlobalOldWords <- Replace[["Old"]]
GlobalNewWords <- Replace[["New"]]
## Generate some sample text
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
## Running Single Threaded
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
The process of cleaning text is embarrassingly parallel, so in theory you should be able some big time savings possible with multiple cores.
I used to run this pipeline in parallel, but looking back at it today, it turns out that the communication overhead makes this take twice as long with 8 cores as it does single threaded. I'm not sure if this was the same for my original use case, but I guess this may simply serve as a good example of why trying to parallelize instead of optimize can lead to more trouble than value.
## This function handles the cluster creation
## and exporting libraries, functions, and objects
parallelCleaner <- function(Text, NCores){
cl <- parallel::makeCluster(NCores)
clusterEvalQ(cl, library(magrittr))
clusterEvalQ(cl, library(stringi))
clusterExport(cl, list("textCleaner",
"GlobalStopWords",
"GlobalOldWords",
"GlobalNewWords"))
Text <- as.character(unlist(parallel::parLapply(cl, Text,
fun = function(x) textCleaner(x,
GlobalStopWords,
GlobalOldWords,
GlobalNewWords))))
parallel::stopCluster(cl)
return(Text)
}
## Run it Parallel
system.time({
DT[,CleanedText := parallelCleaner(Text = Text,
NCores = 8)]
})
# user system elapsed
# 6.700 5.099 131.429
If the TermDocumentMatrix(template) is the chief offender:
Update: I mentioned Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN recently that might be worth checking out: ngram Github Repository. Turns out I should have just tried it before explaining the really cumbersome process of building a command line tool from source-- this would have saved me a lot of time had been around 18 months ago!
It is a good deal more memory intensive and not quite as fast -- my memory usage peaked around 31 GB so that may or may not be a deal-breaker for you. All things considered, this seems like a really good option.
For the 500,000 paragraph case, ngrams clocks in at around 7 minutes of runtime:
#install.packages("ngram")
library(ngram)
library(data.table)
system.time({
ng1 <- ngram::ngram(DT[["CleanedText"]],n = 1)
ng2 <- ngram::ngram(DT[["CleanedText"]],n = 2)
ng3 <- ngram::ngram(DT[["CleanedText"]],n = 3)
pt1 <- setDT(ngram::get.phrasetable(ng1))
pt1[,Ngrams := 1L]
pt2 <- setDT(ngram::get.phrasetable(ng2))
pt2[,Ngrams := 2L]
pt3 <- setDT(ngram::get.phrasetable(ng3))
pt3[,Ngrams := 3L]
pt <- rbindlist(list(pt1,pt2,pt3))
})
# user system elapsed
# 411.671 12.177 424.616
pt[Ngrams == 2][order(-freq)][1:5]
# ngrams freq prop Ngrams
# 1: SED SED 75096 0.0018013693 2
# 2: AC SED 33390 0.0008009444 2
# 3: SED AC 33134 0.0007948036 2
# 4: SED EU 30379 0.0007287179 2
# 5: EU SED 30149 0.0007232007 2
You can try using a more efficient ngram generator. I use a command line tool called ngrams (available on github here) by Zheyuan Yu- partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6 to take pre-processed text files off disk and generate a .csv output with ngram frequencies.
You'll need to build from source yourself using make and then interface with it using system() calls from R, but I found it to run orders of magnitude faster while using a tiny fraction of the memory. Using it, I was was able generate 5-grams from ~700MB of text input in well under an hour, the CSV result with all the output was 2.9 GB file with 93 million rows.
Continuing the example above, In my working directory, I have a folder, ngrams-master, in my working directory that contains the ngrams executable created with make.
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
I think I may have made a couple tweaks to get the output format how I wanted it, if you're interested I can try to find the changes I made to generate a .csvoutputs that differ from the default and upload to Github. (I did that project before I was familiar with the platform so I don't have a good record of the changes I made, live and learn.)
Update 2: I created a fork on Github, msummersgill/ngrams that reflects the slight tweaks I made to output results in a .CSV format. If someone was so inclined, I have a hunch that this could be wrapped up in a Rcpp based package that would be acceptable for CRAN submission -- any takers? I honestly have no clue how Ternary Search Trees work, but they seem to be significantly more memory efficient and faster than any other N-gram implementation currently available in R.
Drew Schmidt and Christian Heckendorf also submitted an R package named ngram to CRAN, I haven't used it personally but it might be worth checking out as well: ngram Github Repository.
The Whole Shebang:
Using the same pipeline described above but with a size closer to what you're dealing with (ExampleText.txt comes out to ~274MB):
DT <- data.table(Text = stringi::stri_rand_lipsum(500000))
system.time({
DT[,CleanedText := textCleaner(Text, GlobalStopWords,GlobalOldWords, GlobalNewWords)]
})
# user system elapsed
# 66.969 0.747 67.802
writeLines(DT[["CleanedText"]],con = "ExampleText.txt")
system2(command = "ngrams-master/ngrams",args = "--type=word --n = 3 --in ExampleText.txt", stdout = "ExampleGrams.csv")
# ngrams have been generated, start outputing.
# Subtotal: 165 seconds for generating ngrams.
# Subtotal: 12 seconds for outputing ngrams.
# Total 177 seconds.
Grams <- fread("ExampleGrams.csv")
# Read 5917978 rows and 3 (of 3) columns from 0.160 GB file in 00:00:06
Grams[Ngrams == 3 & Frequency > 10][sample(.N,5)]
# Ngrams Frequency Token
# 1: 3 11 INTERDUM_NEC_RIDICULUS
# 2: 3 18 MAURIS_PORTTITOR_ERAT
# 3: 3 14 SOCIIS_AMET_JUSTO
# 4: 3 23 EGET_TURPIS_FERMENTUM
# 5: 3 14 VENENATIS_LIGULA_NISL
While the example may not be a perfect representation due to the limited vocabulary generated by stringi::stri_rand_lipsum(), the total run time of ~4.2 minutes using less than 8 GB of RAM on 500,000 paragraphs has been fast enough for the corpuses (corpi?) I've had to tackle in the past.
If wordcloud() is the source of the slowdown:
I'm not familiar with this function, but #Gregor's comment on your original post seems like it would take care of this issue.
library(wordcloud)
GramSubset <- Grams[Ngrams == 2][1:500]
par(bg="gray50")
wordcloud(GramSubset[["Token"]],GramSubset[["Frequency"]],color = GramSubset[["Frequency"]],
rot.per = 0.3,font.main=1, col.main="cornsilk3", cex.main=1.5)

How would one check the system memory available using R on a Windows machine?

I am running a multi-threaded R program but am having trouble with some nodes crashing due to the host system running out of memory. Is there a way for each node to check the available memory for the entire system before continuing to run? (machine is running Windows Server 2012 R2)
Maybe one of the below will help ( I am also on Windows Server 2012 R2):
Maybe this would be the most useful:
> system('systeminfo')
#the output is too big to show but you can save into a list and choose the rows you want
Or just use one of the below which are specific about the memory
> system('wmic MemoryChip get BankLabel, Capacity, MemoryType, TypeDetail, Speed')
BankLabel Capacity MemoryType Speed TypeDetail
RAM slot #0 8589934592 2 512
RAM slot #1 4294967296 2 512
Free available memory:
> system('wmic OS get FreePhysicalMemory /Value')
FreePhysicalMemory=8044340
Total available Memory
> system('wmic OS get TotalVisibleMemorySize /Value')
TotalVisibleMemorySize=12582456
Basically you can even run any other cmd command you want that you know it could help you through the system function. R will show the output on the screen and you can then save into a data.frame and use as you want.
Just for the sake of completion, I added support for Linux on Stefan's answer above-
Tested on Ubuntu 16
getFreeMemoryKB <- function() {
osName <- Sys.info()[["sysname"]]
if (osName == "Windows") {
x <- system2("wmic", args = "OS get FreePhysicalMemory /Value", stdout = TRUE)
x <- x[grepl("FreePhysicalMemory", x)]
x <- gsub("FreePhysicalMemory=", "", x, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
return(as.integer(x))
} else if (osName == 'Linux') {
x <- system2('free', args='-k', stdout=TRUE)
x <- strsplit(x[2], " +")[[1]][4]
return(as.integer(x))
} else {
stop("Only supported on Windows and Linux")
}
}
I wrapped LyzandeR's answer above up in a functions that returns the physical memory in kilobytes (1024 bytes). Tested on windows 7.
get_free_ram <- function(){
if(Sys.info()[["sysname"]] == "Windows"){
x <- system2("wmic", args = "OS get FreePhysicalMemory /Value", stdout = TRUE)
x <- x[grepl("FreePhysicalMemory", x)]
x <- gsub("FreePhysicalMemory=", "", x, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
as.integer(x)
} else {
stop("Only supported on Windows OS")
}
}

Getting a random internal selfref error in data.table for R

I love data.table, it's fast and intuitive, what could be better?
Alas, here's my problem: when referring to a data.table within a foreach() loop (using the doMC implementation) I will occasionally get the following error:
EXAMPLE IN APPENDIX
Error in { :
Internal error: .internal.selfref prot is not itself an extptr
One of the annoying problems here is that I can't get it to reproduce with any consistency, but it will happen during some long (several hrs) tasks, so I want to make sure it never happens, if possible.
Since I refer to the same data.table, DT, in each loop, I tried running the following at the beginning of each loop:
setattr(DT,".internal.selfref",NULL)
...to remove the invalid/corrupted self ref attribute. This works and the internal selfref error no longer occurs. It's a workaround, though.
Any ideas for addressing the root problem?
Many thanks for any help!
Eric
Appendix: Abbreviated R Session Info to confirm latest versions:
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
other attached packages:
[1] data.table_1.8.8 doMC_1.3.0
Example using simulated data -- you may have to run the history() function many times (like, hundreds) to get the error:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Load packages and Prepare Data
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
require(data.table)
##this is the package we use for multicore
require(doMC)
##register n-2 of your machine's cores
registerDoMC(multicore:::detectCores()-2)
## Build simulated data
value.a <- runif(500,0,1)
value.b <- 1-value.a
value <- c(value.a,value.b)
answer.opt <- c(rep("a",500),rep("b",500))
answer.id <- rep( 6000:6499 , 2)
question.id <- rep( sample(c(1001,1010,1041,1121,1124),500,replace=TRUE) ,2)
date <- rep( (Sys.Date() - sample.int(150, size=500, replace=TRUE)) , 2)
user.id <- rep( sample(250:350, size=500, replace=TRUE) ,2)
condition <- substr(as.character(user.id),1,1)
condition[which(condition=="2")] <- "x"
condition[which(condition=="3")] <- "y"
##Put everything in a data.table
DT.full <- data.table(user.id = user.id,
answer.opt = answer.opt,
question.id = question.id,
date = date,
answer.id = answer.id,
condition = condition,
value = value)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Daily Aggregation Function
##
##a basic function that aggregates all the values from
##all users for every question on a given day:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
each.day <- function(val.date){
DT <- DT.full[ date < val.date ]
#count the number of updates per user (for weighting)
setkey(DT, question.id, user.id)
DT <- DT[ DT[answer.opt=="a",length(value),by="question.id,user.id"] ]
setnames(DT, "V1", "freq")
#retain only the most recent value from each user on each question
setkey(DT, question.id, user.id, answer.id)
DT <- DT[ DT[ ,answer.id == max(answer.id), by="question.id,user.id", ][[3]] ]
#now get a weighted mean (with freq) of the value for each question
records <- lapply(unique(DT$question.id), function(q.id) {
DT <- DT[ question.id == q.id ]
probs <- DT[ ,weighted.mean(value,freq), by="answer.opt" ]
return(data.table(q.id = rep(q.id,nrow(probs)),
ans.opt = probs$answer.opt,
date = rep(val.date,nrow(probs)),
value = probs$V1))
})
return(do.call("rbind",records))
}
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## foreach History Function
##
##to aggregate accross many days quickly
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
history <- function(start, end){
#define a sequence of dates
date.seq <- seq(as.Date(start),as.Date(end),by="day")
#now run a foreach to get the history for each date
hist <- foreach(day = date.seq, .combine = "rbind") %dopar% {
#setattr(DT,".internal.selfref",NULL) #resolves occasional internal selfref error
each.day(val.date = day)
}
}
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Examples
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##aggregate only one day
each.day(val.date = "2012-12-13")
##generate a history
hist.example <- history (start = "2012-11-01", end = Sys.Date())
Thanks for reporting and all the help in finding it! Now fixed in v1.8.11. From NEWS :
In long running computations where data.table is called many times
repetitively, the following error could sometimes occur, #2647 :
Internal error: .internal.selfref prot is not itself an extptr
Fixed. Thanks to theEricStone, StevieP and JasonB for (difficult) reproducible examples.
Possibly related is a memory leak in grouping, which is also now fixed.
Long outstanding (usually small) memory leak in grouping fixed, #2648. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed these, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered. Tests added.
Thanks to many including vc273 and Y T.
Memory leak in data.table grouped assignment by reference
Slow memory leak in data.table when returning named lists in j (trying to reshape a data.table)
A similar problem has been plaguing me for months. Perhaps we can see a pattern by putting our experiences together.
I've been waiting to post till I could create a a reproducible example. Not possible thus far.
The bug doesn't happen in the same code location. In the past I've been able to avoid the error often by merely rerunning the exact same code. Other times I've reformulated an expression and rerun with success. In any case I'm pretty sure that these errors are truly internal to data.table.
I've saved the last 4 error messages in attempt to detect a pattern (pasted below).
---------------------------------------------------
[1] "err msg: location 1"
Error in selfrefok(x) :
Internal error: .internal.selfref prot is not itself an extptr
Calls: my.fun1 ... $<- -> $<-.data.table -> [<-.data.table -> selfrefok
Execution halted
---------------------------------------------------
[1] "err msg: location 1"
Error in alloc.col(newx) :
Internal error: .internal.selfref prot is not itself an extptr
Calls: my.fun1 -> $<- -> $<-.data.table -> copy -> alloc.col
Execution halted
---------------------------------------------------
[1] "err msg: location 2"
Error in shallow(x) :
Internal error: .internal.selfref prot is not itself an extptr
Calls: print ... do.call -> lapply -> as.list -> as.list.data.table -> shallow
Execution halted
---------------------------------------------------
[1] "err msg: location 3"
Error in shallow(x) :
Internal error: .internal.selfref prot is not itself an extptr
Calls: calc.book.summ ... .rbind.data.table -> as.list -> as.list.data.table -> shallow
Execution halted
Another similarity to the above example: I'm passing data.tables around among parallel threads, so they are being serialized/unserialized.
I will try the 'setattr' fix mentioned above.
hope this helps and thanks, jason
here is a simplification of one of the code segments that seems to generate this error 1 out of every 50-100k times it is run:
thanks #MatthewDowle btw. data.table has been most useful. Here is one stripped down bit of code:
require(data.table)
require(xts)
book <- data.frame(name='',
s=0,
Value=0.0,
x=0.0,
Qty=0)[0, ]
for (thing in list(1,2,3,4,5)) {
tmp <- xts(1:5, order.by= make.index.unique(rep(Sys.time(), 5)))
colnames(tmp) <- 'A'
tmp <- cbind(coredata(tmp[nrow(tmp), 'A']),
coredata(colSums(tmp[, 'A'])),
coredata(tmp[nrow(tmp), 'A']))
book <- rbind(book,
data.table(name='ALPHA',
s=0*NA,
Value=tmp[1],
x=tmp[2],
Qty=tmp[3]))
}
something like this seems to be the cause of this error:
Error in shallow(x) :
Internal error: .internal.selfref prot is not itself an extptr
Calls: my.function ... .rbind.data.table -> as.list -> as.list.data.table -> shallow
Execution halted
For the sake of reproducing the error, I have a script for you guys to pour over and figure out where this bug is coming from. The error reads:
Error in { :
task 96 failed - "Internal error: .internal.selfref prot is not itself an extptr"
Calls: apply ... system.time -> apply -> FUN -> %dopar% -> <Anonymous>
Execution halted
and I'm using doParallel to register my backend for foreach.
Context: I'm testing out classifiers on the MNIST hand-written digit dataset. You can get the data from me via
wget -nc https://www.dropbox.com/s/xr4i8gy11ed8bsh/digit_id_data_and_benchmarks.zip
just be sure to modify the script (above) so that it correctly points to load_data.R and load_data.R correctly points to the MNIST data -- though it may be easier for you to just clone my repo, hop on the random_gov branch, and then run dt_centric_random_gov.R.
Sorry I couldn't make a more minimal reproducible example, but like #JasonB's answer, this error doesn't seem to pop up until you do a ton of calculations.
edit: I re-ran my script using the suggested work-around above and it seemed to go off without a hitch.

Resources