Dump heap in SBCL - common-lisp

My program execution was aborted with the following diagnostics:
Heap exhausted during garbage collection: 0 bytes available, 16 requested.
Gen Boxed Code Raw LgBox LgCode LgRaw Pin Alloc Waste Trig WP GCs Mem-age
3 21843 1 47 0 0 0 59 716955392 368896 2000000 21891 0 1.0481
4 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
5 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
6 491 2 223 55 0 10 0 24917312 674496 2000000 781 0 0.0000
7 10080 0 15 0 0 0 0 330663696 129264 2000000 10095 0 0.0000
Total bytes allocated = 1072536400
Dynamic-space-size bytes = 1073741824
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 88102(tid 0x7fff9e07c380):
Heap exhausted, game over.
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
Is there a way to find where all of the memory was consumed?
The program itself is here: https://github.com/hemml/gridgen2

SBCL's (room t) will give you quite a bit more information if you can do it before you run out of heap. I'm unfamiliar with LDB and whether or not it can execute room. However you could wrap a call to (room t) with something that redirects its output to file, and add the function to the *after-gc-hooks* list so you can see the (extremely verbose) growth patterns.

Related

Why do I get an error message: "non-conformable arguments", even though I have no NA values in my matrix?

I am trying to run the r code from Network-Analysis on Attitudes: A Brief Tutorial.
You can find it here.
First we loaded the cognitive attitudes.
unzip('ANES2012.zip')
ANES2012 <- read.dta('anes_timeseries_2012_Stata12.dta')#loads the data to the object ANES2012
#########################
#Recode variables
#Items regarding Obama
ObamaCog <- data.frame(Mor = as.numeric(ANES2012$ctrait_dpcmoral),#this creates a data frame containing the items tapping beliefs
Led = as.numeric(ANES2012 $ ctrait_dpclead),
Car = as.numeric(ANES2012$ctrait_dpccare),
Kno = as.numeric(ANES2012$ctrait_dpcknow),
Int = as.numeric(ANES2012$ctrait_dpcint),
Hns = as.numeric(ANES2012$ctrait_dpchonst))
ObamaCog[ObamaCog < 3] <- NA#values below 3 represent missing values
I had to change the code a little bit, as the .binarize function didn't work. (I couldn't load a package ("cmprsk") that was necessary.) So I installed library(biclust) and was able to binarize the data:
ObamaCog <- binarize(ObamaCog, threshold = 5)
Then we did the same for the affective attitudes:
ObamaAff <- data.frame(Ang = as.numeric(ANES2012$candaff_angdpc),#this creates a data frame containing the items tapping feelings
Hop = as.numeric(ANES2012$candaff_hpdpc),
Afr = as.numeric(ANES2012$candaff_afrdpc),
Prd = as.numeric(ANES2012$candaff_prddpc))
ObamaAff[ObamaAff < 3] <- NA#values below 3 represent missing values
ObamaAff <- binarize(ObamaAff, 4)#(not) endorsing the feelings is encoded as 1 (0)
And created one Obama-matrix out of it:
Obama <- data.frame(ObamaCog,ObamaAff)
Then we are omitting the NA values:
Obama <- na.omit(Obama)
And I checked:
write.csv(Obama, file = "Obama-Excel1")
There are no more NA values in my matrix.
And I think it fits the required structure: nobs x nvars
Mor Led Car Kno Int Hns Ang Hop Afr Prd
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
...
60 0 0 0 0 0 0 0 0 0 0
61 1 1 1 1 1 1 0 0 0 0
62 0 0 0 0 0 0 0 0 0 0
63 0 0 0 0 0 0 0 0 0 0
65 0 1 1 0 0 1 0 0 0 0
66 1 1 1 1 1 1 0 0 0 0
67 0 0 0 0 0 0 0 0 0 0
until 5914. And if there was an NA-value in the row before, it is now missing. (For example row 64)
If I am then trying to run the IsingFit-function:
ObamaFit <- IsingFit(Obama)
It doesn't work, I am getting the error message:
Error in y %*% rep(1, nc) : non-conformable arguments
I am a beginner in R and I assumed that non-conformable arguments are NA-values, but this doesn't seem to be the case. Can anyone tell me, what the error message means and how I might solve the problem, so I can use the IsingFit-function?

Finding "local maximas" but ignore value less than 20% of highest one

I'm trying to create a function to find a "local maximas" in every row of my data but ignore if they are not atleast 20% of the "highest" maximum in a row.
Function which I use to find a local maximas:
which(diff(sign(diff(Gene name)))==-2)+1
but I would like to modify it and set pick only if the other maxima is atleast 20% of the highest value.
That's my data:
Name Mo Tue Wen Thu Fr Sat Sun
Mark 0 32 53 11 0 33 52
Ettin 22 51 31 0 0 1 0
Gerard 36 0 13 0 111 33 0
Marcus 0 44 31 10 0 2 0
That's an output I got with my function:
Name Mo Tue Wen Thu Fr Sat Sun
Mark 0 0 1 0 0 0 1 ## Two local maximas
Ettin 0 1 0 0 0 1 0 ## Two local maximas (Should be one!)
Gerard 1 0 1 0 1 0 0 ## Three local maximas (Should be two!)
Marcus 0 1 0 0 0 1 0 ## Two local maximas (Should be one!)
For 3 rows the output isn't correct because the values in cells (Ettin,Sat) & (Gerard, Wen) & (Marcus, Sat) are not even close to atleast 20% of the highest value.
That's what I expect to get with new function:
Name Mo Tue Wen Thu Fr Sat Sun
Mark 0 0 1 0 0 0 1
Ettin 0 1 0 0 0 0 0
Gerard 1 0 0 0 1 0 0
Marcus 0 1 0 0 0 0 0
Is it possible to write such function ?
if(master[j,i]>master[j,i-1]) {
if(master[j,i] > 0.2*max(master [j,])) {
mas_max[j,i] <- 1 ## Setting maxima
mas_max[j,i-1] <- 0 ## Removing potential maxima before
}
}
That's a loop I created but it is not the best way to get desired results.
If your local maxima are at
ind <- which(diff(sign(diff(GeneName)))==-2)+1
then you can get the indices of the thresholds that are no less than 20% of the highest by
ind[GeneName[ind] >= 0.2 * max(GeneName[ind])]
Also, note that ==-2 won't spot local maxima that are part of a plateau, for instance it won't spot c(0,10,10,0) - not sure if that's an issue, but thought it best to point that out.

While printing in console or saving a (big) data (came as an output in R) using sink() some of the last rows are getting omitted

I have an array Y_obs of dimension (200X353x5) which came as an output in R (I mostly use Rstudio in Ubuntu 13.10.).
The problem: print(Y_obs) does not display the whole array in console. It is showing the following:
[22,] 0 0 0 0 0 0
[23,] 0 0 0 0 0 0
[24,] 0 0 0 0 0 0
[25,] 0 0 0 0 0 0
[26,] 0 0 0 0 0 0
[27,] 0 0 0 0 0 0
[28,] 0 0 0 0 0 0
[ reached getOption("max.print") -- omitted 172 row(s) and 4 matrix slice(s) ]
Then I went to sink my array Y_obs by using the following commands:
sink('CR.csv')
Y_obs
sink()
Then also it is showing the same output as the console after saving the incomplete data in the .csv file omitting the last 172 rows and 4 matrix slices.
When I tried the same in R terminal the it showed:
[81,] 0 0 0 0 0 0
[82,] 0 0 0 0 0 0
[83,] 0 0 0 0 0 0
[ reached getOption("max.print") -- omitted 117 row(s) and 3 matrix slice(s) ]
My question is: How to save the full array Y_obs in a specified .csv file ?
?options # the help page for the options function should appear.
bigger = options()$max.print + 200 # or add something larger
options("max.print" = bigger) # apparently RStudio sets max.print very low.
print(Y_obs) # the default on the typical R installation is around 10K

Creating a sparse matrix from a TermDocumentMatrix

I've created a TermDocumentMatrix from the tm library in R. It looks something like this:
> inspect(freq.terms)
A document-term matrix (19 documents, 214 terms)
Non-/sparse entries: 256/3810
Sparsity : 94%
Maximal term length: 19
Weighting : term frequency (tf)
Terms
Docs abundant acid active adhesion aeropyrum alternative
1 0 0 1 0 0 0
2 0 0 0 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 1 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 1 0
11 0 0 1 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 1 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 1
This is just a small sample of the matrix; there are actually 214 terms that I'm working with. On a small scale, this is fine. If I want to convert my TermDocumentMatrix into an ordinary matrix, I'd do:
data.matrix <- as.matrix(freq.terms)
However the data that I've displayed above is just a subset of my overall data. My overall data has probably at least 10,000 terms. When I try to create a TDM from the overall data, I run an error:
> Error cannot allocate vector of size n Kb
So from here, I'm looking into alternative ways of finding efficient memory allocation for my tdm.
I tried turning my tdm into a sparse matrix from the Matrix library but ran into the same problem.
What are my alternatives at this point? I feel like I should be investigating one of:
bigmemory/ff packages as talked about here (although the bigmemory package doesn't seem available for Windows at the moment)
the irlba package for computing partials SVD of my tdm as mentioned here
I've experimented with functions from both libraries but can't seem to arrive at anything substantial. Does anyone know what the best way forward is? I've spent so long fiddling around with this that I thought I'd ask people who have much more experience than myself working with large datasets before I waste even more time going in the wrong direction.
EDIT: changed 10,00 to 10,000. thanks #nograpes.
The package qdap seems to be able to handle a problem this large. The first part is recreating a data set that matches the OP's problem followed by the solution. As of qdap version 1.1.0 there is compatibility with the tm package:
library(qdapDictionaries)
FUN <- function() {
paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}
library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))
This gives a similar corpus...
Now the qdap approach. You have to first convert the Corpus to a dataframe (tm_corpus2df) and then use the tdm function to create a TermDocumentMatrix.
out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)
## A term-document matrix (19914 terms, 15 documents)
##
## Non-/sparse entries: 80235/218475
## Sparsity : 73%
## Maximal term length: 19
## Weighting : term frequency (tf)

Arules Package--Trio error

I'm working with a large binary data matrix, 4547 x 5415, for association rule mining. Per usual, each row is a transaction with every column being an item. Whenever I call on the arules package it yields some arcane error message referencing the trio library. Does anyone have experience with this type of error?
i[1:10,1:10]
101402 101403 101404 101405 101406 101411 101412 101413 101414 101415
[1,] 0 0 0 1 0 0 1 0 0 0
[2,] 0 1 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 1 0 0 0
[4,] 0 0 0 1 0 0 0 0 0 1
[5,] 0 0 0 1 0 0 0 0 0 1
[6,] 0 1 0 0 0 1 0 0 0 0
[7,] 0 0 0 0 0 0 1 0 0 0
[8,] 0 0 1 0 0 0 0 0 0 1
[9,] 0 0 0 0 0 1 0 0 0 0
[10,] 0 0 0 0 1 0 1 0 0 0
rules <- apriori(i, parameter=list(support=0.001, confidence=0.5))
parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.5 0.1 1 none FALSE TRUE 0.001 1 10 rules
ext
FALSE
algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5415 item(s), 4547 transaction(s)] done [0.47s].
sorting and recoding items ... [4908 item(s)] done [0.18s].
creating transaction tree ... done [0.01s].
**checking subsets of size 1 2Error in apriori(i, parameter = list(support = 0.001, confidence = 0.5)) :
internal error in trio library**
Reproducible example:
y <- matrix(nrow=4547, ncol=5415)
y <- apply(y, c(1,2), function(x) sample(c(0,1),1))
rules <- apriori(y, parameter=list(support=0.001, confidence=0.5))
The problem is that there is a bug in the error handling in the arules package. You run out of memory and when the apriori code tries to create the appropriate error message then it instead creates an invalid call to printf which is handled under Windows by the trio library. So in short you should get an out of memory error.
This problem will be resolved in arules version 1.1-4.
To avoid running out of memory you need to increase support and/or restrict the number of items in the itemsets (maxlen in the list for parameter)
-Michael

Resources