Mclust freezes with small dataset - r

I am trying to use the Mclust() function from the R-package mclust on a dataset with 500 observations and 2 variables, and I want to identify 2 clusters.
> head(data)
x y
1 0.9929185 -1.9662945
2 8.2259360 -0.7240049
3 3.3866952 -1.8054764
4 -0.5161490 -2.3096992
5 1.8931073 -1.8928091
6 4.0833228 -1.9045669
> Mclust(data, G = 2)
fitting ...
|=============================================================== | 67%
This should produce an output relatively quickly, but freezes at 67%.
I ran this function multiple times over different datasets, and had no problems whatsoever. It even works if I only include observations up to row 498, but fails as soon as row 499+ is included.
498 -1.710175250 -1.612248596
499 -5.666497204 5.565422240
500 -3.649579976 1.552779499
I have uploaded the whole dataset in my GitHub repository: https://github.com/fstermann/bthesis/tree/main/MclustFreeze
I would greatly appreciate if anyone has an idea why this is happing with this specific dataset.
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_5.4.7
loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5

Related

Addition of NA and expression that evaluates to NaN return different results depending on order, violation of the commutative property?

I am investigating corner cases of numeric operations in R. I came across the following particular case involving zero divided by zero:
(0/0)+NA
#> [1] NaN
NA+(0/0)
#> [1] NA
Created on 2021-07-10 by the reprex package (v2.0.0)
Session info
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.27 withr_2.4.2 magrittr_2.0.1 reprex_2.0.0
#> [5] evaluate_0.14 highr_0.9 stringi_1.6.2 rlang_0.4.11
#> [9] cli_3.0.0 rstudioapi_0.13 fs_1.5.0 rmarkdown_2.9
#> [13] tools_4.1.0 stringr_1.4.0 glue_1.4.2 xfun_0.23
#> [17] yaml_2.2.1 compiler_4.1.0 htmltools_0.5.1.1 knitr_1.33
This clearly violates the commutative property of addition. I have two questions:
Is there an explanation of this behavior based on the R language definition?
Are there other examples of the violation of the commutative property of addition (including in other languages) that don't involve side effects in the addend sub-expressions?
Noting that
0/0
#[1] NaN
a more general example of the behavior of + in the question is the following:
NA + NaN
#[1] NA
NaN + NA
#[1] NaN
This is in a r-devel thread and R Core Team member Tomas Kalibera answers the following (my emphasis and link).
Yes, the performance overhead of fixing this at R level would be too
large and it would complicate the code significantly. The result of
binary operations involving NA and NaN is hardware dependent (the
propagation of NaN payload) - on some hardware, it actually works the
way we would like - NA is returned - but on some hardware you get NaN or
sometimes NA and sometimes NaN. Also there are C compiler optimizations
re-ordering code, as mentioned in ?NaN. Then there are also external
numerical libraries that do not distinguish NA from NaN (NA is an R
concept). So I am afraid this is unfixable. The disclaimer mentioned by
Duncan is in ?NaN/?NA, which I think is ok - there are so many numerical
functions through which one might run into these problems that it would
be infeasible to document them all. Some functions in fact will preserve
NA, and we would not let NA turn into NaN unnecessarily, but the
disclaimer says it is something not to depend on.
According to ?NA, this could be because of NaN resulted from 0/0
Numerical computations using NA will normally result in NA: a possible exception is where NaN is also involved, in which case either might result (which may depend on the R platform). However, this is not guaranteed and future CPUs and/or compilers may behave differently. Dynamic binary translation may also impact this behavior (with valgrind, computations using NA may result in NaN even when no NaN is involved).

Error when performing an NA replacement in R 4.0

With R 3.6 I can perform the following NA replacement
> d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
> d[is.na(d)] <- 1
> d
a b
2021-03-03 1 1
With R 4.0 I get the following error:
> d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
> d[is.na(d)] <- 1
Error in as.Date.default(e) :
do not know how to convert 'e' to class “Date”
Has some default behavior changed in R 4.0?
R 3.6 session info:
Microsoft Windows [Version 10.0.19041.804]
(c) 2020 Microsoft Corporation. All rights reserved.
C:\>R --no-site-file
R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Warning message:
package 'zoo' was built under R version 4.0.4
> d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
> d[is.na(d)] <- 1
> d
a b
2021-03-03 1 1
R 4.0 session info:
Microsoft Windows [Version 10.0.19041.804]
(c) 2020 Microsoft Corporation. All rights reserved.
C:\>R --no-site-file
R version 4.0.4 (2021-02-15) -- "Lost Library Book"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
> d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
> d[is.na(d)] <- 1
Error in as.Date.default(e) :
do not know how to convert 'e' to class "Date"
Session Info (3.6):
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.8-8
loaded via a namespace (and not attached):
[1] compiler_3.6.1 grid_3.6.1 lattice_0.20-38
Session Info (4.0):
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.8-8
loaded via a namespace (and not attached):
[1] compiler_4.0.4 tools_4.0.4 grid_4.0.4 lattice_0.20-41
Thanks for raising this issue, it was a bug in the zoo package. In the [.zoo and [<-.zoo methods we checked whether the index i was a matrix via
if (all(class(i) == "matrix")) ...
This worked correctly in R 3.x.y because matrix objects just had the class "matrix". However, in R 4.0.0 matrix objects started additionally inheriting from "array". See: https://developer.R-project.org/Blog/public/2019/11/09/when-you-think-class.-think-again/.
In the zoo development version on R-Forge (https://R-Forge.R-project.org/R/?group_id=18) I have fixed the issue now by replacing the above code with
if (inherits(i, "matrix")) ...
So you can already install zoo 1.8-9 from R-Forge and your code will work again as intended. Alternatively, you can wait for that version to arrive on CRAN which will hopefully come out in the next days after reverse dependency checks. In the mean time you can work around the issue by using
coredata(d)[is.na(d)] <- 1
I'm having an issue with this as well!! here's more odd behavior/breadcrumbs as to what might be happening. Still not sure WHY but seems to be an indexing issue w/ zoo rather than is.na() specifically. Logical indexing works if the structure of the logical has the same rownames/indices as the zoo obj:
Printing out d[is.na(d)] (without assignment) results in an empty zoo object, suggesting the issue is w/ indexing
wrapping d in coredata() works
d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
coredata(d)[is.na(d)] <- 1
d
a b
> 2021-03-05 1 1
The logical returned by is.na() will work if it is transformed to have the same rownames/indices as the zoo obj.
d <- zoo(data.frame(a = NA, b = 1), Sys.Date())
changes <- is.na(d) #storing logical in a variable
> d
a b
2021-03-05 NA 1
> changes #d[changes] won't work, so change rownames
a b
[1,] TRUE FALSE
> changes <- as.zoo(changes, index(d))
> changes
a b
2021-03-05 TRUE FALSE
> d[as.logical(changes)] #changing zoo back to a logical, returns something
a b
2021-03-05 NA 1
> d[as.logical(changes)] <- 1
a b
2021-03-05 NA 1
Now for the breadcrumbs...does anybody know what changes were made to R's date class in version 4? Zoo suggest it made some changes to merge.zoo "explicitly to work around the new behavior of c.Date() in R >= 4.1.0."
https://cran.r-project.org/web/packages/zoo/NEWS (see bullet point #2 at the top)
I've searched and searched and see no mention of those changes...
I'm guessing there were some changes to the zoo class to more strictly enforce date indexing...unsure...also can't quite seem to figure out to post an issue w/ zoo
more breadcrumbs....
apparently this functionality was working for zoo objects back in April of 2020 according to this thread https://github.com/joshuaulrich/xts/issues/331
R 4.0.1 came out later that month and newest zoo package, 1.8-8, was released 5/2020 so maybe running version 1.8-7 of this package could determine if it's a change w/ R or a change w/ zoo that's causing different behavior
#neilfws you are correct that the issue is due to the change in class response in R 4.0.
For now, the best option is to use either:
d <- na.fill(d, 1)
or
coredata(d)[is.na(d)] <- 1
The old use case of d[is.na(d)] <- 1 will require an update to the zoo package

Merging 2 large data sets in R

I am trying to merge two large data sets as i need to create a final trainset for my models to run
head(TrainWithAppevents_rel4)
event_id |device_id |gender |age |group| phone_brand |device_model| numbrand nummodel | app_id
6 6 1476664663289716480 M 19 M22- åŽä¸º Mate 7 29 919 4348659952760821248
and
head(app_labels)
app_id |label_id
1 7324884708820028416 251
The first dataset has unique rows now as i have worked on it to remove all duplicates
i want my final set to be having the below columns
event_id device_id gender age group phone_brand device_model numbrand nummodel app_id label_id
However when i try to merge using the below in R (R studio session)
TrainWithLabels=merge(x=TrainWithAppevents_rel4,y=app_labels,by="app_id",all.x = TRUE)
i get following error
**Error: cannot allocate vector of size 512.0 Mb**
Error varies if i run again but only in terms of size of vector
The sizes of my datasets are as below :
> dim(TrainWithAppevents_rel4)
[1] 4787796 10
> dim(app_labels)
[1] 459943 2
More information about the machine/R i use :
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
i use intel 2.6GHz/16GB RAM /64 Bit OS/Windows10/x64 -based processesor
i have tried the following :
-Reducing the dataset by removing duplicates and unwanted columns ,
all rows in the first dataset are unique now
-closing all other application on my laptop and then running the merge-Still fails
-executing gc() and then running merge
I have gone through similar questions on SO for R, however none of them offered a solution to move forward and not specific to merges failing on a 64 bit machine
Can anyone please help in either suggesting a solution or a workaround to move forward.
Please assume that this is the only machine where i can execute the code and running this R script on AWS via zepplin is not possible at the moment.

Is it possible to make a heatmap with 200k rows?

I would like to know if it is possible to make a heatmap with 200k rows? I have a matrix with genomic coordinates as rows and each column represents the presence or absence of that region in each patients. so I have 15 patient (columns). Now when I am trying heatmap.2 or pheatmap I get the memory allocation problem, how will I be able to use the entire matrix to generate a heatmap. The values of my matrix are jsut 0 and 1 and I want to draw some hypothesis based on the heatmap. How will I be able to use it. I tried once on my laptop but it does not help , so I tried in our linux cluster which is quite powerful. But still shows the below error. How can I resolve this problem. I also add the sessionInfo() . Any workaround is appreciated
data<-read.delim("path/H3_marks_map.txt",sep="\t", row.names=1)
pdf(file="path/map/H3_maps1.pdf")
pheatmap(data,scale="none")
Error: cannot allocate vector of size 170.5 Gb
dev.off()
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.1-2 pheatmap_1.0.7
loaded via a namespace (and not attached):
[1] colorspace_1.2-6 grid_3.1.2 gtable_0.1.2 munsell_0.4.2
[5] plyr_1.8.3 Rcpp_0.12.1 scales_0.3.0 tools_3.1.2

Knitr and data.table

I have an automated report that i produce using knitr. i'm running across the oddest problem. I wrote a function that sums the data by month for several locations. when i run this function in R i get the following result (which is correct):
###NAME MONTH VOL
###1: TOTAL 1 13.00872
###2: TOTAL 2 11.62527
###3: TOTAL 3 12.71313
###4: TOTAL 4 12.67269
###5: TOTAL 5 15.05127
###6: TOTAL 6 14.61002
###7: TOTAL 7 15.43827
###8: TOTAL 8 15.22400
###9: TOTAL 9 14.91259
###10: TOTAL 10 15.83505
###11: TOTAL 11 14.97242
###12: TOTAL 12 16.34950
when i run this same function (no changes) through knitr to produce the report i get the following result:
###NAME MONTH VOL
###1: TOTAL 1 14.00872
###2: TOTAL 2 13.62527
###3: TOTAL 3 15.71313
###4: TOTAL 4 16.11338
###5: TOTAL 5 17.61269
###6: TOTAL 6 18.46945
###7: TOTAL 7 20.18851
###8: TOTAL 8 21.04382
###9: TOTAL 9 21.72287
###10: TOTAL 10 23.54272
###11: TOTAL 11 23.72971
###12: TOTAL 12 26.03293
i also have another table where knitr just prints non-sense even though the table has actual values in it.
Here is my session info:
R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lubridate_1.3.3 xtable_1.7-4 shape_1.4.2 reshape2_1.4.1 rgdal_0.9-2 raster_2.2-12
[7] sp_1.0-17 png_0.1-7 data.table_1.9.2
loaded via a namespace (and not attached): [1] digest_0.6.8 evaluate_0.5.5 formatR_1.1 grid_3.1.2 knitr_1.9 lattice_0.20-29 memoise_0.2.1
[8] packrat_0.4.3 plyr_1.8.1 Rcpp_0.11.5 stringr_0.6.2 tools_3.1.2
UPDATE
i pinpointed the problem on at least one of the tables that this error occurs. The problem was the setnames function and the key merge feature of data.table.
when the merge happens R recognizes duplicate column names using a ".1" notation (i.e., if table1 and table 2 both have columns names CHEM then TABLE = table1[table2] has columns named CHEM and CHEM.1) whereas knitr is transforming them into CHEM and i.CHEM. to fix this, i originally used the code setnames(TABLE,names(TABLE),c(New column names)). but this didn't recognize the names(TABLE) in the correct order so i was renaming the wrong columns. but this error only happened when it was passed through knitr. when i ran this code through R alone it worked properly. What is the diconnect between knitr and data.table?
I will work on getting an example code up but as it stands the code would need to be simplified to make posting an example helpful.

Resources