how to run mclust faster on 50000 records dataset - r

I am a beginner, I am trying to cluster a data frame (with 50,000 records) that has 2 features (x, y) by using mclust package. However, it feels like forever to run a command (e.g.Mclust(XXX.df) or densityMclust(XXX.df).
Is there any way to execute the command faster? an example code will be helpful.
For your info I'm using 4 core processor with 6GB RAM, it took me about 15 minutes or so to do the same analysis (clustering) with Weka, using R the process is still running above 1.5 hours. I do really want to use R for the analysis.

Dealing with large datasets while using mclust is described in Technical Report, subsection 11.1.
Briefly, functions Mclust and mclustBIC include a provision for using a subsample of the data in the hierarchical clustering phase before applying EM to the full data set, in order to extend the method to larger datasets.
Generic example:
library(mclust)
set.seed(1)
##
## Data generation
##
N <- 5e3
df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5))
##
## Full set
##
system.time(res <- Mclust(df))
# > user system elapsed
# > 66.432 0.124 67.439
##
## Subset for initial stage
##
M <- 1e3
system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M))))
# > user system elapsed
# > 19.513 0.020 19.546
"Subsetted" version runs approximately 3.5 times faster on my Dual Core (although Mclust uses only single core).
When N<-5e4 (as in your example) and M<-1e3 it took about 3.5 minutes for version with subset to complete.

Related

R - How to speed up panel data construction process using for-loop

I am planning to construct a panel dataset and, as a first step, I am trying to create a vector that has 25 repetitive ids (same 25 ids to assign to each year later) for each 99537 unique observation. What I have so far is:
unique_id=c(1:99573)
panel=c()
for(i in 1:99573){
x=rep(unique_id[i],25)
panel=append(panel,x)
}
The problem I have is the codes above are taking too much time. RStudio keeps processing and does not give me any output. Could there be any other ways to speed up the process? Please share any ideas with me.
We don't need a loop here
panel <- rep(unique_id, each = 25)
Benchmarks
system.time(panel <- rep(unique_id, each = 25))
# user system elapsed
# 0.046 0.002 0.047
length(panel)
#[1] 2489325

How to reduce the size of the data in R?

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.
I pulled the data use the package readr with following code
VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")
When I checked the object size using following code, it's about 3 gb.
> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb
In the next step, using the below code, I want to create training and test set for LASSO regression.
set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))
train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]
yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))
When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.
rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)
Appreciate if someone can guide me a way to reduce the size of the data.
Thanks
R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.
You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.

Removing duplicates requires a transpose, but my dataframe is too large

I had asked a question here. I had a simple dataframe, for which I was attempting to remove duplicates. Very basic question.
Akrun gave a great answer, which was to use this line:
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
I went ahead and did this, which worked great on the dummy problem. But I have 3.5 million records that I'm trying to filter.
In an attempt to see where the bottleneck is, I broke the code into steps.
step1 <- apply(df1[1:2], 1, sort)
step2 <- t(step1)
step3 <- data.frame(step2, df1$location)
step4 <- !duplicated(step3)
final <- df1[step4, ,]
step 1 look quite a long time, but it wasn't the worst offender.
step 2, however, is clearly the culprit.
So I'm in the unfortunate situation where I'm looking for a way to transpose 3.5 million rows in R. (Or maybe not in R. Hopefully there is some way to do it somewhere).
Looking around, I saw a few ideas
install the WGCNA library, which has a transposeBigData function. Unfortunately this package is not longer being maintained, and I can't install all the dependencies.
write the data to a csv, then read it in line by line, and transpose each line one at a time. For me, even writing the file run overnight with no completion.
This is really strange. I just want to remove duplicates. For some reason, I have to transpose a dataframe in this process. But I can't transpose a dataframe this large.
So I need a better strategy for either removing duplicates, or for transposing. Does anyone have any ideas on this?
By the way, I'm using Ubuntu 14.04, with 15.6 GiB RAM, for which cat /proc/cpuinfo returns
Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
model name : Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
cpu MHz : 1200.000
cache size : 6144 KB
Thanks.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))
A faster option would be using pmin/pmax with data.table
library(data.table)
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2)))]
# id1 id2 location comment
#1: 1 2 Alaska cold
#2: 3 4 California nice
#3: 4 5 Kansas boring
#4: 9 10 Alaska cold
If 'location' also needs to be included to find the unique
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2), location))]
So after struggling with this for most of the weekend (grateful for plenty of selfless help from the illustrious #akrun), I realized that I would need to go about this in a completely different manner.
Since the dataframe was simply too large to process in memory, I ended up using a strategy where I pasted together a (string) key and column-bound it onto the dataframe. Next, I collapsed the key and sorted the characters. Here I could use which to get the index of the rows that contained non-duplicate keys. With that I could filter the my dataframe.
df_with_key <- within(df, key <- paste(boxer1, boxer2, date, location, sep=""))
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
df_with_key$key <- strSort(df_with_key$key)
idx <- which(!duplicated(df_with_key$key))
final_df <- df[idx,]

ANOVA in R using summary data

is it possible to run an ANOVA in r with only means, standard deviation and n-value? Here is my data frame:
q2data.mean <- c(90,85,92,100,102,106)
q2data.sd <- c(9.035613,11.479667,9.760268,7.662572,9.830258,9.111457)
q2data.n <- c(9,9,9,9,9,9)
q2data.frame <- data.frame(q2data.mean,q2data.sq,q2data.n)
I am trying to find the means square residual, so I want to take a look at the ANOVA table.
Any help would be really appreciated! :)
Here you go, using ind.oneway.second from the rspychi package:
library(rpsychi)
with(q2data.frame, ind.oneway.second(q2data.mean,q2data.sd,q2data.n) )
#$anova.table
# SS df MS F
#Between (A) 2923.5 5 584.70 6.413
#Within 4376.4 48 91.18
#Total 7299.9 53
# etc etc
Update: the rpsychi package was archived in March 2022 but the function is still available here: http://github.com/cran/rpsychi/blob/master/R/ind.oneway.second.R (hat-tip to #jrcalabrese in the comments)
As an unrelated side note, your data could do with some renaming. q2data.frame is a data.frame, no need to put it in the title. Also, no need to specify q2data.mean inside q2data.frame - surely mean would suffice. It just means you end up with complex code like:
q2data.frame$q2data.mean
when:
q2$mean
would give you all the info you need.

MPI parallelization using SNOW is slow

My foray into parallelization continues. I initially had difficulty installing Rmpi, but I got that going (I needed to sudo apt-get it). I should say that I'm running a machine with Ubuntu 10.10.
I ran the same simulation as my previous question. Recall the system times for the unclustered and SNOW SOCK cluster respectively:
> system.time(CltSim(nSims=10000, size=100))
user system elapsed
0.476 0.008 0.484
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.028 0.004 0.375
Now, using an MPI cluster, I get a speed slowdown relative to not clustering:
> stopCluster(cl)
> cl <- getMPIcluster()
> system.time(ParCltSim(cluster=cl, nSims=10000, size=100))
user system elapsed
0.088 0.196 0.604
Not sure whether this is useful, but here's info on the cluster created:
> cl
[[1]]
$rank
[1] 1
$RECVTAG
[1] 33
$SENDTAG
[1] 22
$comm
[1] 1
attr(,"class")
[1] "MPInode"
[[2]]
$rank
[1] 2
$RECVTAG
[1] 33
$SENDTAG
[1] 22
$comm
[1] 1
attr(,"class")
[1] "MPInode"
attr(,"class")
[1] "spawnedMPIcluster" "MPIcluster" "cluster"
Any idea about what could be going on here? Thanks for your assistance as I try out these parallelization options.
Cordially,
Charlie
It's a bit the same as with your other question : the communication between the nodes in the cluster is taking up more time than the actual function.
This can be illustrated by changing your functions :
library(snow)
cl <- makeCluster(2)
SnowSim <- function(cluster, nSims=10,n){
parSapply(cluster, 1:nSims, function(x){
Sys.sleep(n)
x
})
}
library(foreach)
library(doSNOW)
registerDoSNOW(cl)
ForSim <- function(nSims=10,n) {
foreach(i=1:nSims, .combine=c) %dopar% {
Sys.sleep(n)
i
}
}
This way we can simulate a long-calculating and a short-calculating function in different numbers of simulations. Let's take two cases, one where you have 1 sec calculation and 10 loops, and one with 1ms calculation and 10000 loops. Both should last 10sec :
> system.time(SnowSim(cl,10,1))
user system elapsed
0 0 5
> system.time(ForSim(10,1))
user system elapsed
0.03 0.00 5.03
> system.time(SnowSim(cl,10000,0.001))
user system elapsed
0.02 0.00 9.78
> system.time(ForSim(10000,0.001))
user system elapsed
10.04 0.00 19.81
Basically what you see, is that for long-calculating functions and low simulations, the parallelized versions cleanly cut the calculation time in half as expected.
Now the simulations you do are of the second case. There you see that the snow solution doesn't really make a difference any more, and that the foreach solution even needs twice as much. This is simply due to the overhead of communication to and between nodes, and handling of the data that gets returned. The overhead of foreach is a lot bigger than with snow, as shown in my answer to your previous question.
I didn't fire up my Ubuntu to try with an MPI cluster, but it's basically the same story. There are subtle differences between the different types of clusters according to time needed for communication, partly due to differences between the underlying packages.

Resources