Setting cluster id in MClust - r

When I cluster a dataset using MClust, I use the following code-
i = 2
print(paste("Number of clusters =", i))
cluster_model1 <- Mclust(cc[2:6], G=i)
When I repeat the clustering, the cluster classification (id) in each iteration can remain the same or it can change from 1 to 2 or 2 to 1. Is it possible to set the cluster id so that it does not change arbitrarily. I want to see how many times data from 10 imputed datasets belongs to cluster 1 or cluster 2. I can calculate this only if the cluster id remains the same.
The dataset cc has this data
head(cc[2:6])
ea pa sa en pn
1 1.0 1.0 1.0 2.2 1.6
2 3.2 2.4 1.0 3.2 1.8
3 1.2 1.0 1.0 2.0 1.0
4 1.6 1.2 1.2 1.0 1.2
5 3.6 1.0 1.6 4.0 2.6
6 1.6 1.0 1.4 1.4 1.2
When I cluster, the classification could be
head(cluster_model1$classification)
[1] 2 1 2 1 1 1
or
head(cluster_model1$classification)
[1] 1 2 1 2 2 2
While the clustering results are correct, is it possible to set it as 2 1 2 1 1 1 every time the clustering is done.

Related

organize data, to match quantities of table A to the quantities of table B, and vise versa

I have this two tables. I want each row of two tables match in quantity, and difference of those numbers to go to 1 row below and match with next row.
Current Data:
Quant
Symbol
Price
20
B
1.5
15
B
1.8
31
B
1.9
14
B
2.2
20
B
2.3
10
B
2.5
and
Quant
Symbol
Price
20
S
2.6
10
S
2
35
S
1.8
15
S
1.6
10
S
1.5
I would like it to turn to this table.
Quant
Symbol
Price
20
B
1.5
10
B
1.8
5
B
1.8
30
B
1.9
1
B
1.9
14
B
2.2
10
B
2.3
10
B
2.3
and
Quant
Symbol
Price
20
S
2.6
10
S
2
5
S
1.8
30
S
1.8
1
S
1.6
14
S
1.6
10
S
1.5

Setting values to NA in one column based on conditions in another column

Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16

R: convert data frame values from integer to double based on the column index

I have a data frame, where each column corresponds to variable and each row corresponds to the numerical category, e.g. 0, 1 or 2.
df <- data.frame(TKI=c(1,1,2,0,1),
Chemo=c(1,2,2,0,1),
Radio=c(1,1,2,0,1),
EGFR=c(1,2,2,0,1),
ALK=c(1,1,2,0,1))
df
TKI Chemo Radio EGFR ALK
1 1 1 1 1 1
2 1 2 1 2 1
3 2 2 2 2 2
4 0 0 0 0 0
5 1 1 1 1 1
I would like to convert each value to the double based on the column index. For example, the aforementioned table will be transformed into this:
1.1 2.1 3.1 4.1 5.1
1.1 2.2 3.1 4.2 5.1
1.2 2.2 3.2 4.2 5.2
1.0 2.0 3.0 4.0 5.0
1.1 2.1 3.1 4.1 5.1
I also would like to obtain the list of columns and their indexes, such as this:
1 - TKI
2 - Chemo
3 - Radio
4 - EGFR
5 - ALK
How can I do this conversion?
Thank you for your time and help!
You can use col to get the column index and divide the dataframe by 10 to create a decimal value and add the two numbers together.
col(df) + df/10
# TKI Chemo Radio EGFR ALK
#1 1.1 2.1 3.1 4.1 5.1
#2 1.1 2.2 3.1 4.2 5.1
#3 1.2 2.2 3.2 4.2 5.2
#4 1.0 2.0 3.0 4.0 5.0
#5 1.1 2.1 3.1 4.1 5.1
To get column names and their position you can do :
ref_df <- data.frame(index = seq_along(df),
names = names(df))
# index names
#1 1 TKI
#2 2 Chemo
#3 3 Radio
#4 4 EGFR
#5 5 ALK

inner join with multiple conditions r data table

I am trying to do an inner join using data table that has multiple, fairly dynamic conditions. I am getting tripped up on the syntax. First, I create two objects, x and x2 that I want to do an inner join with.
set.seed(1)
#generate data
x = data.table(CJ(t=1:10, d=1:3,p1s=seq(1,3,by=0.1),p1sLAST=seq(1,3,by=0.1)))
x[d==1,p1sLAST:=3]
x=x[p1s<=p1sLAST]
x2 = data.table(CJ(tprime=1:10, p1sLASTprm=seq(1,3,by=0.1)))
With the objects:
> x
t d p1s p1sLAST
1: 1 1 1.0 3.0
2: 1 1 1.0 3.0
3: 1 1 1.0 3.0
4: 1 1 1.0 3.0
5: 1 1 1.0 3.0
---
9026: 10 3 2.8 2.9
9027: 10 3 2.8 3.0
9028: 10 3 2.9 2.9
9029: 10 3 2.9 3.0
9030: 10 3 3.0 3.0
> x2
tprime p1sLASTprm
1: 1 1.0
2: 1 1.1
3: 1 1.2
4: 1 1.3
5: 1 1.4
---
206: 10 2.6
207: 10 2.7
208: 10 2.8
209: 10 2.9
210: 10 3.0
Now, I want to do these last three steps in a single inner join.
joined = x[,x2[],by=names(x)]
joined=joined[p1sLASTprm==p1s & d!=3 | d==3 & p1sLASTprm==3]
joined=joined[tprime==t+1]
Resulting in the final output:
> joined
t d p1s p1sLAST tprime p1sLASTprm
1: 1 1 1.0 3.0 2 1.0
2: 1 1 1.1 3.0 2 1.1
3: 1 1 1.2 3.0 2 1.2
4: 1 1 1.3 3.0 2 1.3
5: 1 1 1.4 3.0 2 1.4
---
4343: 9 3 2.8 2.9 10 3.0
4344: 9 3 2.8 3.0 10 3.0
4345: 9 3 2.9 2.9 10 3.0
4346: 9 3 2.9 3.0 10 3.0
4347: 9 3 3.0 3.0 10 3.0
I do not think a single inner join can accomplish those 3 steps since there is a | and most likely a union of results will be required.
A more memory efficient approach could be:
ux <- unique(x)[, upt := t+1]
rbindlist(list(
ux[d!=3][x2,
c(mget(names(ux)), mget(names(x2))),
on=c("p1s"="p1sLASTprm", "upt"="tprime"),
nomatch=0L],
ux[d==3][x2[p1sLASTprm==3],
c(mget(names(ux)), mget(names(x2))),
on=c("upt"="tprime"),
nomatch=0L]
))

R: Calculating mean value for preceding rows and within groups

x<-c("A","B")
y<-c(1:10)
dat<-expand.grid(visit=y,site=x)
I would like to get a column that has the mean value for visit of the preceding rows within each site. The first visits will have no values.
So example of returned data
visit site mean
1 A
2 A 1
3 A 1.5
4 A 2
5 A 2.5
6 A 3
1 B etc..
Using y = 1:6 for this, to match the example in the question.
You can get the running averages with by and cumsum:
with(dat, by(visit, site, FUN=function(x) cumsum(x)/1:length(x)))
## site: A
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
## -----------------------------------------------------------------------------------------------------
## site: B
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
These are almost what you want. You want them shifted by one, and don't want the last entry. That's easy enough to do (if a bit odd of a requirement).
with(dat, by(visit, site, FUN=function(x) c(NA, head(cumsum(x)/1:length(x), -1))))
## site: A
## [1] NA 1.0 1.5 2.0 2.5 3.0
## -----------------------------------------------------------------------------------------------------
## site: B
## [1] NA 1.0 1.5 2.0 2.5 3.0
And you can easily present these in a single column with unlist:
dat$mean <- unlist(with(dat, by(visit, site, FUN=function(x) c(NA, head(cumsum(x)/1:length(x), -1)))))
dat
## visit site mean
## 1 1 A NA
## 2 2 A 1.0
## 3 3 A 1.5
## 4 4 A 2.0
## 5 5 A 2.5
## 6 6 A 3.0
## 7 1 B NA
## 8 2 B 1.0
## 9 3 B 1.5
## 10 4 B 2.0
## 11 5 B 2.5
## 12 6 B 3.0

Resources