Retrieve best number of clusters from NbClust - r

Many functions in R provide some sort of console output (such as NbClust() etc.) Is there any way of retrieving some of the output (e.g. a certain integer value) without having a look at the output? Any way of reading from the console?
Imagine the output would look like the following output from example code provided in the package manual:
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 1 proposed 2 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
How would I retrieve the value 4 from the last line of the above output?

It is better to work with objects rather than output in the console. Any "good" function would return hopefully structured output that can be accessed using $ or # signs, use str() to see object's structure.
In your case, I think this should work:
length(unique(res$Best.partition))

Another option is:
max(unlist(res[4]))

Related

Pavlidis Template Matching (PTM, DataVisEasy) R function with 3 levels

I need to perform a correlation analysis on a data frame constructed as follows:
Rows: Features --> gene variants related with different levels of severity of the disease we are studying, in a format of a Boolean matrix
Columns: Observations --> List of patients;
The discriminant of my analysis is, thus, the severity marked as follows:
A: less severe than expected
B: equal to what expected
C: more severe than expected
Suppose I have a lot more features than observations and I want to use the PTM function with a three-level annotation (i.e. A,B,C) as a match template. The function requires you to set the annotation.level.set.high parameter, but it's not clear to me how it works. For example, if I set annotation.level.set.high='A', does that mean I'm making a comparison between A vs B&C? So I can only do a comparison between two groups/classes even if I have multiple levels? Because my goal is to compare all levels with each other (i.e. A vs B vs C), but it is not clear to me how to achieve this comparison, if it is possible.
Thanks

Calculation of possible combinations if the number of possible properties and their possible states is unknown

I have an unknown number of properties, and each property has an unknown number of possible states. How can I calculate the number of possible combinations?
It's hard enough for me to formulate it mathematically. That's why I can't get it into my code.
If all properties could have the same number of states, the number of possible combinations would be simply number_of_possible_combinations = number_possible_states^number_possible_properties.
However, that is not the case.
A coded example would be helpful, or a mathematical formula.
Just multiply all the possible number of states; for example
3 states
2 states
11 states
gives a total of 2 * 3 * 11 = 66 possible combinations
The case where the number of states is fixed is just a special case of this formula.
In mathematical terms is the product of the cardinality of the sets of states:

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

How sum up and print points of all exercises?

I know that in r-exam I can assign points to exercises eg. via expoints in the exercise' meta-data. However, I don't know to get the sum of the points across all exercises.
As a specific usecase: Consider a test that (by formal university requirements) must consist of (say) 90 points. So, I need to track the number of points that are already included via the exercises of the test.
I'm unaware which variable tracks this score (if any).
You are right, this information is not directly available, however it can be extracted from the metainformation contained in the output from any exams2xyz() interface. As a simple illustration consider:
library("exams")
set.seed(0)
exm <- exams2pdf(c("swisscapital.Rmd", "deriv.Rmd", "ttest.Rmd"),
n = 1, points = c(1, 17, 2))
Now exm is a list with only n = 1 exam, consisting of three exercises, each of which provides its metainform (among other details). So you can extract the points of the second exercise in the first (and only) exam via:
exm[[1]][[2]]$metainfo$points
## [1] 17
So to get the points from all exercises in the first exam:
sapply(exm[[1]], function(y) y$metainfo$points)
## exercise1 exercise2 exercise3
## 1 17 2
Of course, here it the points were explicitly set in exams2pdf() and were thus known. But the same approach can also be used if the points are set via the expoints tag in the individual exercises.

Item-based recommender system with R

Actually I'm trying to build a simple item-based recommender system and I decided to use R due to my lack of programming knowledge.
Still, some issues remain so I'll try to be as methodic as possible to explain them.
Log file
I start with a log file imported as a data frame containing many columns among which : The ID of the customers, the ID of the items and the transaction date.
Here's an overview of the log file
'data.frame': 539673 obs. of 3 variables:
$ cid: int 1 1 1 1 2 2 3 4 ...
$ iid : int 1002 1345 1346 2435 6421 4356 1002 4212 ...
$ timestamp : int 1330232453 1330233859 13302349437 1330254065 1330436051
I succeeded to turn this log file into a matrix with the customers in lines , the products in columns and the timestamp (much easier to manipulate) of the transaction date when a transaction occurs between these first two elements.
So I end up with a matrix of 100000 rows and 3000 columns, which is pretty huge.
Similarity matrix
From that point, I can create my item-based recommender system.
First, I binarize my matrix m in order to be able to compute the similarity:
mbin <- (m > 0) + 0
To compute the similarity, I use the cosine measure by a function:
getCosine <- function(x,y)
{
cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
return(cosine)
}
After creating a matrix for receiving the different similarity measures,
I created two loops to measure this one:
for(i in 1:ncol(mbin)) {
for(j in 1:ncol(mbin)) {
mbin.sim[i,j] <- getCosine((mbin[i]),(mbin[j]))
}}
This similarity matrix takes too long to be computed, that is why I only focus on retrieving one particular similiarity.
Note that I've taken a random number n, I would also like to enter an item's name
n = 5
for(i in n) {
for(j in 1:ncol(mbin)) {
mbin.sim[i,j] <- getCosine(mbin[i],mbin[j])
}}
How can I achieve that?
Building and applying the recommender
From this part, I'm stuck because I can't see how I can make an easy recommender which will take one item and recommender k users to it.
Testing
Moreover, to test the recommender, I should be able to go back in time and to see if , from a certain date, I can predict the good users.
To do that, I know that I have to create a function which can give me the date of the nth transaction. Mathematically, it means that for a particular column, I have to get the nth non zero elements for that column. So I tried this but it only gives me the first one :
firstel <- function(x,n){
m <- head(x[x!=0],1)
m[n]
}
How can I do that? and moreover, How can I use this variable to discriminate between past and future events, with another function?
Sorry for this long post but I really wanted to show that I'm really into it and that I really want to overcome that step in order to begin the real analysis afterwards.
NB : I'm really doing that without complex packages due to the huge amount of data.

Resources