Make a simple clustering manually in R - r

I am trying to make a simple clustering manually (without using any clustering algorithm) based on the distance between the points. I used the pearson correlation to calculate the distance:
c <- round(cor(t(df)), digits = 2)
d <- as.dist(1 - c)
I want to cluster all point that have a correlation greater than a certain threshold. For example 0,7. How could I cluster this data points in R?
The first rows and columns of my data frame look like this: (there are in total 188 entries and 31 columns
| | A1 | A2 | A3 | A4 | A5 |
| --- | --- | --- | --- | --- | --- |
| U00 | 0 | 0 | 0 | 0 | 0 |
| U01 | 0 | 0 | 84 | 0 | 0 |
| U02 | 0 | 1 | 0 | 0 | 0 |
| U03 | 0 | 0 | 0 | 0 | 0 |
| U04 | 0 | 0 | 0 | 0 | 0 |
| U05 | 0 | 0 | 0 | 0 | 0 |
| U06 | 0 | 0 | 0 | 0 | 0 |
and the dist:
| | U00 | U01 | U02 | U03 | U04 | U05 | U06 |
| | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| U01 | 0,05 | | | | | | |
| U02 | 1,04 | 1,05 | | | | | |
| U03 | 1,04 | 1,04 | 0,92 | | | | |
| U04 | 1,04 | 1,04 | 0,92 | 0,00 | | | |
| U05 | 1,04 | 1,04 | 0,92 | 0,00 | 0,00 | | |
| U06 | 1,04 | 1,04 | 0,92 | 0,00 | 0,00 | 0,00 | |
At the end I would like to habe an extra column in my data frame with the number of the cluster. Thank you in advance!

Things like this can be done using igraph package:
library(igraph)
threshold <- 0.7
graph_from_adjacency_matrix(abs(cor(df)) > threshold) %>%
components() %>%
membership() %>%
split(names(.), .)
note: I took absolute correlation, you can just remove abs.

Related

Overlap two equally dimension matrices in R

I have two matrices A and B, both have the same dimensions and are binary in nature. I want to overlap matrix A on matrix B.
Matrix A:
| Gene A | Gene B |
| -------- | ----------- |
| 0 | 1 |
| 0 | 1 |
Matrix B:
| Gene A | Gene B |
| -------- | ----------- |
| 1 | 0 |
| 0 | 0 |
Result:
Matrix C:
| Gene A | Gene B |
| -------- | ----------- |
| 1 | 1 |
| 0 | 1 |
The resultant matrix will also have the same dimension as the input.
How can this be done?
Please let me know.

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.
Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

Split data based on grouping column

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
and
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.
I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.
You can use te following command:
data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])
It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.
I've tried the 'sample unique' way but it did not always work for me in the past.

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Query performance - 'Left join is null' vs 'Not exists select'

I have a question about a query that I want to execute, but I dont know what is the best qua performance. I need to get all the words exclude the words that have a relation with the table wordfilter.
The output of the queries is right, but maybe there is a better solution for this. I have almost none knowledge about query plans, I'm trying to understand it now.
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
WHERE NOT EXISTS (SELECT word_id FROM stocknews.wordfilter WHERE stocknews.word.id = word_id)
AND user_id = 1
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ref | user_id | user_id | 4 | const | 843 | Using where |
| 2 | MATERIALIZED | wordfilter | index | PRIMARY | PRIMARY | 756 | | 16 | Using index |
+----+--------------+------------+-------+---------------+---------+---------+-------+------+-------------+
Against
SELECT CONCAT(SPACE(1), UCASE(stocknews.word.word), SPACE(1)) AS word, stocknews.word.language
FROM stocknews.word
LEFT JOIN stocknews.wordfilter ON stocknews.word.id = stocknews.wordfilter.word_id
WHERE stocknews.wordfilter.word_id IS NULL AND user_id = 1
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
| 1 | SIMPLE | word | ref | user_id | user_id | 4 | const | 843 | |
| 1 | SIMPLE | wordfilter | ref | PRIMARY | PRIMARY | 4 | word.id | 1 | Using where; Using index; Not exists |
+----+-------------+------------+------+---------------+---------+---------+---------+------+--------------------------------------+
Any help is welcome! An explanation would be nice.
Edit:
For query 1:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1044 |
| Handler_read_last | 0 |
| Handler_read_next | 859 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 215 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+
25 rows in set (0.00 sec)
For query 2:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 0 |
| Handler_icp_attempts | 0 |
| Handler_icp_match | 0 |
| Handler_mrr_init | 0 |
| Handler_mrr_key_refills | 0 |
| Handler_mrr_rowid_refills | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 0 |
| Handler_read_key | 844 |
| Handler_read_last | 0 |
| Handler_read_next | 843 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_deleted | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_tmp_update | 0 |
| Handler_tmp_write | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+-------+
It seems to be a close race between the two formulations. (Some other example may show a clearer winner.)
From the HANDLER values: Query 1 did more read_keys, and some writing (which goes along with MATERIALIZED). The other numbers were about same. So, I conclude that Query 1 is slower -- although possibly not enough slower to make much difference.
I vote for LEFT JOIN as the better query pattern (in this case)

Resources