I have a the results of two clusterings and I would like to create vectors so that all features that belong to the cluster are listed in a vector.
The following dataframe results from a clustering algorithm. The columns "C" are the clusters from two different algorithms.
| A1 | A2 | A3 | A4 | A5 | C1 | C2 |
| -- | -- | -- | -- | -- | -- | -- |
| 0 | 0 | 0 | 15 | 0 | 1 | 1 |
| 0 | 20 | 34 | 0 | 0 | 2 | 2 |
| 33 | 0 | 0 | 7 | 0 | 1 | 1 |
| 0 | 0 | 0 | 0 | 85 | 3 | 2 |
| 0 | 0 | 0 | 0 | 94 | 3 | 2 |
| 0 | 12 | 57 | 0 | 0 | 2 | 2 |
I want to create one vector for each cluster so that at the end I have
c11 = ['A1','A4']
c12 = ['A2','A3']
c13 = ['A5']
c21 = ['A1','A4']
c22 = ['A2','A3', 'A5']
EDIT:
To be more specific, the code should create a vector for each cluster in this way: If the cluster has a value different from 0 in any of the cluster specific rows for a feature, then add this feature to the vector.
In the first step for the second clustering the algorithm looks at cluster C21 (Rows 1 and 3) according to this rows the features A4 and A1 might be positive in instances of the cluster. In the second step the algorithm looks at the rows 2, 4, 5 and 6 for C22. There the values A2, A3 might be positive (according to the 2nd and 6th row) and the A5 as well (according to the 4th and 5th row)
EDIT2: The question was already solved but I still have a question in the special case that each instance would be clustered in multiple clusters and these are not disjunct. In this case the clusters are given as a string.
| A1 | A2 | A3 | A4 | A5 | C |
| -- | -- | -- | -- | -- | ---- |
| 0 | 30 | 0 | 15 | 0 | "1,2"|
| 0 | 20 | 34 | 0 | 0 | "2" |
| 33 | 0 | 0 | 7 | 0 | "1" |
| 28 | 0 | 0 | 0 | 85 | "3,1"|
| 0 | 0 | 0 | 0 | 94 | "3" |
| 0 | 12 | 57 | 0 | 0 | "2,3"|
c1 = ['A2','A1','A4','A5']
c2 = ['A2','A4','A3']
c3 = ['A1','A5','A2','A3']
Create a list of column names for each row, where the value is not equal to 0, by looping across the row with apply and MARGIN = 1, Use the column 'C1', 'C2' to split the list, loop over the outer list and unlist the inner list elements, get the unique and sort it
l1 <- apply(df1[1:5] != 0, 1, FUN = function(x)
names(x)[x])
lst1 <- lapply(split(l1, df1$C1), function(x) sort(unique(unlist(x))))
lst2 <- lapply(split(l1, df1$C2), function(x) sort(unique(unlist(x))))
-output
> lst1
$`1`
[1] "A1" "A4"
$`2`
[1] "A2" "A3"
$`3`
[1] "A5"
> lst2
$`1`
[1] "A1" "A4"
$`2`
[1] "A2" "A3" "A5"
Related
I am working with the following table:
Var1 is of the format location.transport type. Therefore, A.land means location A, transport type land. Frequency is simply the number of times a location used the respective transport_type. Location_ID and Transport_Type were created in Stata by splitting Var1.
| Var1 | Frequency | Location_ID | Transport_Type|
|---- |---- | ----- | ----- |
| A.land | 4 | A |land |
| A.air | 3 | A |air |
| A.sea | 2 | A |sea |
| B.sea | 5 | B |sea |
| B.other | 2 | B |other |
| B.land | 2 | B |land |
| C.land | 1 | C |land |
| C.air | 3 | C |air |
| C.other | 1 | C |other |
The goal is to find the distribution of the types of transports from each location A, B, and C.
I wish to create four variables: Proportion_land, Proportion_sea, Proportion_air, and Proportion_other.
For example, for location A I would want to create something like this:
Location |Proportion_land| Proportion_sea | Proportion_air | Proportion_other|
|---- |---- |------ | ----- |----- |
|A | 4/9 | 3/9 | 2/9 |0 |
It is a little unclear what you want here, as you don't provide any directly readable data or any exact code. But with some surgery, I get this version of your example data in Stata:
* Example generated by -dataex-. For more info, type help dataex
clear
input str7 Var1 byte Frequency str1 Location_ID str5 Transport_Type
"A.land" 4 "A" "land"
"A.air" 3 "A" "air"
"A.sea" 2 "A" "sea"
"B.sea" 5 "B" "sea"
"B.other" 2 "B" "other"
"B.land" 2 "B" "land"
"C.land" 1 "C" "land"
"C.air" 3 "C" "air"
"C.other" 1 "C" "other"
end
and then what you call for requires not so much new variables as a basic cross-tabulation:
. tab Location_ID Transport_Type [fw=Freq], row
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
Location_I | Transport_Type
D | air land other sea | Total
-----------+--------------------------------------------+----------
A | 3 4 0 2 | 9
| 33.33 44.44 0.00 22.22 | 100.00
-----------+--------------------------------------------+----------
B | 0 2 2 5 | 9
| 0.00 22.22 22.22 55.56 | 100.00
-----------+--------------------------------------------+----------
C | 3 1 1 0 | 5
| 60.00 20.00 20.00 0.00 | 100.00
-----------+--------------------------------------------+----------
Total | 6 7 3 7 | 23
| 26.09 30.43 13.04 30.43 | 100.00
clear
input str7 Var1 byte Frequency str1 Location_ID str5 Transport_Type
"A.land" 4 "A" "land"
"A.air" 3 "A" "air"
"A.sea" 2 "A" "sea"
"B.sea" 5 "B" "sea"
"B.other" 2 "B" "other"
"B.land" 2 "B" "land"
"C.land" 1 "C" "land"
"C.air" 3 "C" "air"
"C.other" 1 "C" "other"
end
local types land sea air other
* Get frequency for each type
foreach type of local types {
gen `type' = Frequency if (Transport_Type == "`type'")
}
* Aggregate freq for total and each type on location level
collapse (sum) loc_total=Frequency `types' , by(Location_ID)
* Calculate proportion for each type
foreach type of local types {
gen Proportion_`type' = `type' / loc_total
}
I have two matrices A and B, both have the same dimensions and are binary in nature. I want to overlap matrix A on matrix B.
Matrix A:
| Gene A | Gene B |
| -------- | ----------- |
| 0 | 1 |
| 0 | 1 |
Matrix B:
| Gene A | Gene B |
| -------- | ----------- |
| 1 | 0 |
| 0 | 0 |
Result:
Matrix C:
| Gene A | Gene B |
| -------- | ----------- |
| 1 | 1 |
| 0 | 1 |
The resultant matrix will also have the same dimension as the input.
How can this be done?
Please let me know.
In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2
Currently, I'm working on a mealy fsm that detects a 17 bit sequence 10100001010000001. Due to the length of the sequence, I'm having difficulty figuring out which state to return to when the input doesn't allow me to move on to the next state. Any suggestions ??
Think about what previous pattern is satisfied when the expected pattern is not met. The FSM for a 10100001010000001 Mealy machine is shown below, in ASCII art (not sure how it will render here...)
s0-1->s1-0->s2-1->s3-0->s4-0->s5-0->s6-0->s7-1->s8-0->s9-1->s10-0->s11-0->s12-0->s13-0->s14-0->s15-0->s16-1->s17-0->s2
| | | | | | | | | | | | | | | | | |
0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1
| | | | | | | | | | | | | | | | | |
s0 s1 s0 s1 s3 s1 s1 s0 s1 s0 s1 s3 s1 s1 s8 s1 s0 s1
If got data looking like this:
A | B | C
--------------
f | 1 | 1420h
f | 1 | 1540h
f | 3 | 600h
g | 2 | 900h
g | 2 | 930h
h | 1 | 700h
h | 3 | 400h
Now I want to create a new column which counts other rows in the data frame that meet certain conditions.
In this case I would like to know in each row how often the same combination of A and B occured in a range of 100 around C.
So the result with this data would be:
A | B | C | D
------------------
f | 1 | 1420 | 0
f | 1 | 1540 | 0
f | 3 | 1321 | 0
g | 2 | 900 | 1
g | 2 | 930 | 1
h | 1 | 700 | 0
h | 3 | 400 | 0
I actually came to a solution using for(for()). But the time R needs to compute the resuts is tooooo long.
for(i in 1:nrow(df)) {
df[i,D] <- sum( for(p in 1:nrow(df)) {
df[p,A] == df[i,A] &
df[p,B] == df[i,B] &
df[i,C] +100 > df[p,C] &
df[p,C] > df[i,C]-100 } ) }
Is there a better way?
Thanks a lot!