I have a data set involving 100 people and their diagnosis of 5 medical conditions. Any combinations of conditions can occur, but I've set it up so that the probability of condition D depends on condition A, and E depends on B.
set.seed(14)
numpeople <- 100
diagnoses <- data.frame(A=rbinom(100, 1, .15),
B=rbinom(100, 1, .1),
C=rbinom(100, 1, .2)
)
# Probability of diagnosis for D increases by .4 if patient has A, otherwise .5
diagnoses$D <- sapply(diagnoses$A, function(x) rbinom(1, 1, .4*x+.2))
# Probability of diagnosis for E increases by .3 if patient has B, otherwise rare
diagnoses$E <- sapply(diagnoses$B, function(x) rbinom(1, 1, .7*x+.1))
To make a co-occurrence matrix, where each cell is the number of people with both of the diagnoses in the row and column, I use matrix algebra:
diagnoses.dist <- t(as.matrix(diagnoses))%*%as.matrix(diagnoses)
diag(diagnoses.dist) <- 0
diagnoses.dist
> diagnoses.dist
A B C D E
A 0 1 1 11 3
B 1 0 0 1 7
C 1 0 0 5 4
D 11 1 5 0 4
E 3 7 4 4 0
Then I'd like to use a chord diagram to show the proportion of co-diagnoses for each diagnosis.
circos.clear()
circos.par(gap.after=10)
chordDiagram(diagnoses.dist, symmetric=TRUE)
By default, size of the sector (pie slice) allocated for each group is proportional to the number of links.
> colSums(diagnoses.dist) #Number of links related to each diagnosis
A B C D E
16 9 10 21 18
Is it possible to set the sector width to illustrate the number of people which each diagnosis?
> colSums(diagnoses) #Number of people with each diagnosis
A B C D E
16 8 20 29 18
This problem seems somewhat related to section 14.5 of the circlize book, but I'm not sure how to work the math for the gap.after argument.
Based on section 2.3 of the circlize book, I tried setting the sector size using circos.initalize but I think the chordDiagram function overrides this, because the scale on the outside is exactly the same.
circos.clear()
circos.par(gap.after=10)
circos.initialize(factors=names(diagnoses), x=colSums(diagnoses)/sum(diagnoses), xlim=c(0,1))
chordDiagram(diagnoses.dist, symmetric=TRUE)
I see a lot of options to fine-tune tracks in chordDiagram but not much for sectors. Is there a way this can be done?
In your case, Number of people in the category sometimes can be smaller than the total number of co-occurrence to other categories. For example, category B has totally 9 co-occurrence but the number of people is only 8.
If this is not the problem, you can put some values on the diagram of the matrix which correspond to the number people that only stay in one category. In following example code, I just add random numbers to the diagram to illustrate the idea:
diagnoses.dist <- t(as.matrix(diagnoses))%*%as.matrix(diagnoses)
diag(diagnoses.dist) = sample(10, 5)
# since the matrix is symmetric, we set the uppper diagnal to zero.
# we don't use `symmetrix = TRUE` here because the values on the diagonal
# are still used.
diagnoses.dist[upper.tri(diagnoses.dist)] = 0
par(mfrow = c(1, 2))
# here you can remove `self.link = 1` to see the difference
chordDiagram(diagnoses.dist, grid.col = 2:6, self.link = 1)
# If you don't want to see the "mountains"
visible = matrix(TRUE, nrow = nrow(diagnoses.dist), ncol = ncol(diagnoses.dist))
diag(visible) = FALSE
chordDiagram(diagnoses.dist, grid.col = 2:6, self.link = 1, link.visible = visible)
PS: link.visible option is only available in recent versions of circlize.
Related
Consider a clustering problem, where the true class labels are known (say g).
Suppose, p denotes the predicted cluster labels (can be obtained by any clustering approach).
So, both g and p splits the data set in some groups, though the number of groups need not be same in two cases.
Among these two sets of groups, in some cases one group by g will be identical to another group by p, though their labels in two cases may be different. I want to find the number of such groups, i.e. I want to find the number of cases where the clustering method is able to detect a class perfectly.
I understand this is not a standard way to evaluate clustering (Rand Index, Dunn Index, etc. are recommended), but I am interested in this. I also understand that this number will be very small in most of the real life data, may be even 0, but the data set I am currently working with has a large number (around 1500) of classes, with highest number of observations in one class being at most 15. So, in this case, this number is likely to be quite high.
Here is a reproducible example and my attempt (working) at the solution:
# true labels
g <- c(1, 1, 2, 2, 2, 1, 3, 3, 3, 4)
# predicted labels
p <- c(3, 3, 1, 1, 1, 3, 4, 4, 1, 2)
# correctly detected groups
n_correct <- 2 # (1st class and 3rd cluster), (4th class and 2nd cluster)
# attempt
distinct_class_labels <- unique(x = g)
counter <- 0
for (i in seq_along(along.with = distinct_class_labels))
{
cluster_labels_of_obs_in_ith_class <- subset(x = p,
subset = (g == distinct_class_labels[i]))
unique_cluster_labels_of_obs_in_ith_class <- unique(x = cluster_labels_of_obs_in_ith_class)
if (length(x = unique_cluster_labels_of_obs_in_ith_class) == 1)
{
class_labels_of_obs_in_this_cluster <- subset(x = g,
subset = (p == unique_cluster_labels_of_obs_in_ith_class))
if (length(x = unique(x = class_labels_of_obs_in_this_cluster)) == 1)
{
counter <- (counter + 1)
}
}
}
counter
#> [1] 2
Created on 2019-05-22 by the reprex package (v0.3.0)
This works correctly, but it takes time (and I do not like this method). I suppose one can use dplyr::group_by with both g and p separately and somehow compare the groups of these two objects. I guess there are other better approaches to this and I will highly appreciate such answers.
Thanks.
If you are also interested in the combination of the correctly detected groups you can try this
library(tidyverse)
tibble(g = g, p=p) %>%
distinct(g,p) %>% # unique combinations of g and p
add_count(g, name="g_count") %>% # count how often each class/label occurs in g and p. When it is unambiguous assigned it should be 1
add_count(p, name="p_count") %>%
filter(g_count == 1 & p_count == 1) %>%
select(g,p)
# A tibble: 2 x 2
g p
<dbl> <dbl>
1 1 3
2 4 2
The number of rows (you can use nrow()) will give you the number of correctly detected groups
Convert g and p to factor with levels specified based on their occurrence in the vector and count the frequencies that match.
sum(table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g))))
#[1] 2
To understand, see
table(factor(p, levels = unique(p)))
#3 1 4 2
#3 4 2 1
table(factor(g, levels = unique(g)))
#1 2 3 4
#3 3 3 1
We can ignore the labels (as the group labels are not same) and focus only on frequency. We can see that the first and fourth value have the same frequency hence, the count 2.
If you want to find out which groups are similar, you can do
inds <- table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g)))
unique(p)[inds]
#[1] 3 2
unique(g)[inds]
#[1] 1 4
This says that group 3 in p is similar to group 1 in g and same for 2 and 4 respectively.
Before solving it using table , I did it with split although the underlying logic is the same.
sum(lengths(split(p, factor(p, levels = unique(p)))) ==
lengths(split(g, factor(g, levels = unique(g)))))
EDIT
If there is a chance of class imbalance we need to combine the levels to include all. For example,
g1 <- c(g, 5)
p1 <- c(p, 1)
sum(table(factor(p1, levels = unique(c(p1, g1)))) ==
table(factor(g1, levels = unique(c(g1, p1)))))
#[1] 2
I'm using the epiR package as it does nice 2 by 2 contingency tables with odds ratios, and population attributable fractions.
As is common my data is coded
0 = No
1 = Yes
So when I do
tabele(var_1,var_2)
The output comes out as a table aligned like
For its input though epiR wants the top left square to be Exposed+VE Outcome+VE - i.e the top left square should be Var 1==1 and Var 2==1
Currently I do this by recoding the zeroes to 2 or alternatively by setting as a factor and using re-level. Both of these are slightly annoying for other analyses as in general I want Outcome+VE to come after Outcome-VE
So I wondered if there is an easy way (?within table) to flip the orientation of table so that it essentially inverts the ordering of the rows/columns?
Hope the above makes sense - happy to provide clarification if not.
Edit: Thanks for suggestions below; just for clarification I want to be able to do this when calling table from existing dataframe variable - i.e when what I am doing is table(data$var_1, data$var_2) - ideally without having to create a whole new object
Table is a simple matrix. You can just call indices in reverse order.
xy <- table(data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2]))
variable
value a b
0 20 22
1 30 28
xy[2:1, 2:1]
variable
value b a
1 20 30
0 30 20
Using factor levels:
# reproducible example (adapted from Roman's answer)
df1 <- data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2])
table(df1)
# variable
# value a b
# 0 32 23
# 1 18 27
#convert to factor, specify levels
df1$value <- factor(df1$value, levels = c("1", "0"))
df1$variable <- factor(df1$variable, levels = c("b", "a"))
table(df1)
# variable
# value b a
# 1 24 26
# 0 26 24
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = c(1, 1, 0, 0, 1))
ada = boosting(formula=var1~., data=df1)
Error in cbind(yval2, yprob, nodeprob) :
el número de filas de las matrices debe coincidir (vea arg 2)
Hi everyone, I'm trying to use boosting function from adabag package, but it's telling me that the number of rows from matrix (?) must be equal. This data is not the original, but it seems to throw the same error.
Could you help me?
Thank you.
You should not use ID as explanatory variable.
Unfortunately your df1 dataset is too small and it is not possibile to understand if ID is the source of your problem.
Below I generate a bigger data set:
library(adabag)
set.seed(1)
n <- 100
df1 <- data.frame(ID = 1:n,
var1 = sample(letters[1:5], n, replace=T),
var2 = sample(c(0,1), n, replace=T))
head(df1)
# ID var1 var2
#
# 1 1 b 1
# 2 2 b 0
# 3 3 c 0
# 4 4 e 1
# 5 5 b 1
# 6 6 e 0
ada <- boosting(var1~var2, data=df1)
ada.pred <- predict.boosting(ada, newdata=df1)
ada.pred$confusion
# Observed Class Predicted Class a b c d e
# b 5 20 2 7 11
# c 2 2 10 2 2
# d 6 3 7 17 4
Pablo, if we have a closer look at your sample data, we will notice a property that makes it impossible for the classification algorithm to handle. Your dataset consists of five samples, each having a unique label i.e. the dependent variable: a, b, c, d, e. The dataset has only one feature (i.e. independent variable var2, as ID should be excluded from the features’ list) consisting of two classes: 0 and 1. It means there are several labels (of the dependent variable) that correspond to the same class of the independent variable. When algorithm tries to build a model, in this process it encounters a problem with defining regression due to the previously described dataset property and throws the error (number of rows of matrices must match (see arg 2)).
Marco's data, instead, has some healthy diversity: in the dataset of six samples, there are only three labels (b, c, e) and two classes (0, 1). The data set is diverse and reliable enough for the algorithm to handle it.
So, in order to use adabag’s boosting (that uses a regression tree called rpart as the control), you should make your data more diverse and reliable. Good luck!
I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female.
Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical.
Name<-c("Jane","Brad","Harry","Shana","Debra","Jason")
Sex<-c("F","M","M","F","F","M")
Score<-c(25,50,36,40,39,62)
Team<-c("A","A","A","B","B","B")
Grade<-c(1,2,1,2,1,2)
Rep<-c("C","D","C","D","D","D")
df<-data.frame(Name,Sex,Score,Team,Grade,Rep)
df
Name Sex Score Team Grade Rep
1 Jane F 25 A 1 C
2 Brad M 50 A 2 D
3 Harry M 36 A 1 C
4 Shana F 40 B 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F"))
direction <- c("==","==")
rhs<-c(2,1)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
4 Shana F 40 B 2 D
6 Jason M 62 B 2 D
What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative.
I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.
It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people.
For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block:
pairs <- t(combn(nrow(df), 2))
(blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) {
x[pairs[,1]] == x[pairs[,2]]
})) >= 3,])
# [,1] [,2]
# [1,] 1 3
# [2,] 4 6
We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints:
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"),
outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "=="))
direction <- rep(c("==", "<="), c(2, nrow(blocked)))
rhs<-c(2, 1, rep(1, nrow(blocked)))
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
# Name Sex Score Team Grade Rep
# 2 Brad M 50 A 2 D
# 5 Debra F 39 B 1 D
# 6 Jason M 62 B 2 D
The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.
For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this:
a'x <= 1
To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1.
The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45.
Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving:
nr <- nrow(df)
s <- split(1:nr, df[4:6])
s1 <- s[lengths(s) > 1]
excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0
matrix2 <- rbind(matrix, excl)
direction2 <- c(direction, rep("<=", nrow(excl)))
rhs2 <- c(rhs, rep(1, nrow(excl)))
sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2,
dir = direction2, rhs = rhs2, types = "B", max = TRUE)
df[ sol2$solution == 1, ]
giving:
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc.
I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline()
I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions.
The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.
There's two main components to take care of here: The selection of
a method for the storing of the basic data and the algorithm that
computes the result you desire.
For the computation, it might be preferable to have your data
stored in a matrix, due to the way R recycles the shorter vector
when multiplying two vectors. This recycling also kicks in if you
want to multiply a matrix with a vector, since a matrix is a
vector with some additional attributes (that is to say, dimension
and dimension-names). Consider the example below to see how it
works
test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)
test_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
test_matrix * test_vec
[,1] [,2] [,3] [,4]
[1,] 3 12 21 30
[2,] 0 0 0 0
[3,] 3 6 9 12
Based on this observation, it's possible to deduce that a solution
where each amino acid has one row in a matrix might be a good way
to store the look up data; when we have a counting vector with
specifying the desired amount of contribution from each row, it
will be sufficient to multiply our matrix with our counting
vector, and then sum the columns - the last part solved using
colSums.
colSums(test_matrix * test_vec)
[1] 6 18 30 42
It's in general a "pain" to store this kind of information in a
matrix, since it might be a "lot of work" to update the
information later on. However, I guess it's not that often it's
required to add new amino acids, so that might not be an issue in
this case.
So let's create a matrix for the the five amino acids needed
for the peptide you mentioned in your example. The numbers was
found on Wikipedia, and hopefully I didn't mess up when I copied
them. Just follow suit to add all the other amino acids too.
amino_acids <- rbind(
G = c(C = 2, H = 5, N = 1, O = 2),
L = c(C = 6, H = 13, N = 1, O = 2),
H = c(C = 6, H = 9, N = 3, O = 2),
K = c(C = 6, H = 14, N = 2, O = 2),
Y = c(C = 9, H = 11, N = 1, O = 3))
amino_acids
C H N O
G 2 5 1 2
L 6 13 1 2
H 6 9 3 2
K 6 14 2 2
Y 9 11 1 3
This matrix contains the information we want, but it might be
preferable to have them in lexicographic order - and it would be
nice to ensure that we haven't by mistake added the same row
twice. The code below takes care of both of these issues.
amino_acids <-
amino_acids[sort(unique(rownames(amino_acids))), ]
amino_acids
C H N O
G 2 5 1 2
H 6 9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3
The next part is to figure out how to deal with the peptides. This
will here be done by first using strsplit to split the string
into separate characters, and then use a table-solution upon the
result to get the vector that we want to multiply with the matrix.
peptide <- "KGHLY"
peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"
Using table upon peptide_2 gives us
table(peptide_2)
peptide_2
G H K L Y
1 1 1 1 1
This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.
Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.
peptide_vec <- table(peptide_2)
colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
C H N O
29 52 8 11
This outlines one possible solution for the core of your problem,
and this can be collected into a function that takes care of all
the steps for us.
peptide_function <- function(peptide, amino_acids) {
peptide_vec <- table(
unlist(strsplit(x = peptide, split = "")))
## Compute the result and return it to the work flow.
colSums(
amino_acids[names(peptide_vec), ] *
as.vector(peptide_vec))
}
And finally a test to see that we get the same answer as before.
peptide_function(peptide = "GHKLY",
amino_acids = amino_acids)
C H N O
29 52 8 11
What next? Well that depends on how you have stored your
peptides, and what you would like to do with the result. If for
example you have the peptides stored in a vector, and would like
to have the result stored in a matrix, then it might e.g. be
possible to use vapply as given below.
data_vector <- c("GHKLY", "GGLY", "HKLGL")
result <- t(vapply(
X = data_vector,
FUN = peptide_function,
FUN.VALUE = numeric(4),
amino_acids = amino_acids))
result
C H N O
GHKLY 29 52 8 11
GGLY 19 34 4 9
HKLGL 26 54 8 10