Cluster observations based on multiple variables

Cluster observations based on multiple variables - r

I am looking for an r function to create clusters in my dataset based on two variables (hope "cluster" is the right name for what I want to do) . Every two observations with the same value for variable_1 OR variable_2, should be in the same cluster. In the brief exemple that follows, I aggregating dataframe df based on variable_1 and variable_2.
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
The result should look like:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
Notice that the first cluster contains everyone with variable_1 equals "a", "b", "c", or "d": "a" and "b" are merged together because they share "g1" (lines 1 and 3); "a" and "c" are merged because they share "g2" (lines 2 and 5); and "c" and "d" are merged because they share "g4" (lines 6 and 7).
Finally, in the last cluster there are only observations with variable_1=="e", because they don't share variable_2 with anyone.
Just to clarify what I intend to do, I will explain my problem set a little better. I’m pairing counties with nearby Touristic Attractions. Different counties are surround by different Touristic Attraction (TA), and there are many TA around the same county. But this "touristic-clusters" of counties and TAs are sparsely spread in the country. Notice that some far away counties might be inside the same cluster due to a “chain” effect of county-Touristic Attraction connections. So I want to find those “clusters” based on the id of the county and of the Touristic Attraction.
This seems simple, but I can't figure it out how to implement.
Thanks a lot

igraph solution
Disclaimer: I am completely new to igraph, so there's probably a better solution to this problem. However this seems to work.
With the igraph package we can cluster the data using the graph_from_data_frame() function, and then extract the clusters with components. You get the added advantage of being able to visualise the clusters.
library(igraph)
graph <- graph_from_data_frame(df[, 1:2], directed = FALSE)
cmp <- components(graph)$membership
df$cluster <- cmp[df$variable_1]
plot(graph)
Wrapping it up into a function
If you wanted to wrap it up as a function, something like this works:
find_clusters <- function(x, y) {
edges <- data.frame(from = x, to = y)
graph <- igraph::graph_from_data_frame(edges, directed = FALSE)
cmp <- igraph::components(graph)$membership
return(cmp[x])
}
Using the additional example you posted as a comment above, we thus have the following workflow:
library(dplyr)
df <- data.frame(
variable_1 = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f"),
variable_2 = c( "g1", "g2", "g1", "g3", "g2", "g4", "g4", "g6", "g7", "g8", "g9", "g12"),
value = rnorm(12)
)
df %>%
mutate(cluster = find_clusters(variable_1, variable_2))
# variable_1 variable_2 value cluster
# 1 a g1 -0.03410073 1
# 2 a g2 0.51261548 1
# 3 b g1 0.06470451 1
# 4 b g3 -1.97228101 1
# 5 c g2 -0.39751063 1
# 6 c g4 0.17761619 1
# 7 d g4 -0.13771207 1
# 8 d g6 -0.72183017 1
# 9 e g7 0.09012701 2
# 10 e g8 0.45763593 2
# 11 f g9 -0.83172613 3
# 12 f g12 2.83480352 3

So, I wrote a function to achieve what I need. It is ugly but it is working. If someone has a better/more efficient solution I would really appreciate it.
find_clusters <- function(original_df){
find_clus <- original_df
cluster_number <- 1
find_clus$cluster <- "cl"
i=1
for(i in 1:nrow(find_clus)){
if(nchar(find_clus$cluster[i])>2) next
aux <- lapply(original_df,function(x){ which(x==x[i])})%>% reshape2::melt()
idx <- aux$value %>%unique() %>%sort()
j = 1
while(j <= length(idx)){
aux <- lapply(original_df,function(x){ which(x==x[idx[j]])})%>% reshape2::melt()
idx <- c( idx, aux$value) %>%unique() %>% sort()
j <- j+1
}
find_clus$cluster[idx] <- paste0("cl",sprintf("%04d", cluster_number))
cluster_number<- cluster_number +1
}
return(find_clus$cluster)
}
So, to find the clusters one should write:
find_clusters(df[,c(1,2)])

Related

Recoding values by row based on values of other columns

I have a data set containing the genetic information of two parents and 300+ offspring. I'm trying to change the row values of the offspring based on the values of the parents in that row such that:
P1 P2 o1 o2 o3
1 A T A T AT
2 C A CA A C
3 G C G G C
4 T C C TC CT
becomes:
P1 P2 o1 o2 o3
1 A T a b h
2 C A b b a
3 G C a a b
4 T C b b h
where 'a' in the offspring indicates that it's like P1, 'b' for P2, and 'h' for having both. I've split the parent columns from the offspring for ease (Parents and Test, respectively), but my loop doesn't work or changes the entire row to NA. I've just been trying to tackle recoding to 'a' and 'b' for now with the following code:
for (i in 1:nrow(Test)) {
if (Parents[i, 1] == "A") {
Test[Test[i, ] == "A"] <- "a"
} else
if (Parents[i, 2] =="A") {
Test[Test[i, ] == "A"] <-"b"
}
}
I'd appreciate any help, I'm desperately trying to avoid doing this by hand.

I wonder if your expected output is inconsistent with your rules; assuming that it is, try this:
dat
# P1 P2 o1 o2 o3
# 1 A T A T AT
# 2 C A CA A C
# 3 G C G G C
# 4 T C C TC CT
vecgrepl <- Vectorize(grepl)
dat[,3:5] <- lapply(dat[,3:5], function(Z)
sapply(paste0(+(vecgrepl(dat$P1, Z)), +(vecgrepl(dat$P2, Z))),
switch, "01"="b", "10"="a", "11"="h", "-")
)
dat
# P1 P2 o1 o2 o3
# 1 A T a b h
# 2 C A h b a
# 3 G C a a b
# 4 T C b h h
Breakdown:
grepl accepts a pattern of length 1 only, so we need to Vectorize it. There are other ways to do this with equivalent results.
vecgrepl(dat$P1, Z) should return (for each column Z) whether P1's letter is found in its value.
+(.) is a shortcut for converting FALSE/TRUE to 0/1, used for switch below; admittedly we could use FALSETRUE, TRUEFALSE,TRUETRUE and such, I thought this might appear cleaner.
switch is an easy way to emulate multiple if/then conditionals, looking at the three combinations. The trailing "-" is a default value, if none of "01", "10", or "11" is seen (effectively "00" here). (This could also be emulated with dplyr::case_when or data.table::fcase with little adjustment.)
Because switch is also length-1 only, I vectorize it using sapply(..., switch, ...) where the second ... are arguments sent to switch. Equivalent to sapply(paste0(...), function(x) switch(z, "01"="b", ..)).
dat[,3:5] <- lapply(dat[,3:5], ..), only do this for three of the volumns; could also have done dat[,-(1:2)] to do all columns except the first two.
Data
dat <- structure(list(P1 = c("A", "C", "G", "T"), P2 = c("T", "A", "C", "C"), o1 = c("A", "CA", "G", "C"), o2 = c("T", "A", "G", "TC"), o3 = c("AT", "C", "C", "CT")), class = "data.frame", row.names = c("1", "2", "3", "4"))

Counting unique occurrence of two values per ID, considering presence of third value

The title is not quite cool - I apologise that I was not able to summarise the question better. I am conceptually a bit lost and wondered if there is a better approach for the following:
What I have:
I have two columns, ID and eye. Eyes can be coded as "r", "l" or "b" (right/ left/ both eyes). It does not have to contain all values, and it can include NA.
What I want:
I want to count number of distinct eyes by ID. If "b" is occurring, "r|l" for the same ID should not be counted (because "right | left eye" is part of "both eyes").
Ideally base R only:
My approach uses base R only, and I would much prefer a base R solution, because this is intended for a package. (Actually, the core of this function is already part of a package, but I wonder if this can be improved).
Other solutions very welcome:
The final function is also to be applied on data frames with 10^6 rows and thousands of IDs, so should be fast computation by group. My solution seems already fairly fast, (I have not done a formal test though). I would therefore also think any dplyr::group_by solution would not be an option (at least in my approaches).
# sample data
set.seed(42)
id <- letters[sample(11, replace = TRUE)]
foo1 <- data.frame(id, eye = sample(c("r", "l", "b"), 11, replace = TRUE))
foo2 <- data.frame(id, eye = "r")
foo3 <- foo2
foo3$eye[1:5] <- NA
foo4 <- data.frame(id, eye = "b")
count_eyes <- function(x, pat_col, eye) {
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl("r|l", colnames(eye_tab)))){
eye_tab[, c("r","l")][eye_tab[, "b"] == 1] <- 0
}
}
# I chose the programmatic approach because the column names might not be present
# I add all 1 for each column. Because r is set to 0 previously, I have to
# add the count for b again to get the real number of right eyes.
n_b <- unname(colSums(eye_tab[, colnames(eye_tab) == "b", drop = FALSE]))
n_right <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "r", drop = FALSE])), n_b)
n_left <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "l", drop = FALSE])), n_b)
c(r = n_right, l = n_left)
}
expected result
lapply(mget(c("foo1", "foo2", "foo3", "foo4")), count_eyes, pat_col = "id", eye = "eye")
#> $foo1
#> r l
#> 7 6
#>
#> $foo2
#> r l
#> 8 0
#>
#> $foo3
#> r l
#> 6 0
#>
#> $foo4
#> r l
#> 8 8

The code could be shortened if we convert the column to factor with levels specified
count_eyes <- function(x, pat_col, eye) {
nm1 <- c('r', 'l')
x$eye <- factor(x$eye, levels = c("b", nm1)) # // convert to factor
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl(paste(nm1, collapse="|"), colnames(eye_tab)))){
eye_tab[, nm1][eye_tab[, "b"] == 1] <- 0
}
}
out <- colSums(eye_tab)
out[nm1] + out['b']
}
-testing
lapply(mget(paste0('foo', 1:4)), count_eyes, pat_col = "id", eye = "eye")
#$foo1
#r l
#7 6
#$foo2
#r l
#8 0
#$foo3
#r l
#6 0
#$foo4
#r l
#8 8

Here's another approach with split and rowSums:
count_eyes <- function(x , pat_col, eye){
rowSums(sapply(split(subset(x,select = eye),
subset(x,select = pat_col)),
function(y){c(r = any(y %in% c("b", "r")),
l = any(y %in% c("b", "l")))
}))}
lapply(mget(ls(pattern="foo")),count_eyes, "id", "eye")
$foo1
r l
5 4
$foo2
r l
6 0
$foo3
r l
4 0
$foo4
r l
6 6

looping not working in R

I am trying to get a function to solve a small problem. I have to two list, and each list comprise n samples. Each sample has a variable amount of identifiers of bacteria (in the example letters, in my problem bacterial identifiers like OTU1-OTUn, in both cases are “character”). One list comprise samples from diet, and the another list samples from gut contents. I want to know for each sample of list gut, how many bacteria from diet are in the gut and how many bacteria in the gut do not come from diet. This was easily done when working with phyloseq object were diet and gut are both phyloseq objects with n samples each.
Bacteria_from_diet<-length(intersect(taxa_names(gut),taxa_names(diet))
Bacteria_not_diet<-length(taxa_names(diet)- Bacteria_from_diet
However, this “summarizes” the result over the n samples of gut and diet, I mean, like if I collapse data by sample, and I need some measure of variation.
I have tried the following code in R:
diet<-list(DL1=c("A","B","C"),DL2=c("A","C","D"),DL3=c("B","D","E"),DL4=c("B","D","E"))
gut<-list(DL5=c("A","F","G"),DL6=c("B","F","H"),DL7=c("D","H","J"),DL8=c("A","G","F"))
gut_vs_diet <- function(a,b) ## a is diet and b is gut
{
xx<-10
gut = numeric(xx)
diet = numeric(xx)
all<-unlist(lapply(b,length)) ### get the number of elements of each element of list b
for(i in seq_along(b)){ #### loop over b (gut) to get:
diet<-length(intersect(b[[i]],a[[i]])) ### the number of elements of diet are present in gut
gut = all-diet ## the number of elements of gut that not come from diet
}
gutvsdiet = data.frame(all,gut,diet)
return(gutvsdiet)
}
When running the funtion I obtain this result, which is not correct
gut_vs_diet(diet,gut)
all gut diet
DL5 3 3 0
DL6 3 3 0
DL7 3 3 0
DL8 3 3 0
In some cases, I was able to get some value in diet column, but the function randomly choose the diet sample.
I do not know where could be the mistake.Anyway, I would like to do this iteratively, I mean, get the values for each sample of gut compared with all samples of diet. Alternatively, I can run a replicate(10,gut_vs_diet(sample(diet),sample(gut)) to get random comparisons and avoid somekind of bias.
Thank you very much for your help
Manuel

Here is my version of your code:
diet <- list(DL1=c("A","B","C"), DL2=c("A","C","D"), DL3=c("B","D","E"), DL4=c("B","D","E"))
gut <- list(DL5=c("A","F","G"), DL6=c("B","F","H"), DL7=c("D","H","J"), DL8=c("A","G","F"))
gut_vs_diet <- function(a, b) ## a is diet and b is gut
{
all <- lengths(b) ### get the number of elements of each element of list b
diet <- mapply(function(ai, bi) length(intersect(ai, bi)), a, b)
# diet <- lengths(mapply(intersect, a, b)) ## a variant
data.frame(all, gut=all-diet, diet)
}
gut_vs_diet(diet,gut)
# > gut_vs_diet(diet,gut)
# all gut diet
# DL5 3 2 1
# DL6 3 3 0
# DL7 3 2 1
# DL8 3 3 0

As #jogo suggested in a comment, you can use mapply instead of your for-loop:
FOO <- function(x, y){
all <- lengths(y)
diet <- mapply(function(a, b){
length(intersect(b, a))
}, x, y)
gut <- all - diet
return(data.frame(all, gut, diet))
}
> FOO(diet, gut)
all gut diet
DL5 3 2 1
DL6 3 3 0
DL7 3 2 1
DL8 3 3 0

Just for completion, with a for loop it would look like this. Note that you need to subtract all[[i]] - diet and construct the dataframe inside the loop, otherwise you will just fill it with the last result of the loop, which is, data.frame(all = c(3,3,3,3), gut = 3, diet = 0)
diet <- list(DL1 = c("A", "B", "C"), DL2 = c("A", "C", "D"), DL3 = c("B", "D", "E"), DL4 = c("B", "D", "E"))
gut <- list(DL5 = c("A", "F", "G"), DL6 = c("B", "F", "H"), DL7 = c("D", "H", "J"), DL8 = c("A", "G", "F"))
gut_vs_diet <- function(a, b)
{
all <- lengths(b)
gutvsdiet <- NULL
for (i in seq_along(b)) {
diet <- length(intersect(b[[i]], a[[i]]))
gut <- all[[i]] - diet
resultForThisListElement <- c(all[[i]], gut, diet)
gutvsdiet <- rbind(gutvsdiet, resultForThisListElement)
}
colnames(gutvsdiet) <- c("all", "gut", "diet")
return(gutvsdiet)
}
gut_vs_diet(diet, gut)

How to create table with both forward and backward information for edges with igraph

I have created a table with igraph listing the data as follows :
where a,b,c,d,e are the edges.
a and b are mutual edges,
with the weight values of 1 for a->b, 2 for b->a (There is no self-loop).
By the way I used the following code to create the above table:
library(igraph)
library(dplyr)
g <- data.frame(from = c("a", "b", "c", "d", "e"),
to = c("b", "a", "a", "b", "a"), weight = c(1:5)) %>%
igraph::graph_from_data_frame()
Now I hope to create another table listing both the forward and backward information between the edges, as well as the weight values like:
Does anyone know how to do this with igraph?

First you could get a list of the pairs of node that share and edge regardless of direction
simplified <- as.undirected(g, mode="collapse")
pairs <- ends(simplified, E(simplified))
Then we can write a helper function to return a given edge weight between two node and if it doesn't exist, return NA instead
get_edge_weight<- Vectorize(function(a, b) {
e <- E(g)[a %->% b]
if(length(e)==1) {
e$weight
} else {
NA
}
})
Then you can build your desired data.frame with
data.frame(from=pairs[,1], to=pairs[,2],
fwd=get_edge_weight(pairs[,1], pairs[,2]),
back=get_edge_weight(pairs[,2], pairs[,1])
)
# from to fwd back
# b a b 1 2
# c a c NA 3
# d b d NA 4
# e a e NA 5

trouble understanding count.multiple and simplify in igraph

I am working with network data and have come across an odd (or at least I didn't expect it) behavior with count.multiple in the igraph package in R.
library(igraph)
library(plyr)
df <- data.frame( sender = c( "a", "a", "a", "b", "b", "c","c","d" ),
receiver = c( "b", "b", "b", "c", "a", "d", "d", "a" ) )
What I want is to count up all of the edges and use the multiples as a weight.
when I do ddply(df, .(sender, receiver), "nrow") my results are:
sender receiver nrow
1 a b 3
2 b a 1
3 b c 1
4 c d 2
5 d a 1
Which is what I would expect.
However, I cannot reproduce this using igraph's count.multiple, which is what I expected to do this within igraph
df.graph <- graph.edgelist(as.matrix(df))
E(df.graph)$weight <- count.multiple(df.graph)
E(df.graph)$weight produces:
3 3 3 1 1 2 2 1
I then used the simplify command:
df.graph <- simplify(df.graph)
which produces
9 1 1 4 1
I get what is going on here, simplify is just adding the weights, but I don't understand why/when this would be used as opposed to what ddply is doing..?
Any thoughts?
Thanks!

The default behaviour of simplify is to add the weights of multiple edges.
To avoid double counting, you can set the initial weights to 1
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- 1
g <- simplify( g )
E(g)$weight
or change the way they are aggregated.
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- count.multiple(g)
g <- simplify( g, edge.attr.comb = list(weight=max, name="concat", "ignore") )
E(g)$weight

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cluster observations based on multiple variables - r

Related

Recoding values by row based on values of other columns

Counting unique occurrence of two values per ID, considering presence of third value

looping not working in R

How to create table with both forward and backward information for edges with igraph

trouble understanding count.multiple and simplify in igraph

Categories

Resources