looping not working in R - r

I am trying to get a function to solve a small problem. I have to two list, and each list comprise n samples. Each sample has a variable amount of identifiers of bacteria (in the example letters, in my problem bacterial identifiers like OTU1-OTUn, in both cases are “character”). One list comprise samples from diet, and the another list samples from gut contents. I want to know for each sample of list gut, how many bacteria from diet are in the gut and how many bacteria in the gut do not come from diet. This was easily done when working with phyloseq object were diet and gut are both phyloseq objects with n samples each.
Bacteria_from_diet<-length(intersect(taxa_names(gut),taxa_names(diet))
Bacteria_not_diet<-length(taxa_names(diet)- Bacteria_from_diet
However, this “summarizes” the result over the n samples of gut and diet, I mean, like if I collapse data by sample, and I need some measure of variation.
I have tried the following code in R:
diet<-list(DL1=c("A","B","C"),DL2=c("A","C","D"),DL3=c("B","D","E"),DL4=c("B","D","E"))
gut<-list(DL5=c("A","F","G"),DL6=c("B","F","H"),DL7=c("D","H","J"),DL8=c("A","G","F"))
gut_vs_diet <- function(a,b) ## a is diet and b is gut
{
xx<-10
gut = numeric(xx)
diet = numeric(xx)
all<-unlist(lapply(b,length)) ### get the number of elements of each element of list b
for(i in seq_along(b)){ #### loop over b (gut) to get:
diet<-length(intersect(b[[i]],a[[i]])) ### the number of elements of diet are present in gut
gut = all-diet ## the number of elements of gut that not come from diet
}
gutvsdiet = data.frame(all,gut,diet)
return(gutvsdiet)
}
When running the funtion I obtain this result, which is not correct
gut_vs_diet(diet,gut)
all gut diet
DL5 3 3 0
DL6 3 3 0
DL7 3 3 0
DL8 3 3 0
In some cases, I was able to get some value in diet column, but the function randomly choose the diet sample.
I do not know where could be the mistake.Anyway, I would like to do this iteratively, I mean, get the values for each sample of gut compared with all samples of diet. Alternatively, I can run a replicate(10,gut_vs_diet(sample(diet),sample(gut)) to get random comparisons and avoid somekind of bias.
Thank you very much for your help
Manuel

Here is my version of your code:
diet <- list(DL1=c("A","B","C"), DL2=c("A","C","D"), DL3=c("B","D","E"), DL4=c("B","D","E"))
gut <- list(DL5=c("A","F","G"), DL6=c("B","F","H"), DL7=c("D","H","J"), DL8=c("A","G","F"))
gut_vs_diet <- function(a, b) ## a is diet and b is gut
{
all <- lengths(b) ### get the number of elements of each element of list b
diet <- mapply(function(ai, bi) length(intersect(ai, bi)), a, b)
# diet <- lengths(mapply(intersect, a, b)) ## a variant
data.frame(all, gut=all-diet, diet)
}
gut_vs_diet(diet,gut)
# > gut_vs_diet(diet,gut)
# all gut diet
# DL5 3 2 1
# DL6 3 3 0
# DL7 3 2 1
# DL8 3 3 0

As #jogo suggested in a comment, you can use mapply instead of your for-loop:
FOO <- function(x, y){
all <- lengths(y)
diet <- mapply(function(a, b){
length(intersect(b, a))
}, x, y)
gut <- all - diet
return(data.frame(all, gut, diet))
}
> FOO(diet, gut)
all gut diet
DL5 3 2 1
DL6 3 3 0
DL7 3 2 1
DL8 3 3 0

Just for completion, with a for loop it would look like this. Note that you need to subtract all[[i]] - diet and construct the dataframe inside the loop, otherwise you will just fill it with the last result of the loop, which is, data.frame(all = c(3,3,3,3), gut = 3, diet = 0)
diet <- list(DL1 = c("A", "B", "C"), DL2 = c("A", "C", "D"), DL3 = c("B", "D", "E"), DL4 = c("B", "D", "E"))
gut <- list(DL5 = c("A", "F", "G"), DL6 = c("B", "F", "H"), DL7 = c("D", "H", "J"), DL8 = c("A", "G", "F"))
gut_vs_diet <- function(a, b)
{
all <- lengths(b)
gutvsdiet <- NULL
for (i in seq_along(b)) {
diet <- length(intersect(b[[i]], a[[i]]))
gut <- all[[i]] - diet
resultForThisListElement <- c(all[[i]], gut, diet)
gutvsdiet <- rbind(gutvsdiet, resultForThisListElement)
}
colnames(gutvsdiet) <- c("all", "gut", "diet")
return(gutvsdiet)
}
gut_vs_diet(diet, gut)

Related

Recoding values by row based on values of other columns

I have a data set containing the genetic information of two parents and 300+ offspring. I'm trying to change the row values of the offspring based on the values of the parents in that row such that:
P1 P2 o1 o2 o3
1 A T A T AT
2 C A CA A C
3 G C G G C
4 T C C TC CT
becomes:
P1 P2 o1 o2 o3
1 A T a b h
2 C A b b a
3 G C a a b
4 T C b b h
where 'a' in the offspring indicates that it's like P1, 'b' for P2, and 'h' for having both. I've split the parent columns from the offspring for ease (Parents and Test, respectively), but my loop doesn't work or changes the entire row to NA. I've just been trying to tackle recoding to 'a' and 'b' for now with the following code:
for (i in 1:nrow(Test)) {
if (Parents[i, 1] == "A") {
Test[Test[i, ] == "A"] <- "a"
} else
if (Parents[i, 2] =="A") {
Test[Test[i, ] == "A"] <-"b"
}
}
I'd appreciate any help, I'm desperately trying to avoid doing this by hand.
I wonder if your expected output is inconsistent with your rules; assuming that it is, try this:
dat
# P1 P2 o1 o2 o3
# 1 A T A T AT
# 2 C A CA A C
# 3 G C G G C
# 4 T C C TC CT
vecgrepl <- Vectorize(grepl)
dat[,3:5] <- lapply(dat[,3:5], function(Z)
sapply(paste0(+(vecgrepl(dat$P1, Z)), +(vecgrepl(dat$P2, Z))),
switch, "01"="b", "10"="a", "11"="h", "-")
)
dat
# P1 P2 o1 o2 o3
# 1 A T a b h
# 2 C A h b a
# 3 G C a a b
# 4 T C b h h
Breakdown:
grepl accepts a pattern of length 1 only, so we need to Vectorize it. There are other ways to do this with equivalent results.
vecgrepl(dat$P1, Z) should return (for each column Z) whether P1's letter is found in its value.
+(.) is a shortcut for converting FALSE/TRUE to 0/1, used for switch below; admittedly we could use FALSETRUE, TRUEFALSE,TRUETRUE and such, I thought this might appear cleaner.
switch is an easy way to emulate multiple if/then conditionals, looking at the three combinations. The trailing "-" is a default value, if none of "01", "10", or "11" is seen (effectively "00" here). (This could also be emulated with dplyr::case_when or data.table::fcase with little adjustment.)
Because switch is also length-1 only, I vectorize it using sapply(..., switch, ...) where the second ... are arguments sent to switch. Equivalent to sapply(paste0(...), function(x) switch(z, "01"="b", ..)).
dat[,3:5] <- lapply(dat[,3:5], ..), only do this for three of the volumns; could also have done dat[,-(1:2)] to do all columns except the first two.
Data
dat <- structure(list(P1 = c("A", "C", "G", "T"), P2 = c("T", "A", "C", "C"), o1 = c("A", "CA", "G", "C"), o2 = c("T", "A", "G", "TC"), o3 = c("AT", "C", "C", "CT")), class = "data.frame", row.names = c("1", "2", "3", "4"))

Cluster observations based on multiple variables

I am looking for an r function to create clusters in my dataset based on two variables (hope "cluster" is the right name for what I want to do) . Every two observations with the same value for variable_1 OR variable_2, should be in the same cluster. In the brief exemple that follows, I aggregating dataframe df based on variable_1 and variable_2.
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
The result should look like:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
Notice that the first cluster contains everyone with variable_1 equals "a", "b", "c", or "d": "a" and "b" are merged together because they share "g1" (lines 1 and 3); "a" and "c" are merged because they share "g2" (lines 2 and 5); and "c" and "d" are merged because they share "g4" (lines 6 and 7).
Finally, in the last cluster there are only observations with variable_1=="e", because they don't share variable_2 with anyone.
Just to clarify what I intend to do, I will explain my problem set a little better. I’m pairing counties with nearby Touristic Attractions. Different counties are surround by different Touristic Attraction (TA), and there are many TA around the same county. But this "touristic-clusters" of counties and TAs are sparsely spread in the country. Notice that some far away counties might be inside the same cluster due to a “chain” effect of county-Touristic Attraction connections. So I want to find those “clusters” based on the id of the county and of the Touristic Attraction.
This seems simple, but I can't figure it out how to implement.
Thanks a lot
igraph solution
Disclaimer: I am completely new to igraph, so there's probably a better solution to this problem. However this seems to work.
With the igraph package we can cluster the data using the graph_from_data_frame() function, and then extract the clusters with components. You get the added advantage of being able to visualise the clusters.
library(igraph)
graph <- graph_from_data_frame(df[, 1:2], directed = FALSE)
cmp <- components(graph)$membership
df$cluster <- cmp[df$variable_1]
plot(graph)
Wrapping it up into a function
If you wanted to wrap it up as a function, something like this works:
find_clusters <- function(x, y) {
edges <- data.frame(from = x, to = y)
graph <- igraph::graph_from_data_frame(edges, directed = FALSE)
cmp <- igraph::components(graph)$membership
return(cmp[x])
}
Using the additional example you posted as a comment above, we thus have the following workflow:
library(dplyr)
df <- data.frame(
variable_1 = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f"),
variable_2 = c( "g1", "g2", "g1", "g3", "g2", "g4", "g4", "g6", "g7", "g8", "g9", "g12"),
value = rnorm(12)
)
df %>%
mutate(cluster = find_clusters(variable_1, variable_2))
# variable_1 variable_2 value cluster
# 1 a g1 -0.03410073 1
# 2 a g2 0.51261548 1
# 3 b g1 0.06470451 1
# 4 b g3 -1.97228101 1
# 5 c g2 -0.39751063 1
# 6 c g4 0.17761619 1
# 7 d g4 -0.13771207 1
# 8 d g6 -0.72183017 1
# 9 e g7 0.09012701 2
# 10 e g8 0.45763593 2
# 11 f g9 -0.83172613 3
# 12 f g12 2.83480352 3
So, I wrote a function to achieve what I need. It is ugly but it is working. If someone has a better/more efficient solution I would really appreciate it.
find_clusters <- function(original_df){
find_clus <- original_df
cluster_number <- 1
find_clus$cluster <- "cl"
i=1
for(i in 1:nrow(find_clus)){
if(nchar(find_clus$cluster[i])>2) next
aux <- lapply(original_df,function(x){ which(x==x[i])})%>% reshape2::melt()
idx <- aux$value %>%unique() %>%sort()
j = 1
while(j <= length(idx)){
aux <- lapply(original_df,function(x){ which(x==x[idx[j]])})%>% reshape2::melt()
idx <- c( idx, aux$value) %>%unique() %>% sort()
j <- j+1
}
find_clus$cluster[idx] <- paste0("cl",sprintf("%04d", cluster_number))
cluster_number<- cluster_number +1
}
return(find_clus$cluster)
}
So, to find the clusters one should write:
find_clusters(df[,c(1,2)])

permute dataframe but must have unique rows

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

Using R to Assign Treatments to Groups

We have seven exposures and 24 groups. We would like to randomly assign five of the seven exposures to groups while also ensuring that we end up with a consistent count for each exposure, meaning that each exposure ends up being exposed about the same number of times. I have written some code that does this but I cannot control how many times each exposure is shown. For example:
exposures <- c("A", "B", "C", "D", "E", "F", "G")
groups <- c(1:24)
table <- c()
for (i in 1:24){
draw <- sample(exposures, size=5, replace=F)
table <- rbind(table, draw)
}
table(table)
So the counts end up somewhat close but is there something I can do to ensure a minimum for each exposure? Thanks!
EDIT Also, we need each exposure to appear only once per group.
It's easier to think of it in terms of the two exposures that aren't used, rather than the five that are. Let's limit the number of times an exposure can be excluded:
draw_exc <- function(exposures,nexp,ng,max_excluded = 10){
nexc <- length(exposures)-nexp
exp_rem <- exposures
exc <- matrix(,ng,nexc)
for (i in 1:ng){
pool <- combn(exp_rem,nexc)
draw <- pool[,sample(1:ncol(pool), 1)]
exc[i,] <- draw
tab <- table(exc)
exp_rem <- setdiff(exp_rem, names(tab[tab > max_excluded]) )
}
exc
}
Here's an illustration:
set.seed(1)
exc <- draw_exc(exposures,5,24,10)
assignment <- apply(exc,1,function(x) setdiff(exposures,x))
table(exc)
# exc
# A B C D E F G
# 7 4 6 6 8 10 7
table(assignment)
# assignment
# A B C D E F G
# 17 20 18 18 16 14 17
So, with 24 groups, the maximum number of exclusions equals 24 minus the minimum number of appearances. This loop is not efficient, but it seems to get the job done.

trouble understanding count.multiple and simplify in igraph

I am working with network data and have come across an odd (or at least I didn't expect it) behavior with count.multiple in the igraph package in R.
library(igraph)
library(plyr)
df <- data.frame( sender = c( "a", "a", "a", "b", "b", "c","c","d" ),
receiver = c( "b", "b", "b", "c", "a", "d", "d", "a" ) )
What I want is to count up all of the edges and use the multiples as a weight.
when I do ddply(df, .(sender, receiver), "nrow") my results are:
sender receiver nrow
1 a b 3
2 b a 1
3 b c 1
4 c d 2
5 d a 1
Which is what I would expect.
However, I cannot reproduce this using igraph's count.multiple, which is what I expected to do this within igraph
df.graph <- graph.edgelist(as.matrix(df))
E(df.graph)$weight <- count.multiple(df.graph)
E(df.graph)$weight produces:
3 3 3 1 1 2 2 1
I then used the simplify command:
df.graph <- simplify(df.graph)
which produces
9 1 1 4 1
I get what is going on here, simplify is just adding the weights, but I don't understand why/when this would be used as opposed to what ddply is doing..?
Any thoughts?
Thanks!
The default behaviour of simplify is to add the weights of multiple edges.
To avoid double counting, you can set the initial weights to 1
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- 1
g <- simplify( g )
E(g)$weight
or change the way they are aggregated.
g <- graph.edgelist(as.matrix(df))
E(g)$weight <- count.multiple(g)
g <- simplify( g, edge.attr.comb = list(weight=max, name="concat", "ignore") )
E(g)$weight

Resources