Related
I have a data set containing the genetic information of two parents and 300+ offspring. I'm trying to change the row values of the offspring based on the values of the parents in that row such that:
P1 P2 o1 o2 o3
1 A T A T AT
2 C A CA A C
3 G C G G C
4 T C C TC CT
becomes:
P1 P2 o1 o2 o3
1 A T a b h
2 C A b b a
3 G C a a b
4 T C b b h
where 'a' in the offspring indicates that it's like P1, 'b' for P2, and 'h' for having both. I've split the parent columns from the offspring for ease (Parents and Test, respectively), but my loop doesn't work or changes the entire row to NA. I've just been trying to tackle recoding to 'a' and 'b' for now with the following code:
for (i in 1:nrow(Test)) {
if (Parents[i, 1] == "A") {
Test[Test[i, ] == "A"] <- "a"
} else
if (Parents[i, 2] =="A") {
Test[Test[i, ] == "A"] <-"b"
}
}
I'd appreciate any help, I'm desperately trying to avoid doing this by hand.
I wonder if your expected output is inconsistent with your rules; assuming that it is, try this:
dat
# P1 P2 o1 o2 o3
# 1 A T A T AT
# 2 C A CA A C
# 3 G C G G C
# 4 T C C TC CT
vecgrepl <- Vectorize(grepl)
dat[,3:5] <- lapply(dat[,3:5], function(Z)
sapply(paste0(+(vecgrepl(dat$P1, Z)), +(vecgrepl(dat$P2, Z))),
switch, "01"="b", "10"="a", "11"="h", "-")
)
dat
# P1 P2 o1 o2 o3
# 1 A T a b h
# 2 C A h b a
# 3 G C a a b
# 4 T C b h h
Breakdown:
grepl accepts a pattern of length 1 only, so we need to Vectorize it. There are other ways to do this with equivalent results.
vecgrepl(dat$P1, Z) should return (for each column Z) whether P1's letter is found in its value.
+(.) is a shortcut for converting FALSE/TRUE to 0/1, used for switch below; admittedly we could use FALSETRUE, TRUEFALSE,TRUETRUE and such, I thought this might appear cleaner.
switch is an easy way to emulate multiple if/then conditionals, looking at the three combinations. The trailing "-" is a default value, if none of "01", "10", or "11" is seen (effectively "00" here). (This could also be emulated with dplyr::case_when or data.table::fcase with little adjustment.)
Because switch is also length-1 only, I vectorize it using sapply(..., switch, ...) where the second ... are arguments sent to switch. Equivalent to sapply(paste0(...), function(x) switch(z, "01"="b", ..)).
dat[,3:5] <- lapply(dat[,3:5], ..), only do this for three of the volumns; could also have done dat[,-(1:2)] to do all columns except the first two.
Data
dat <- structure(list(P1 = c("A", "C", "G", "T"), P2 = c("T", "A", "C", "C"), o1 = c("A", "CA", "G", "C"), o2 = c("T", "A", "G", "TC"), o3 = c("AT", "C", "C", "CT")), class = "data.frame", row.names = c("1", "2", "3", "4"))
I have several probability distribution functions defined using the pdqr package. Let say, they are A, B and C:
A <- as_d(function(x)dnorm(x, mean = 3, sd = 1))
B <- as_d(function(x)dnorm(x, mean = 6, sd = 1))
C <- as_d(function(x)dnorm(x, mean = 2, sd = 2))
I have a large data.frame with a vector which has a character describing the appropiate PDF per case in a vector distr, let say:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
I would like to generate the mean of each PDF per case. Individually this works like this for PDF A:
> pdqr::summ_mean(A)
[1] 3
Now I would like to generate the mean for each case based on the PDF set in distr. This means passing the PDF into pdqr::sum_mean(). I have tried the following with the resulting errors:
> df$distr_mean <- summ_mean(df$distr)
Error: `f` is not pdqr-function. It should be function.
>
> df$distr_mean <- summ_mean(invoke_map(df$distr))
Error in A() : argument "x" is missing, with no default
>
> df$distr_mean <- df %>%
+ pull(distr) %>%
+ summ_mean()
Error: `f` is not pdqr-function. It should be function.
So, either it doesn't understand that a pdqr-function is being passed, or it needs a x-value, which doesn't make sense, since I want the mean over the entire distribution, not a single x (passing a range like c(1:10) also doesn't work). Furthermore, I understand that any apply or do.call function only passes one single function, while I want to pass several different functions, given in a vector.
How to proceed?
One way to do this is to use the distr column as an argument to mget, which will return all the appropriate functions in a list. Just feed that list to summ_mean using sapply:
sapply(mget(df$distr), pdqr::summ_mean)
#> A C A B B A C
#> 3 2 3 6 6 3 2
Though inside mutate you'll need to tell mget which environment the functions will be found:
df %>%
mutate(distr_mean = sapply(mget(distr, envir = .GlobalEnv), pdqr::summ_mean))
#> distr distr_mean
#> 1 A 3
#> 2 C 2
#> 3 A 3
#> 4 B 6
#> 5 B 6
#> 6 A 3
#> 7 C 2
This may be easier to manage if you store your functions in a named list, rather than in the top level environment. From there, it's relatively easy to use sapply or lapply to calculate the mean for each function and then extract the results into df:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
pdfs <- list(
A = as_d(function(x)dnorm(x, mean = 3, sd = 1)),
B = as_d(function(x)dnorm(x, mean = 6, sd = 1)),
C = as_d(function(x)dnorm(x, mean = 2, sd = 2))
)
means <- sapply(pdfs, summ_mean)
df$distr_mean <- means[df$distr]
distr distr_mean
1 A 3
2 C 2
3 A 3
4 B 6
5 B 6
6 A 3
7 C 2
Or in one line:
df$distr_mean <- lapply(df$distr, \(x) pdqr::summ_mean(pdfs[[x]]))
I am looking for an r function to create clusters in my dataset based on two variables (hope "cluster" is the right name for what I want to do) . Every two observations with the same value for variable_1 OR variable_2, should be in the same cluster. In the brief exemple that follows, I aggregating dataframe df based on variable_1 and variable_2.
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
The result should look like:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
Notice that the first cluster contains everyone with variable_1 equals "a", "b", "c", or "d": "a" and "b" are merged together because they share "g1" (lines 1 and 3); "a" and "c" are merged because they share "g2" (lines 2 and 5); and "c" and "d" are merged because they share "g4" (lines 6 and 7).
Finally, in the last cluster there are only observations with variable_1=="e", because they don't share variable_2 with anyone.
Just to clarify what I intend to do, I will explain my problem set a little better. I’m pairing counties with nearby Touristic Attractions. Different counties are surround by different Touristic Attraction (TA), and there are many TA around the same county. But this "touristic-clusters" of counties and TAs are sparsely spread in the country. Notice that some far away counties might be inside the same cluster due to a “chain” effect of county-Touristic Attraction connections. So I want to find those “clusters” based on the id of the county and of the Touristic Attraction.
This seems simple, but I can't figure it out how to implement.
Thanks a lot
igraph solution
Disclaimer: I am completely new to igraph, so there's probably a better solution to this problem. However this seems to work.
With the igraph package we can cluster the data using the graph_from_data_frame() function, and then extract the clusters with components. You get the added advantage of being able to visualise the clusters.
library(igraph)
graph <- graph_from_data_frame(df[, 1:2], directed = FALSE)
cmp <- components(graph)$membership
df$cluster <- cmp[df$variable_1]
plot(graph)
Wrapping it up into a function
If you wanted to wrap it up as a function, something like this works:
find_clusters <- function(x, y) {
edges <- data.frame(from = x, to = y)
graph <- igraph::graph_from_data_frame(edges, directed = FALSE)
cmp <- igraph::components(graph)$membership
return(cmp[x])
}
Using the additional example you posted as a comment above, we thus have the following workflow:
library(dplyr)
df <- data.frame(
variable_1 = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f"),
variable_2 = c( "g1", "g2", "g1", "g3", "g2", "g4", "g4", "g6", "g7", "g8", "g9", "g12"),
value = rnorm(12)
)
df %>%
mutate(cluster = find_clusters(variable_1, variable_2))
# variable_1 variable_2 value cluster
# 1 a g1 -0.03410073 1
# 2 a g2 0.51261548 1
# 3 b g1 0.06470451 1
# 4 b g3 -1.97228101 1
# 5 c g2 -0.39751063 1
# 6 c g4 0.17761619 1
# 7 d g4 -0.13771207 1
# 8 d g6 -0.72183017 1
# 9 e g7 0.09012701 2
# 10 e g8 0.45763593 2
# 11 f g9 -0.83172613 3
# 12 f g12 2.83480352 3
So, I wrote a function to achieve what I need. It is ugly but it is working. If someone has a better/more efficient solution I would really appreciate it.
find_clusters <- function(original_df){
find_clus <- original_df
cluster_number <- 1
find_clus$cluster <- "cl"
i=1
for(i in 1:nrow(find_clus)){
if(nchar(find_clus$cluster[i])>2) next
aux <- lapply(original_df,function(x){ which(x==x[i])})%>% reshape2::melt()
idx <- aux$value %>%unique() %>%sort()
j = 1
while(j <= length(idx)){
aux <- lapply(original_df,function(x){ which(x==x[idx[j]])})%>% reshape2::melt()
idx <- c( idx, aux$value) %>%unique() %>% sort()
j <- j+1
}
find_clus$cluster[idx] <- paste0("cl",sprintf("%04d", cluster_number))
cluster_number<- cluster_number +1
}
return(find_clus$cluster)
}
So, to find the clusters one should write:
find_clusters(df[,c(1,2)])
I am trying to get a function to solve a small problem. I have to two list, and each list comprise n samples. Each sample has a variable amount of identifiers of bacteria (in the example letters, in my problem bacterial identifiers like OTU1-OTUn, in both cases are “character”). One list comprise samples from diet, and the another list samples from gut contents. I want to know for each sample of list gut, how many bacteria from diet are in the gut and how many bacteria in the gut do not come from diet. This was easily done when working with phyloseq object were diet and gut are both phyloseq objects with n samples each.
Bacteria_from_diet<-length(intersect(taxa_names(gut),taxa_names(diet))
Bacteria_not_diet<-length(taxa_names(diet)- Bacteria_from_diet
However, this “summarizes” the result over the n samples of gut and diet, I mean, like if I collapse data by sample, and I need some measure of variation.
I have tried the following code in R:
diet<-list(DL1=c("A","B","C"),DL2=c("A","C","D"),DL3=c("B","D","E"),DL4=c("B","D","E"))
gut<-list(DL5=c("A","F","G"),DL6=c("B","F","H"),DL7=c("D","H","J"),DL8=c("A","G","F"))
gut_vs_diet <- function(a,b) ## a is diet and b is gut
{
xx<-10
gut = numeric(xx)
diet = numeric(xx)
all<-unlist(lapply(b,length)) ### get the number of elements of each element of list b
for(i in seq_along(b)){ #### loop over b (gut) to get:
diet<-length(intersect(b[[i]],a[[i]])) ### the number of elements of diet are present in gut
gut = all-diet ## the number of elements of gut that not come from diet
}
gutvsdiet = data.frame(all,gut,diet)
return(gutvsdiet)
}
When running the funtion I obtain this result, which is not correct
gut_vs_diet(diet,gut)
all gut diet
DL5 3 3 0
DL6 3 3 0
DL7 3 3 0
DL8 3 3 0
In some cases, I was able to get some value in diet column, but the function randomly choose the diet sample.
I do not know where could be the mistake.Anyway, I would like to do this iteratively, I mean, get the values for each sample of gut compared with all samples of diet. Alternatively, I can run a replicate(10,gut_vs_diet(sample(diet),sample(gut)) to get random comparisons and avoid somekind of bias.
Thank you very much for your help
Manuel
Here is my version of your code:
diet <- list(DL1=c("A","B","C"), DL2=c("A","C","D"), DL3=c("B","D","E"), DL4=c("B","D","E"))
gut <- list(DL5=c("A","F","G"), DL6=c("B","F","H"), DL7=c("D","H","J"), DL8=c("A","G","F"))
gut_vs_diet <- function(a, b) ## a is diet and b is gut
{
all <- lengths(b) ### get the number of elements of each element of list b
diet <- mapply(function(ai, bi) length(intersect(ai, bi)), a, b)
# diet <- lengths(mapply(intersect, a, b)) ## a variant
data.frame(all, gut=all-diet, diet)
}
gut_vs_diet(diet,gut)
# > gut_vs_diet(diet,gut)
# all gut diet
# DL5 3 2 1
# DL6 3 3 0
# DL7 3 2 1
# DL8 3 3 0
As #jogo suggested in a comment, you can use mapply instead of your for-loop:
FOO <- function(x, y){
all <- lengths(y)
diet <- mapply(function(a, b){
length(intersect(b, a))
}, x, y)
gut <- all - diet
return(data.frame(all, gut, diet))
}
> FOO(diet, gut)
all gut diet
DL5 3 2 1
DL6 3 3 0
DL7 3 2 1
DL8 3 3 0
Just for completion, with a for loop it would look like this. Note that you need to subtract all[[i]] - diet and construct the dataframe inside the loop, otherwise you will just fill it with the last result of the loop, which is, data.frame(all = c(3,3,3,3), gut = 3, diet = 0)
diet <- list(DL1 = c("A", "B", "C"), DL2 = c("A", "C", "D"), DL3 = c("B", "D", "E"), DL4 = c("B", "D", "E"))
gut <- list(DL5 = c("A", "F", "G"), DL6 = c("B", "F", "H"), DL7 = c("D", "H", "J"), DL8 = c("A", "G", "F"))
gut_vs_diet <- function(a, b)
{
all <- lengths(b)
gutvsdiet <- NULL
for (i in seq_along(b)) {
diet <- length(intersect(b[[i]], a[[i]]))
gut <- all[[i]] - diet
resultForThisListElement <- c(all[[i]], gut, diet)
gutvsdiet <- rbind(gutvsdiet, resultForThisListElement)
}
colnames(gutvsdiet) <- c("all", "gut", "diet")
return(gutvsdiet)
}
gut_vs_diet(diet, gut)
I have created a table with igraph listing the data as follows :
where a,b,c,d,e are the edges.
a and b are mutual edges,
with the weight values of 1 for a->b, 2 for b->a (There is no self-loop).
By the way I used the following code to create the above table:
library(igraph)
library(dplyr)
g <- data.frame(from = c("a", "b", "c", "d", "e"),
to = c("b", "a", "a", "b", "a"), weight = c(1:5)) %>%
igraph::graph_from_data_frame()
Now I hope to create another table listing both the forward and backward information between the edges, as well as the weight values like:
Does anyone know how to do this with igraph?
First you could get a list of the pairs of node that share and edge regardless of direction
simplified <- as.undirected(g, mode="collapse")
pairs <- ends(simplified, E(simplified))
Then we can write a helper function to return a given edge weight between two node and if it doesn't exist, return NA instead
get_edge_weight<- Vectorize(function(a, b) {
e <- E(g)[a %->% b]
if(length(e)==1) {
e$weight
} else {
NA
}
})
Then you can build your desired data.frame with
data.frame(from=pairs[,1], to=pairs[,2],
fwd=get_edge_weight(pairs[,1], pairs[,2]),
back=get_edge_weight(pairs[,2], pairs[,1])
)
# from to fwd back
# b a b 1 2
# c a c NA 3
# d b d NA 4
# e a e NA 5