Related
I am looking for an r function to create clusters in my dataset based on two variables (hope "cluster" is the right name for what I want to do) . Every two observations with the same value for variable_1 OR variable_2, should be in the same cluster. In the brief exemple that follows, I aggregating dataframe df based on variable_1 and variable_2.
df <- data.frame(variable_1=c("a","a","b","b","c","c","d","d","e","e"),variable_2=c("g1","g2","g1","g3","g2","g4","g4","g6","g7","g8"),value=rnorm(10))
df$clusters <- some_function_to_create_clusters(df[,c("variable_1","variable_2")])
The result should look like:
df$clusters <- c("clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu1","clu2","clu2")
df
Notice that the first cluster contains everyone with variable_1 equals "a", "b", "c", or "d": "a" and "b" are merged together because they share "g1" (lines 1 and 3); "a" and "c" are merged because they share "g2" (lines 2 and 5); and "c" and "d" are merged because they share "g4" (lines 6 and 7).
Finally, in the last cluster there are only observations with variable_1=="e", because they don't share variable_2 with anyone.
Just to clarify what I intend to do, I will explain my problem set a little better. I’m pairing counties with nearby Touristic Attractions. Different counties are surround by different Touristic Attraction (TA), and there are many TA around the same county. But this "touristic-clusters" of counties and TAs are sparsely spread in the country. Notice that some far away counties might be inside the same cluster due to a “chain” effect of county-Touristic Attraction connections. So I want to find those “clusters” based on the id of the county and of the Touristic Attraction.
This seems simple, but I can't figure it out how to implement.
Thanks a lot
igraph solution
Disclaimer: I am completely new to igraph, so there's probably a better solution to this problem. However this seems to work.
With the igraph package we can cluster the data using the graph_from_data_frame() function, and then extract the clusters with components. You get the added advantage of being able to visualise the clusters.
library(igraph)
graph <- graph_from_data_frame(df[, 1:2], directed = FALSE)
cmp <- components(graph)$membership
df$cluster <- cmp[df$variable_1]
plot(graph)
Wrapping it up into a function
If you wanted to wrap it up as a function, something like this works:
find_clusters <- function(x, y) {
edges <- data.frame(from = x, to = y)
graph <- igraph::graph_from_data_frame(edges, directed = FALSE)
cmp <- igraph::components(graph)$membership
return(cmp[x])
}
Using the additional example you posted as a comment above, we thus have the following workflow:
library(dplyr)
df <- data.frame(
variable_1 = c("a", "a", "b", "b", "c", "c", "d", "d", "e", "e", "f", "f"),
variable_2 = c( "g1", "g2", "g1", "g3", "g2", "g4", "g4", "g6", "g7", "g8", "g9", "g12"),
value = rnorm(12)
)
df %>%
mutate(cluster = find_clusters(variable_1, variable_2))
# variable_1 variable_2 value cluster
# 1 a g1 -0.03410073 1
# 2 a g2 0.51261548 1
# 3 b g1 0.06470451 1
# 4 b g3 -1.97228101 1
# 5 c g2 -0.39751063 1
# 6 c g4 0.17761619 1
# 7 d g4 -0.13771207 1
# 8 d g6 -0.72183017 1
# 9 e g7 0.09012701 2
# 10 e g8 0.45763593 2
# 11 f g9 -0.83172613 3
# 12 f g12 2.83480352 3
So, I wrote a function to achieve what I need. It is ugly but it is working. If someone has a better/more efficient solution I would really appreciate it.
find_clusters <- function(original_df){
find_clus <- original_df
cluster_number <- 1
find_clus$cluster <- "cl"
i=1
for(i in 1:nrow(find_clus)){
if(nchar(find_clus$cluster[i])>2) next
aux <- lapply(original_df,function(x){ which(x==x[i])})%>% reshape2::melt()
idx <- aux$value %>%unique() %>%sort()
j = 1
while(j <= length(idx)){
aux <- lapply(original_df,function(x){ which(x==x[idx[j]])})%>% reshape2::melt()
idx <- c( idx, aux$value) %>%unique() %>% sort()
j <- j+1
}
find_clus$cluster[idx] <- paste0("cl",sprintf("%04d", cluster_number))
cluster_number<- cluster_number +1
}
return(find_clus$cluster)
}
So, to find the clusters one should write:
find_clusters(df[,c(1,2)])
I am trying to get a function to solve a small problem. I have to two list, and each list comprise n samples. Each sample has a variable amount of identifiers of bacteria (in the example letters, in my problem bacterial identifiers like OTU1-OTUn, in both cases are “character”). One list comprise samples from diet, and the another list samples from gut contents. I want to know for each sample of list gut, how many bacteria from diet are in the gut and how many bacteria in the gut do not come from diet. This was easily done when working with phyloseq object were diet and gut are both phyloseq objects with n samples each.
Bacteria_from_diet<-length(intersect(taxa_names(gut),taxa_names(diet))
Bacteria_not_diet<-length(taxa_names(diet)- Bacteria_from_diet
However, this “summarizes” the result over the n samples of gut and diet, I mean, like if I collapse data by sample, and I need some measure of variation.
I have tried the following code in R:
diet<-list(DL1=c("A","B","C"),DL2=c("A","C","D"),DL3=c("B","D","E"),DL4=c("B","D","E"))
gut<-list(DL5=c("A","F","G"),DL6=c("B","F","H"),DL7=c("D","H","J"),DL8=c("A","G","F"))
gut_vs_diet <- function(a,b) ## a is diet and b is gut
{
xx<-10
gut = numeric(xx)
diet = numeric(xx)
all<-unlist(lapply(b,length)) ### get the number of elements of each element of list b
for(i in seq_along(b)){ #### loop over b (gut) to get:
diet<-length(intersect(b[[i]],a[[i]])) ### the number of elements of diet are present in gut
gut = all-diet ## the number of elements of gut that not come from diet
}
gutvsdiet = data.frame(all,gut,diet)
return(gutvsdiet)
}
When running the funtion I obtain this result, which is not correct
gut_vs_diet(diet,gut)
all gut diet
DL5 3 3 0
DL6 3 3 0
DL7 3 3 0
DL8 3 3 0
In some cases, I was able to get some value in diet column, but the function randomly choose the diet sample.
I do not know where could be the mistake.Anyway, I would like to do this iteratively, I mean, get the values for each sample of gut compared with all samples of diet. Alternatively, I can run a replicate(10,gut_vs_diet(sample(diet),sample(gut)) to get random comparisons and avoid somekind of bias.
Thank you very much for your help
Manuel
Here is my version of your code:
diet <- list(DL1=c("A","B","C"), DL2=c("A","C","D"), DL3=c("B","D","E"), DL4=c("B","D","E"))
gut <- list(DL5=c("A","F","G"), DL6=c("B","F","H"), DL7=c("D","H","J"), DL8=c("A","G","F"))
gut_vs_diet <- function(a, b) ## a is diet and b is gut
{
all <- lengths(b) ### get the number of elements of each element of list b
diet <- mapply(function(ai, bi) length(intersect(ai, bi)), a, b)
# diet <- lengths(mapply(intersect, a, b)) ## a variant
data.frame(all, gut=all-diet, diet)
}
gut_vs_diet(diet,gut)
# > gut_vs_diet(diet,gut)
# all gut diet
# DL5 3 2 1
# DL6 3 3 0
# DL7 3 2 1
# DL8 3 3 0
As #jogo suggested in a comment, you can use mapply instead of your for-loop:
FOO <- function(x, y){
all <- lengths(y)
diet <- mapply(function(a, b){
length(intersect(b, a))
}, x, y)
gut <- all - diet
return(data.frame(all, gut, diet))
}
> FOO(diet, gut)
all gut diet
DL5 3 2 1
DL6 3 3 0
DL7 3 2 1
DL8 3 3 0
Just for completion, with a for loop it would look like this. Note that you need to subtract all[[i]] - diet and construct the dataframe inside the loop, otherwise you will just fill it with the last result of the loop, which is, data.frame(all = c(3,3,3,3), gut = 3, diet = 0)
diet <- list(DL1 = c("A", "B", "C"), DL2 = c("A", "C", "D"), DL3 = c("B", "D", "E"), DL4 = c("B", "D", "E"))
gut <- list(DL5 = c("A", "F", "G"), DL6 = c("B", "F", "H"), DL7 = c("D", "H", "J"), DL8 = c("A", "G", "F"))
gut_vs_diet <- function(a, b)
{
all <- lengths(b)
gutvsdiet <- NULL
for (i in seq_along(b)) {
diet <- length(intersect(b[[i]], a[[i]]))
gut <- all[[i]] - diet
resultForThisListElement <- c(all[[i]], gut, diet)
gutvsdiet <- rbind(gutvsdiet, resultForThisListElement)
}
colnames(gutvsdiet) <- c("all", "gut", "diet")
return(gutvsdiet)
}
gut_vs_diet(diet, gut)
Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.
I am trying to use associative memory and ddply to add a column to a data frame. For example:
First, I have defined association and a function that uses association to calculate product of two elements of a row (property damage and multiplier) to get actual damage in dollars. Here,"B" means Billion, "m|M" means MIllions, etc.
validMultiplierLetter <- c("B", "h", "H", "k", "K", "m", "M")
Multiplier <- c(1000000000, 100, 100, 1000, 1000, 1000000, 1000000)
names(Multiplier) <- validMultiplierLetter
The function ploss (property loss) is:
ploss <- function(pd,pm) {
if (pm %in% validMultiplierLetter) pd*Multiplier[pm]
else 0
}
here is a sample data frame with columns pd (property damage) and pm (multiplier) and ddply code to create a pl (property loss) column, which is a product of property damage and the associated value of multiplier. Invalid multipliers are equivalent to 0 (e.g., "+").
tdf <- data.frame(pd = c(5, 10, 15, 20, 25), pm = c("B", "m", "K", "+", "h"))
tldf <- ddply(tdf, .(pd, pm), transform, pl = ploss(pd,pm))
I get the following output when I execute the code above - you can see that the right multiplier was not used for the rows.
> tldf
pd pm pl
1 5 B 500
2 10 m 10000
3 15 K 15000
4 20 + 0
5 25 h 2500
Strangely though, when you pass constant, the multiplier works correctly. But, when you pass a variable (whose value is same as the constant), for some reason you get an incorrect result.
> Multiplier["B"]
B
1e+09
> tdf$pm[1]
[1] B
Levels: + B h K m
> Multiplier[tdf$pm[1]]
h
100
Any explanation of why this happens and how to fix it is greatly appreciated. Thanks.
The problem is that tdf$pm is a factor. When presented a factor, [ will use the factor levels rather than the character values:
x <- 10:15
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 10 11 12 13 14 15
x[c('A','F')] # Lookup by name
## A F
## 10 15
x[factor(c('A','F'))] # Lookup by integer
## A B
## 10 11
This is fixed by using as.character around the factor, so that a character vector is presented to [:
x[as.character(factor(c('A','F')))]
## A F
## 10 15
For your problem, you can coerce to character in the transform function:
ddply(tdf, .(pd, pm), transform, pl = ploss(pd,as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
In addition, you could vectorize your ploss function in the obvious way and do the job directly with transform:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[pm], 0)
}
transform(tdf, pl=ploss(pd, as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
And of course, the as.character coercion could be within the function ploss, so it isn't required in the transform call:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[as.character(pm)], 0)
}
The problem I see is that, if you're using the default R options, tdf$pm is a factor, not a character. You can check this with class(tdf$pm). What's happening here is that "B" is really a mask for 2 (following the order in the printout: Levels: + B h K m), so pd has the value of 2 as far as [ is concerned, and Multiplier[2] is 100 as you've assigned.
When you call data.frame (or read.table) you need to add the argument stringsAsFactors = FALSE, or change the corresponding global option with the options function.
I've built a predictive model that uses a large number (30 or so) of independent factor variables. As the dataset I'm using is much larger than the RAM of my machine, I have sampled it for both my training and test sets.
I am now looking to use the model to make predictions over the entire dataset. I'm pulling in the dataset 1 million rows at a time, and each time, I find new levels for some of my factor variables that were not in my training and test set, therefore preventing the model from making predictions.
As there are so many independent factor variables (and so many overall observations), correcting each case by hand is becoming a real pain.
One additional wrinkle to be aware of: there is no guarantee that the order of variables in the overall dataframe and the training/test sets are the same, as I do pre-processing on the data that changes their order.
As such, I'd like to write a function that:
Selects and sorts the columns of the new data based on the
configuration of my sampled dataframe
Loops through the sampled and new dataframe and designates all factor levels in the new
dataframe that do not exist in their corresponding column in the
sample dataframe as Other.
If a factor level exists in my sample but not the new dataframe, create the level (with no observations assigned to it) to its corresponding column in the new dataframe.
I've got #1 together, but don't know the best way to do #2 and #3. If it were any other language, I'd use for loops, but I know that's frowned upon in R.
Here's a reproducible example:
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))
sampleData
abacus montreal boston
1 a f z
2 b f y
3 a f z
4 a f z
5 a a q
dataset
florida montreal boston abacus
1 e f m a
2 q f y b
3 z f z z
4 d f z a
5 b a r a
6 a a f g
sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]
Below is what I would want dataset to look like once this function is complete (I don't really care about the final ordering of the columns in dataset; I'm just thinking its necessary for the loop (or whatever you guys deem best) to work. Notice that the column dataset$florida is omitted:
dataset
montreal boston abacus
1 f Other a
2 f y b
3 f z Other
4 f z a
5 a Other a
6 a Other Other
Also note that in dataset, the 'q' level for boston does not appear, although it does appear in sampleData. Therefore, the levels will differ if we omit 'q' from the factor in dataset, meaning that in 'dataset', we need boston to include the level q, but to have no actual observations assigned to it.
Last, note that as I'm doing this on 30 variables at a time, I need a programmatic solution and not one that reassigns factors by using explicit column names.
This seems like it might work.
From this function, the new levels returned for the boston column are Other y z q, even though there are no values for the level q. Regarding your comment in the original question, the only way I've found to effectively apply new factor levels is also with a for loop like you, and it's worked well for me so far.
A function, findOthers() :
findOthers <- function(newData) ## might want a second argument for sampleData
{
## take only those columns that are in 'sampleData'
dset <- newData[, names(sampleData)]
## change the 'dset' columns to character
dsetvals <- sapply(dset, as.character)
## change the 'sampleData' levels to character
samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
## find the unmatched elements
others <- sapply(seq(ncol(dset)), function(i){
!(dsetvals[,i] %in% samplevs[[i]])
})
## change the unmatched elements to 'Other'
dsetvals[others] <- "Other"
## create new data frame
newDset <- data.frame(dsetvals)
## get the new levels for each column
newLevs <- lapply(seq(newDset), function(i){
Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
ul <- unique(unlist(Get))
})
## set the new levels for each column
for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
## result
newDset
}
Your sample data :
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")),
montreal=factor(c("f","f","f","f","a")),
boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")),
montreal=factor(c("f","f","f","f","a", "a")),
boston=factor(c("m","y","z","z","r", "f")),
abacus=factor(c("a","b","z","a","a", "g")))
Call findOthers() and view the result with the new factor levels :
(new <- findOthers(newData = dataset))
# abacus montreal boston
# 1 a f Other
# 2 b f y
# 3 Other f z
# 4 a f z
# 5 a a Other
# 6 Other a Other
as.list(new)
# $abacus
# [1] a b Other a a Other
# Levels: a b Other
#
# $montreal
# [1] f f f f a a
# Levels: f a
#
# $boston
# [1] Other y z z Other Other
# Levels: Other y z q ## note the new level 'q', with no value in the column
To answer just the question you ask (rather than suggest what you might do instead). Here we have to make each column character, replace then re-factorise.
sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)
This depends on "q" only inhabiting one column. Otherwise you just have to edit each column separately to get only the changes you want:
sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)
However I think you should just filter the train and test data of these rows as they are so few
they will make absolutely no difference to your model. Otherwise you're making it difficult.
summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]
If the dataset is very very large and you are not doing this because of that then you may want to do this with something like the dplyr package and filter function.
Does this accomplish what you want?
# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]
f <- function(dataset, sampleData, col) {
# For a given column col, assign "Other" to all factor levels
# in dataset[col] that do not exist in sampleData[col].
# If a factor level exists in sampleData[col] but not in dataset[col],
# preserve it as a factor level.
v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
v[is.na(v)] <- "Other"
v
}
# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))
res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset
You can test as follows
> dataset[, "boston"]
[1] Other y z z Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a b Other a a Other
Levels: a b Other