After performing a cluster analysis to my dataset (a dataframe named data.matrix), I added a new column, named cluster, at the end (col 27) containing the cluster name that each instance belongs to.
What I want now, is a representative instance from each cluster. I tried to find the instance having the smallest euclidean distance from the cluster's centroid (and repeat the procedure for each one of my clusters)
This is what I did. Can you think of other -perhaps more elegant- ways? (assume numeric columns with no nulls).
clusters <- levels(data.matrix$cluster)
cluster_col = c(27)
for (j in 1:length(clusters)) {
# get the subset for cluster j
data = data.matrix[data.matrix$cluster == clusters[j],]
# remove the cluster column
data <- data[,-cluster_col]
# calculate the centroid
cent <- mean(data)
# copy data to data.matrix_cl, attaching a distance column at the end
data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))}))
# get instances with min distance
candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),]
# print their rownames
print(paste("Candidates for cluster ",j))
print(rownames(candidates))
}
At first I don't now if you distance formula is alright. I think there should be sqrt(sum((x-cent)^2)) or sum(abs(x-cent)). I assumed first.
Second thought is that just printing solution is not good idea. So I first compute, then print.
Third - I recommend using plyr but I give both (with and without plyr) solutions.
# Simulated data:
n <- 100
data.matrix <- cbind(
data.frame(matrix(runif(26*n), n, 26)),
cluster=sample(letters[1:6], n, replace=TRUE)
)
cluster_col <- which(names(data.matrix)=="cluster")
# With plyr:
require(plyr)
candidates <- dlply(data.matrix, "cluster", function(data) {
dists <- colSums(laply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
})
l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
})
# without plyr
candidates <- tapply(
1:nrow(data.matrix),
data.matrix$cluster,
function(id, data=data.matrix[id, ]) {
dists <- rowSums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
}
)
invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
}))
Is the technique you're interested in 'k-means clustering'? If so, here's how the centroids are calculated at each iteration:
choose a k value (an integer that
specifies the number of clusters to
divide your data set);
random select k rows from your data
set, those are the centroids for the
1st iteration;
calculate the distance that each
data point is from each centroid;
each data point has a 'closest
centroid', that determines its
'group';
calculate the mean for each
group--those are the new centroids;
back to step 3 (stopping criterion
is usually based on comparison with
the respective centroid values in
successive loops, i.e., if they
values change not more than 0.01%,
then quit).
Those steps in code:
# toy data set
mx = matrix(runif60, 10, 99), nrow=12, ncol=5, byrow=F)
cndx = sample(nrow(mx), 2)
# the two centroids at iteration 1
cn1 = mx[cndx[1],]
cn2 = mx[cndx[2],]
# to calculate Pearson similarity
fnx1 = function(a){sqrt((cn1[1] - a[1])^2 + (cn1[2] - a[2])^2)}
fnx2 = function(a){sqrt((cn2[1] - a[1])^2 + (cn2[2] - a[2])^2)}
# calculate distance matrix
dx1 = apply(mx, 1, fnx1)
dx2 = apply(mx, 1, fnx2)
dx = matrix(c(dx1, dx2), nrow=2, ncol=12)
# index for extracting the new groups from the data set
ndx = apply(dx, 1, which.min)
group1 = mx[ndx==1,]
group2 = mx[ndx==2,]
# calculate the new centroids for the next iteration
new_cnt1 = apply(group1, 2, mean)
new_cnt2 = apply(group2, 2, mean)
Related
I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a
I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)
I want to calculate the smallest geographical distance between each row and the column of two dataframes.
DF1 has a number of institutions, and DF2 has a number of events. Like, so:
#DF1 (institutions)
DF1 <- data.frame(latitude=c(41.49532, 36.26906, 40.06599),
longitude=c(-98.77298, -101.40585, -80.72291))
DF1$institution <- letters[seq( from = 1, to = nrow(DF1))]
#DF2 (events)
DF2 <- data.frame(latitude=c(32.05, 32.62, 30.23), longitude=c(-86.82,
-87.67, -88.02))
DF2$ID <- seq_len(nrow(DF1)
I want to return the event with the smallest distance to each institution in DF1 and add both the distance and ID from DF2 to DF1. While I know how to calculate the pairwise distance I am incapable of calculating all the distances from DF[1,] to DF2 and return the smallest value and so forth.
This is what I tried (and failed).
library(geosphere)
#Define a function
distanceCALC <- function(x, y) { distm(x = x, y = y,
fun = distHaversine)}
#Define vector of events
DF2_vec <- DF2[, c('longitude', 'latitude')]
#Define df to hold distances
shrtdist <- data.frame()
Now, my attempt was to feed distanceCALC with row1 of DF1 and the vectorized events.
#Loop through every row in DF1 and calculate all the distances to instutions a, b, c. Append to DF1 smallest distance + DF2$ID.
#This only gives me the pairwise distance
for (i in nrow(DF1)){
result <- distanceCALC(DF1[i,c('longitude', 'latitude')], DF2_vec)
}
#Somehow take shortest distance for each row*column distance matrix
shrtdist <- rbind(shrtdist, min(result[,], na.rm = T))
My guess is that the solution entails reshaping of the data and lapply. Also, the loop is very bad practice and much too slow given the number of observations.
Any help is greatly appreciated.
Here's a simple way to approach this using the outer function
squared_distance <- function(x, y ) (x - y)^2
lat <- outer(DF1$latitude, DF2$latitude, squared_distance)
long <- outer(DF1$longitude, DF2$longitude, squared_distance)
pairwise_dist <- sqrt(lat + long)
rownames(pairwise_dist) <- DF1$institution
colnames(pairwise_dist) <- DF2$ID
pairwise_dist
This gives you a matrix of the distances between each institution (rows) and event (column). To get the distance and event in df1, we can do
df1$min_dist <- apply(pairwise_dist, 1, min)
df1$min_inst <- apply(pairwise_dist, 1, min)
Note that the reason the second one works in this case is because the events are labeled by number. If your real data doesn't have that handy feature, we need to do
df1$min_inst <- colnames(pairwise_dist)[apply(pairwise_dist, 1, which.min)]
Update using alternative distance function
I haven't tested this, but I think this should work. Again, the output will be a matrix.
gcd.hf <- function(DF1, DF2) {
sin2.long <- sin(outer(DF1$longitude, DF2$longitude, "-") / 2)^2
sin2.lat <- outer(DF1$latitude, DF2$latitude, "-")
cos.lat <- outer(cos(DF1$latitude), cos(DF2$latitude), "*")
a <- sin2.long + sin2.lat * cos.lat # we do this cell-wise
cir <- 2 * asin(pmin(1, sqrt(a))) # I never assign anything to "c" since that's concatenate. Rename this variable as appropriate (I have no idea if it's related to the circumference or not.)
cir * 6371
}
pairwise_dist <- gcd.hf(DF1, DF2)
I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)
I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.