Determine cluster membership in R - r

This is my vector before kmeans -
> sort(table(mydata))
mydata
23 7 9 4 10 3 5 8 2 1
1 3 3 4 5 6 6 6 7 9
km <- kmeans(mydata, centers = 10)
After kmeans -
> sort(table(km$cluster))
km$cluster
1 6 7 3 5 2 4 10 8 9
1 3 3 4 5 6 6 6 7 9
Clearly, all my 1s are stored in cluster 9, all 2s are stored in Cluster 8 and so on.
Can I find using R which cluster a particular number belongs to? Say, finding which cluster my 1s are in?

The values for $cluster are returned in the same order as your original data.
mydata <- rep(c(23,7,9,4,10,3,5,8,2,1), c(1,3,3,4,5,6,6,6,7,9))
sort(table(mydata))
# mydata
# 23 7 9 4 10 3 5 8 2 1
# 1 3 3 4 5 6 6 6 7 9
km <- kmeans(mydata, centers = 10)
unique(cbind(value=mydata, clust=km$cluster))
# value clust
# [1,] 23 9
# [2,] 7 5
# [3,] 9 7
# [4,] 4 4
# [5,] 10 1
# [6,] 3 10
# [7,] 5 2
# [8,] 8 8
# [9,] 2 6
# [10,] 1 3
Here i've just re-joined the two with cbind and used unique to eliminate all the dups since you have such discrete data.

Extending on MrFlick's answer (upvoted), and in case you want the cluster number programmatically, you could do also this (utilizing the magrittr package, to get rid of all these nested parentheses):
library(magrittr)
data.point <- 5 # put the data point here
cluster.no <- c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
Examples:
library(magrittr)
set.seed(42) # for reproducibility
mydata <- rep(c(23,7,9,4,10,3,5,8,2,1), c(1,3,3,4,5,6,6,6,7,9))
km <- kmeans(mydata, centers = 10)
data.point <- 23
c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
# 8
data.point <- 10
c(mydata==data.point) %>% which %>% km$cluster[.] %>% unique
# 1

Related

If columns match, take value from another column and assign it to a new variable

I have a data set that contains a group an id variable and a pair id variable. Each observation is paired with another observation. I want to create a new variable that takes the id variable of its pair-partner and assign it as a new 'partner' variable.
Below I have an example data:
id <- 1:10
pair_id <- rep(1:5,2)
df <- cbind(id, pair_id)
> df
id pair_id
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 1
[7,] 7 2
[8,] 8 3
[9,] 9 4
[10,] 10 5
As I said above, I want to add a variable indicating on each id partner's id, where the partnership is identified with the pair_id. As an example, observations with id == 1 has pair_id == 1, which makes it partner to be the observation with id == 6, since they share pair_id.
So the end result should look like this:
id pair_id partner_id
[1,] 1 1 6
[2,] 2 2 7
[3,] 3 3 8
[4,] 4 4 9
[5,] 5 5 10
[6,] 6 1 1
[7,] 7 2 2
[8,] 8 3 3
[9,] 9 4 4
[10,] 10 5 5
Thanks!
You can reverse the id values for each pair_id :
library(dplyr)
df %>% group_by(pair_id) %>% mutate(partner_id = rev(id))
# id pair_id partner_id
# <int> <int> <int>
# 1 1 1 6
# 2 2 2 7
# 3 3 3 8
# 4 4 4 9
# 5 5 5 10
# 6 6 1 1
# 7 7 2 2
# 8 8 3 3
# 9 9 4 4
#10 10 5 5
The equivalent in base R :
df$partner_id <- with(df, ave(id, pair_id, FUN = rev))
and in data.table is
library(data.table)
setDT(df)[, partner_id := rev(id), pair_id]
data
df <- data.frame(id, pair_id)
We can use
library(dplyr)
df %>%
group_by(pair_id) %>%
mutate(partner_id = id[n():1])

Reduce columns of a matrix by a function in R

I have a matrix sort of like:
data <- round(runif(30)*10)
dimnames <- list(c("1","2","3","4","5"),c("1","2","3","2","3","2"))
values <- matrix(data, ncol=6, dimnames=dimnames)
# 1 2 3 2 3 2
# 1 5 4 9 6 7 8
# 2 6 9 9 1 2 5
# 3 1 2 5 3 10 1
# 4 6 5 1 8 6 4
# 5 6 4 5 9 4 4
Some of the column names are the same. I want to essentially reduce the columns in this matrix by taking the min of all values in the same row where the columns have the same name. For this particular matrix, the result would look like this:
# 1 2 3
# 1 5 4 7
# 2 6 1 2
# 3 1 1 5
# 4 6 4 1
# 5 6 4 4
The actual data set I'm using here has around 50,000 columns and 4,500 rows. None of the values are missing and the result will have around 40,000 columns. The way I tried to solve this was by melting the data then using group_by from dplyr before reshaping back to a matrix. The problem is that it takes forever to generate the data frame from the melt and I'd like to be able to iterate faster.
We can use rowMins from library(matrixStats)
library(matrixStats)
res <- vapply(split(1:ncol(values), colnames(values)),
function(i) rowMins(values[,i,drop=FALSE]), rep(0, nrow(values)))
res
# 1 2 3
#[1,] 5 4 7
#[2,] 6 1 2
#[3,] 1 1 5
#[4,] 6 4 1
#[5,] 6 4 4
row.names(res) <- row.names(values)

sum if greater than in r

I have a dataframe (obs) with 145 rows and more than 1000 columns plus a numeric vector with 145 values (thr).
I would like to derive another vector (sumifs) with 145 elements where each element is the sum of the values of obs[n,] >= thr[n].
I thought I could run a for loop where a single row sum is calculated more or less like:
sumifs[n] <- if(obs[n,]>=thr[n],sum(obs[n,]))
but I didn't manage to make it work for the single row either.
I've been giving a look to other questions where it has been suggested to use aggregate or the plyr package but I didn't really find anything.
A simplified example with only 15 rows and 3 columns is following
c1 <- rep(1:5,3)
c2 <- rep(3:7,3)
c3 <- rep(2:6,3)
obs <- data.frame(r1,r2,r3)
thr <- c(2,2,3,3,4,4,5,5,2,2,3,3,4,4,5)
obs
r1 r2 r3
1 1 3 2
2 2 4 3
3 3 5 4
4 4 6 5
5 5 7 6
6 1 3 2
7 2 4 3
8 3 5 4
9 4 6 5
10 5 7 6
11 1 3 2
12 2 4 3
13 3 5 4
14 4 6 5
15 5 7 6
therefore, sumifs should be:
sumifs
5
9
12
15
18
0
0
0
15
18
3
7
9
15
18
#your data
DF <- as.data.frame(matrix(1:6, ncol = 2))
#turn into matrix
m <- as.matrix(DF)
#your threshold
thr <- c(3, 1, 7)
#compare
m >= thr
# V1 V2
#[1,] FALSE TRUE
#[2,] TRUE TRUE
#[3,] FALSE FALSE
#logical values get turned to 0/1 during arithmetics
#thus we can just multiply the matrix with the comparison
m * (m >= thr)
# V1 V2
#[1,] 0 4
#[2,] 2 5
#[3,] 0 0
#and calculate the row sums
rowSums(m * (m >= thr))
#[1] 4 7 0

Finding unique combinations irrespective of position [duplicate]

This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
I'm sure it's something simple, but I have a data frame
df <- data.frame(a = c(1, 2, 3),
b = c(2, 3, 1),
c = c(3, 1, 4))
And I want a new data frame that contains the unique combinations of values in the rows, irrespective of which column they're in. So in the case above I'd want
a b c
1 2 3
3 1 4
I've tried
unique(df[c('a', 'b', 'c')])
but it sees (1, 2, 3) as unique from (2, 3, 1), which I don't want.
Maybe something like that
indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rows
df[indx, ] # selects only the non - duplicates according to that index
# a b c
# 1 1 2 3
# 3 3 1 4
If your data.frame is quite big, the speed may be a matter for you. You can find duplicated sets much faster with the following idea.
Let's imaginary assign each possible value in rows a prime number and count products for each row. For example, for given df we can accept primenums = c(2,3,5,7) and count products c(30,30,70). Then duplicates in this products-vector correspond to duplicated sets in our data.frame. As multiplication is being computed much faster then any kinds of sorting, you can gain efficiency.
The code is following.
require("numbers")
primenums <- Primes(100)[1:4]
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult)
df[my_indx,]
Here we initialize vector primenums with the help of function Primes from package numbers, but you can do manually in other way.
Take a look at the example. Here I show comparison of efficiency.
require("numbers")
# generate all unique combinations 10 out of 20
allcomb <- t(combn(20,10))
# make sample of 1 million rows
set.seed(789)
df <- allcomb[sample(nrow(allcomb), 1e6, T),]
# lets sort matrix to show we have duplicates
df <- df[do.call(order, lapply(1:ncol(df), function(i) df[, i])), ]
head(df, 10)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 2 3 4 5 6 7 8 9 10
# [2,] 1 2 3 4 5 6 7 8 9 10
# [3,] 1 2 3 4 5 6 7 8 9 10
# [4,] 1 2 3 4 5 6 7 8 9 10
# [5,] 1 2 3 4 5 6 7 8 9 11
# [6,] 1 2 3 4 5 6 7 8 9 11
# [7,] 1 2 3 4 5 6 7 8 9 11
# [8,] 1 2 3 4 5 6 7 8 9 11
# [9,] 1 2 3 4 5 6 7 8 9 11
# [10,] 1 2 3 4 5 6 7 8 9 11
# to be fair need to permutate numbers in rows before searching for identical sets
df <- t(apply(df, 1, function(z) z[sample(10,10)] ))
df <- as.data.frame(df)
names(df) <- letters[1:10]
# how does it look like now?
head(df, 10)
# a b c d e f g h i j
# 1 2 3 7 9 10 1 4 8 5 6
# 2 4 2 6 3 8 10 9 1 5 7
# 3 4 2 6 8 5 1 10 7 3 9
# 4 6 8 5 4 2 1 10 9 7 3
# 5 11 2 7 6 8 1 9 4 5 3
# 6 9 6 3 11 4 2 8 7 5 1
# 7 5 2 3 11 1 8 6 9 7 4
# 8 3 9 7 1 2 5 4 8 11 6
# 9 6 2 8 3 4 1 11 5 9 7
# 10 4 6 3 9 7 2 1 5 11 8
# now lets shuffle rows to make df more plausible
df <- df[sample(nrow(df), nrow(df)),]
Now when data.frame is ready we can test different algorithms.
system.time(indx <- !duplicated(t(apply(df, 1, sort))) )
# user system elapsed
# 119.75 0.06 120.03
# doesn't impress, frankly speaking
library(sets)
system.time(indx <- !duplicated(apply(df, 1, as.set)) )
# user system elapsed
# 91.60 0.00 91.89
# better, but we want faster! =)
# now lets check out the method with prime numbers
primenums <- Primes(100)[1:20]
# [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71
system.time({
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 6.44 0.16 6.61
# not bad, isn't it? but lets compare results
identical(indx, my_indx)
# [1] TRUE
# So, if there is no difference, why wait more? ;)
There is one important assumption here - we use as.matrix(df), but what if there are not only numeric variables in our data.frame? A more unificated solution will be as follows:
system.time({
dfmult <- apply(
apply(df, 2, function(colmn) as.integer(factor(colmn,
levels = unique(c(as.matrix(df)))))),
1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 27.48 0.34 27.84
# is distinctly slower but still much faster then previous methods
And what about if we have very much columns or very much different variables? In this case instead of prod() we can use sum(log()) (which is being computed probably even faster for large numbers). Take a look at this.
pr <- Primes(5e7)
length(pr)
# [1] 3001134
system.time(N <- sum(log(pr)))
# user system elapsed
# 0.12 0.00 0.13
N
# [1] 49993718
It's hard to imagine df with 3 mln columns, but here it's ok. This way allows us to carry df of any incredibly huge size with as many columns our RAM can hold.
As an alternative approach, the package sets provides a fast way of checking for set equality:
library(sets)
df.sets <- apply(df, 1, as.set)
#[[1]]
#{1, 2, 3}
#[[2]]
#{1, 2, 3}
#[[3]]
#{1, 3, 4}
df[!duplicated(df.sets),]
# a b c
#1 1 2 3
#3 3 1 4

R: create a data frame out of a rolling window

Lets say I have a data frame with the following structure:
DF <- data.frame(x = 0:4, y = 5:9)
> DF
x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
what is the most efficient way to turn 'DF' into a data frame with the following structure:
w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Where w is a length 2 window rolling through the dataframe 'DF.' The length of the window should be arbitrary, i.e a length of 3 yields
w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 7
3 3 8
3 4 9
I am a bit stumped by this problem, because the data frame can also contain an arbitrary number of columns, i.e. w,x,y,z etc.
/edit 2: I've realized edit 1 is a bit unreasonable, as xts doesn't seem to deal with multiple observations per data point
My approach would be to use the embed function. The first thing to do is to create a rolling sequence of indices into a vector. Take a data-frame:
df <- data.frame(x = 0:4, y = 5:9)
nr <- nrow(df)
w <- 3 # window size
i <- 1:nr # indices of the rows
iw <- embed(i,w)[, w:1] # matrix of rolling-window indices of length w
> iw
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 3 4
[3,] 3 4 5
wnum <- rep(1:nrow(iw),each=w) # window number
inds <- i[c(t(iw))] # the indices flattened, to use below
dfw <- sapply(df, '[', inds)
dfw <- transform(data.frame(dfw), w = wnum)
> dfw
x y w
1 0 5 1
2 1 6 1
3 2 7 1
4 1 6 2
5 2 7 2
6 3 8 2
7 2 7 3
8 3 8 3
9 4 9 3

Resources