I am trying to implement the Affinity Propagation clustering algorithm in C++. As part of testing I want to compare my results with well established implementations of the algorithm in Matlab (Link) and in R (package apcluster). Unfortunately, the clusterings do not agree.
To be more precise, the (test) data set is:
0.9411760 0.9702140
0.9607826 0.9744693
0.9754896 0.9574479
0.9852929 0.9489372
0.9950962 0.9234050
1.0000000 0.8936175
1.0000000 0.8723408
0.9852929 0.8595747
1.0000000 0.8893622
1.0000000 0.9191497
In R I typed:
S<-negDistMat(data)
A<-apcluster(S,maxits=1000,convits=100, lam=0.9,q=0.5)
and got:
> A#idx
2 2 2 5 5 9 9 9 9 5
2 2 2 5 5 9 9 9 9 5
In Matlab I just typed:
[idx,netsim,dpsim,expref]=apcluster(S,diag(S));
From the apcluster.m file implementing apcluster (line 77):
maxits=1000; convits=100; lam=0.9; plt=0; details=0; nonoise=0;
This explains the parameters for R, in Matlab their are the default values. Since I'm more comfortable with R concerning Affinity Propagation, for comparison reasons I stuck with Matlab's defaults, just to avoid messing something up unintentionally.
..but got:
>> idx'
ans =
3 3 3 3 5 9 9 9 9 5
In both cases the similarity matrices matched. What could I've missed?
Update:
I've also implemented the Matlab code proposed by Frey & Dueck in their original publication. (You may notice that I omitted noise) and although I can replicate the indexes provided by the former Matlab implementation, Availability and Responsibility matrices differ on some values. The error is less than 0.01 but this is significant.
Their code is:
function [idx,A,R]=frey(S);
N=size(S,1);
A=zeros(N,N);
R=zeros(N,N);
lam=0.9; % Set damping factor
for iter=1:122
% Compute responsibilities
Rold=R;
AS=A+S;
[Y,I]=max(AS,[],2);
for i=1:N
AS(i,I(i))=-realmax;
end;
[Y2,I2]=max(AS,[],2);
R=S-repmat(Y,[1,N]);
for i=1:N
R(i,I(i))=S(i,I(i))-Y2(i);
end;
R=(1-lam)*R+lam*Rold; % Dampen responsibilities
% Compute availabilities
Aold=A;
Rp=max(R,0);
for k=1:N
Rp(k,k)=R(k,k);
end;
A=repmat(sum(Rp,1),[N,1])-Rp;
dA=diag(A);
A=min(A,0);
for k=1:N
A(k,k)=dA(k);
end;
A=(1-lam)*A+lam*Aold; % Dampen availabilities
end;
E=R+A; % Pseudomarginals
I=find(diag(E)>0); K=length(I); % Indices of exemplars
[tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % Assignments
I have tried all your code and the problem is caused by the way you supply the input preference. In the first case (R), you specify q=0.5. This means that the input preference p is set to the median of off-diagonal similarities (in your example, this is -0.05129912). If I run the Matlab code as follows (I used Octave, but Matlab should give the same result), I get:
octave:7> [idx,netsim,dpsim,expref]=apcluster(S,-0.05129912);
octave:8> idx'
ans =
2 2 2 5 5 9 9 9 9 5
This is exactly the same as the R result. If I run your Matlab code (with diag(S) being the second argument) and if I run
apcluster(S, p=diag(S))
in R (which sets the input preference to 0 for all samples in both cases), I get 10 one-sample clusters in both cases. So the two results match again, though I could not recover your Matlab result
3 3 3 3 5 9 9 9 9 5
I hope that makes the difference clear.
Cheers, UBod
Related
I have the following formula:
Reg_Total<- In_Bigdata2 %>%
lm(log(This_6) ~ This_1+This_2+This_3+This_4+
This_5+This_7+This_8+
This_12+This_13+This_14+This_15+This_16+This_17,This_18,data = .)
With that data
With only the variable This_18 as a subset, do you know why it gives me a perfect regression with an r2 of 1?
OK, this was a good puzzle.
You have to dig a little bit to find out what the subset= argument does, as it gets passed to the model.frame() function inside lm(). From ?model.frame():
subset: a specification of the rows to be used: defaults to all rows.
This can be any valid indexing vector (see ‘[.data.frame’)
for the rows of ‘data’ or if that is not supplied, a data
frame made up of the variables used in ‘formula’.
(emphasis added). Usually people specify a logical expression for subset= (e.g. This_5>2) to restrict the regression to particular cases. If you put in an integer vector, lm()/model.frame() will select the rows corresponding to those integers.
So ... what lm()/model.frame() have done is to construct a data set for the linear model that consists of rows of the original data set indexed by This_18. In other words, since the first few elements of This_18 are (2,3,4,3,3,2, ...), the first row of the new data set will be row 2 of the original data set; the second row will be row 3; the third row will be row 4; the fourth row will be another copy of row 3; and so on ...
head(model.frame(This_6~.-This_18, data=dd, subset=This_18))
## This_6 This_1 This_2 This_3 This_4 This_5 This_7 This_8 This_9 This_10 ...
## 2 2 5 3 3 3 3 3 2 3 1 ...
## 3 3 3 3 3 3 3 3 4 4 4 ...
## 4 1 3 3 3 3 3 3 2 1 2 ...
## 3.1 3 3 3 3 3 3 3 4 4 4 ...
## 3.2 3 3 3 3 3 3 3 4 4 4 ...
## 2.1 2 5 3 3 3 3 3 2 3 1 ...
(you can also get this object by running model.frame(fitted_model)).
Therefore, since the only values of This_18 are the integers 1-6, you get a regression run only on multiple copies of rows 1-6 of the original data set. Thus it's not surprising that you get a perfect fit, since there are only 6 unique response/sets of predictors.
The remaining question is ... what did you intend to do by using subset=This_18 ... ? "subset" refers to a subset of observations, not a subset of predictors.
If you want to do best subset regression (i.e. find the subset of predictors that maximize some criterion) there is not a single easy answer (and in fact there are some potential statistical pitfalls if you are interested in inference rather than prediction). Googling "R best subset regression" should help you, or searching for those keywords on Stack Overflow. (Or see the glmulti package, or the leaps package, or the stepAIC function in the MASS package, or r the MuMIn package, or ...)
I am trying to make a model that will predict the group of a city according to the development level of it. I mean, the cities in the 1st group are the most developed cities and the ones in the 6th group are the least developed ones. I have 10 numerical variables in my data about each city.
First, I normalized them using max-min normalization. Then I generated the training and data sets. I have 81 cities.Dimensions of training and data sets are 20x10 and 61x10, respectively. I excluded the target variable from them. Then I made labels for them as training labels and test labels with dimensions 61x1 and 20x1.
Then I run the knn function like this
knn(train = Data.training, test = Data.test, cl = Data.trainLabels , k = 3)
its output is this
[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Levels: 1 2 3 4 5 6
But if I set the argument use.all to FALSE I get this output and that changes everytime I run the code
[1] 1 4 2 2 2 3 5 4 3 5 5 6 5 6 5 6 4 5 2 2
Levels: 1 2 3 4 5 6
I can't find the reason why my code gives the same prediction in the first place and what use.all has got to do with it.
As explained in the knn documentation :
use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.
In your case, all points have the same distances, so they all win as 'best neighbour' (use.all = True) or the algorithm picks k winners at random (use.all = False).
The problem seems to be in how you trained the algorithm or in the data itself. Since you did not post a sample of your data, I cannot help with that, but I suggest that you re-check it. You can also compute a few distances by hand, to see what is going on.
Also, check that you randomised your data before splitting it into training and testing sets. For example, say that the dataset is ordered by the label (the target variable). If you use the first 20 points to train the algorithm, it is likely that the algorithm will never see some of the labels during the training phase and therefore it will perform poorly on those during the testing phase.
I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)
I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.
I want the convolution of two functions defined on [0,Inf), say
f=function(x)
(1+0.5*cos(2*pi*x))*(x>=0)
and
g=function(x)
exp(-2*x)*(x>0)
Using the integrate function of R I can do this,
cfg=function(x)
integrate(function(y) f(y)*g(x-y),0,x)$value
By searching the web, it seems that there are more efficient (and more accurate) ways of doing this (say using fft() or convolve()). Can anyone with such experiences explain how please?
Thanks!
convolve or fft solutions are to get a discrete result, rather than a function as you have defined in cfg. They can give you the numeric solution to cfg on some regular, discrete input.
fft is for periodic functions (only) so that is not going to help. However, convolve has a mode of operation called "open", which emulates the operation that is being performed by cfg.
Note that with type="open", you must reverse the second sequence (see ?convolve, "Details"). You also have to only use the first half of the result. Here is a pictoral example of the result of convolution of c(2,3,5) with c(7,11,13) as would be performed by convolve(c(2,3,5), rev(c(7,11,13)), type='open'):
2 3 5 2 3 5 2 3 5 2 3 5 2 3 5
13 11 7 13 11 7 13 11 7 13 11 7 13 11 7
Sum: 14 43 94 94 65
Note that evaluation the first three elements is similar to the results of your integration. The last three would be used for the reverse convolution.
Here is a comparison with your functions. Your function, vectorized, plotted with
y <- seq(0,10,by=.01)
plot(y, Vectorize(cfg)(y), type='l')
And an application of convolve plotted with the following code. Note that there are 100 points per unit interval in y so division by 100 is appropriate.
plot(y, convolve(f(y), rev(g(y)), type='open')[1:1001]/100, type='l')
These do not quite agree, but the convolution is much faster:
max(abs(Vectorize(cfg)(y) - convolve(f(y), rev(g(y)), type='open')[1:1001]/100))
## [1] 0.007474999
benchmark(Vectorize(cfg)(y), convolve(f(y), rev(g(y)), type='open')[1:1001]/100, columns=c('test', 'elapsed', 'relative'))
## test elapsed relative
## 2 convolve(f(y), rev(g(y)), type = "open")[1:1001]/100 0.056 1
## 1 Vectorize(cfg)(y) 5.824 104
I have a data set containing the following information:
Workload name
Configuration used
Measured performance
Here you have a toy data set to illustrate my problem (performance data does not make sense at all, I just selected different integers to make the example easy to follow. In reality that data would be floating point values coming from performance measurements):
workload cfg perf
1 a 1 1
2 b 1 2
3 a 2 3
4 b 2 4
5 a 3 5
6 b 3 6
7 a 4 7
8 b 4 8
You can generate it using:
dframe <- data.frame(workload=rep(letters[1:2], 4),
cfg=unlist(lapply(seq_len(4),
function(x) { return(c(x, x)) })),
perf=round(seq_len(8))
)
I am trying to compute the harmonic speedup for the different configurations. For that a base configuration is needed (cfg = 1 in this example). Then the harmonic speedup is computed as:
num_workloads
HS(cfg_i) = num_workloads / sum (perf(cfg_base, wl_j) / perf(cfg_i, wl_j))
wl_j
For instance, for configuration 2 it would be:
HS(cfg_2) = 2 / [perf(cfg_1, wl_1) / perf(cfg_2, wl_1) +
perf(cfg_1, wl_2) / perf_cfg_2, wl_2)]
I would like to compute harmonic speedup for every workload pair and configuration. By using the example data set, the result would be:
workload.pair cfg harmonic.speedup
1 a-b 1 2 / (1/1 + 2/2) = 1
2 a-b 2 2 / (1/3 + 2/4) = 2.4
3 a-b 3 2 / (1/5 + 2/6) = 3.75
4 a-b 4 2 / (1/7 + 2/8) = 5.09
I am struggling with aggregate and ddply in order to find a solution that does not uses loops, but I have not been able to come up with a working solution. So, the basic problems that I am facing are:
how to handle the relationship between workloads and configuration. The results for a given workload pair (A-B), and a given configuration must be handled together (the first two performance measurements in the denominator of the harmonic speedup formula come from workload A, while the other two come from workload B)
for each workload pair and configuration, I need to "normalize" performance values with the values from configuration base (cfg 1 in the example)
I do not really know how to express that with some R function, such as aggregate or ddply (if it is possible, at all).
Does anyone know how this can be solved?
EDIT: I was somehow afraid that using 1..8 as perf could lead to some confusion. I did that for the sake of simplicity, but the values do not need to be those ones (for instance, imagine initializing them like this: dframe$perf <- runif(8)). Both James and Zach's answers understood that part of my question wrong, so I thought it was better to clarify this in the question. Anyway, I generalized both answers to deal with the case where performance for configuration 1 is not (1, 2)
Try this:
library(plyr)
baseline <- dframe[dframe$cfg == 1,]$perf
hspeed <- function(x) length(x) / sum(baseline / x)
ddply(dframe,.(cfg),summarise,workload.pair=paste(workload,collapse="-"),
harmonic.speedup=hspeed(perf))
cfg workload.pair harmonic.speedup
1 1 a-b 1.000000
2 2 a-b 2.400000
3 3 a-b 3.750000
4 4 a-b 5.090909
For problems like this, I like to "reshape" the dataframe, using the reshape2 package, giving a column for workload a, and a column for workload b. It is then easy to compare the 2 columns using vector operations:
library(reshape2)
dframe <- dcast(dframe, cfg~workload, value.var='perf')
baseline <- dframe[dframe$cfg == 1, ]
dframe$harmonic.speedup <- 2/((baseline$a/dframe$a)+(baseline$b/dframe$b))
> dframe
cfg a b harmonic.speedup
1 1 1 2 1.000000
2 2 3 4 2.400000
3 3 5 6 3.750000
4 4 7 8 5.090909