R - Get a matrix with the reduced number of features with SVD - r

I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)

I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.

Related

kNN algorithm predicts only one group

I am trying to make a model that will predict the group of a city according to the development level of it. I mean, the cities in the 1st group are the most developed cities and the ones in the 6th group are the least developed ones. I have 10 numerical variables in my data about each city.
First, I normalized them using max-min normalization. Then I generated the training and data sets. I have 81 cities.Dimensions of training and data sets are 20x10 and 61x10, respectively. I excluded the target variable from them. Then I made labels for them as training labels and test labels with dimensions 61x1 and 20x1.
Then I run the knn function like this
knn(train = Data.training, test = Data.test, cl = Data.trainLabels , k = 3)
its output is this
[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Levels: 1 2 3 4 5 6
But if I set the argument use.all to FALSE I get this output and that changes everytime I run the code
[1] 1 4 2 2 2 3 5 4 3 5 5 6 5 6 5 6 4 5 2 2
Levels: 1 2 3 4 5 6
I can't find the reason why my code gives the same prediction in the first place and what use.all has got to do with it.
As explained in the knn documentation :
use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.
In your case, all points have the same distances, so they all win as 'best neighbour' (use.all = True) or the algorithm picks k winners at random (use.all = False).
The problem seems to be in how you trained the algorithm or in the data itself. Since you did not post a sample of your data, I cannot help with that, but I suggest that you re-check it. You can also compute a few distances by hand, to see what is going on.
Also, check that you randomised your data before splitting it into training and testing sets. For example, say that the dataset is ordered by the label (the target variable). If you use the first 20 points to train the algorithm, it is likely that the algorithm will never see some of the labels during the training phase and therefore it will perform poorly on those during the testing phase.

R: Correct strings by distance measure (stringdistmatrix)

I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos.
My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.
library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
[,1] [,2] [,3] [,4]
[1,] 0 6 1 5
[2,] 6 0 7 4
[3,] 1 7 0 6
[4,] 5 4 6 0
(number_of_people<-str_count(names, ",")+1)
[1] 4
The correct value of number_of_people should be, of course, 3.
As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.
One option is to try to cluster the names based on their distance matrix:
library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)
Depending on your actual data you will need to experiment with the distance type (qgrams/cosine may be useful, or the jaro-winkler distance in the case of names).

finding set of multinomial combinations

Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90

covariance matrix from a community list with grouping factors

I am still learning to use data.table (from the data.table package) and even after looking for help on the web and the help files, I am still struggling to do what I want.
I have a large data table with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:
> TEST<-data.table(Time=c("0","0","0","7","7","7","12"),
Zone=c("1","1","0","1","0","0","1"),
quadrat=c(1,2,3,1,2,3,1),
Sp1=c(0,4,29,9,1,2,10),
Sp2=c(20,17,11,15,32,15,10),
Sp3=c(1,0,1,1,1,1,0))
>setkey(TEST,Time)
> TEST
Time Zone quadrat Sp1 Sp2 Sp3
1: 0 1 1 0 20 1
2: 0 1 2 4 17 0
3: 0 0 3 29 11 1
4: 12 1 1 10 10 0
5: 7 1 1 9 15 1
6: 7 0 2 1 32 1
7: 7 0 3 2 15 1
I need to calculate the sum of the covariances for each Zone x quadrat group. If I only had the species list for a given Zone x quadrat combination, then I could use the cov() function but using cov() in the same way that I would use mean() or sum() in
Abundance = TEST[,lapply(.SD,mean),by="Zone,quadrat"]
does not work as I get the following error message:
Error in cov(value) : supply both 'x' and 'y' or a matrix-like 'x'
I understand why but I cannot figure out how to solve this.
What I exactly want is to be able to get, for each Zone x quadrat combination, the covariance matrix of all the species across all the sampling Time points. From each matrix, I then need to calculate the sum of the covariances of all pairs of species, so that then I can have a sum of covariance for each Zone x quadrat combination.
Any help would be greatly appreciated, Thanks.
From the help provided above by #Frank and some additional searching that I did around the use of the upper.tri function, the following code works:
Cov= TEST[,sum(cov(.SD)[upper.tri(cov(.SD), diag = FALSE)]), by='Zone,quadrat', .SDcols=paste('Sp',1:3,sep='')]
The initial version proposed, where upper.tri() did not appear in [ ] only extracted logical values from the covariance matrix and having diag = FALSE allowed to exclude the diagonal values before summing the upper triangle of the matrix. In my case, I didn't care whether it was the upper or lower triangle but I'm sure that using lower.tri() would work equally well.
I hope this helps other users who might encounter a similar issue.

Affinity Propagation results do not match

I am trying to implement the Affinity Propagation clustering algorithm in C++. As part of testing I want to compare my results with well established implementations of the algorithm in Matlab (Link) and in R (package apcluster). Unfortunately, the clusterings do not agree.
To be more precise, the (test) data set is:
0.9411760 0.9702140
0.9607826 0.9744693
0.9754896 0.9574479
0.9852929 0.9489372
0.9950962 0.9234050
1.0000000 0.8936175
1.0000000 0.8723408
0.9852929 0.8595747
1.0000000 0.8893622
1.0000000 0.9191497
In R I typed:
S<-negDistMat(data)
A<-apcluster(S,maxits=1000,convits=100, lam=0.9,q=0.5)
and got:
> A#idx
2 2 2 5 5 9 9 9 9 5
2 2 2 5 5 9 9 9 9 5
In Matlab I just typed:
[idx,netsim,dpsim,expref]=apcluster(S,diag(S));
From the apcluster.m file implementing apcluster (line 77):
maxits=1000; convits=100; lam=0.9; plt=0; details=0; nonoise=0;
This explains the parameters for R, in Matlab their are the default values. Since I'm more comfortable with R concerning Affinity Propagation, for comparison reasons I stuck with Matlab's defaults, just to avoid messing something up unintentionally.
..but got:
>> idx'
ans =
3 3 3 3 5 9 9 9 9 5
In both cases the similarity matrices matched. What could I've missed?
Update:
I've also implemented the Matlab code proposed by Frey & Dueck in their original publication. (You may notice that I omitted noise) and although I can replicate the indexes provided by the former Matlab implementation, Availability and Responsibility matrices differ on some values. The error is less than 0.01 but this is significant.
Their code is:
function [idx,A,R]=frey(S);
N=size(S,1);
A=zeros(N,N);
R=zeros(N,N);
lam=0.9; % Set damping factor
for iter=1:122
% Compute responsibilities
Rold=R;
AS=A+S;
[Y,I]=max(AS,[],2);
for i=1:N
AS(i,I(i))=-realmax;
end;
[Y2,I2]=max(AS,[],2);
R=S-repmat(Y,[1,N]);
for i=1:N
R(i,I(i))=S(i,I(i))-Y2(i);
end;
R=(1-lam)*R+lam*Rold; % Dampen responsibilities
% Compute availabilities
Aold=A;
Rp=max(R,0);
for k=1:N
Rp(k,k)=R(k,k);
end;
A=repmat(sum(Rp,1),[N,1])-Rp;
dA=diag(A);
A=min(A,0);
for k=1:N
A(k,k)=dA(k);
end;
A=(1-lam)*A+lam*Aold; % Dampen availabilities
end;
E=R+A; % Pseudomarginals
I=find(diag(E)>0); K=length(I); % Indices of exemplars
[tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % Assignments
I have tried all your code and the problem is caused by the way you supply the input preference. In the first case (R), you specify q=0.5. This means that the input preference p is set to the median of off-diagonal similarities (in your example, this is -0.05129912). If I run the Matlab code as follows (I used Octave, but Matlab should give the same result), I get:
octave:7> [idx,netsim,dpsim,expref]=apcluster(S,-0.05129912);
octave:8> idx'
ans =
2 2 2 5 5 9 9 9 9 5
This is exactly the same as the R result. If I run your Matlab code (with diag(S) being the second argument) and if I run
apcluster(S, p=diag(S))
in R (which sets the input preference to 0 for all samples in both cases), I get 10 one-sample clusters in both cases. So the two results match again, though I could not recover your Matlab result
3 3 3 3 5 9 9 9 9 5
I hope that makes the difference clear.
Cheers, UBod

Resources