PSO and K-means based Text Document Clustering in R - r

I am newbie to Particle Swarm Optimization. I read research paper on Clustering based on PSO and K-means but I did not found any working example of the same. Any kind of help is much appreciated. Thanks in advance!
I want to perform text document clustering using PSO and K-means in R. I have the basic idea that first PSO will give me the optimised values of the cluster centroids, then I have to use those optimised value of cluster centroids of PSO as the initial cluster centroid for k-means to get cluster of documents.
Below are the codes which describe what I have done so far!
#Import library
library(pdist)
library(hydroPSO)
#Create matrix and suppose it is our document term matrix which we get after
the cleaning of corpus
( In my actual data I have 20 documents with 951 terms i.e., dim(dtm) = 20*951 )
matri <- matrix(data = seq(1, 20, 1), nrow = 4, ncol = 7, byrow = TRUE)
matri
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 2 3 4 5 6 7
[2,] 8 9 10 11 12 13 14
[3,] 15 16 17 18 19 20 1
[4,] 2 3 4 5 6 7 8
#Initially select first and second row as centroids
cj <- matri[1:2,]
#Calculate Euclidean Distance of each data point from centroids
vm <- as.data.frame(t(as.matrix(pdist(matri, cj))))
vm
V1 V2 V3 V4
1 0.00000 18.52026 34.81379 2.645751
2 18.52026 0.00000 21.51744 15.874508
#Create binary matrix S in which 1 means Instance Ii is allocated to the cluster Cj otherwise 0.
S <- matrix(data = NA, nrow = nrow(vm), ncol = ncol(vm))
for(i in 1:nrow(vm)){
for(j in 1:ncol(vm)){
cd <- which.min(vm[, j])
ifelse(cd==i, S[i,j] <-1, S[i,j] <-0)
}
}
S
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 0 1 1 0
#Apply `hydroPSO()` to get optimised values of centroids.
set.seed(5486)
D <- 4 # Dimension
lower <- rep(0, D)
upper <- rep(10, D)
m_s <- matrix(data = NA, nrow = nrow(S), ncol = ncol(matri))
Fn= function(y) { #Objective Function which has to be minimised
for(j in 1:ncol(matri)){
for(i in 1:nrow(matri)){
for(k in 1:nrow(y)){
for(l in 1:ncol(y)){
m_s[k,] <- colSums(matri[y[k,]==1,])/sum(y[k,])
}
}
}
}
sm <- sum(m_s)/ nrow(S)
return(sm)
}
hh1 <- hydroPSO(S,fn=Fn, lower=lower, upper=upper,
control=list(write2disk=FALSE, npart=3))
But the above hydroPSO() function is not working. It is giving error Error in 1:nrow(y) : argument of length 0. I searched for it but didn't get any solution which works for me.
I also made some changes in my objective function and this time hydroPSO() worked but I guess not correctly. I am passing my initial centroid matrix as a parameter whose dimension is 2*7 but the function returns only 1*7 optimised values. I am not getting its reason.
set.seed(5486)
D <- 7# Dimension
lower <- rep(0, D)
upper <- rep(10, D)
Fn = function(x){
vm <- as.data.frame(t(as.matrix(pdist(matri, x))))
S <- matrix(data = NA, nrow = nrow(vm), ncol = ncol(vm))
for(i in 1:nrow(vm)){
for(j in 1:ncol(vm)){
cd <- which.min(vm[, j])
ifelse(cd==i, S[i,j] <-1, S[i,j] <-0)
}
}
m_s <- matrix(data = NA, nrow = nrow(S), ncol = ncol(matri))
for(j in 1:ncol(matri)){
for(i in 1:nrow(matri)){
for(k in 1:nrow(S)){
for(l in 1:ncol(S)){
m_s[k,] <- colSums(matri[S[k,]==1,])/sum(S[k,])
}
}
}
}
sm <- sum(m_s)/ nrow(S)
return(sm)
}
hh1 <- hydroPSO(cj,fn=Fn, lower=lower, upper=upper,
control=list(write2disk=FALSE, npart=2, K=2))
Output of the above function.
## $par
## Param1 Param2 Param3 Param4 Param5 Param6 Param7
## 8.6996174 2.1952303 5.6903588 0.4471795 3.7103161 1.6605425 8.2717574
##
## $value
## [1] 61.5
##
## $best.particle
## [1] 1
##
## $counts
## function.calls iterations regroupings
## 2000 1000 0
##
## $convergence
## [1] 3
##
## $message
## [1] "Maximum number of iterations reached"
I guess I am passing parameters to the hydroPSO() in a wrong way. Please correct me where I'm doing it wrong.
Thank you very much!

Instead of passing cj to hydroPSO() I used as.vector(t(cj)) in my second approach and it worked fine for me. I got 14 optimised values

Related

Function that only runs when matrices has NAs in it

I have two matrices, one of them has a NA value and I want to use a function that only runs if there are NAs present in the data, so if I run the function it should only work on df2 and not df1. How would I do this?
df1 <- matrix(1:4, nrow = 2, ncol = 2)
df2 <- matrix(1,2,3,NA, nrow = 2, ncol = 2)
Based on the comment above, here is a complete answer (assuming I understand what you are getting at). The function is set up to do something or not to the matrix depending on whether it has NA values.
df1 <- matrix(1:4, nrow = 2, ncol = 2)
df2 <- matrix(c(1,2,3,NA), nrow = 2, ncol = 2)
myfunc <- function(m) {
ret <- m
if (all(!is.na(m))) {
print("This matrix has no NAs")
} else {
print("This matrix has NAs")
}
return(ret)
}
myfunc(df1)
# [1] "This matrix has no NAs"
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
myfunc(df2)
# [1] "This matrix has NAs"
# [,1] [,2]
# [1,] 1 3
# [2,] 2 NA

Creating more pseudo-random matrices same time in R? Comparing the points sign matching?

I can make one pseudo-random matrix with the following :
nc=14
nr=14
set.seed(111)
M=matrix(sample(
c(runif(58,min=-1,max=0),runif(71, min=0,max=0),
runif(nr*nc-129,min=0,max=+1))), nrow=nr, nc=nc)
The more important question: I need 1000 matrices with the same amount of negative, positive and zero values, just the location in the matrices need to be various.
I can make matrices one by one, but I want to do this task faster.
The less important question: If I have the 1000 matrices, I need to identify for every point of the matrices, that how many positive negative or zero value got there, for example:
MATRIX_A
[,1]
[9,] -0,2
MATRIX_B
[,1]
[9,] -0,5
MATRIX_C
[,1]
[9,] 0,1
MATRIX_D
[,1]
[9,] 0,0
MATRIX_E
[,1]
[9,] 0,9
What I need:
FINAL_MATRIX_positive
[,1]
[9,] (2/5*100)=40% or 0,4 or 2
,because from 5 matrix in this point were 2 positive value, and also need this for negative and zero values too.
If it isn't possible to do this in R, I can compare them "manually" in Excel.
Thank you for your help!
Actually you are almost there!
You can try the code below, where replicate can make 1000 times for generating the random matrix, and Reduce gets the statistics of each position:
nc <- 14
nr <- 14
N <- 1000
lst <- replicate(
N,
matrix(sample(
c(
runif(58, min = -1, max = 0),
runif(71, min = 0, max = 0),
runif(nr * nc - 129, min = 0, max = +1)
)
), nrow = nr, nc = nc),
simplify = FALSE
)
pos <- Reduce(`+`,lapply(lst,function(M) M > 0))/N
neg <- Reduce(`+`,lapply(lst,function(M) M < 0))/N
zero <- Reduce(`+`,lapply(lst,function(M) M == 0))/N
I use a function for your simulation scheme:
my_sim <- function(n_neg = 58, n_0 = 71, n_pos = 67){
res <- c(runif(n_neg, min=-1, max=0),
rep(0, n_0),
runif(n_pos, min=0, max=+1))
return(sample(res))
}
Then, I simulate your matrices (I store them in a list):
N <- 1000
nr <- 14
nc <- nr
set.seed(111)
my_matrices <- list()
for(i in 1:N){
my_matrices[[i]] <- matrix(my_sim(), nrow = nr, ncol = nc)
}
Finally, I compute the proportion of positive numbers for the position row 1 and column 9:
sum(sapply(my_matrices, function(x) x[1,9]) > 0)/N
# [1] 0.366
However, if you are interested in all the positions, these lines will do the job:
aux <- lapply(my_matrices, function(x) x > 0)
FINAL_MATRIX_positive <- 0
for(i in 1:N){
FINAL_MATRIX_positive <- FINAL_MATRIX_positive + aux[[i]]
}
FINAL_MATRIX_positive <- FINAL_MATRIX_positive/N
# row 1, column 9
FINAL_MATRIX_positive[1, 9]
# [1] 0.366

How to randomise a matrix element for each iteration of a loop?

I'm working with the popbio package on a population model. It looks something like this:
library(popbio)
babies <- 0.3
kids <- 0.5
teens <- 0.75
adults <- 0.98
A <- c(0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults
)
A <- matrix ((A), ncol=6, byrow = TRUE)
N<-c(10,10,10,10,10,10)
N<-matrix (N, ncol=1)
model <- pop.projection(A,N,iterations=10)
model
I'd like to know how I can randomise the input so that at each iteration, which represents years this case, I'd get a different input for the matrix elements. So, for instance, my model runs for 10 years, and I'd like to have the baby survival rate change for each year. babies <- rnorm(1,0.3,0.1)doesn't do it because that still leaves me with a single value, just randomly selected.
Update: This is distinct from running 10 separate models with different initial, random values. I'd like the update to occur within a single model run, which itself has 10 iteration in the pop.projection function.
Hope you can help.
I know this answer is very late, but here's one approach using expressions. First, use an expression to create the matrix.
vr <- list( babies=0.3, kids=0.5, teens=0.75, adults=0.98 )
Ax <- expression( matrix(c(
0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults), ncol=6, byrow = TRUE ))
A1 <- eval(Ax, vr)
lambda(A1)
[1] 1.011821
Next, use an expression to create vital rates with nrorm or other functions.
vr2 <- expression( list( babies=rnorm(1,0.3,0.1), kids=0.5, teens=0.75, adults=0.98 ))
A2 <- eval(Ax, eval( vr2))
lambda(A2)
[1] 1.014586
Apply the expression to 100 matrices.
x <- sapply(1:100, function(x) lambda(eval(Ax, eval(vr2))))
quantile(x, c(.05,.95))
5% 95%
0.996523 1.025900
Finally, make two small changes to pop.projection by adding the vr option and a line to evaluate A at each time step.
pop.projection2 <- function (Ax, vr, n, iterations = 20)
{
x <- length(n)
t <- iterations
stage <- matrix(numeric(x * t), nrow = x)
pop <- numeric(t)
change <- numeric(t - 1)
for (i in 1:t) {
stage[, i] <- n
pop[i] <- sum(n)
if (i > 1) {
change[i - 1] <- pop[i]/pop[i - 1]
}
## evaluate Ax
A <- eval(Ax, eval(vr))
n <- A %*% n
}
colnames(stage) <- 0:(t - 1)
w <- stage[, t]
pop.proj <- list(lambda = pop[t]/pop[t - 1], stable.stage = w/sum(w),
stage.vectors = stage, pop.sizes = pop, pop.changes = change)
pop.proj
}
n <-c(10,10,10,10,10,10)
pop.projection2(Ax, vr2, n, 10)
$lambda
[1] 0.9874586
$stable.stage
[1] 0.33673579 0.11242588 0.08552367 0.02189786 0.02086656 0.42255023
$stage.vectors
0 1 2 3 4 5 6 7 8 9
[1,] 10 11.590000 16.375700 19.108186 20.2560223 20.5559445 20.5506251 20.5898222 20.7603581 20.713271
[2,] 10 4.147274 3.332772 4.443311 5.6693931 1.9018887 6.8455597 5.3879202 10.5214540 6.915534
[3,] 10 5.000000 2.073637 1.666386 2.2216556 2.8346965 0.9509443 3.4227799 2.6939601 5.260727
[4,] 10 5.000000 2.500000 1.036819 0.8331931 1.1108278 1.4173483 0.4754722 1.7113899 1.346980
[5,] 10 7.500000 3.750000 1.875000 0.7776139 0.6248948 0.8331209 1.0630112 0.3566041 1.283542
[6,] 10 17.300000 22.579000 24.939920 25.8473716 25.9136346 25.8640330 25.9715930 26.2494195 25.991884
$pop.sizes
[1] 60.00000 50.53727 50.61111 53.06962 55.60525 52.94189 56.46163 56.91060 62.29319 61.51194
$pop.changes
[1] 0.8422879 1.0014610 1.0485765 1.0477793 0.9521023 1.0664832 1.0079517 1.0945797 0.9874586

Efficient way to compute standard deviation of nearest neighbours of each element in matrix

I would like to compute the standard deviation of the nearest neighbors (3*3 moving window) of each element in a matrix. I wrote some code in R to implement it:
library(FNN)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
df_ <- reshape2::melt(df)
df_index <- df_[, c(1,2)]
df_query <- df_index
neighbor_index <- knnx.index(df_index, df_query, k = 9, algorithm = 'kd_tree')
neighbor_coor<- apply(neighbor_index, 1, function(x) df_query[x, ])
neighbor_sd <- lapply(neighbor_coor, function(x) sd(df[x[, 1], x[, 2]]))
sd <- do.call(rbind, neighbor_sd)
But the speed is too slow. Would you give me some advice to speed up? Are there other ways to implement it?
As #romanlustrik proposed in his comment, we can use a raster::focal() for this problem.
library(raster)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
dfR <- raster(df)
dfSD <- as.matrix(focal(dfR, w = matrix(1,3,3), fun = sd))
where, w is the a matrix representing the nearest neighbors and their weighting within fun (in this case 3x3 which is the cell itself and it 8 neighbors). Thus, any neighborhood pattern is imaginable as long as it it can be represented by a matrix.
matrix(1,3,3)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 1 1
An example with only the 4 neighbors (excluding diagonals and the cell itself):
matrix(c(0,1,0,1,0,1,0,1,0), 3, 3)
# [,1] [,2] [,3]
# [1,] 0 1 0
# [2,] 1 0 1
# [3,] 0 1 0

Is there a way to simulate a dataset based on a model object in TAM?

so I've estimated a multidimensional IRT model using the TAM package, based on this dataset that I have.
So now that I have the TAM fit object, is there any way to use it to simulate a new dataset that "abides by the rules" of that model I estimated?
Here is something similar, but about lme fit objects:
https://stats.stackexchange.com/questions/11233/how-to-simulate-data-based-on-a-linear-mixed-model-fit-object-in-r
Thanks in advance,
KH
Edit
now, since TAM version 1.10-0, it is possible using the function IRT.simulate (see respective help file). Thanks again for the request.
library(TAM)
data(data.gpcm)
psych::describe(data.gpcm)
resp <- data.gpcm
# define three dimensions and different loadings of item categories
# on these dimensions in B loading matrix
I <- 3 # 3 items
D <- 3 # 3 dimensions
# define loading matrix B
# 4 categories for each item (0, 1, 2, 3)
B <- array(0 , dim = c(I, 4, D))
for (ii in 1:I){
B[ii, 1:4, 1] <- 0:3
B[ii, 1, 2] <- 1
B[ii, 4, 3] <- 1
}
dimnames(B)[[1]] <- colnames(resp)
B[1, , ]
## > B[1,,]
## [,1] [,2] [,3]
## [1,] 0 1 0
## [2,] 1 0 0
## [3,] 2 0 0
## [4,] 3 0 1
#-- test run
mod1 <- tam.mml(resp, B = B, control = list(snodes = 1000, maxiter = 5))
sim.dat <- IRT.simulate(mod1, nobs = 2000)
Old Solution
I wouldn't say it is impossible. However, for the time being, it is not easy since it involves handling of TAM internal functions and attributes of the estimation object. That is, there is no method yet that lets you extract the response probability function at prespecified trait points.
However, thanks to your request, we are working on exactly this very valuable feature and I'll give an update to this answer as soon as the method is on CRAN.
For now, let's extend the example of that request: Implement ConQuest score command in TAM that Alex also included at the manual page of the tam function as EXAMPLE 20.
data(data.gpcm)
psych::describe(data.gpcm)
resp <- data.gpcm
# define three dimensions and different loadings of item categories
# on these dimensions in B loading matrix
I <- 3 # 3 items
D <- 3 # 3 dimensions
# define loading matrix B
# 4 categories for each item (0, 1, 2, 3)
B <- array(0 , dim = c(I, 4, D))
for (ii in 1:I){
B[ii, 1:4, 1] <- 0:3
B[ii, 1, 2] <- 1
B[ii, 4, 3] <- 1
}
dimnames(B)[[1]] <- colnames(resp)
B[1, , ]
## > B[1,,]
## [,1] [,2] [,3]
## [1,] 0 1 0
## [2,] 1 0 0
## [3,] 2 0 0
## [4,] 3 0 1
#-- test run
mod1 <- tam.mml(resp, B = B, control = list(snodes = 1000, maxiter = 5))
Now for the part where we extract the attributes that are necessary for the computation of the response probabilities and generate new testees.
# Extract necessary item attributes
xsi <- mod1$xsi$xsi
A <- mod1$A
B <- mod1$B
maxK <- mod1$maxK
nI <- dim(A)[1]
iIndex <- 1:nI
AXsi <- matrix(0, nrow = nI, ncol = maxK)
# Simulate new testees
nnodes <- 2000
theta <- mvrnorm(n = nnodes, mod1$beta, mod1$variance)
The response probabilities can be obtained from a call to an internal function.
# Calculate response probablities and simulate
p <- TAM:::calc_prob.v5(iIndex, A, AXsi, B, xsi, theta, nnodes, maxK, recalc = TRUE)$rprobs
p[,,1] # response probability of testee 1 to each category 0, 1, 2, 3 for all three items
# [,1] [,2] [,3] [,4]
# [1,] 0.06738066 0.8111365 0.1043441 0.0171387
# [2,] 0.02545206 0.4895568 0.3182046 0.1667866
# [3,] 0.04503185 0.5105446 0.3429603 0.1014633
With this, simulate the success cut and compare that to the response probabilities.
sim.data <- matrix(runif(nnodes * nI), nrow = nnodes, ncol = nI)
for(pp in 1:nnodes){
cat.success.pp <- (sim.data[pp, ] > t(apply(p[, , pp], 1, cumsum)))
sim.data[pp, ] <- c(cat.success.pp %*% rep(1, maxK))
}
Best,
Tom

Resources