Implementation of SVM-RFE Algorithm in R - r

I'm using the R code for the implementation of SVM-RFE Algorithm from this source http://www.uccor.edu.ar/paginas/seminarios/Software/SVM_RFE_R_implementation.pdf but I made a small modification so that the r code uses the gnum library. The code is the following:
svmrfeFeatureRanking = function(x,y){
n = ncol(x)
survivingFeaturesIndexes = seq(1:n)
featureRankedList = vector(length=n)
rankedFeatureIndex = n
while(length(survivingFeaturesIndexes)>0){
#train the support vector machine
svmModel = SVM(x[, survivingFeaturesIndexes], y, C = 10, cache_size=500,kernel="linear" )
#compute ranking criteria
rankingCriteria = svmModel$w * svmModel$w
#rank the features
ranking = sort(rankingCriteria, index.return = TRUE)$ix
#update feature ranked list
featureRankedList[rankedFeatureIndex] = survivingFeaturesIndexes[ranking[1]]
rankedFeatureIndex = rankedFeatureIndex - 1
#eliminate the feature with smallest ranking criterion
(survivingFeaturesIndexes = survivingFeaturesIndexes[-ranking[1]])
}
return (featureRankedList)
}
That function receive a matrix as an input for x and a factor as an input for y. I use the function for some data , and I receive the following error message in the last iterations:
Error in if (nrow(x) != length(y)) { : argument is of length zero
Debugging the code, I got this:
3 SVM.default(x[, survivingFeaturesIndexes], y, C = 10, cache_size = 500,
kernel = "linear")
2 SVM(x[, survivingFeaturesIndexes], y, C = 10, cache_size = 500,
kernel = "linear")
1 svmrfeFeatureRanking(sdatx, ym)
So, what's the error of the function?

Looks like your matrix gets converted into a list when only one feature remains. Try this:
svmModel = SVM(as.matrix(x[, survivingFeaturesIndexes]), y, C = 10, cache_size=500,kernel="linear" )

Related

Repeat iteration in a for loop in r

I am trying to generate a for loop that will repeat a sequence of the following:
sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)
I want it to repeat 5000 times. So far, I include the above as the body of the loop and added
for (i in seq_along[1:5000]){
at the beginning but I am getting an error message saying
Error in seq_along[1:10000] : object of type 'builtin' is not subsettable
We need replicate
out <- replicate(5000, sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)), simplify = FALSE)
There are a few issues here.
#MartinGal noted the syntax issues with seq_along and the missing ). Note that you can use seq(n) or 1:n in defining the number of loops.
You are not storing the sampled vectors anywhere, so the for loop will run the code but you won't capture the output.
You have x = 1:14 but you only have 4 prob values, which suggests you intended x = 1:4 (either that or you are 10 prob values short).
Here's one way to address these issues using a for loop.
n <- 5
s <- 10
xmax <- 4
p <- 1/4
out <- matrix(nrow = n, ncol = s, byrow = TRUE)
set.seed(1L)
for (i in seq(n)) {
out[i, ] <- sample(x = seq(xmax), size = s, replace = TRUE, prob = rep(p, xmax))
}
As andrew reece notes in his comment, it looks like you want x = 1:4 Depending what you want to do with your result you could generate all of the realizations at one time since you are sampling with replacement and then store the result in a matrix with 5000 rows of 10 realizations per row. So:
x <- sample(1:4, size = 5000 * 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4))
result <- matrix(x, nrow = 5000)

R: Trapezoidal integration over the third dimension of an array

EDIT: I changed my example at the end but I'm still stumped.
How can I apply a trapezoidal integration function so that each matrix point is integrated through the 3rd dimension of an array?
I am working with a large array of data (2160, 4320, 46), and I want to use trapezoidal integration over the third dimension of the array (each matrix point down through the 46 stacks of the third dimension). So for example, point 1,1,1 ; 1,1,2 ; 1,1,3 , etc are inputted as the Y points in the integration function.
EDIT: I have been told that this function can be found in CRAN package caTools by Jarek Tuszynski
I have a custom function from my PI for the trapezoidal integration:
trapz = function(x, y){
idx = 2:length(x)
return (as.double( (x[idx] - x[idx-1]) %*% (y[idx] + y[idx-1])) / 2)
}
And I have used apply functions on such arrays before like this:
data_output = apply(X = data, MARGIN = 1:2 , FUN = function(k) (mean(na.omit(k))) )
But I can't figure out how to get the trapz function to work along the margins that I want.
Working with simple example code, if I create an array like this:
x.mat = matrix(1:100, nrow = 10, ncol = 10)
y.mat = matrix(1:100, nrow = 10, ncol = 10)
library(abind)
x.array = abind(x.mat,(x.mat+1),(x.mat+2),along = 3, force.array = T)
y.array = abind(y.mat,(y.mat+1),(y.mat+2),along = 3, force.array = T)
apply(MARGIN = 1:2, FUN = trapz, X = x.array, y = y.array)
The output is a matrix of the correct dimensions (10,10) but every number is a 4.
Please help me understand what I'm doing wrong.

How to summarize email text using LDA in R

I am working on complaints data analysis where I am adapting text summary technique for reducing unnecessary text and bringing out only useful text.
I have used LDA - Latent Dirichlet Allocation in R for text summarization but I am not able to perform it to its full potential.
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
Error -
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Actual text:
it is the council’s responsibility to deal with the loose manhole cover.
Could you provide an update on the next steps taken by the council.
** Trail Mails Text follows - about 50 lines of text**
summarized text:
it is the council’s responsibility to deal with the loose manhole cover.I have read the email thread, please get in contact with the numbers provided by ABC"

using optimr to minimize a function

I am trying to run the optimr function from the optimr package. My code is
library(optimr)
#____function to compute Sum of Square Errors
SSE=function(par,T) {
options(warn = -1)
nr = dim(T)[1] # number of redds (rows)
S = numeric(nr) # redd survival vector
TC = par[1]
bt = par[2]
#--- temperature deviations dT=(Ti-TC)
dT = sweep(T,1,TC,FUN='-')
dT = ifelse(is.na(dT)|dT<0,0,dT)
h = bt*dT
for(i in 1:nr) S[i] = 1-prod(exp(-h[i,]))
SSE = sum((S-mean(S))^2) # same as var(S)*(nr-1)
options(warn = 0)
}
#___ Data
Temp = matrix(round(rnorm(2000,12,1.5),2), ncol=20) #fictious data used
hist(Temp,col='grey60')
#____ minimize SSE via optimizer
o = optimr(par=c(8,.01),fn=SSE,T=Temp,method='Nelder-Mead')
print(cbind(Tcrit=o$par[1],bt=o$par[2]),digits=3)
The problem is that no matter what initial parameter values I give the command, it returns the same initial values. I know that if the starting values were on a flat surface, those values would not change, but no matter what values I give it, the sames ones get returned. Any suggestions?

Using lapply and the lm function together in R

I have a df as follows:
t r
1 0 100.00000
2 1 135.86780
3 2 149.97868
4 3 133.77316
5 4 97.08129
6 5 62.15988
7 6 50.19177
and so on...
I want to apply a rolling regression using lm(r~t).
However, I want to estimate one model for each iteration, where the iterations occur over a set time window t+k. Essentially, the first model should be estimated with t=0,t=1,...t=5, if k = 5, and the second model estimated with t=1, t=2,...,t=6, and so on.
In other words, it iterates from a starting point with a set window t+k where k is some pre-specified window length and applies the lm function over that particular window length iteratively.
I have tried using lapply like this:
mdls = lapply(df, function(x) lm(r[x,]~t))
However, I got the following error:
Error in r[x, ] : incorrect number of dimensions
If I remove the [x,], each iteration gives me the same model, in other words using all the observations.
If I use rollapply:
coefs = rollapply(df, 3, FUN = function(x) coef(lm(r~t, data =
as.data.frame(x))), by.column = FALSE, align = "right")
res = rollapply(df, 3, FUN = function(z) residuals(lm(r~t, data =
as.data.frame(z))), by.column = FALSE, align = "right")
Where:
t = seq(0,15,1)
r = (100+50*sin(0.8*t))
df = as.data.frame(t,r)
I get 15 models, but they are all estimated over the entire dataset, providing the same intercepts and coefficients. This is strange as I managed to make rollapply work just before testing it in a new script. For some reason it does not work again, so I am perplexed as to whether R is playing tricks on me, or whether there is something wrong with my code.
How can I adjust these methods to make sure they iterate according to my wishes?
I enclose a possible solution. The idea is to use a vector 1: nrow (df) in the function rollapply to indicate which rows we want to select.
df = data.frame(t = 0:6, r = c(100.00000, 135.86780, 149.97868, 133.77316, 97.08129, 62.15988, 50.19177))
N = nrow(df)
require(zoo)
# Coefficients
coefs <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- coef(lm(r~t))
return(out)
})
# Residuals
res <- rollapply(data = 1:N, width = 3, FUN = function(x){
r = df$r[x]
t = df$t[x]
out <- residuals(lm(r~t))
return(out)
})

Resources