SVM applied in images using Julia Language - math

I would like to create programming involving a support vector machine (SVM) in Julia with scikit-Learn of Python.
So I need a training set (X_train) and a test set (y_train).
As an example, I have an X set of type 500 x 3 (this is my full data).
Hence, I get a subset A of X of type Array 110 x 3 and a subset B of X of type Array 120 x 3.
The union of sets A and B will be my X_train training set.
My y_train set is obtained as follows:
For every vector element of A, I associate with the number 0 and for every vector element of B, I associate with the number 1.
So I have the following training and exit set:
T = {(a1, 0), (a2, 0), ..., (a110,0), (b1, 1), (b2, 1), ..., (b120, 1)}.
These sets A and B are disjoint and obtained from an RGB img vector image that I transformed into X of the 500 x 3 array type.
After having the T set, I should use Python's scikitlearn library in Julia, the svc.fit() command and probably the svc.predict() command to create a prediction model for the remaining pixels that have not been trained.
Could someone help me make this code?
I am not able to associate every element from A to 0 and every element from B to 1.
To concatenate I am using vcat(A, B) and defining X_train = vcat(A, B).
I've tried many things and I can't get out of here.
In summary:
I have an X set of vectors in 3 coordinates. And I would like to associate a numerical value between 0 and 1 for each X vector. So I took a part of the X set as the X_train training set. This X_train set is the union of two disjoint subsets A and B in which for each vector of A I know that y is 0 and for each vector of B I know that y is 1.
I would now like to train the X_train set and get a y function that gives me a value between 0 to 1 of the whole X set. This via SVM. I believe that there are many other supervised approaches, but with that I can follow.

Related

Can I set a benchmark with several data sets in R?

This is just a premature thought, but say I have X datasets, and they have Y characteristics, e.g. artificial vs. non-fictional data.
Is it possible to compute an index ranging from 0 to 1 based on the X datasets' Y characteristics? I have not conceptualised yet what exactly 1 means or 0.

How does SMOTE create new data from categorical data?

I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.

R avoid loops using builtin functions

I am working in R language.
I have two arrays denoted E and y. E is indexed as E[state], and y as y[time,sample]. Thus, E is 1 dimensional, and y is 2 dimensional.
I have to calculate the following 3 dimensional array, denoted as P.
P(state,time,sample) = f(E(state),y[time,sample]), where f is some function. I wanted to do this without using loops.
Sample code is as follows.
# EM algorithm for HMM
#y is array containing data
T #length of time series
nsamples #number of sequences observed
nstates # no. of states in HMM
P_forward<-array(0,dim=c(nstates,T,nsamples)) #forward probabilities in
#EM algo. for HMM
P_forward[state,time,sample] = 1/abs(E[state]-y[time,sample])

lm()$assign: what is it?

What is the assign attribute of a linear model fit? It's supposed to somehow provide the position of the response term, but in practice it seems to enumerate all coefficients in the model. It's my understanding that assign is a carryover from S and it's not supported by glm(). I need to extract the equivalent information for glm, but I don't understand what the implementation does for lm and can't seem to find the source code either. The help file for lm.fit says, unhelpfully:
non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects
You find this in help("model.matrix"), which creates these values:
There is an attribute "assign", an integer vector with an entry for
each column in the matrix giving the term in the formula which gave
rise to the column. Value 0 corresponds to the intercept (if any), and
positive values to terms in the order given by the term.labels
attribute of the terms structure corresponding to object.
So, it maps the design matrix to the formula.
The numbers from $assign represent the corresponding predictor variable. If your predictor is categorical with 3 levels, you will see the corresponding number (3-1) times in your $assign call. Example:
data(mpg, package = "ggplot2")
m = lm(cty ~ hwy + class,data = mpg)
m$assign
[1] 0 1 2 2 2 2 2 2
# Note how there is six 2's to represent the indicator variables
# for the various 'class' levels. (class has 7 levels)
You will see the quantitative predictors will only have one value (hwy in the example above), since they are represented by one term in the design formula.

Does anyone know how to run 4-dimensional or larger problems using kde in the ks package in R

I am trying to get the kde for a four-dimensional dataset using the kde function in the ks package, but have not been successful. I am running the following code:
kde(m, h=delta, gridsize = n.grid)
where m is a n x 4 matrix. I have n features in my dataset with 4 different variables. I have tried running this function with an n x 3 matrix and the function works great, returning a 3 dimensional array kernel density estimate. When I run it with the four dimensional data matrix however it says I must supply the evaluation points (which is weird since the documentation says I only need to do that for d > 4).
So, I ended up creating a new evaluation point matrix that is n.grid x 4 in size with n.grid equally spaced points from the original data matrix m. However, when I run this, it returns to me a 1 dimensional array of estimates instead of a 4 dimensional array.
Does anyone know how to run kde properly for dimensions greater than 3?

Resources