I am working in R language.
I have two arrays denoted E and y. E is indexed as E[state], and y as y[time,sample]. Thus, E is 1 dimensional, and y is 2 dimensional.
I have to calculate the following 3 dimensional array, denoted as P.
P(state,time,sample) = f(E(state),y[time,sample]), where f is some function. I wanted to do this without using loops.
Sample code is as follows.
# EM algorithm for HMM
#y is array containing data
T #length of time series
nsamples #number of sequences observed
nstates # no. of states in HMM
P_forward<-array(0,dim=c(nstates,T,nsamples)) #forward probabilities in
#EM algo. for HMM
P_forward[state,time,sample] = 1/abs(E[state]-y[time,sample])
Related
I would like to create programming involving a support vector machine (SVM) in Julia with scikit-Learn of Python.
So I need a training set (X_train) and a test set (y_train).
As an example, I have an X set of type 500 x 3 (this is my full data).
Hence, I get a subset A of X of type Array 110 x 3 and a subset B of X of type Array 120 x 3.
The union of sets A and B will be my X_train training set.
My y_train set is obtained as follows:
For every vector element of A, I associate with the number 0 and for every vector element of B, I associate with the number 1.
So I have the following training and exit set:
T = {(a1, 0), (a2, 0), ..., (a110,0), (b1, 1), (b2, 1), ..., (b120, 1)}.
These sets A and B are disjoint and obtained from an RGB img vector image that I transformed into X of the 500 x 3 array type.
After having the T set, I should use Python's scikitlearn library in Julia, the svc.fit() command and probably the svc.predict() command to create a prediction model for the remaining pixels that have not been trained.
Could someone help me make this code?
I am not able to associate every element from A to 0 and every element from B to 1.
To concatenate I am using vcat(A, B) and defining X_train = vcat(A, B).
I've tried many things and I can't get out of here.
In summary:
I have an X set of vectors in 3 coordinates. And I would like to associate a numerical value between 0 and 1 for each X vector. So I took a part of the X set as the X_train training set. This X_train set is the union of two disjoint subsets A and B in which for each vector of A I know that y is 0 and for each vector of B I know that y is 1.
I would now like to train the X_train set and get a y function that gives me a value between 0 to 1 of the whole X set. This via SVM. I believe that there are many other supervised approaches, but with that I can follow.
I'm a complete beginner with R and I need to perform regressions on some data sets. My problem is, I'm not sure, how to rewrite the model into the mathematical formula.
Most confusing are interactions and poly function.
Can they be understood like a product and a polynomial?
Example
Let's have following model, both a and b are vectors of numbers:
y ~ poly(a, 2):b
Can it be rewritten mathematically like this?
y = a*b + a^2 * b
Example 2
And when I get a following expression from fit summary
poly(a, 2)2:b
is it equal to the following formula?
a^2 * b
Your question has two fold:
what does poly do;
what does : do.
For the first question, I refer you to my answer https://stackoverflow.com/a/39051154/4891738 for a complete explanation of poly. Note that for most users, it is sufficient to know that it generates a design matrix of degree number or columns, each of which being a basis function.
: is not a misery. In your case where b is also a numeric, poly(a, 2):b will return
Xa <- poly(a, 2) # a matrix of two columns
X <- Xa * b # row scaling to Xa by b
So your guess in the question is correct. But note that poly gives you orthogonal polynomial basis, so it is not as same as I(a) and I(a^2). You can set raw = TRUE when calling poly to get ordinary polynomial basis.
Xa has column names. poly(a,2)2 just means the 2nd column of Xa.
Note that when b is a factor, there will be a design matrix, say Xb, for b. Obviously this is a 0-1 binary matrix as factor variables are coded as dummy variables. Then poly(a,2):b forms a row-wise Kronecker product between Xa and Xb. This sounds tricky, but is essentially just pair-wise multiplication between all columns of two matrices. So if Xa has ka columns and Xb has kb columns, the resulting matrix has ka * kb columns. Such mixing is called 'interaction'.
The resulting matrix also has column names. For example, poly(a, 2)2:b3 means the product of the 2nd column of Xa and the dummy column in Xb for the third level of b. I am not saying 'the 3rd column of Xb' as this is false if b is contrasted. Usually a factor will be contrasted so if b has 5 levels, Xb will have 4 columns. Then the dummy column for third level will be the 2nd column of Xb, if the first factor level is the reference level (hence not appearing in Xb).
I have a collection of n coordinate points of the form (x,y,z). These are stored in an n x 3 matrix M.
Is there a built in function in Julia to calculate the distance between each point and every other point? I'm working with a small number of points so calculation time isn't too important.
My overall goal is to run a clustering algorithm, so if there is a clustering algorithm that I can look at that doesn't require me to first calculate these distances please suggest that too. An example of the data I would like to perform clustering on is below. Obviously I'd only need to do this for the z coordinate.
To calculate distances use the Distances package.
Given a matrix X you can calculate pairwise distances between columns. This means that you should supply your input points (your n objects) to be the columns of the matrices. (In your question you mention nx3 matrix, so you would have to transpose this with the transpose() function.)
Here is an example on how to use it:
>using Distances # install with Pkg.add("Distances")
>x = rand(3,2)
3x2 Array{Float64,2}:
0.27436 0.589142
0.234363 0.728687
0.265896 0.455243
>pairwise(Euclidean(), x, x)
2x2 Array{Float64,2}:
0.0 0.615871
0.615871 0.0
As you can see the above returns the distance matrix between the columns of X. You can use other distance metrics if you need to, just check the docs for the package.
Just for completeness to the #niczky12 answer, there is a package in Julia called Clustering which essentially, as the name says, allows you to perform clustering.
A sample kmeans algorithm:
>>> using Clustering # Pkg.add("Clustering") if not installed
>>> X = rand(3, 100) # data, each column is a sample
>>> k = 10 # number of clusters
>>> r = kmeans(X, k)
>>> fieldnames(r)
8-element Array{Symbol,1}:
:centers
:assignments
:costs
:counts
:cweights
:totalcost
:iterations
:converged
The result is stored in the return of the kmeans (r) which contains the above fields. The two probably most interesting fields: r.centers contains the centers detected by the kmeans algorithm and r.assigments contains the cluster to which each of the 100 samples belongs.
There are several other clustering methods in the same package. Feel free to dive into the documentation and apply the one that best suits your needs.
In your case, as your data is an N x 3 matrix you only need to transpose it:
M = rand(100, 3)
kmeans(M', k)
I am trying to get the kde for a four-dimensional dataset using the kde function in the ks package, but have not been successful. I am running the following code:
kde(m, h=delta, gridsize = n.grid)
where m is a n x 4 matrix. I have n features in my dataset with 4 different variables. I have tried running this function with an n x 3 matrix and the function works great, returning a 3 dimensional array kernel density estimate. When I run it with the four dimensional data matrix however it says I must supply the evaluation points (which is weird since the documentation says I only need to do that for d > 4).
So, I ended up creating a new evaluation point matrix that is n.grid x 4 in size with n.grid equally spaced points from the original data matrix m. However, when I run this, it returns to me a 1 dimensional array of estimates instead of a 4 dimensional array.
Does anyone know how to run kde properly for dimensions greater than 3?
Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?
You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]
In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.