apache spark understanding dense vector - vector

My question is based upon code from the page.
my general understanding is that sparse vector is used when most of the elements are 0 and the dense vector is used when very few elements are 0. A sparse vector is easy to compress
why do we have to define below vectors as dense vectors? How does defining dense vectors help given that there are only 3 elements in each vector. Why cannot we just refer them as vectors?
# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

Spark uses breeze under the hood for high performance Linear Algebra in Scala.
In Spark MLlib and ML some algorithms depends on org.apache.spark.mllib.libalg.Vector type which is rather dense or sparse.
Their is no implicit conversion between a scala Vector or array into a dense Vector from mllib.
Semanticaly speaking, Dense vectors are equivalent to normal vectors, as you see you can create them with mllib Vectors factory with the dense methods to create a Vector of type org.apache.spark.mllib.libalg.Vector

Related

Is this dct (FFTW.jl) behavior in julia normal?

I'm trying to do some exercises of Compressed Sensing on Julia, but i realize that the discrete cosine transformation (using FFTW.jl) of an identity matrix doesn't looks as the result of other programming languages (aka. Mathematica and Matlab).
For example in Julia
using Plots, FFTW, LinearAlgebra
n = 100
Psi = dct(Matrix(1.0I,n,n))
heatmap(Psi)
results in this matrix (which is essentially an identity matrix with some noise)
But in Matlab
imagesc(dct(eye(100,100),'Type',2))
this is the result (as expected)
Finally in Mathematica
MatrixPlot[N[FourierDCTMatrix[100, 2]], PlotLegends -> Automatic]
returns this
Why Julia behaves so differently?
And is this normal?
Matlab (and I guess Mathematica), does dct of each column in your matrix. FFTW performs a 2-dimensional dct when the input is two-dimensional. The same happens for fft.
If you want column-wise transformation, you can specify the dimension:
Psi1 = dct(Matrix(1.0I,n,n), 1); # along first dimension
heatmap(Psi1)
Notice that the direction of the y-axis is opposite for Plots.jl relative to Matlab.
(BTW, you can also just write I(n) or 1.0I(n) instead of Matrix(1.0I,n,n))
This is something that sets Julia apart from some other languages. It tends to treat matrices as matrices, and not as just a collection of vectors or a bunch of scalars. For example exp(M) and log(M) for matrices not operate elementwise, but will calculate the matrix exponential and matrix logarithm according to their linear algebra definitions.

SVD in LSI in the book Introduction to Information Retrieval

In the example 18.4 of the book Introduction to Information Retrieval. The term-document matrix is decomposed using SVD. My question is why Σ is a 5*5 matrix in the example? Shouldn't it be a 5*6 matrix? Is it wrong?
Here is the link of the Chapter 18 of the book Introduction to Information Retrieval. Thanks!
The book is correct. A term document matrix (of dimension DxT) is split into a product of three matrices. The middle matrix (denoted as \Sigma in the book) is the key matrix whose dimension is TxT (T=5 in the example).
Intuitively, you can think of this matrix as denoting relationship between terms. In the best case, all the column vectors of this matrix should be linearly independent meaning that this forms the basis vector in the term space and there is no dependence between the terms. However, this is not true in practice. You'll find that the rank of this matrix is typically a few orders of magnitude less than T (say T'), meaning that there are T-T' linearly dependent column vectors in this matrix.
One can then take a lower order approximation of this matrix by considering only a T'xT' term matrix. In effect, you take the principal eigen values of the matrix and project your vectors on these eigen vectors (treated as new basis) using rotation and scaling. That's precisely what spectral decomposition or PCA (or LSA) does.

Perform sum of vectors in CUDA/thrust

So I'm trying to implement stochastic gradient descent in CUDA, and my idea is to parallelize it similar to the way that is described in the paper Optimal Distributed Online Prediction Using Mini-Batches
That implementation is aimed at MapReduce distributed environments so I'm not sure if it's optimal when using GPUs.
In short the idea is: at each iteration, calculate the error gradients for each data point in a batch (map), take their average by sum/reducing the gradients, and finally perform the gradient step updating the weights according to the average gradient. The next iteration starts with the updated weights.
The thrust library allows me to perform a reduction on a vector allowing me for example to sum all the elements in a vector.
My question is: How can I sum/reduce an array of vectors in CUDA/thrust?
The input would be an array of vectors and the output would be a vector that is the sum of all the vectors in the array (or, ideally, their average).
Converting my comment into this answer:
Let's say each vector has length m and the array has size n.
An "array of vectors" is then the same as a matrix of size n x m.
If you change your storage format from this "array of vectors" to a single vector of size n * m, you can use thrust::reduce_by_key to sum each row of this matrix separately.
The sum_rows example shows how to do this.

Apply a transformation matrix over time

I have an initial frame and a bounding box around some information. I have a transformation matrix T, for which I want to use to transform this bounding box.
I could easily apply the transformation and draw it in the output frame, but I would like to apply the transformation over a sequence of x frames, can anyone suggest a way to do this?
Aly
Building on #egor-n comment, you could compute R = T^{1/x} and compute your bounding box on frame i+1 from the one at frame i by
B_{i+1} = R * B_{i}
with B_{0} your initial bounding box. Depending on the precise form of T, we could discuss how to compute R.
There are methods for affine transforms - to make decomposition of affine transform matrix to product of translation, rotation, scaling and shear matrices, and linear interpolation of parameters of every matrix (for example, rotation angle for R and so on). Example
But for homography matrix there is no single solution, as described here, so one can find some "good" approximation (look at complex math in that article). Probably, some limitations for possible transforms could simplify the problem.
Here's something a little different you could try. Let M be the matrix representing the final transformation. You could try interpolating between I (the identity matrix, with 1's on the diagonal and 0's elsewhere) using the formula
M(t) = exp(t * ln(M))
where t is time from 0 to 1, M(0) = I, M(1) = M, exp is the exponential function for matrices given by the usual infinite series, and ln is the similar natural logarithm function for matrices given by the usual infinite series.
The correctness of the formula depends on the type of transformation represented by M and the type of transformations allowed in intermediate steps. The formula should work for rigid motions. For other types of transformations, various bad things might happen, including divergence of the logarithm series. Other formulas can be used in other cases; let me know if you're using transformations other than rigid motions and I can give some other formulas.
The exponential and logarithm functions may be available in a matrix library. If not, they can be easily implemented as partial sums of infinite series.
The above method should give the same result as some quaternion methods in the case of rotations. The quaternion methods are probably faster when they're available.
UPDATE
I see you mention elsewhere that your transformation is a homography (perspectivity), so the method I suggested above for rigid motions won't work. Instead you could use a different, but related method outlined in ftp://ftp.cs.huji.ac.il/users/aristo/papers/SYGRAPH2005/sig05.pdf. It goes as follows: represent your transformation by a matrix in one higher dimension. Scale the matrix so that its determinant is equal to 1. Call the resulting matrix G. You want to interpolate from the identity matrix I to G, going through perspectivities.
In what follows, let M^T be the transpose of M. Let the function expp be defined by
expp(M) = exp(-M^T) * exp(M+M^T)
You need to find the inverse of that function at G; in other words you need to solve the equation
expp(M) = G
where G is your transformation matrix with determinant 1. Call the result M = logp(G). That equation can be solved by standard numerical techniques, or you can use Matlab or other math software. It's somewhat time-consuming and complicated to do, but you only have to do it once.
Then you calculate the series of transformations by
G(t) = expp(t * logp(G))
where t varies from 0 to 1 in steps of 1/k, where k is the number of frames you want.
You could parameterize the transform over some number of frames by adding a variable with a domain greater than zero but less than 1.
Let t be the frame number
Let T be the total number of frames
Let P be the original location and orientation of the object
Let theta be the total rotation angle
and translation be the vector [x,y]'
The transform in 2D becomes:
T(P|t) = R(t)*P +(t*[x,y]')/T
where R(t) = {{Cos((theta*t)/T),-Sin((theta*t)/T)},{Sin((theta*t)/T),Cos((theta*t)/T)}}
So that at frame t_n you apply the transform T(t) to the position of the object at time t_0 = 0 (which is equivalent to no transform)

Can a very large (or very small) value in feature vector using SVC bias results? [scikit-learn]

I am trying to better understand how the values of my feature vector may influence the result. For example, let's say I have the following vector with the final value being the result (this is a classification problem using an SVC, for example):
0.713, -0.076, -0.921, 0.498, 2.526, 0.573, -1.117, 1.682, -1.918, 0.251, 0.376, 0.025291666666667, -200, 9, 1
You'll notice that most of the values center around 0, however, there is one value that is orders of magnitude smaller, -200.
I'm concerned that this value is skewing the prediction and is being weighted unfairly heavier than the rest simply because the value is so much different.
Is this something to be concerned about when creating a feature vector? Or will the statistical test I use to evaluate my vector control for this large (or small) value based on the training set I provide it with? Are there methods available in sci-kit learn specifically that you would recommend to normalize the vector?
Thank you for your help!
Yes, it is something you should be concerned about. SVM is heavily influenced by any feature scale variances, so you need a preprocessing technique in order to make it less probable, from the most popular ones:
Linearly rescale each feature dimension to the [0,1] or [-1,1] interval
Normalize each feature dimension so it has mean=0 and variance=1
Decorrelate values by transformation sigma^(-1/2)*X where sigma = cov(X) (data covariance matrix)
each can be easily performed using scikit-learn (although in order to achieve the third one you will need a scipy for matrix square root and inversion)
I am trying to better understand how the values of my feature vector may influence the result.
Then here's the math for you. Let's take the linear kernel as a simple example. It takes a sample x and a support vector sv, and computes the dot product between them. A naive Python implementation of a dot product would be
def dot(x, sv):
return sum(x_i * sv_i for x_i, sv_i in zip(x, sv))
Now if one of the features has a much more extreme range than all the others (either in x or in sv, or worse, in both), then the term corresponding to this feature will dominate the sum.
A similar situation arises with the polynomial and RBF kernels. The poly kernel is just a (shifted) power of the linear kernel:
def poly_kernel(x, sv, d, gamma):
return (dot(x, sv) + gamma) ** d
and the RBF kernel is the square of the distance between x and sv, times a constant:
def rbf_kernel(x, sv, gamma):
diff = [x_i - sv_i for x_i, sv_i in zip(x, sv)]
return gamma * dot(diff, diff)
In each of these cases, if one feature has an extreme range, it will dominate the result and the other features will effectively be ignored, except to break ties.
scikit-learn tools to deal with this live in the sklearn.preprocessing module: MinMaxScaler, StandardScaler, Normalizer.

Resources