generating completely new vector based on other vectors - vector

Assume I have four-vectors (v1,v2,v3,v4), and I want to create a new vector (vec_new) that is not close to any of those four-vectors. I was thinking about interpolation and extrapolation. Do you think they are suitable? Are they also apply for vector and generate a vector of let's say 300 dimensions? Another possible option would be the transformation matrix. But I am not sure if it fit my concern. I think averaging and concatenation are not the good ones as I might be close to some of those four-vectors.
based on my problem, Imagine I divided my vectors into two categories. I need to find a vector which belongs to non-of those categories.
Any other ideas?

Per my comment, I wouldn't expect the creation of synthetic "far away" examples to be useful for realistic goals.
Even things like word antonyms are not maximally cosine-dissimilar from each other, because among the realm of all word-meaning-possibilities, antonyms are quite similar to each other. For example, 'hot' and 'cold' are considered opposites, but are the same kind of word, describing the same temperature-property, and can often be drop-in replacements for each other in the same sentences. So while they may show an interesting contrast in word-vector space, the "direction of difference" isn't going to be through the origin -- as would create maximal cosine-dissimilarity.
And in classification contexts, even a simple 2-category classifier will need actual 'negative' examples. With only positive examples, the 'vector space' won't necessarily model anything about hypothesized-but-not-actually-present negative examples. (It's nearly impossible to divide the space into two categories without training examples showing the real "boundaries".)
Still, there's an easy way to make a vector that is maximally dissimilar to another single vector: negate it. That creates a vector that's in the exact opposite direction from the original, and thus will have a cosine-similarity of -1.0.
If you have a number of vectors against which you want to find a maximally-dissimilar vector, I suspect you can't do much better than negating the average of all the vectors. That is, average the vectors, then negate that average-vector, to find the vector that's pointing exactly-opposite the average.
Good luck!

Related

Array-processing: Eigenstructure of the Spatial Covariance Matrix

I've been staring at the following underlined statement from this book for hours, and I cannot for the life of me figure out how it can be right:
For some definitions:
is an r x r matrix (we may ignore its contents for this purpose).
A is an N x r matrix defined as the following matrix of column vectors, where each vector is N elements long:
First of all I'm convinced that when they write:
they really mean:
otherwise it simply would not make sense from the start. My confusion is when they say is a linear combination of the column vectors of A.
At first I thought maybe it just wasn't obvious to me, so I started doing the calculations as an exercise.
My calculation (please don't make me type all this into a text equation editor):
I THINK my calculation is correct, but there was a lot to keep track of, so...
I did not do the multiplication with because it's trivial, and it doesn't solve the problem.
How can the products of different elements (complex conjugated no less) of the vectors in A end up as a linear combination of the columns of A?
Am I forgetting something fundamental here? Maybe something to do with the fact that is an eigenvector of ...?

Checking for similarity of text in two text strings

I have two strings of text (typically two paragraphs). I am looking to check for the "similarity" between them, e.g. check if one paragraph is a plagiarised version of the other. Ideally I need a similarity score, as well as an indication of where the similarities are. I prefer to do this fully in R. Any suggestions please?
The difference of stings can be measured with the levenshtein distance (or concepts that build on top of that). The main idea is to quantify the "editiing distance" of strings: how many letters need to be included/excluded/changed, etc (depending on the algorithm more or less types of editing are allowed). A package in R for this task would be fuzzyjoin.
To look up the similarities you could cut both texts (original and suposed plagiate) in sentences and build the fuzzy joins on this - Then you can filter for best matches. The topic is a bit tricky so I recomend trying out different algorithms (jaccard distance, damerau levenshtein, etc). A start into the topic can be found here: https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html

Movement data analysis in R; Flights and temporal subsampling

I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).

Normalize a vector?

How do you normalize a M*N vector, such that the sum of all its elements is now equal to 1. I browsed online a little, and nothing seems to quite match what I need. Thanks!
You add up all the elements, then divide each element by the sum.
Obviously, the division (at least) needs to be in floating point. Since that indicates a floating point matrix, doing the summing while maintaining maximum accuracy will be non-trivial.
Just for example, if you have one large element, and a lot of small elements, you'll probably get a more accurate result from adding all the small elements together, then adding that sum to the large element, than if you added each small element to the large one individually.
Edit: I suppose I should add that the usual way to deal with this is called Kahan summation, after the high guru of numerical analysis, William Kahan.
i think you have to divide every vector component by the euklidean distance of the vector

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.
I am using cosine similarity with tf-idf weights and that is how I did it.
1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).
2- I compute the tf-idf weights of trigrams and create vectors for all sentences.
3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.
However, the system does not work as I expected. Here, I have some questions in my mind.
As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).
tf = number of occurrences of trigram in sentence / number of all trigrams in sentence
idf = number of all sentences in all articles / number of sentences where trigram appears
Do you think it is ok to use such a definition for this problem?
Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?
Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?
Thank you in advance.
I am not sure why you are sorting the trigrams for every sentence. All you need to care about when computing cosine similarity is that whether the same trigram occurred in the two sentences or not and with what frequencies. Conceptually speaking you define a fixed and common order among all possible trigrams. Remember the order has to be the same for all sentences. If the number of possible trigrams is N, then for each sentence you obtain a vector of dimensionality N. If a certain trigram does not occur, you set the corresponding value in the vector to zero. You dont really need to store the zeros, but have to take care of them when you define the dot product.
Having said that, trigrams are not a good choice as chances of a match are a lot sparser. For high k you will have better results from bags of k consecutive words, rather than k-grams. Note that the ordering does not matter inside a bag, its a set. You are using k=3 k-grams, but that seems to be on the high side, especially for sentences. Either drop down to bi-grams or use bags of different lengths, starting from 1. Preferably use both.
I am sure you have noticed that sentences that do not use the exact trigram has 0 similarity in your method. K-bag of words
will alleviate the situation somewhat but not solve it completely. Because now you need sentences to share actual words. Two sentences may be similar without using the same words. There are a couple of ways to fix this. Either use LSI(latent Semantic Indexing) or clustering of the words and use the cluster labels to define your cosine similarity.
In order to compute the cosine similarity between vectors x and y you compute the dot product and divide by the norms of x and y.
The 2-norm of the vector x can be computed as square root of the sum of the components squared. However you should also try your algorithm out without any normalization to compare. Usually it works fine, because you are already taking care of the relative sizes of the sentences when you compute the term frequencies (tf).
Hope this helps.

Resources