Gabor feature extraction - pattern-recognition

I am doing a project on Gabor feature extraction. I am very confused about what a Gabor feature means. I have made a feature matrix with different orientation and frequency. Is that the Gabor feature or the feature like statistical feature, geometric feature, spatial domain feature, invariance, repeatability, etc computed of image obtained after convolving the image with the Gabor filter bank with different orientation and frequencies refers to the Gabor feature.

Gabor filters act very similar to mamalian visual cortical cells so they extract features from different orientation and different scales.
I too recently did some Gabor filters based Feature extraction.
It looks hard initially but it is easy to implement.
To make it easy for you to understand I will give you a walkthrough.
Suppose you have an image like
And you calculate gabor features at 5 scales and 8 orientations (Which I suppose you have already done) you will get filters like
Now you need to convolve each filters with the image to get 40 (8*5=40) different representation(response matrices) of same image where each image gives you a feature vector.
So after convolution
Now you need to convert those Response Matrices to feature vector.
So feature vector may consist of : Local Energy,Mean Amplitude,Phase Amlitude or Orientation whose local has maximum Energy
I worked on local energy and mean amplitude and got good enough results.
Local Energy = summing up the squared value of each matrix value from a response matrix
Mean Amplitude = sum of absolute values of each matrix value from a response matrix
Thus at the end you will get two matrix that will be [1x40] each.
You can append one matrix to the other to create a [1x80] feature matrix for One image and thus create a [nx80] vector for n images for further training purpose.
How ever in order to increase efficiency you can use Log Gabor filters.(see this)
And for more information regarding the feature Extraction with Gabor Filters see this paper

Related

Creation of correlated marks. E.g. point sizes varying with inter-point distances

I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.
I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).

What is the difference between metric and non-metric MDS for a beginner?

I am fairly new to data science and would like to know in simple words (like teaching your grandmother) what the difference between metric and non-metric Multidimensional scaling is.
I have been googling for 2 days and watching different videos and wasn't able to quite understand some of the terms people are using to describe the difference, maybe I am lacking some basic knowledge but I don't know in which area so if you have an idea of what I should have a firm understanding of before tackling this subject, I would appreciate the advice. Here is what I know:
Multidimensional scaling is a way of reducing dimensions to be able to visualize or represent data in a more friendly manner. I know that there are several ways for MDS like metric and non metric, PCA and FA (maybe FA is a part of PCA, I'm not sure).
The example I am trying to apply this on is a set of data showing different cities and attributes related to these cities. For example, on a score from 1-7 (1 lowest - 7 highest), this is the score of each city and the corresponding attribute.
**Clean** **Friendly** **Expensive** **Beautiful**
Berlin----------- 4 --------------------- 2-----------------------5------------------------6
Geneva---------6 --------------------- 3-----------------------7------------------------7
Paris------------ 3 --------------------- 4-----------------------6------------------------7
Barcelona----- 2 --------------------- 6-----------------------3------------------------4
How do I know if I should be using metric or non-metric MDS. Are there general rules of thumb or simple logic that I can use to decide without going deep into the technical process.
Thank you
Well, I might not be able to give you a specific answer but a simple answer would be that metric MDS already has the input matrix in the form of distances (i.e. actual distances between cities) and therefore the distances have meaning in the input matrix and create a map of actual physical locations from those distances.
In non-metric MDS, the distances are just a representation of the rankings (i.e. high as in 7 or low as in 1) and they do not have any meaning on their own but they are needed to create the map using euclidean geometry and the map then just shows the similarity in rankings represented by distances between coordinates on the map.
Metric MDS deals with an item x item input matrix whose entries represent Euclidean distance (special case of metric MDS called classical MDS and being equivalent to PCA) or any other distance between items.
Non-metric MDS deals with some distance-like measure (let's call it dissimilarity) between items. There is no requirement for the dissimilarity to satisfy formal properties of a distance/metric (see this wiki for needed properties). The only requirement is that it should be possible to order the dissimilarity values for all item x item pairs in non-decreasing order.
In your case, the item x attribute matrix contains ordinal data (data on a scale 1-7). Euclidean distance won't be appropriate here, but e.g. Pearson "distance" or cosine "distance" are usually used for such data and, as they're not proper distances, non-metric MDS should then be chosen.

How to add zoom option for wordcloud in Shiny (with reproducible example)

Could you please help me to add zooming option for wordcloud
Please find reproducible example #
´http://shiny.rstudio.com/gallery/word-cloud.html´
I tried to incorporate rbokeh and plotly but couldnt find wordcloud equivalent render function
Additionally, I found ECharts from github #
´https://github.com/XD-DENG/ECharts2Shiny/tree/8ac690a8039abc2334ec06f394ba97498b518e81´
But incorporating this ECharts are also not convenient for really zoom.
Thanks in advance,
Abi
Normalisation is required only if the predictors are not meant to be comparable on the original scaling. There's no rule that says you must normalize.
PCA is a statistical method that gives you a new linear transformation. By itself, it loses nothing. All it does is to give you new principal components.
You lose information only if you choose a subset of those principal components.
Usually PCA includes centering the data as a Pre Process Step.
PCA only arranges the data in its own Axis (Eigne Vectors) System.
If you use all axis you lose no information.
Yet, usually we want to apply Dimensionality Reduction, intuitively, having less coordinates for the data.
This process means projecting the data into Sub Space which is spanned by only some of the Eigen Vectors of the data.
If one chose wisely the number of vectors one might end up with a significant reduction in the number of dimensions of the data with negligible loss of data / information.
The way to do so is by choosing Eigen Vectors which their Eigen Values sum to most of the data power.
PCA itself is invertible, so lossless.
But:
It is common to drop some components, which will cause a loss of information.
Numerical issues may cause a loss in precision.

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

Fast way of doing k means clustering on binary vectors in c++

I want to cluster binary vectors (millions of them) into k clusters.I am using hamming distance for finding the nearest neighbors to initial clusters (which is very slow as well). I think K-means clustering does not really fit here. The problem is in calculating mean of the nearest neighbors (which are binary vectors) to some initial cluster center, to update the centroid.
A second option is to use K-medoids in which the new cluster center is chosen from one of the nearest neighbors ( the one which is closest to all neighbors for a particular cluster center). But finding that is another problem because numbers of nearest neighbors are also quite large.
Can someone please guide me?
It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris#de-vries.id.au.
If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.
Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.
Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.
Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Resources