Similarity function for Mahout boolean user-based recommender - similarity

I am using Mahout to build a user-based recommendation system which operates with boolean data.
I use GenericBooleanPrefUserBasedRecommender, NearestNUserNeighborhood and now trying to decide about the most suitable user similarity function.
It was suggested to use either LogLikelihoodSimilarity or TanimotoCoefficientSimilarity. I tried both and am getting [subjectively evaluated] meaningful results in both cases. However the RMSE rating for the same data set is better the LogLikehood. The number of "no recommendation" is similar in both case.
Can anyone recommend which of these similarity function is most suitable for this case?

(I'm the developer.) If I was stranded on a desert island with just one similarity metric for data without ratings/prefs, it would be log-likelihood. I would generally expect it to be the better similarity metric.
The problem with the test you're doing is that, perhaps not at all obviously, it's not meaningful for this kind of recommender / data. RMSE is root-mean-square-error, and it's comparing the actual vs predicted rating for held-out test data. But you have no ratings. They're all "1.0". Really, RMSE is always 0!
It's not, since to have anything to rank on, these recommenders will rank by some meaningful function of the similarities. But they are not estimating ratings / prefs at all. So, RMSE means squat here.
The only metric you can really use is a precision/recall test in this case, I think. Even that is problematic. This and more fun topics are covered in a book which I will shamelessly promote: Mahout in Action

Related

Smoothing before FPCA

I am trying to use FPCA on my time series and I know that I should do some smoothing before using FPCA. However, I don't know what smoothing method is good?
Any resource is much appreciated!
Thanks!
Smoothing will depend on the data you have. Considering the FDA parametric approach (Ramsay & Silverman, 2005) the basis functions is your choice: in general, it is common to use the “fourier” basis for periodic data, and “bspline” basis for non-recurrent data. B-splines have a very good local behaviour.
You can find more info about the implementation of different basis functions in "Functional data analysis with R and MATLAB" (Ramsay et al. 2009)
There's no specific rule to choose the dimension of the basis as it depends on several factors. I strongly recommend studying the least square error of the process in all possible dimensions, and then to choose it by the region of convenience. Some packages have implemented functions to calculate it; e.g. fda.usc::min.basis() -best minimum number of basis functions- and also by the cross-validation method, e.g. fda.usc::CV.S().
P-splines provide the lowest approximation errors, its computational implementation is easier and are quite insensitive to the choice of knots. You can try to smooth your functional object like:
library(fda)
fdobj <- create.bspline.basis(df,nbasis=k,norder=4)
smooth.fdPar(fdobj, Lfdobj=NULL, lambda=0,
estimate=TRUE, penmat=NULL)

Posterior Predictive Checking

I'm relatively new at all this. I've performed an imputation on metabolomics data, and another colleague has queried the quality of my imputation (I performed predictive mean matching using MICE in R.)
Having looked into this, there isn't any official way to assess imputation beyond visually assessing the imputed and observed data. I've found some papers on using posterior predictive checking and p-value comparing whether the complete data is more extreme than the observed, but as I'm new to this I feel attempting to write the codes without guidance to be too challenging at this point.
Can anyone direct me to an example script, or perhaps a well described command, in the program R which will enable me to perform sufficient checks on my imputation?
Thank you, K.

Statistical comparison of machine learning algorithm

I am working in machine learning. I am stuck in one of the thing.
I want to compare 4 machine learning techniques among 10 datasets. After performing experiment i got Area Under Curve value. After this i have applied Analysis of variance test which shows there is a significant difference between 4 machine learning techniques.
Now my problem is that which test will conclude that particular algorithm perform well compared to other algorithm and i want only one winner among the machine learning techniques.
A classifier's quality can be measured by the F-Score which measures the test's accuracy. Comparing these respective scores will give you a simple measure.
However, if you want to measure whether the difference between the classifiers' accuracies is significant, you can try the Bayesian Test or, if classifiers are trained once, McNemar's test.
There are other possibilities and the papers On Comparing Classifiers: Pitfalls to Avoid and a
Recommended Approach and Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms are probably worth reading.
If you are gathering performance metrics (ROC,accuracy,sensitivity,specificity...) from identicially resampled data sets then you can perform statistical tests using paired comparisons. Most statistical software impliment Tukeys Range test (ANOVA). https://en.wikipedia.org/wiki/Tukey%27s_range_test. A formal treatment of this material is here: http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf. This is the test I like to use for the purpose you discuss, although there are others and people have varying opinions.
You will still have to choose how you will sample based on your data (k-fold), repeated (k-fold), bootstrap, leave one out, repeated training test splits. Bootstrap methods tend to give you the tightest confidence intervals after leave one out; but leave one out might not be an option if your data is huge.
That being said you may also need to consider the problem domain. False positives may be an issue in classification. You may need to consider other metrics to choose the best performer for the domain. AUC might not always be the best model for a specific domain. For instance a credit card company may not want to deny a transaction to customers, we need a very low false positive on fraud classification.
You may also want to consider implementation. If a logistic regression performs near as well it may be a better choice over a more complicated implementation of a random forest. Are there legal implications to model use (Fair Credit Reporting Act...)?
A common sense approach is to begin with something like RF or Gradient boosted trees to get an empirical sense of a performance ceiling. Then build simpler models and use the simpler model that performs reasonabley well compared to the ceiling.
Or you could combine all your models using something like LASSO... or some other model.

R - Fit many distributions to sample, visualize, and sort by g.o.f. test

Is there any package in R that allows fitting many pdfs at the same time to some sample data, plots all fits and the sample histogram, and then allows sorting the fittings by some gof criteria like Kolmogorov-Smirnov, Anderson-Darling, X2, ...? Something similar to what commercial software EasyFit does?
UPDATE
I've received valuable comments to my initial question. Specifically, the AIC stands out as a metric that allows comparing pdfs with differing number of parameters. However, AIC also has limitations. Therefore, it'd be interesting to come up with/find some sort of summary that states the pros and cons of all the g.o.f. tests for model selection. Many of these topics are common to statisticians but might not be so, and are very useful for practitioners that must perform many g.o.f. decisions on a daily basis for practical problems.
Any suggestions are welcome.
Thanks!

Genetic Algorithms Introduction

Starting off let me clarify that i have seen This Genetic Algorithm Resource question and it does not answer my question.
I am doing a project in Bioinformatics. I have to take data about the NMR spectrum of a cell(E. Coli) and find out what are the different molecules(metabolites) present in the cell.
To do this i am going to be using Genetic Algorithms in R language. I DO NOT have the time to go through huge books on Genetic algorithms. Heck! I dont even have time to go through little books.(That is what the linked question does not answer)
So i need to know of resources which will help me understand quickly what it is Genetic Algorithms do and how they do it. I have read the Wikipedia entry ,this webpage and also a couple of IEEE papers on the subject.
Any working code in R(even in C) or pointers to which R modules(if any) to be used would be helpful.
A brief (and opinionated) introduction to genetic algorithms is at http://www.burns-stat.com/pages/Tutor/genetic.html
A simple GA written in R is available at http://www.burns-stat.com/pages/Freecode/genopt.R The "documentation" is in 'S Poetry' http://www.burns-stat.com/pages/Spoetry/Spoetry.pdf and the code.
I assume from your question you have some function F(metabolites) which yields a spectrum but you do not have the inverse function F'(spectrum) to get back metabolites. The search space of metabolites is large so rather than brute force it you wish to try an approximate method (such as a genetic algorithm) which will make a more efficient random search.
In order to apply any such approximate method you will have to define a score function which compares the similarity between the target spectrum and the trial spectrum. The smoother this function is the better the search will work. If it can only yield true/false it will be a purely random search and you'd be better off with brute force.
Given the F and your score (aka fitness) function all you need to do is construct a population of possible metabolite combinations, run them all through F, score all the resulting spectrums, and then use crossover and mutation to produce a new population that combines the best candidates. Choosing how to do the crossover and mutation is generally domain specific because you can speed the process greatly by avoiding the creation of nonsense genomes. The best mutation rate is going to be very small but will also require tuning for your domain.
Without knowing about your domain I can't say what a single member of your population should look like, but it could simply be a list of metabolites (which allows for ordering and duplicates, if that's interesting) or a string of boolean values over all possible metabolites (which has the advantage of being order invariant and yielding obvious possibilities for crossover and mutation). The string has the disadvantage that it may be more costly to filter out nonsense genes (for example it may not make sense to have only 1 metabolite or over 1000). It's faster to avoid creating nonsense rather than merely assigning it low fitness.
There are other approximate methods if you have F and your scoring function. The simplest is probably Simulated Annealing. Another I haven't tried is the Bees Algorithm, which appears to be multi-start simulated annealing with effort weighted by fitness (sort of a cross between SA and GA).
I've found the article "The science of computing: genetic algorithms", by Peter J. Denning (American Scientist, vol 80, 1, pp 12-14). That article is simple and useful if you want to understand what genetic algorithms do, and is only 3 pages to read!!

Resources