Can I perform Generalized Iterative Scaling in R? - r

I'm looking to port our home-grown platform of various machine learning algorithms from C# to a more robust data mining platform such as R. While it's obvious R is great at many types of data mining tasks, it is not clear to me if it can be used for text classification.
Specifically, we extract a list of bigrams from the text and then classify it into one of 15 different categories, eg:
Bigram list: jewelry, books, watches, shoes, department store
-> Category: Shopping
We'd want to both train the models in R as well as hook up to a database to perform this on a larger scale.
Can it be done in R?

Hmm, I am rather starting to look into Machine Learning, but I might have a suggestion: have you considered Weka? There's a bunch of various algorithms around and there'S IS some documentation. Plus, there is an R package RWeka that makes use of the Weka jars.
EDIT:
There is also a nice, comprehensive read by Witten et al. : Data mining that contains an extensive description of Weka among other interesting things. Look into the API opportunities.

Related

How to do Language Modeling using HTK

I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri
If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.

How to compute Bayesian Network from microarray Gene Pix data using free software?

I have tried to use MeV26, Bayesia software and R for making Bayesian network from 26 Columns of gene expression microarray numbers (.csv file, 652 genes there). Does anybody experienced can advise what software and scripts to use and what books and tutorials are best for that task? Are there any Python or Ruby libraries for that?
Thank you
Software Tools:
The easiest way would be to use WEKA. Simply import your data into WEKA, select Bayesian/ Bayesian Network (BN) as your classifier option, learn a structure and look at your classification performance.
The second would be to use R with the bnlearn package.
In general, you can find BN libraries in all major languages. I am not familiar with Ruby but Python has this.
But again, I would advise first using WEKA, as it will give you results almost instantaneously, which you can use later to benchmark your more detailed results that you obtain by getting your hands dirty / coding in whatever language.
Reading:
Obviously, there are many articles published on BNs but you may not have access to them and I presume you don't want to pay to buy a book straight-away. MatLab BNT's developer K. Murphy has a nice introductory article. Furthermore the BNT's manual itself provides a nice, brief and hands-on (if you have MatLab) introduction to the use of BNs.

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still exploring different packages implemented in R.
I have quick question: how do you compare R to Matlab for data mining application, its popularity, pros and cons, industry and academic acceptance etc.? Which one would you choose and why?
I went through various comparisons for Matlab vs R against various metrics but I am specifically interested to get answer for its applicability in Data Mining and ML.
Since both language are pretty new for me I was just wondering if R would be a good choice or not.
I appreciate any kind of suggestions.
For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.
I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.
The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.
My Top 5 list for Learning ML/Data Mining in R:
Mining Association Rules in R
This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.
The standard reference, The Elements of Statistical
Learning by Hastie et al.
The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).
CRAN Task View: Machine Learning
The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)
Revolution Analytics Blog
With particular focus on the posts tagged with Predictive Analytics
ML in R tutorial comprised of slide deck and R code by Josh Reich
A thorough study of the code would, by itself, be an excellent introduction to ML in R.
And one final resource that i think is excellent, but didn't make in the top 5:
A Guide to Getting Stared in Machine Learning [in R]
posted at the blog A Beautiful WWW
Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.
Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.
However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).
And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.
Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.
If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!
Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!
We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.
I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.
I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.
I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.
I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:
Why MATLAB for Data Mining?
I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:
R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.
Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).
Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.

R and SPSS difference

I will be analysing vast amount of network traffic related data shortly, and will pre-process the data in order to analyse it. I have found that R and SPSS are among the most popular tools for statistical analysis. I will also be generating quite a lot of graphs and charts. Therefore, I was wondering what is the basic difference between these two softwares.
I am not asking which one is better, but just wanted to know what are the difference in terms of workflow between the two (besides the fact that SPSS has a GUI). I will be mostly working with scripts in either case anyway so I wanted to know about the other differences.
Here is something that I posted to the R-help mailing list a while back, but I think that it gives a good high level overview of the general difference in R and SPSS:
When talking about user friendlyness
of computer software I like the
analogy of cars vs. busses:
Busses are very easy to use, you just
need to know which bus to get on,
where to get on, and where to get off
(and you need to pay your fare). Cars
on the other hand require much more
work, you need to have some type of
map or directions (even if the map is
in your head), you need to put gas in
every now and then, you need to know
the rules of the road (have some type
of drivers licence). The big advantage
of the car is that it can take you a
bunch of places that the bus does not
go and it is quicker for some trips
that would require transfering between
busses.
Using this analogy programs like SPSS
are busses, easy to use for the
standard things, but very frustrating
if you want to do something that is
not already preprogrammed.
R is a 4-wheel drive SUV (though
environmentally friendly) with a bike
on the back, a kayak on top, good
walking and running shoes in the
pasenger seat, and mountain climbing
and spelunking gear in the back.
R can take you anywhere you want to go
if you take time to leard how to use
the equipment, but that is going to
take longer than learning where the
bus stops are in SPSS.
There are GUIs for R that make it a bit easier to use, but also limit the functionality that can be used that easily. SPSS does have scripting which takes it beyond being a mere bus, but the general phylosophy of SPSS steers people towards the GUI rather than the scripts.
I work at a company that uses SPSS for the majority of our data analysis, and for a variety of reasons - I have started trying to use R for more and more of my own analysis. Some of the biggest differences I have run into include:
Output of tables - SPSS has basic tables, general tables, custom tables, etc that are all output to that nifty data viewer or whatever they call it. These can relatively easily be transported to Word Documents or Excel sheets for further analysis / presentation. The equivalent function in R involves learning LaTex or using a odfWeave or Lyx or something of that nature.
Labeling of data --> SPSS does a pretty good job with the variable labels and value labels. I haven't found a robust solution for R to accomplish this same task.
You mention that you are going to be scripting most of your work, and personally I find SPSS's scripting syntax absolutely horrendous, to the point that I've stopped working with SPSS whenever possible. R syntax seems much more logical and follows programming standards more closely AND there is a very active community to rely on should you run into trouble (SO for instance). I haven't found a good SPSS community to ask questions of when I run into problems.
Others have pointed out some of the big differences in terms of cost and functionality of the programs. If you have to collaborate with others, their comfort level with SPSS or R should play a factor as you don't want to be the only one in your group that can work on or edit a script that you wrote in the future.
If you are going to be learning R, this post on the stats exchange website has a bunch of great resources for learning R: https://stats.stackexchange.com/questions/138/resources-for-learning-r
The initial workflow for SPSS involves justifying writing a big fat cheque. R is freely available.
R has a single language for 'scripting', but don't think of it like that, R is really a programming language with great data manipulation, statistics, and graphics functionality built in. SPSS has 'Syntax', 'Scripts' and is also scriptable in Python.
Another biggie is that SPSS squeezes its data into a spreadsheety table structure. Dealing with other data structures is probably very hard, but comes naturally to R. I wouldn't know where to start handling network graph type data in SPSS, but there's a package to do it for R.
Also with R you can integrate your workflow with your reporting by using Sweave - you write a document with embedded bits of R code that generate plots or tables, run the file through the system and out comes the report as a PDF. Great for when you want to do a weekly report, or you do a body of work and then the boss gives you an updated data set. Re-run, read it over, its done.
But you know, your call...
Well, are you a decent programmer? If you are, then it's worthwhile to learn R. You can do more with your data, both in terms of manipulation and statistical modeling, than you can with SPSS, and your graphs will likely be better too. On the other hand, if you've never really programmed before, or find the idea of spending several months becoming a programmer intimidating, you'll probably get more value out of SPSS. The level of stuff that you can do with R without diving into its power as a full-fledged programming language probably doesn't justify the effort.
There's another option -- collaborate. Do you know someone you can work with on your project (you don't say whether it's academic or industry, but either way...), who knows R well?
There's an interesting (and reasonably fair) comparison between a number of stats tools here
http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
I work with both in a company and can say the following:
If you have a large team of different people (not all data scientists), SPSS is useful because it is plain (relatively) to understand. For example, if users are going to run a model to get an output (sales estimates, etc), SPSS is clear and easy to use.
That said, I find R better in almost every other sense:
R is faster (although, sometimes debatable)
As stated previously, the syntax in SPSS is aweful (I can't stress this enough). On the other hand, R can be painful to learn, but there are tons of resources online and in the end it pays much more because of the different things you can do.
Again, like everyone else says, the sky is the limit with R. Tons of packages, resources and more importantly: indepedence to do as you please. In my organization we have some very high level functions that get a lot done. The hard part is creating them once, but then they perform complicated tasks that SPSS would tangle in a never ending web of canvas. This is specially true for things like loops.
It is often overlooked, but R also has plenty of features to cooperate between teams (github integration with RStudio, and easy package building with devtools).
Actually, if everyone in your organization knows R, all you need is to maintain a basic package on github to share everything. This of course is not the norm, which is why I think SPSS, although a worst product, still has a market.
I have not data for it, but from my experience I can tell you one thing:
SPSS is a lot slower than R. (And with a lot, I really mean a lot)
The magnitude of the difference is probably as big as the one between C++ and R.
For example, I never have to wait longer than a couple of seconds in R. Using SPSS and similar data, I had calculations that took longer than 10 minutes.
As an unrelated side note: In my eyes, in the recent discussion on the speed of R, this point was somehow overlooked (i.e., the comparison with SPSS). Furthermore, I am astonished how this discussion popped up for a while and silently disappeared again.
There are some great responses above, but I will try to provide my 2 cents. My department completely relies on SPSS for our work, but in recent months, I have been making a conscious effort to learn R; in part, for some of the reasons itemized above (speed, vast data structures, available packages, etc.)
That said, here are a few things I have picked up along the way:
Unless you have some experience programming, I think creating summary tables in CTABLES destroys any available option in R. To date, I am unaware package that can replicate what can be created using Custom Tables.
SPSS does appear to be slower when scripting, and yes, SPSS syntax is terrible. That said, I have found that scipts in SPSS can always be improved but using the EXECUTE command sparingly.
SPSS and R can interface with each other, although it appears that it's one way (only when using R inside of SPSS, not the other way around). That said, I have found this to be of little use other than if I want to use ggplot2 or for some other advanced data management techniques. (I despise SPSS macros).
I have long felt that "reporting" work created in SPSS is far inferior to other solutions. As mentioned above, if you can leverage LaTex and Sweave, you will be very happy with your efficient workflows.
I have been able to do some advanced analysis by leveraging OMS in SPSS. Almost everything can be routed to a new dataset, but I have found that most SPSS users don't use this functionality. Also, when looking at examples in R, it just feels "easier" than using OMS.
In short, I find myself using SPSS when I can't figure it out quickly in R, but I sincerely have every intention of getting away from SPSS and using R entirely at some point in the near future.
SPSS provides a GUI to easily integrate existing R programs or develop new ones. For more info, see the SPSS Community on IBM Developer Works.
#Henrik, I did the same task you have mentioned (C++ and R) on SPSS. And it turned out that SPSS is faster compared to R on this one. In my case SPSS is aprox. 7 times faster. I am surprised about it.
Here is a code I used in SPSS.
data list free
/x (f8.3).
begin data
1
end data.
comp n = 1e6.
comp t1 = $time.
loop #rep = 1 to 10.
comp x = 1.
loop #i=1 to n.
comp x = 1/(1+x).
end loop.
end loop.
comp t2 = $time.
comp elipsed = t2 - t1.
form elipsed (f8.2).
exe.
Check out this video why is good to combine SPSS and R...
Link
http://bluemixanalytics.wordpress.com/2014/08/29/7-good-reasons-to-combine-ibm-spss-analytics-and-r/
If you have a compatible copy of R installed, you can connect to it from IBM SPSS Modeler and carry out model building and model scoring using custom R algorithms that can be deployed in IBM SPSS Modeler. You must also have a copy of IBM SPSS Modeler - Essentials for R installed. IBM SPSS Modeler - Essentials for R provides you with tools you need to start developing custom R applications for use with IBM SPSS Modeler.
The truth is: both packages are useful if you do data analysis professionally. Sure, R / RStudio has more statistical methods implemented than SPSS. But SPSS is much easier to use and gives more information per each button click. And, therefore, it is faster to exploit whenever a particular analysis is implemented in both R and SPSS.
In the modern age, neither CPU nor memory is the most valuable resource. Researcher's time is the most valuable resource. Also, tables in SPSS are more visually pleasing, in my opinion.
In summary, R and SPSS complement each other well.

Where can I find useful R tutorials with various implementations?

I'm using R language and the manuals on the R site are really informative. However, I'd like to see some more examples and implementations with R which can help me develop my knowledge faster. Any suggestions?
Just to add some more
Programming in R
INTRODUCTION TO STATISTICAL MODELLING IN R
Linear algebra in R
The R Inferno
R by example
The R Clinic
Survey analysis in R
R & Bioconductor Manual
Rtips
Resources to help you learn and use R
General R Links
I'll mention a few that i think are excellent resources but that i haven't seen mentioned on SO. They are all free and freely available on the Web (links supplied).
Data Analysis Examples
A collection of individual examples from the UCLA Statistics Dept. which you can browse by major category (e.g., "Count Models", "Multivariate Analysis", "Power Analysis") then download examples with complete R code under any of these rubrics (e.g., under "Count Models" are "Poisson Regression", "Negative Binomial Regression", and so on).
Verzani: SimpleR: Using R for Introductory Statistics A little over 100 pages, and just outstanding. It's easy to follow but very dense. It is a few years old, still i've only found one deprecated function in this text. This is a resource for a brand new R user; it also happens to be an excellent statistics refresher. This text probably contains 20+ examples (with R code and explanation) directed to fundamental statistics (e.g., hypothesis testing, linear regression, simple simulation, and descriptive statistics).
Statistics with R (Vincent Zoonekynd) You can read it online or print it as a pdf. Printed it's well over 1000 pages. The author obviously got a lot of the information by reading the source code for the various functions he discusses--a lot of the information here, i haven't found in any other source. This resource contains large sections on Graphics, Basic Statistics, Regression, Time Series--all w/ small examples (R code + explanation). The final three sections contain the most exemplary code--very thorough application sections on Finance (which seems to be the author's professional field), Genetics, and Image Analysis.
All the packages on CRAN are open source, so you can download all the source code from there. I recommend starting there by looking at the packages you use regularly to see how they're implemented.
Beyond that, Rosetta Code has many R examples. And you may want to follow R-Bloggers.
Book like tutorials
Book like tutorials are usually spread in the form of PDF. Many of them are available on the R-project homepage here:
http://cran.r-project.org/other-docs.html#english
(This link includes many of the texts others have mentioned)
Article like tutorials
These are usually present inside blogs. The biggest list of R-bloggers I know of exists here:
http://www.r-bloggers.com/
And many of these bloggers posts (many of which are tutorials) are listed here:
http://www.r-bloggers.com/archive/
(although inside each blog there are usually more tutorials).

Resources