Where can I find useful R tutorials with various implementations? - r

I'm using R language and the manuals on the R site are really informative. However, I'd like to see some more examples and implementations with R which can help me develop my knowledge faster. Any suggestions?

Just to add some more
Programming in R
INTRODUCTION TO STATISTICAL MODELLING IN R
Linear algebra in R
The R Inferno
R by example
The R Clinic
Survey analysis in R
R & Bioconductor Manual
Rtips
Resources to help you learn and use R
General R Links

I'll mention a few that i think are excellent resources but that i haven't seen mentioned on SO. They are all free and freely available on the Web (links supplied).
Data Analysis Examples
A collection of individual examples from the UCLA Statistics Dept. which you can browse by major category (e.g., "Count Models", "Multivariate Analysis", "Power Analysis") then download examples with complete R code under any of these rubrics (e.g., under "Count Models" are "Poisson Regression", "Negative Binomial Regression", and so on).
Verzani: SimpleR: Using R for Introductory Statistics A little over 100 pages, and just outstanding. It's easy to follow but very dense. It is a few years old, still i've only found one deprecated function in this text. This is a resource for a brand new R user; it also happens to be an excellent statistics refresher. This text probably contains 20+ examples (with R code and explanation) directed to fundamental statistics (e.g., hypothesis testing, linear regression, simple simulation, and descriptive statistics).
Statistics with R (Vincent Zoonekynd) You can read it online or print it as a pdf. Printed it's well over 1000 pages. The author obviously got a lot of the information by reading the source code for the various functions he discusses--a lot of the information here, i haven't found in any other source. This resource contains large sections on Graphics, Basic Statistics, Regression, Time Series--all w/ small examples (R code + explanation). The final three sections contain the most exemplary code--very thorough application sections on Finance (which seems to be the author's professional field), Genetics, and Image Analysis.

All the packages on CRAN are open source, so you can download all the source code from there. I recommend starting there by looking at the packages you use regularly to see how they're implemented.
Beyond that, Rosetta Code has many R examples. And you may want to follow R-Bloggers.

Book like tutorials
Book like tutorials are usually spread in the form of PDF. Many of them are available on the R-project homepage here:
http://cran.r-project.org/other-docs.html#english
(This link includes many of the texts others have mentioned)
Article like tutorials
These are usually present inside blogs. The biggest list of R-bloggers I know of exists here:
http://www.r-bloggers.com/
And many of these bloggers posts (many of which are tutorials) are listed here:
http://www.r-bloggers.com/archive/
(although inside each blog there are usually more tutorials).

Related

Importing quiz questions created using the R exams package into canvas

I have been using the R exams package to create exams for my introductory statistics course this semester. It is really a great tool! I've been able to create several questions from scratch & import them to canvas without issue. However, there are some questions that give me problems when I try to import them (e.g., the anova and boxplot examples that are included in the package). I can successfully import if I use:
R> library("exams")
R> set.seed(1)
R> exams2canvas("anova.Rmd")
However, I sometimes run into problems when trying to create many versions of the same question:
R> library("exams")
R> exams2canvas("anova.Rmd", n=50)
TL;DR
The source of the problems are multiple-choice exercises with no correct alternative. These are not supported by learning management systems like Canvas or Moodle and hence exercises for these systems must assure at least one correct alternative and one wrong alternative.
Demo exercises
Some of the demo exercises in R/exams did not restrict the number of correct/wrong alternatives to a minimum of one. So from time to time it could happen that no alternative is correct. Up to version 2.3-6 of R/exams this affects the following exercises:
anova,
automaton,
boxplots,
cholesky,
relfreq,
scatterplot.
All of these have been adapted in version 2.4-0 (which was the development version of the package at the time of writing this answer).
Background
Multiple-choice exercises without correct alternatives are straightforward to handle without partial credits when the entire answer pattern must be fully correct. However, when using partial credits, no positive points can be obtained when there are no correct alternatives.
When we created the demo exercises in R/exams we adapted exercises from an environment where we did not use partial credits. But learning management systems like Moodle or Canvas expect at least one correct (and typically also one wrong) alternative for scoring it correctly with partial credits.

How to make publishable tables and plots using R? [duplicate]

There are a range of tools available for creating publication quality tables using R, Sweave, and LaTeX.
In particular, there are helper functions like latex in the Hmisc package, and xtable in the xtable package. I've also often written my own code so that I could have complete control over table formatting (e.g., see this example).
However, when preparing publication quality tables a range of issues often arise:
how and when to apply numeric formatting
how to precisely control alignment of columns and cells
how to precisely control cell borders
how to convert variable labels to variable names
and so on
Beyond the high level issues of specifying the desired table format, there are issues of implementation.
When should a helper function such as xtable be used?
Which helper function should be used in a given situation?
How can the default output of helper functions be customised to particular requirements?
Question
It seems to me that the above issues are deserving of a detailed textbook-style introduction.
Are there any online or offline resources that provide a detailed overview of how to produce publication quality tables using R, Sweave, and LaTeX, and that address the issues discussed above?
Just to tie this up with a nice little bow at the time of current writing, the best existant tutorials on publication-quality tables and usage scenarios appear to be an amalgamation of these documents:
A Sweave example (source)
The Joy of Sweave: A Beginner's Guide to Reproducible Research with Sweave (source)
Latex and R via Sweave: An example document how to use Sweave (source)
Sweave = R · LaTeX2 (source)
The xtable gallery (source)
The Sweave Homepage
LaTeX documentation
Going beyond the scope of what currently exists, you may want to ask the author of The Joy of Sweave for a document on publication-quality tables specifically. It seems like he's gone above and beyond this problem in his research. In addition to the questions you've raised, this space specifically could use a style guide that, flatly, does not currently exist.
And, as mentioned in the question errata, this is a perfect example of a question for https://tex.stackexchange.com/. I encourage you to continue to ask specific questions there when you run into any difficulties in your current projects.
The package stargazer can create publication-quality - incl. using templates designed to resemble existing academic journals - from commonly used R statistical functions and packages (lm, glm, plm, svyglm, survival, pscl, AER, and others). Also good for creating summary statistics tables, and can directly output data frame content as well.
There is a tabular function in the tables package which addresses formatting, alignment and label operations. The package has a vignette which is a good starting point.
xtable has worked fine for me so far.
In combination with siunitx, and when necessary, longtable, it can produce pretty effective tables, in my opinion. With packages like booktabs and caption, the aesthetics can be pleasing too.
I am not sure this level of detail was asked for by the OP, but for what it's worth, the basic implementation could be something along these lines: https://tex.stackexchange.com/questions/41067/caption-for-longtable-in-sweave/41183#41183 (my own answer to another question).
I highly recommend ConTeXt which makes use of the TABLE package. There is a Table overview in contextgarden and an exhaustive manual.

Are there any guidelines for when reproducible code should be included into a publication?

Given the stress toward reproducible science, I was wondering if my recent work warrants the inclusion of example code in the publication. The datasets that I am using are quite big, so it wouldn't make sense to publish those necessarity - However, the statistical methods that I apply within R are not generally known to my audience (although I would think that they should be).
I'm using empirical orthogonal function analysis (EOF) and generalized additive models (GAM) within my analysis. GAM, in particular, is widely used in ecological studies, but less so within the physical sciences - my work spans both disciplines.
I definitely refer to the R packages that I use, and it wouldn't really be difficult for a reviewer / reader to look for those references (and included examples) themselves. So, my question is, what situations are most appropriate for the inclusion of reproducible code in a publication?
Code is the most accurate representation of what you actually did. Therefore, in my view you should always aim to publish code alongside your article.
However, editor resistance to this is pretty strong. The fear is that if the reviewer had access to the code, then the journal looks pretty bad if a substantive coding mistake is later found. This is not a hypothetical fear, given the Levitt paper, etc.
Knuth has some strong views on literate programming that you should be able to cite as justification. If you can't convince the journal to accept your code as an integral piece of the publication, consider publishing it on your personal website (the approach taken by e.g. Raj Chetty for many of his papers) or publish it as an R package.
Finally, here's a note I wrote to my programming students:
Consider publishing your code. Doing so will act as a commitment
device which will encourage good habits--habits that make your own
work easier. Publishing your code also makes it easier for others to
extend your analysis, which can result in more citations of your work.
Releasing your code is good academic practice as well: it is the
truest testament to your analysis. And offering your program to the
world shows off the beautiful coding skills which you are about to
acquire.
A basic tenet of science is reproducibility. So the answer would be to "include" code required to conduct your analysis to every paper/publication that is based on data analysis.
I say "include" because you don't need to put the R code directly into the paper. Many if not most journals allow supplementary material which is an option. Alternative, supply your script to one of the many Science data archiving sites (Such as Figshare) and then (and here is the killer!) cite your own script using the DOI that Figshare gives to your deposited script. If you can post the data too, then all the better; Figshare doesn't really care too much about big data sets.
The above applies to code where you are using other packages and your R script does things like loads and formats data, calls functions from other packages and then plots or displays output/results. If you have developed new R code to implement a particular method then I would say package the code as an R package and submit that to CRAN or r-forge or something like that.
From your description, the former (deposit the analysis script in a repo) would be most appropriate.
We recently had a discussion at our research institute regarding reproducible research. The incentive came from the Nature editorial (http://arstechnica.com/science/2012/02/science-code-should-be-open-source-according-to-editorial/) which argued that all your code should be published. I whole heartedly agree with this. Even though your dataset is very big, publishing the R code that you used to create your results makes it crystal clear what you did. Often times the methods of a paper do not contain sufficient detail to reproduce the result, the code is quite a help in this case.

Can I perform Generalized Iterative Scaling in R?

I'm looking to port our home-grown platform of various machine learning algorithms from C# to a more robust data mining platform such as R. While it's obvious R is great at many types of data mining tasks, it is not clear to me if it can be used for text classification.
Specifically, we extract a list of bigrams from the text and then classify it into one of 15 different categories, eg:
Bigram list: jewelry, books, watches, shoes, department store
-> Category: Shopping
We'd want to both train the models in R as well as hook up to a database to perform this on a larger scale.
Can it be done in R?
Hmm, I am rather starting to look into Machine Learning, but I might have a suggestion: have you considered Weka? There's a bunch of various algorithms around and there'S IS some documentation. Plus, there is an R package RWeka that makes use of the Weka jars.
EDIT:
There is also a nice, comprehensive read by Witten et al. : Data mining that contains an extensive description of Weka among other interesting things. Look into the API opportunities.

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still exploring different packages implemented in R.
I have quick question: how do you compare R to Matlab for data mining application, its popularity, pros and cons, industry and academic acceptance etc.? Which one would you choose and why?
I went through various comparisons for Matlab vs R against various metrics but I am specifically interested to get answer for its applicability in Data Mining and ML.
Since both language are pretty new for me I was just wondering if R would be a good choice or not.
I appreciate any kind of suggestions.
For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.
I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.
The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.
My Top 5 list for Learning ML/Data Mining in R:
Mining Association Rules in R
This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.
The standard reference, The Elements of Statistical
Learning by Hastie et al.
The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).
CRAN Task View: Machine Learning
The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)
Revolution Analytics Blog
With particular focus on the posts tagged with Predictive Analytics
ML in R tutorial comprised of slide deck and R code by Josh Reich
A thorough study of the code would, by itself, be an excellent introduction to ML in R.
And one final resource that i think is excellent, but didn't make in the top 5:
A Guide to Getting Stared in Machine Learning [in R]
posted at the blog A Beautiful WWW
Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.
Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.
However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).
And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.
Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.
If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!
Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!
We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.
I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.
I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.
I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.
I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:
Why MATLAB for Data Mining?
I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:
R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.
Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).
Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.

Resources