R provides a number of .tests for various applications. It is often the case that there are many methods in the literature for a test. For example, a test of a population proportion probably has at least a half dozen possible tests. Each of these has various properties, and they may make different assumptions.
The base R documentation seems to not always give much information on precisely which method is used for a test. For example ?prop.test does not mention whether the Wald method or Wilson method (or some other method) is used.
Is this information documented somewhere? How can I find out more about which methods are being used by a particular test in R?
One option is to view and dissect the source code: getAnywhere(prop.test)
While possibly tedious, it gives an unambiguous explanation for what is actually happening when you run a function.
Related
From what I understand, it is usually difficult to select the best possible clustering method for your data priori, and we can use cluster validity to compare the results of different clustering algorithms and choose the one with the best validation scores.
I use an internal validation function from R stats package on my clustering result (for clustering methods I used R igraph fast.greedy and walk.trap).
The outcome is a list of many validation scores.
In the list, almost in every validation Fast greedy method has better scores than Walk trap, except in entropy walk trap method has a better score.
Can I use this validation result list as one of my reasons to explain to others why I choose Fast greedy method rather than walk trap method?
Also, is there any way to validate a disconnected graph?
Short answer: NO!
You can't use a internal index to justify the choose of a algorithm over another. Why?
Because evaluation indexes were designed to evaluate clustering results, i.e., partitions and hierarchies. You can only use them to access the quality of a clustering and therefore justify its choice over the others options. But again, you can't use them to justify choosing a particular algorithm to apply on a different dataset based on a single previous experiment.
For this task, several benchmarks are needed to determine which algorithms are generally better and should be tried first. Here some paper about it: Community detection algorithms: a comparative analysis.
Edit: What I am saying is, your validation indexes may show that the fast.greed's solution is better than the walk.trap's. However, they do not explain why you chose these algorithms instead of any others. Only your data, your assumptions, and your constraints could do that.
Also, is there any way to validate a disconnected graph?
Theoretically, any evaluation index can do this. Technically, some implementations don't handle disconnected components.
I know that there is possibility to export/import h2o model, that was previously trained.
My question is - is there a way to transform h2o model to a non-h2o one (that just works in plain R)?
I mean that I don't want to launch the h2o environment (JVM) since I know that predicting on trained model is simply multiplying matrices, applying activation function etc.
Of course it would be possible to extract weights manually etc., but I want to know if there is any better way to do it.
I do not see any previous posts on SA about this problem.
No.
Remember that R is just the client, sending API calls: the algorithms (those matrix multiplications, etc.) are all implemented in Java.
What they do offer is a POJO, which is what you are asking for, but in Java. (POJO stands for Plain Old Java Object.) If you call h2o.download_pojo() on one of your models you will see it is quite straightforward. It may even be possible to write a script to convert it to R code? (Though it might be better, if you were going to go to that trouble, to convert it to C++ code, and then use Rcpp!)
Your other option is to export the weights and biases, in the case of deep learning, implement your own activation function, and use them directly.
But, personally, I've never found the Java side to be a bottleneck, either from the point of view of dev ops (install is easy) or computation (the Java code is well optimized).
I'm developing an major upgrade to the R package, and as part of the changes I want to start using the S3 methods so I can use the generic plot, summary and print functions. But I think I'm not totally sure I understand why and when to use generic functions in general.
For example, I currently have a function called logLikSSM, which computes the log-likelihood of a state space model. Instead of using this functions, I could make function logLik.SSM or something like that, as there is generic function logLik in R. The benefit of this would be that logLik is shorter to write than logLikSSM, but is there really any other point in this?
Similar case, there is a generic function called simulate in stats package, so in theory I could use that instead of simulateSSM. But now the description of the simulate function tells that function is used to "Simulate Responses", but my function actually simulates the hidden states, so it really doesn't fit into the description of the simulate function. So probably in this case I shouldn't use the generic function right?
I apologize if this question is too vague for here.
The advantages of creating methods for generics from the core of R include:
Ease of Use. Users of your package already familiar with those generics will have less to remember making it easier to use your package. They might even be able to do a certain amount without reading the documentation. If you come up with your own names then they must discover and remember new names which is an added cognitive burden.
Leverage Existing Functionality. Also any other functions that make use of generics you create methods for can then automatically use yours as well; otherwise, they would have to be changed. For example, AIC uses logLik.
A disadvantage is that the generic involves the extra level of dispatch and if logLik is in the inner loop of an optimization there could be an impact (although possibly not material). In that case you could check the performance of calling the generic vs. calling the method directly and use the latter if it makes a significant difference.
Regarding the case that your function has a completely different purpose than the generic in the core of R, then it might be more confusing than helpful so you might, in that case, not create a method but have your own function name.
You might want to read the zoo Design manual (see link to zoo Design under Vignettes near the bottom of that page) which discusses the design ideas that went into the zoo package. These include the idea being discussed here.
EDIT: Added disadvantates.
good question.
I'll split your Question into two parts; here's the first one:
i]s there really any other point in [making functions generic]?
Well, this pattern is usually invoked when the develper doesn't know the object class for every object he/she expects a user to pass in to the method under consideration.
And because of this uncertainty, this design pattern (which is called overloading in many other languages) is invokved, and which requires R to evaluate the object class, then dispatch that object to the appropriate method given the object type.
The second part of your Question: [i]n this case I shouldn't use [the generic function] right?
To try to give you an answer useful beyond the detail of your Question, consider what happens to the original method when you call setGeneric, passing that method in.
the original function body is replaced with code for performing a top-level dispatch based on type of object passed in. This replaces the original function body, which just slides down one level so that it becomes the default method that the top level (generic) function dispatches to.
showMethods() will let you see all of those methods which are called by the newly created dispatch function (generic function).
And now for one huge disadvantage:
Ease of MISUse:
Users of your package already familiar with those generics might do a certain amount without reading the documentation.
And therein lies the fallacy that components, reusable objects, services, etc are an easy panacea for all software challenges.
And why the overwhelming majority of software is buggy, bloated, and operates inconsistently with little hope of tech support being able to diagnose your problem.
There WAS a reason for static linking and small executables back in the day. But this generation of code now, get paid now, debug later if ever, before the layoffs/IPO come, has no memory of the days when code actually worked very reliably and installation/integration didn't require 200$/hr Big 4 consultants or hackers who spend a week trying to get some "simple" open source product installed and productively running.
But if you want to continue the tradition of writing ever shorter function/method names, be my guest.
I am a statistics graduate student who works a lot with R. I am familiar with OOP in other programming contexts. I even see its use in various statistical packages that define new classes for storing data.
At this stage in my graduate career, I am usually coding some algorithm for some class assignment--something that takes in raw data and gives some kind of output. I would like to make it easier to reuse code, and establish good coding habits, especially before I move on to more involved research. Please offer some advice on how to "think OOP" when doing statistical programming in R.
I would argue that you shouldn't. Try to think about R in terms of a workflow. There's some useful workflow suggestions on this page:
Workflow for statistical analysis and report writing
Another important consideration is line-by-line analysis vs. reproducible research. There's a good discussion here:
writing functions vs. line-by-line interpretation in an R workflow
Two aspects of OOP are data and the generics / methods that operate on data.
The data (especially the data that is the output of an analysis) often consists of structured and inter-related data frames or other objects, and one wishes to manage these in a coordinated fashion. Hence the OOP concept of classes, as a way to organize complex data.
Generics and the methods that implement them represent the common operations performed on data. Their utility comes when a collection of generics operate consistently across conceptually related classes. Perhaps a reasonable example is the output of lm / glm as classes, and the implementation of summary, anova, predict, residuals, etc. as generics and methods.
Many analyses follow familiar work flows; here one is a user of classes and methods, and gets the benefit of coordinated data + familiar generics. Thinking 'OOP' might lead you to explore the methods on the object, methods(class="lm") rather than its structure, and might help you to structure your work flows so they follow the well-defined channels of established classes and methods.
Implementing a novel statistical methodology, one might think about how to organize the results in to a coherent, inter-related data structure represented as a new class, and to write methods for the class that correspond to established methods on similar classes. Here one gets to represent the data internally in a way that is convenient for subsequent calculation rather than as a user might want to 'see' it (separating representation from interface). And it is easy for the user of your class (as Chambers says, frequently yourself) to use the new class in existing work flows.
It's a useful question to ask 'why OOP' before 'how OOP'.
You may want to check these links out: first one, second one.
And if you want to see some serious OO code in R, read manual page for ReferenceClasses (so called R5 object orientation), and take a look at Rook package, since it relies heavily on ReferenceClasses. BTW, Rook is a good example of reasonable usage of R5 in R coding. Previous experience with JAVA or C++ could be helpful, since R5 method dispatching differs from S3. Actually, S3 OO is very primitive, since the actuall "class" is saved as an object attribute, so you can change it quite easily.
S3: <method>.<class>(<object>)
R5: <object>$<method>
Anyway, if you can grab a copy, I recommend: "R in a Nutshell", chapter 10.
I have a limited knowledge of how to use R effectively, but there here is an article that allowed even me to walk through using R in an OO manner:
http://www.ibm.com/developerworks/linux/library/l-r3/index.html
I take exception to David Mertz's "The methods package is still somewhat tentative from what I can tell, but some moderately tweaked version of it seems certain to continue in later R versions" mentioned in the link in BiggsTRC answer. In my opinion, programming with classes and methods and using the methods package (S4) is the proper way to "think OOP" in R.
The last paragraph of chapter 9.2 "Programming with New Classes" (page 335) of John M. Chambers' "Software for Data Analysis" (2008) states:
"The amount of programming involved in using a new class may be much more than that involved in defining the class. You owe it to the users of your new classes to make that programming as effective as possible (even if you expect to be your own main user). So the fact that the programming style in this chapter and in Chapter 10 ["Methods and Generic Functions"] is somewhat different is not a coincidence. We're doing some more serious programming here."
Consider studying the methods package (S4).
Beyond some of the other good answers here (e.g. the R in a Nutshell chapter, etc), you should take a look at the core Bioconductor packages. BioC has always had a focus on strong OOP design using S4 classes.
We have a suite of converters that take complex data and transform it. Mostly the input is EDI and the output XML, or vice-versa, although there are other formats.
There are many inter-dependencies in the data. What methods or software are available that can generate complex input data like this?
Right now we use two methods: (1) a suite of sample files that we've built over the years mostly from files bugs and samples in documentation, and (2) generating pseudo-random test data. But the former only covers a fraction of the cases, and the latter has lots of compromises and only tests a subset of the fields.
Before go further down the path of implementing (reinventing?) a complex table-driven data generator, what options have you found successful?
Well, the answer is in your question. Unless you implement a complex table-driven data generator, you're doing the things right with (1) and (2).
(1) covers the rule of "1 bug verified, 1 new test case".
And if the structure of the pseudo-random test data of (2) corresponds whatsoever in real life situations, it is fine.
(2) can always be improved, and it'll improve mainly over time, when thinking about new edge cases. The problem with random data for tests is that it can only be random to a point where it becomes so difficult to compute the expected output from the random data in the test case, that you have to basically rewrite the tested algorithm in the test case.
So (2) will always match a fraction of the cases. If one day it matches all the cases, it will be in fact a new version of your algorithm.
I'd advise against using random data as it can make it difficult if not impossible to reproduce the error that reported (I know you said 'pseudo-random', just not sure what you mean by that exactly).
Operating over entire files of data would likely be considering functional or integration testing. I would suggest taking your set of files with known bugs and translating these into unit tests, or at least do so for any future bugs you come across. Then you can also extend these unit tests to include coverage for the other erroneous conditions that you don't have any 'sample data'. This will likely be easier then coming up with a whole new data file every time you think of a condition/rule violation you want to check for.
Make sure your parsing of the data format is encapsulated from the interpretation of the data in the format. This will make unit testing as described above much easier.
If you definitely need to drive your testing you may want to consider getting a machine readable description of the file format, and writing a test data generator which will analyze the format and generate valid/invalid files based upon it. This will also allow your test data to evolve as the file formats do as well.