Am I able to remove sparse terms WHILE creating a tm::TermDocumentMatrix object?
I tried:
TermDocumentMatrix(file.corp, control = list(removeSparseTerms=0.998))
but it does not work.
No, you cannot remove sparse terms like that with the TermDocumentMatrix function. If you check the help for that function with ?TermDocumentMatrix you'll see that the options for control are listed in the help for termFreq, and when you look at the help for that function with ?termFreq, you'll see that removeSparseTerms is not listed there. Although you have bounds which can do a related job.
If you just want a one-liner that combines TermDocumentMatrix and removeSparseTerms, you simply flip your line inside-out and that will work fine:
removeSparseTerms(TermDocumentMatrix(file.corp), 0.998)
I recommend you have a careful look at the documentation for the tm package, it's one of better examples of a well-documented contributed package. It might save you time waiting for someone to answer your questions here!
Related
I am learning R package SimInf to simulate data-driven stochastic epidemiological models. As I was reading the documentation I came across an unrecognized funcion Nn when defining a function for epicurves. Specifically, this line:
j <- sample(seq_len(Nn(model)), 1)
Values of model are integers. My guess is that Nn selects non-negative values, however my R does not recognize this function. From documentation it does not look like they pre-defined Nn either. Can someone please tell if they know what "Nn" is for? Thank you.
A way to go is always taking the package-name and triple-":" it, such that you can find nearly all functions inside the package. Maybe you are familiar with namespacing a function via packageName::functionFrompackageTocall. The packageName::: shows (nearly) all functions defined in this package. If you do this in R-Studio with SimInf:: and SimInf:::, you will see that the latter gives much more functions. But you can only find the functions SimInf:::Nd and SimInf:::Nc, not the Nn-function. Hence you will have to go to the github-sources of the package, in this case https://github.com/stewid/SimInf .Then search for Nn the whole repository. You will see that it seems like it is always an int, but this doesn't help you since you want to get ii as a function, not as a variable. Scrolling further down in the search-results, you will find the NEWS.md-file which mentions The 'Nn' function to determine the number of nodes in a model has been replaced with the S4 method 'n_nodes'. in the https://github.com/stewid/SimInf/blob/fd7eb4a29b82a4a97f64b528bb0e78e5474aa8a5/NEWS.md file under SimInf 8.0.0 (2020-09-13). Hence having a current version of SimInf installed, it shouldn't use the method Nn anymore. If you use it in your code, replace it by n_nodes. If you find it in current package code, you can email the package-maintainer that you found a bug in his code.
TLDR: Nn is an outdated version of n_nodes
I am currently translating some MATLAB scripts to R for Multivariate Data Analysis. Currently I am trying to generate the same data as the Coeffs.Linear and Coeffs.Const part of the fitdiscr function in MATLAB.
The code being used is:
fitcdiscr(data, groups, 'DiscrimType', 'linear');
The data consists of 3 groups.
Unfortunately the R function seems to do the LDA only for two LDs and MATLAB seems to always compare all groups in all constellations. Does anybody have an idea how I could obtain that data?
I suspect you mean information on the implementation of various MATLAB function, which would be doc <functionname> (doc fitcdiscr would yield this documentation page on fitcdscr) to get the documentation, and edit <functionname> to get the implementation, if it is not obscured by The MathWorks. If those two do not give you enough information, I'm afraid you're out of luck, since not all TMW codes are available non-obscured.
fitcdiscr is non-obscured, although very brief; it's just a wrapper for some other functions. Keep doing edit <functionname> and doc <functionname> and see how deep the rabbit hole takes you.
NB: there's no built-in function called fitdiscr, but the syntax you describe is that of fitcdiscr (note the c), so I used that as examples. If the actual function being called is named fitdiscr, it's custom-made and you'll have to spit through its file by edit fitdiscr and hope for the best.
I want to follow the tutorial found on this site, but despite being thorough in all other aspects, the author has not included information on what packages need to be used for the code to function.
As far as I understand one of them will be the PerformanceAnalytics package, yet my inexperienced eye is not sure about what else I will need to include.
The fapply function used in the code is one example that I cannot find.
fapply()
Error: could not find function "fapply"
library(sos)
findFn("fapply", sortby = "Function")
The findFN(...) function is great. It should open an internet browser window with the search results by itself at least it does for me.
The tutorial on Backtesting a Trading Strategy uses time series data as seen its part 1 and part 2. fapply is also used in part 2
As the data being collected and processed is time-series data, fapply() function belongs to far package which is used for Modelization for Functional AutoRegressive Processes.
I hope this helps.
I realize there are generic functions like plot, predict for a list of packages. I am wondering how could I get the R script of these generic functions for a specific package, like the lme4::predict. I tried lme4::predict, but it comes with error:
> lme4::predict
Error: 'predict' is not an exported object from 'namespace:lme4'
Since you state that my suggestion above was helpful I will tell you my process. I used my own co-authored package called pacman. This package was developed because we had a hard time remembering all the obscurely named functions to get information on and work with add on packages.
I used this to figure out what you wanted:
library(pacman)
p_funs(lme4, all=TRUE)
I set all = TRUE as predict is a method for a specific class (like print, summary and plot). Generally, these methods are not exported so p_funs won't return them unless you set all = TRUE. Then I scrolled down to the p section and found only a single predict method: predict.merMod
Next I realized it wasn't exported so :: won't show me the stuff and extra colon power is needed, hence: lme4:::predict.merMod
As pointed out by David and rawr above, some functions can be suborn little snippets (methods etc.), thus methods and getAnywhere are helpful.
Here is an example of that:
library(tm)
dissimilarity #The good stuff is hid
methods(dissimilarity) #I want the good stuff
getAnywhere("dissimilarity.DocumentTermMatrix")
Small end note
And of course you don't need pacman to look at functions for packages, it's what I use and helpful but it just wraps base R things. Use THE SOURCE to figure out exactly what.
I have used biglm in R and found it very useful. Now I need the same type of functionality in python. Any ideas? I have seen that patsy/statsmodels has an incremental mode, but have not been able to find any samples to copy/adapt. Any pointers would be much appreciated.
from a related answer of Nathaniel Smith on the statsmodels mailing list
My incremental LS code might be useful here, it's basically the same
problem:
https://github.com/njsmith/pyrerp/blob/master/pyrerp/incremental_ls.py#L330
The new X'X is the sum of the old X'Xs, then you have to re-do the
scaling and inversion to get the new vcov matrix for the estimates.
Should be doable so long as you know how many data points are in each
and the various sums-of-squares. (The code I linked has some extra
complexity because of handling a particular sort of heteroskedasticity
via FGLS, but it can pretty much be ignored.)
statsmodels doesn't have anything in this area yet.
There is an incremental OLS function in statsmodels, however that was written as helper function for cusum tests (in memory) and hasn't been used or checked for any other purpose:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.diagnostic.recursive_olsresiduals.html