I know there are various R packages for performing raking (i.e. calibration to external estimates, iterative proportional fitting, etc) to construct survey weights. I wanted to find a package that would automatically collapse cells if a cell count fell below a certain value. Is there a package out there with such a feature? Or if not raking exactly, a weighting package for a similar algorithm (e.g. GREG, entropy balancing) that would have such a feature for matchings to targets. Thank you.
Doing initial research, packages like "Ipfp: Multidimensional Iterative Proportional Fitting" didn't seems to have the feature I wanted.
Related
I have built an SVM-RBF model in R using Caret. Is there a way of plotting the decisional boundary?
I know it is possible to do so by using other R packages but unfortunately I’m forced to use the Caret package because this is the only package I found that allows me to calculate the variables importance.
In alternative, can you suggest a package that allows to plot the decision boundaries AND gives also the vars importance?
Thank you very much
First of all, unlike other methods, SVM does not produce feature importance. In your case, the importance score caret reports is calculated independent of the method itself: https://topepo.github.io/caret/variable-importance.html#model-independent-metrics
Second, the decision boundary (or hyperplane) you see in most textbook example is based on a toy problem with only two or three features. If you have more than three features, it is not trivial to visualize this hyperplane.
I'm looking for advice on creating classification trees where each split is based on multiple variables. A bit of background: I'm helping design a vegetation classification system, and we're hoping to use a classification and regression tree algorithm to both classify new veg data and create (or at least help to create) visual keys which can be used in publications. The data I'm using is laid out as community data, with tree species as columns, and observations as rows, and the first column is a factor with classes. I'll also add that I'm very new to this type of analysis, and while I've tried to read about it as much as possible, it's quite likely that I've missed some simple but important aspects. My apologies.
Now the problem: R has excellent packages and great documentation for classification with univariate splits (e.g. rpart, partykit, C5.0). However, I would ideally like to be able to create classification trees where each split was based on multiple criteria - so instead of each split having one decision (e.g. "Percent cover of Species A > 6.67"), it would have multiple (Percent cover of Species A > 6.67 AND Percent cover of Species B < 4.2). I've had a lot of trouble finding packages that are capable of doing multivariate splits and creating trees. This answer: https://stats.stackexchange.com/questions/4356/does-rpart-use-multivariate-splits-by-default has been very useful, and I've tried all the packages suggested there for multivariate splitting. Prim does do multivariate splits, but doesn't seem to make trees; the partDSA package seems to be somewhat what I'm looking for, but it also only creates trees with one criteria per split; the optpart package also doesn't seem to be able to make classification trees. If anyone has advice on how I could go about making a classification tree based on a multivariate partitioning method, that would be super appreciated.
Also, this is my first question, and I am very open to suggestions about how to ask questions. I didn't feel that providing an example would be helpful in this case, but if necessary I easily can.
Many Thanks!
I have trying to do sentiment analysis of tweets. I am trying to classify anger,disgust,fear,joy,sadness,surprise of those tweets which is generally done by RTextTools. But I can't how to do it? It would helpful if anyone would help.
Any way of doing it would help. I am not trying to achieve positive or negative categorization. Which i have successfully done.
Similar categorization can be done in sentiment R package. But only Bayes Algorithm can be used. It is also okay if I can apply other Algorithms in the classify_emotion() of sentiment package.
You should check out the caret package (http://topepo.github.io/caret/index.html). What you are trying to do are two different classifications (one mulit-class and one two class problem). Represent the document as term frequency vectors and run a classification algorithm of your choice. SVMs usually work well with bag of words approaches.
You would need some training data of course, but there are data sets available. https://www.crowdflower.com/data-for-everyone/
I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.
The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max
I'm working on a project now that's rather unlike anything I've done before. I have two tests with binary results that will be administered to the same sample, which is drawn from a clustered population (i.e., some subjects will be from the same family). I'd like to compare proportions of positive test results, but the clustering makes McNemar's test inappropriate so I've been reading up on alternative approaches. The two main routes seem to be 1) the clustering-adjusted McNemar alternatives by Rao and Scott (1992), Eliasziw and Donner (1991), and Obuchowski (1998), and 2) GEE.
Do you know of any implementations of the Rao-Obuchowski lineage in R (or, I suppose, SAS)? GEE is easy to find, but have you had a positive or negative experience with any particular packages? Is there another route to analyzing these data that I'm completely missing?
You could always just use a clustered bootstrap. Resample across families, which you believe are independent. That is, keep families together when you resample. Compute p2 - p1 for each sample. After 1000 iterations or so, compute the upper and bottom 2.5% quantiles. This will give you a bootstrapped 95% confidence interval. Alternatively compute the fraction of samples above zero, or whatever your hypothesis is. The procedure should have good pretty good properties unless the number of families is small.
It's probably easiest to do this by hand in R rather than relying on any package.
Check out the survey package: it is designed to take into account correlations induced by clustered sampling.
Have you already checked the CorrBin package in R?
It is for analysis of correlated binary data, there is a paper named: Using the CorrBin package for nonparametric analysis of
correlated binary data by Szabo, it includes the Rao-Scott, stochastic ordering and three versions of a GEE-based test.
The clust.bin.pair package for clustered binary matched-pair data was recently published to CRAN.
It contains implementations of Eliasziw and Donner (1991) and Obuchowski (1998), as well as two more recent tests in the same family Durkalski (2003) and Yang (2010).