Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a data set that has a bunch of raw text that I am vectorizing and using in my matrix for a random forest regression. My question is, should I be treating each word as a .factor or a .numeric if it is a sparse matrix? Which one speed up the computation time?
My understanding is that R matrices coerce factors to characters, so you're better off using numeric.
I'm not terribly familiar with RandomForest -- I have a general idea of what it does, but I'm not sure about the guts of its R implementation. If you need to give it a design matrix (for instance, how ANOVAs or GLMs work when you implement them by hand), you can try using the model.matrix function.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 22 companies response about 22 questions/parameters in a 22x22 matrix. I applied clustering technique which gives me different groups with similarities.
Now I would like to find correlations between parameters and companies preferences. Which technique is more suitable in R?
Normally we build Bayesian network to find a graphical relationship between different parameters from data. As this data is very limited, how i can build Bayesian Network for it?
Any suggestion to analyze this data.
Try looking at Feature selection and Feature Importance in R, it's simple,
this could lead you: http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Some packages are good: https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
, https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
this is good SE question with good answers: https://stats.stackexchange.com/questions/56092/feature-selection-packages-in-r-which-do-both-regression-and-classification
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I tried tuning the models by changing parameters, but that doesn't seem to improve the performance much. What are the possible next steps for me?
This isn't really a programming question, but anyway:
If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing.
More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make predictions quickly. If you are using an interpolation model, you can still tune the model parameters (lambda) using a held out portion of the data.
Finally, I agree with NEO on the reading part and would recommend "Speech and Language Processing" by Jurafsky and Martin.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to calculate the probability distribution function of a random variable that is composed of (sum, division, product, exponentiation, etc...) some other simple random variables. It is pretty complex so I am morte then happy to get a numerical solution
While thought this was a very standard thing to do , I was unable to find a framework to do that. I'd preferably use R, but any major language will do.
What I would like therefore is a library that allowed me to:
i) create numerical random variables from classic distributions
ii) compose them by simple operations (+,-,*,/, exp,min, max,...)
Of course I could work with vectors and use convolutions and the like, but I wanted something more polished.
I am also aware that is possible to use simulation to create the variables, then compose them with the operations and finally getting the PDF using a histogram, but again, I would prefer a non - simulating approach.
Try the rv package. Note that if X is an exponential random variable with mean 1, then -log(X) has a standard Gumbel distribution.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In the R official docs, the term ''variable'' is used to describe two distinct things:
The name we give to any type of object with the <-operator or with assign
For instance, we could say that in a <- data.frame(0), a is a variable, i.e. a symbol that links that particular dataframe to it.
A vector or a factor, belonging or not to a structure like a matrix or a dataframe, and containing units of data which, we assume, can take any of several or many values.
In this case it's akin to the statistical version of the term, such as in ''random variable''.
So my question is the following:
How do I help students understand the difference between programmatic and statistical usage of the term variable when teaching R?
(thanks and credits to #Gregor who formulated it in a better way than I would.)