Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I tried tuning the models by changing parameters, but that doesn't seem to improve the performance much. What are the possible next steps for me?
This isn't really a programming question, but anyway:
If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing.
More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make predictions quickly. If you are using an interpolation model, you can still tune the model parameters (lambda) using a held out portion of the data.
Finally, I agree with NEO on the reading part and would recommend "Speech and Language Processing" by Jurafsky and Martin.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm doing credit risk modelling and the data have large number of features.I am using boruta package for feature selection. The package is too computationally expensive, I cannot run it on the complete training dataset. What i'm trying to do is take a subset of the training data(let's say about 20-30%) and run the boruta package on that subsetted data and get the important features. But when i use random forest to train the data I have too use the full dataset. My question is, Is it right to select features only on a part of train data but then build the model on whole of training data?
Since the question is logical in nature, I will give my two cents.
A single random sample of 20% of the population is good enough i believe
A step further would be taking 3-4 such random sets and the intersection of the significant variables from all of them is an improvement to the above
Using feature selection from multiple methods (xgboost, some caret feature selection methods) -> use a different random sample for each of them, and then take the common significant features
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a known configuration nodes, weights, bias values, and activation function (tanh) for a neural network. I'd like to build that neural network as some 'neural network' object in R by proscribing the parts, and not fitting a network. How can I do this? I see many options to fit a neural network, but cannot find out how to build a network when I already know the components.
R do provide startweights argument to initialize custom weights, see StackOverflow thread. I also won't see citations for changing bias and transfer function.
Either use MATLAB (which is not a good idea for a R expert) or better design custom network based on following fact:
ANN is just a set of maths operations on input vectors and output vectors, where math operations are adjustment of weights based on error term in a loop using simple back-propogation. Use vectors and maths operations ONLY in R to design a simple ANN with back-propogation training
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 22 companies response about 22 questions/parameters in a 22x22 matrix. I applied clustering technique which gives me different groups with similarities.
Now I would like to find correlations between parameters and companies preferences. Which technique is more suitable in R?
Normally we build Bayesian network to find a graphical relationship between different parameters from data. As this data is very limited, how i can build Bayesian Network for it?
Any suggestion to analyze this data.
Try looking at Feature selection and Feature Importance in R, it's simple,
this could lead you: http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Some packages are good: https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
, https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
this is good SE question with good answers: https://stats.stackexchange.com/questions/56092/feature-selection-packages-in-r-which-do-both-regression-and-classification
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My current linear model is: fit<-lm(ES~Area+Anear+Dist+DistSC+Elevation)
I have been asked to further this by:
Fit a linear model for ES using the five explanatory variables and
include up to quadratic terms and first order interactions (i.e. allow
Area^2 and Area*Elevation, but don't allow Area^3 or
Area*Elevation*Dist).
From my research I can do +I(Area^2) and +(Area*Elevation) but this would make a huge list.
Assuming I am understanding the question correctly I would be adding 5 squared terms and 10 * terms giving 20 total. Or do I not need all of these?
Is that really the most efficient way of going about it?
EDIT:
Note that I am planning on carrying out a stepwise regression for the null model and the full model after. I am seemingly having trouble with this when using poly.
Look at ?formula to further your education:
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 )
Those will not be squared terms but rather part of what you were asked to provide... all the 2-way interactions (and main effects). Formula "mathematics" is different than regular use of powers. To add the squared terms in a manner that allows proper statistical interpretation use poly
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 +
poly(Area,2) +poly(Anear,2)+ poly(Dist,2)+ poly(DistSC,2)+ poly(Elevation,2) )