Qualitative data analysis using data mining techniques [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 22 companies response about 22 questions/parameters in a 22x22 matrix. I applied clustering technique which gives me different groups with similarities.
Now I would like to find correlations between parameters and companies preferences. Which technique is more suitable in R?
Normally we build Bayesian network to find a graphical relationship between different parameters from data. As this data is very limited, how i can build Bayesian Network for it?
Any suggestion to analyze this data.

Try looking at Feature selection and Feature Importance in R, it's simple,
this could lead you: http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Some packages are good: https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
, https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
this is good SE question with good answers: https://stats.stackexchange.com/questions/56092/feature-selection-packages-in-r-which-do-both-regression-and-classification

Related

Sampling weights for subpopulations in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.

Build a neural network with known weights and bias [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a known configuration nodes, weights, bias values, and activation function (tanh) for a neural network. I'd like to build that neural network as some 'neural network' object in R by proscribing the parts, and not fitting a network. How can I do this? I see many options to fit a neural network, but cannot find out how to build a network when I already know the components.
R do provide startweights argument to initialize custom weights, see StackOverflow thread. I also won't see citations for changing bias and transfer function.
Either use MATLAB (which is not a good idea for a R expert) or better design custom network based on following fact:
ANN is just a set of maths operations on input vectors and output vectors, where math operations are adjustment of weights based on error term in a loop using simple back-propogation. Use vectors and maths operations ONLY in R to design a simple ANN with back-propogation training

Difference between h2o.ai and SparkMLlib from Machine Learning algorithm point of view [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Currently, I am doing survey on Machine Learning library using h2o.ai and SparkMLlib. I have identified that more number of ML algorithms are supported by h2o.ai library as compare to SparkMLlib, and partition of Spark data frame in to training and testing set seems to be difficult (need to convert spark data frame to R/h2o data frame which is also time/resource consuming).
What are the others advantages/disadvantages of using h2o.ai library over SparkMLib or vice-versa ? I am focusing h2o.ai and SparkMLlib into R based implementation (SparkR). So the dataframes for h2o (as.h2o) and SparkMLlib (as.DataFrame) are different.
Partially, I figure-out the answer using following links: http://datasocial.onsocialengine.com/post/4171645/spark-mllib-or-h2o
Detailed comparative analysis is provided here: https://github.com/szilard/benchm-ml
Slides of bench-marking results: https://speakerdeck.com/szilard/benchmarking-machine-learning-tools-for-scalability-speed-and-accuracy-la-ml-meetup-at-eharmony-june-2015
Video of bench-marking results: https://vimeopro.com/eharmony/talks/video/132838730
Technical report on Analysis of Machine Learning Library: https://github.com/chauhansaurabhb/Analysis-of-H2O-vs-SparkMLlib/blob/master/MLLibrary.pdf

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.

Text analysis : What after term-document matrix? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I tried tuning the models by changing parameters, but that doesn't seem to improve the performance much. What are the possible next steps for me?
This isn't really a programming question, but anyway:
If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing.
More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make predictions quickly. If you are using an interpolation model, you can still tune the model parameters (lambda) using a held out portion of the data.
Finally, I agree with NEO on the reading part and would recommend "Speech and Language Processing" by Jurafsky and Martin.

Resources