Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Several R packages have been developed that allow you to make assertions about your data, results of analysis, etc. However, I have never seen anyone compile a list of useful checks.
Are there any resources that have checklists or other lists of common checks?
For example, if you were analyzing survey data, you might want to sanity check the data as follows:
Impossible values: Someone who lists a profession of doctor is 6 years old
Unlikely correlations: Education level has a negative correlation with Earnings
After doing a lot of joins, you want to verify the final data structure:
Lost observations: A data set begins with N = 100,000... after appending variables, does N still equal 100,000?
Unreasonable values within columns: Summaries of nulls, detection of outliers, distribution of most common values
Unreasonable cross-column relationships: A table with sales references salesperson, but the salesperson ID doesn't exist in the salesperson table
After developing predictions, you want to check if they make sense:
Unlikely predictions across groups: You average predicted probabilities of making a purchase by group and find that non-pet owners are more likely than pet owners to buy pet food
etc. etc.
Below are some R packages that would help incorporate such tests into R... if only we had a checklist of what those tests should be!
testthat
http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf
https://github.com/hadley/testthat
RUnit
http://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf
Svunit
http://cran.r-project.org/web/packages/svUnit/vignettes/svUnit.pdf
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm doing credit risk modelling and the data have large number of features.I am using boruta package for feature selection. The package is too computationally expensive, I cannot run it on the complete training dataset. What i'm trying to do is take a subset of the training data(let's say about 20-30%) and run the boruta package on that subsetted data and get the important features. But when i use random forest to train the data I have too use the full dataset. My question is, Is it right to select features only on a part of train data but then build the model on whole of training data?
Since the question is logical in nature, I will give my two cents.
A single random sample of 20% of the population is good enough i believe
A step further would be taking 3-4 such random sets and the intersection of the significant variables from all of them is an improvement to the above
Using feature selection from multiple methods (xgboost, some caret feature selection methods) -> use a different random sample for each of them, and then take the common significant features
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In the R official docs, the term ''variable'' is used to describe two distinct things:
The name we give to any type of object with the <-operator or with assign
For instance, we could say that in a <- data.frame(0), a is a variable, i.e. a symbol that links that particular dataframe to it.
A vector or a factor, belonging or not to a structure like a matrix or a dataframe, and containing units of data which, we assume, can take any of several or many values.
In this case it's akin to the statistical version of the term, such as in ''random variable''.
So my question is the following:
How do I help students understand the difference between programmatic and statistical usage of the term variable when teaching R?
(thanks and credits to #Gregor who formulated it in a better way than I would.)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am using the edgeR and Limma packages to analyse a RNA-seq count data table.
I only need a subset of the data file, therefore my question is: Do I need to normalize my data within all the samples, or is it better to subset my data first and normalize the data then.
Thank you.
Regards Lisanne
I think it depends on what you want to proof/show. If you also want to take into account your "darkcounts" than you should normalize it at first such that you also take into account the percentage in which your experiment fails. Here your total number of experiments ( good and bad results) sums up to one.
If you want to find out the distribution of your "good events" than you should first produce your subset of good samples and normalize afterwards. In this case your number of good events sums up to 1
So once again, it depends on what you want to proof. As a physicist I would prefer the first method since we do not remove bad data points.
Cheers TL