Power Analysis in [R] for Two-Way Anova [closed] - r

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this question
I am trying to calculate the necessary sample size for a 2x2 factorial design. I have two questions.
1) I am using the package pwr and the one way anova function to calculate the necessary sample size using the following code
pwr.anova.test(k = , n = , f = , sig.level = , power = )
However, I would like to look at two way anova, since this is more efficient at estimating group means than one way anova. There is no two-way anova function that I could find. Is there a package or routine in [R] to do this?
2) Moreover, am I safe in assuming that since I am using a one-way anova power calculations, that the sample size will be more conservative (i.e. larger)?

In a 2 x 2 ANOVA involving Factor A, Factor B, and AxB, you will get separate statistical power estimates for each of these three effects.
G Power 3 provides free software and some clear tutorials for estimating power of effects in factorial designs:
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/user-guide-by-design

After searching - I couldn't find any solution for this online.
What I would suggest you to do (if you know how) is to program this using a simulation.
If you don't know how to do it, then write a SO question about "How can I write a simulation of two-way anova, to achieve power analysis" and see what people might help you with :)
Also, you could start by reviewing the code here:
http://www.rforge.net/doc/packages/NCStats/power.sim.html
For a start on power calculation through simulation.
Notice what Jeromy wrote - this power analysis is for multiple outcomes.
Interesting subject - I'd love to followup on it.
Best,
Tal

Related

DESeq2 design matrix including RIN as covariate in the formula [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been following the last DESeq2 pipeline to perform an RNAseq analysis. My problem is the rin of the experimental samples is quite low compared to the control ones. Iread a paper in which they perform RNAseq analysis with time-course RNA degradation and conclude that including RIN value as a covariate can mitigate some of the effects of low rin in samples.
My question is how I should construct the design in the DESeq2 object:
~conditions+rin
~conditions*rin
~conditions:rin
none of them... :)
I cannot find proper resources where explain how to construct these models (I am new to the field...) and I recognise I crashed against a wall with these kinds of things. I would appreciate also some links to good resources to be able to understand which one is correct and why.
Thank you so much
Turns out to be quite long for typing in a comment.
It depends on your data.
First of all, counts ~conditions:rin does not make sense in your case, because conditions is categorical. You cannot fit only an interaction term model.
I would go with counts ~condition + rin, this assumes there is a condition effect and a linear effect from rin. And the counts' dependency of rin is independent of condition.
As you mentioned, rin in one of the conditions is quite low, but is there any reason to suspect the relationship between rin and counts to differ in the two conditions? If you fit counts ~condition * rin, you are assuming a condition effect and a rin effect that is different in conditions. Meaning a different slope for rin effect if you plot counts vs rin. You need to take a few genes out, and see whether this is true. And also, for fitting this model, you need quite a lot of samples to estimate the effects accurately. See if both of these holds

Machine Learning Suggestions [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have data of a lot of students who got selected by some colleges based on their marks. Iam new to machine Learning. Can I have some suggestions how can I add Azure Machine Learning for predicting the colleges that they can get based on their marks
Try a multi-class logistic regression - also look at this https://gallery.cortanaanalytics.com/Experiment/da44bcd5dc2d4e059ebbaf94527d3d5b?fromlegacydomain=1
Apart from logistic regression, as #neerajkh suggested, I would try as well
One vs All classifiers. This method use to work very well in multiclass problems (I assume you have many inputs, which are the marks of the students) and many outputs (the different colleges).
To implement one vs all algorithm I would use Support Vector Machines (SVM). It is one of the most powerful algorithms (until deep learning came into the scene, but you don't need deep learning here)
If you could consider changing framework, I would suggest to use python libraries. In python it is very straightforward to compute very very fast the problem you are facing.
use randomforesttrees and feed this ML algorithm to OneVsRestClassifer which is a multi class classifier
Keeping in line with other posters' suggestions of using multi-class classification, you could use artificial neural networks (ANNs)/multilayer perceptron to do this. Each output node could be a college and, because you would be using a sigmoid transfer function (logistic) the output for each of the nodes could be directly viewed as the probability of that college accepting a particular student (when trying to make predictions).
Why don't you try softmax regression?
In extremely simple terms, Softmax takes an input and produces the probability distribution of the input belonging to each one of your classes. So in other words based on some input (grade in this case), your model can output the probability distribution that represents the "chance" a given sudent has to be accepted to each college.
I know this is an old thread but I will go ahead and add my 2 cents too.
I would recommend adding multi-class, multi-label classifier. This allows you to find more than one college for a student. Of course this is much easier to do with an ANN but is much harder to configure (say with the configuration of the network; number of nodes/hidden nodes or even the activation function for that matter).
The easiest method to do this as #Hoap Humanoid suggests is to use a Support Vector Classifier.
To do any of these method its a given that you have to havea well diverse data set. I cant say the number of data points you need that you have to experiment with but the accuracy of the model is dependent on number of data points and its diversity.
This is very subjective. Just applying any algorithm that classifies into categories won't be a good idea. Without performing Exploratory Data Analysis and checking following things you can't be sure of a doing predictive analytics, apart from missing values:
Quantitative and Qualitative variable.
Univariate, Bivariate and multivariate distribution.
Variable relationship to your response(college) variable.
Looking for outliers(multivariate and univariate).
Required variable transformation.
Can be the Y variable broken down into chunks for example location, for example whether a candidate can be a part of Colleges in California or New York. If there is a higher chance of California, then what college. In this way you could capture Linear + non-linear relationships.
For base learners you can fit Softmax regression model or 1 vs all Logistic regression which does not really matters a lot and CART for non-linear relationship. I would also do K-nn and K-means to check for different groups within data and decide on predictive learners.
I hope this makes sense!
The Least-square support vector machine (LSSVM) is a powerful algorithm for this application. Visit http://www.esat.kuleuven.be/sista/lssvmlab/ for more information.

ensemble results from different classifier in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have prediction of my data in different classifier. I would like to ensemble the results of them in order to get better final result. Is it possible in R?
lets say:
SVMpredict=[1 0 0 1...]
RFpredict=[1 1 0 1 ...]
NNpredict=[0 0 0 1 ...]
is it possible to combine results by any ensemble technique in R?how?
thanks
Edited:
I run my classifiers on different samples (my case DNA chromosomes). In some samples SVM works better than the others like RF. I want a technique to ensemble the results by considering which classifier works better.
for example if i take average of output probabilities and round them, it would be considered as all classifier are equal effective in result. but when SVM worked better, we should consider results for SVM (with 86% accuracy) has 60% importance and 25% (72% accuracy) for RF and 15% NN (64% accuracy). (these numbers are just examples for clarification)
is there anyway that I can do that?
It depends on the structure of the output of Your classifier.
If it is {0,1} outcome, as You provided, You can just make mean of the predictions and then average it and round it:
round((SVMpredict+RFpredict+NNpredict)/3)
If You know performance of the classifiers, weighted mean is a good idea - favor the ones that perform better. Hardcore apporach is to optimize weights via optim function.
If You know class probabilites for each prediction, it is better average them instead of letting them just vote ({0,1} output case above).

Test for significance in a time series using R [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Given a simplified example time series looking at a population by year
Year<-c(2001,2002,2003,2004,2005,2006)
Pop<-c(1,4,7,9,20,21)
DF<-data.frame(Year,Pop)
What is the best method to test for significance in terms of change between years/ which years are significantly different from each other?
As #joran mentioned, this is really a statistics question rather than a programming question. You could try asking on http://stats.stackexchange.com to obtain more statistical expertise.
In brief, however, two approaches come to mind immediately:
If you fit a regression line to the population vs. year and have a statistically significant slope, that would indicate that there is an overall trend in population over the years, i.e. use lm() in R, like this lmPop <- lm(Pop ~ Year,data=DF).
You could divide the time period into blocks (e.g. the first three years and the last three years), and assume that the population figures for the years in each block are all estimates of the mean population during that block of years. That would give you a mean and a standard deviation of the population for each block of years, which would let you do a t-test, like this: t.test(Pop[1:3],Pop[4:6]).
Both of these approaches suffer from some potential difficulties and the validity of each would depend on the nature of the data that you're examining. For the sample data, however, the first approach suggests that there appears to be a trend over time at a 95% confidence level (p=0.00214 for the slope coefficient) while the second approach suggests that the null hypothesis that there is no difference in means cannot be falsified at the 95% confidence level (p = 0.06332).
They're all significantly different from each other. 1 is significantly different from 4, 4 is significantly different from 7 and so on.
Wait, that's not what you meant? Well, that's all the information you've given us. As a statistician, I can't work with anything more.
So now you tell us something else. "Are any of the values significantly different from a straight line where the variation in the Pop values are independent Normally distributed values with mean 0 and the same variance?" or something.
Simply put, just a bunch of numbers can not be the subject of a statistical analysis. Working with a statistician you need to agree on a model for the data, and then the statistical methods can answer questions about significance and uncertainty.
I think that's often the thing non-statisticians don't get. They go "here's my numbers, is this significant?" - which usually means typing them into SPSS and getting a p-value out.
[have flagged this Q for transfer to stats.stackexchange.com where it belongs]

Unsupervised Learning in R? Classify Matrices - what is the right package? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.
Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.
I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.
So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?
EDIT:
I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.
EDIT2:
I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.
... classify discrete matrices by an unsupervised algorithm
You must mean cluster them. Classification is commonly done by supervised algorithms.
I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there
Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and
each column corresponds to a variable."
So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.
You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html

Resources