Customized Fisher exact test in R - r

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data

It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

Related

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

Behaviour of dfmax in glmnet

(NB: This is a slightly modified version of a post I'd made on a different forum. I received no responses there, hence the post here. If this is not allowed, please let me know, will take down the question).
I am new to glmnet, so I do not yet understand fully what the various
parameters do. I am trying to build a multinomial classifier which restricts
the number of features used in the model. From reading the docs and some
answers on this forum, I understand dfmax is the way to do it. I
played around with it a bit; I have a couple of questions and would appreciate some help:
Setup
For a particular dataset, I want to restrict the number of features to 3;
the original data has 126 features. Here's what I run:
fit<-glmnet(data.matrix(X), data.matrix(y), family='multinomial', dfmax=3)
d<-data.frame(tidy(fit))
This is the value of d:
My questions about the output:
I see multiple values of lambda in there; it looks like
glmnet tries to fit lambdas that gets the number of terms close to
dfmax=3. So its less like the LARs algorithm (where we
move stagewise by adding variables and can stop at an exact number of variables) and more about getting the
right lambdas for regularization that lead to the intended dfmax. Is
this right?
I'm guessing alpha plays a role in how close we can get
to dfmax. At alpha=1, where we're doing lasso, and so its easier to
get close to dfmax, compared to when alpha=0 and we're doing ridge.
Is this understanding correct?
A "neighborhood" of dfmax is the
best we can do it'd seem. Or am I missing a parameter that gets me
to the model with the exact dfmax (FYI: alpha=1 doesn't seem to get
me to the precise number of non zero terms either, at least on this
dataset).
In the first solution - step=1, there are no variables used. Does this mean the relative odds equal a constant?
What does pmax do?
Thanks in advance!

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

Orthoplan in R (fractional factorial design)

I am trying to create a factorial design in R for a conjoint analysis experiment (like SPSS orthoplan).
Searching among past Stackoverflow questions, I have found this previous answer:
How to create a fractional factorial design in R?
It is indeed an useful answer but only in the case you have factors with numeric levels.
That's unfortunately not my case, because the factors I want to use are nominal variable, i.e. their levels are not numeric type but factor type: for example I have to deal with a factor indicating the color of a product which can be green, yellow or red.
I've tried modifying the code proposed as an answer to the question How to create a fractional factorial design in R?
in such a way:
f.design <- gen.factorial(levels.design,factors="all")
but the result is nor balanced, nor orthogonal. Moreover, you have to define the exact number of trials in the optFederov function. In that answer the suggested number of trials was :
nTrials=sum(levels.design)
but in order to have a balanced solution in a design with nominal factors, I expect it should at least be :
nTrials=prod(unique(levels.design))
There's a package anyway that could deal with such an issue, it is the package FrF2 by Prof. Ulrike Groemping, but it handles only dichotomous variables and I cannot figure out how to use it to solve my problem.
After having been for a while researching an answer by myself, I can share here what I have found:
yes, you can build orthogonal designs in R, in a similar fashion as it happens in SPSS orthoplan.
Just define the variable nlevels as a vector containing the levels of your variables.
Then you have to call:
library(DoE.base)
fract.design <- oa.design(nlevels=levels.design)
The function will look up into a library of orthogonal designs (exactly Kuhfeld W., 2009, Orthogonal arrays)
If there isn't a suitable available orthogonal design, the function will just return the full factorial design (and therefore you'll have no other choice in R but to call the optFederov function, as explained above in my question).
As an example try:
oa.design(nlevels=c(2,2,3))
oa.design(nlevels=c(2,2,4))
The first doesn't have a solution (so you'll get back the full factorial), but the second one does have a solution, an 8 cards, orthogonal and balanced design.

Resources