Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!
Related
I am trying to fit values in my algorithm so that I could predict a next month's number. I am getting a No data for variable errror when clearly I've defined what the objects are that I am putting into the equation.
I've tried to place them in vectors so that it could use one vector as a training data set to predict the new values. Current script has worked for me for a different dataset but for some reason isn't working here.
The data is small so I was wondering if that has anything to do with it. The data is:
Month io obs Units Sold
12 in 1 114
1 in 2 29
2 in 3 105
3 in 4 30
4 in 5
I'm trying to predict Units Sold with the code below
matt<-TEST1
isdf<-matt[matt$month<=3,]
isdf<-na.omit(isdf)
osdf<-matt[matt$Units.Sold==4,]
lmfit<-lm(Units.Sold~obs+Month,data=isdf,na.action=na.omit)
predict(lmFit,osdf[1,1])
I am expecting to be able to place lmfit in predict and get an output.
I am a newbie in R programming and seek help in analyzing the Metabolomics data - 118 metabolites with 4 conditions (3 replicates per condition). I would like to know, for each metabolite, which condition(s) is significantly different from which. Here is part of my data
> head(mydata)
Conditions HMDB03331 HMDB00699 HMDB00606 HMDB00707 HMDB00725 HMDB00017 HMDB01173
1 DMSO_BASAL 0.001289121 0.001578235 0.001612297 0.0007772231 3.475837e-06 0.0001221674 0.02691318
2 DMSO_BASAL 0.001158363 0.001413287 0.001541713 0.0007278363 3.345166e-04 0.0001037669 0.03471329
3 DMSO_BASAL 0.001043537 0.002380287 0.001240891 0.0008595932 4.007387e-04 0.0002033625 0.07426482
4 DMSO_G30 0.001195253 0.002338346 0.002133992 0.0007924157 4.189224e-06 0.0002131131 0.05000778
5 DMSO_G30 0.001511538 0.002264779 0.002535853 0.0011580857 3.639661e-06 0.0001700157 0.02657079
6 DMSO_G30 0.001554804 0.001262859 0.002047611 0.0008419137 6.350990e-04 0.0000851638 0.04752020
This is what I have so far.
I learned the first line from this post
kwtest_pvl = apply(mydata[,-1], 2, function(x) kruskal.test(x,as.factor(mydata$Conditions))$p.value)
and this is where I loop through the metabolite that past KW test
tCol = colnames(mydata[,-1])[kwtest_pvl <= 0.05]
for (k in tCol){
output = posthoc.kruskal.dunn.test(mydata[,k],as.factor(mydata$Conditions),p.adjust.method = "BH")
}
I am not sure how to manage my output such that it is easier to manage for all the metabolites that passed KW test. Perhaps saving the output from each iteration appending to excel? I also tried dunn.test package since it has an option of table or list output. However, it still leaves me at the same point. Kinda stuck here.
Moreover, should I also perform some kind of adjusted p-value, i.e FWER, FDR, BH right after KW test - before performing the posthoc test?
Any suggestion(s) would be greatly appreciated.
Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )
I have a seemingly easy question which however is troubling me a bit.
I have couples of vectors made up of nominal attributes. They can be of different length and sometimes some of the attributes in one might not be included in the other. See a and b as two potential examples.
a
1 mathematician
2 engineer
3 mathematician
4 mathematician
5 mathematician
6 engineer
7 mathematician
8 mathematician
9 mathematician
10 mathematician
11 mathematician
12 engineer
13 mathematician
14 mathematician
15 engineer
b
1 physicist
2 surgeon
3 physicist
4 surgeon
5 physicist
6 physicist
7 surgeon
8 surgeon
9 physicist
10 physicist
11 mathematician
Do you have in mind a measure (an index) that could summarize the dissimilarity between them? The type of measure I am looking for is something like the Euclidean distance, but for qualitative vectors.
One option I thought of is to actually compute the Euclidean distance among the categorical vectors earlier transformed into frequency vectors. In this way, they would become quantitative and would be of the same length. But my question is, do you find this a sound approach?
More generally, is there a R package that tackles these type of distances? Can you suggest other distances suitable to the case of nominal variables?
Many thanks!
I've only come across the unalikeability coefficient.
http://www.amstat.org/publications/jse/v15n2/kader.html
Weird name, intuitive approach, and incredibly simple implementation. For example:
> table(a)
a
engineer mathematician
4 11
> unalike(table(a))
[1] 0.391
> table(b)
b
mathematician physicist surgeon
1 6 4
> unalike(table(b))
[1] 0.562
It is clear just by eye-balling that b would be more dissimilar, and this coefficient gives a more quantitative measure.
There are some examples in the paper which I will calculate for you here:
> unalike(3,7)
[1] 0.42
> unalike(5,5)
[1] 0.5
> unalike(1,9)
[1] 0.18
The formula in this function is based on the paper I linked you to above:
unalike <- function(...) {
props <- c(...)
zzz <- 1 - sum(((props) / sum(props)) ** 2)
zzz <- round(zzz, 3)
return(zzz)
}
Let me know how your thing goes since this is a small side project for me as well.
I am not sure this is a programming question, because you do not know what you want to do yet, so we can't offer a solution. I think the main question here is what are you going to use this measure for, because you can measure dissimilarities in a lot of different ways, some will be good for what you want and some will not.
But trying to answer anyway, there is the utils::adist function and there is also a package called stringdist (these are the ones I have used before). But it seems that they are not quite what you want, based on your question, because they will measure the distance for each character string, and not for the whole matrix. But you could use them to have some ideas about how to measure the distance between the two vectors. For example, one measure could be how many changes you would have to make in vector a so it turns to vector b.
Thank you for keeping this open.
One option, which appears to have become available after this discussion, is R's qualvar (Gombin) package. The package provides functions for each of Wilcox's (1967, 1973) Indices of Qualitative Variation. Included with the package is a useful vignette summarizing implementation and results. I have found in limited experience that index selection requires some brute-force testing with actual and simulated data.
I am trying to generate a Poisson Table in R for two events, one with mean 1.5 (lambda1) and the other with mean 1.25 (lambda2). I would like to generate the probabilities in both cases for x=0 to x=7+ (7 or more). This is probably quite simple but I can't seem to figure out how to do it! I've managed to create a data frame for the table but I don't really know how to input the parameters as I've never written a function before:
name <- c("0","1","2","3","4","5","6","7+")
zero <- mat.or.vec(8,1)
C <- data.frame(row.names=name,
"0"=zero,
"1"=zero,
"2"=zero,
"3"=zero,
"4"=zero,
"5"=zero,
"6"=zero,
"7+"=zero)
I am guessing I will need some "For" loops and will involve dpois(x,lambda1) at some point. Can somebody help please?
I'm assuming these events are independent. Here's one way to generate a table of the joint PMF.
First, here are the names you've defined, along with the lambdas:
name <- c("0","1","2","3","4","5","6","7+")
lambda1 <- 1.5
lambda2 <- 1.25
We can get the marginal probabilities for 0-6 by using dpois, and the marginal probability for 7+ using ppois and lower.tail=FALSE:
p.x <- c(dpois(0:6, lambda1), ppois(7, lambda1, lower.tail=FALSE))
p.y <- c(dpois(0:6, lambda2), ppois(7, lambda2, lower.tail=FALSE))
An even better way might be to create a function that does this given any lambda.
Then you just take the outer product (really, the same thing you would do by hand, outside of R) and set the names:
p.xy <- outer(p.x, p.y)
rownames(p.xy) <- colnames(p.xy) <- name
Now you're done:
0 1 2 3 4 5
0 6.392786e-02 7.990983e-02 4.994364e-02 2.080985e-02 6.503078e-03 1.625770e-03
1 9.589179e-02 1.198647e-01 7.491546e-02 3.121478e-02 9.754617e-03 2.438654e-03
2 7.191884e-02 8.989855e-02 5.618660e-02 2.341108e-02 7.315963e-03 1.828991e-03
3 3.595942e-02 4.494928e-02 2.809330e-02 1.170554e-02 3.657982e-03 9.144954e-04
4 1.348478e-02 1.685598e-02 1.053499e-02 4.389578e-03 1.371743e-03 3.429358e-04
5 4.045435e-03 5.056794e-03 3.160496e-03 1.316873e-03 4.115229e-04 1.028807e-04
6 1.011359e-03 1.264198e-03 7.901240e-04 3.292183e-04 1.028807e-04 2.572018e-05
7+ 4.858139e-05 6.072674e-05 3.795421e-05 1.581426e-05 4.941955e-06 1.235489e-06
6 7+
0 3.387020e-04 1.094781e-05
1 5.080530e-04 1.642171e-05
2 3.810397e-04 1.231628e-05
3 1.905199e-04 6.158140e-06
4 7.144495e-05 2.309303e-06
5 2.143349e-05 6.927908e-07
6 5.358371e-06 1.731977e-07
7+ 2.573935e-07 8.319685e-09
You could have also used a loop, as you originally suspected, but that's a more roundabout way to the same solution.