Cluster analysis in R on large data set - r

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

Related

Statistical test for amplicon sequencing data

I have identified the top 50 genera based on the relative abundance in 16S amplicon sequencing data. Now I want to perform a statistical test to compare the two conditions; salt and non-salt. But not sure how to do a normality check and then how to do statistical tests for these multiple genera in R. One more issue in this data; there are lots of zeros.
data = read.table("statisticalTest.salt_vs_non_salt.txt",header=T,sep='\t',check.names=F)
head(data)
salt salt salt non_salt non_salt
1 Burkholderia 0.097731239 0.244541624 0.001426243 0.116990216 0.013678457
2 Bradyrhizobium 0.193819936 0.135586384 0.094290501 0.048097711 0.020956642
3 Anaeromyxobacter 0.103890771 0.087270334 0.139477497 0.117391401 0.219336751
4 Geobacter 0.008452247 0.018362646 0.016118808 0.045958054 0.051796890
5 Microbacterium 0.000513294 0.003237345 0.019378792 0.000156017 0.000538076
6 Methylocystis 0.008794443 0.010794689 0.004957892 0.001760760 0.002265583
non_salt
1 0.27859036
2 0.09660323
3 0.20067454
4 0.03892350
5 0.00000000
6 0.01459201
Many thanks

Why is R adding empty factors to my data?

I have a simple data set in R -- 2 conditions called "COND", and within those conditions adults chose between one of 2 pictures, we call house or car. This variable is called "SAW"
I have 69 people, and 69 rows of data
FOR SOME Reason -- R is adding an empty factor to both, How do I get rid of it?
When I type table to see how many are in each-- this is the output
table(MazeData$SAW)
car house
2 9 59
table(MazeData$COND)
Apples No_Apples
2 35 33
Where the heck are these 2 mystery rows coming from? it wont let me make my simple box plots and bar plots or run t.test because of this error - can someone help? thanks!!

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Resources