Statistical test for amplicon sequencing data - r

I have identified the top 50 genera based on the relative abundance in 16S amplicon sequencing data. Now I want to perform a statistical test to compare the two conditions; salt and non-salt. But not sure how to do a normality check and then how to do statistical tests for these multiple genera in R. One more issue in this data; there are lots of zeros.
data = read.table("statisticalTest.salt_vs_non_salt.txt",header=T,sep='\t',check.names=F)
head(data)
salt salt salt non_salt non_salt
1 Burkholderia 0.097731239 0.244541624 0.001426243 0.116990216 0.013678457
2 Bradyrhizobium 0.193819936 0.135586384 0.094290501 0.048097711 0.020956642
3 Anaeromyxobacter 0.103890771 0.087270334 0.139477497 0.117391401 0.219336751
4 Geobacter 0.008452247 0.018362646 0.016118808 0.045958054 0.051796890
5 Microbacterium 0.000513294 0.003237345 0.019378792 0.000156017 0.000538076
6 Methylocystis 0.008794443 0.010794689 0.004957892 0.001760760 0.002265583
non_salt
1 0.27859036
2 0.09660323
3 0.20067454
4 0.03892350
5 0.00000000
6 0.01459201
Many thanks

Related

Cluster analysis in R on large data set

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Using KNN for pattern matching of time series

I want to try to implement a KNN algorithm for pattern matching (or pattern recognision) in my time series data. The data are of consumption measurements. I've got a table with some columns, where the first column is datetime of the measurement and the other columns represent the measurements. There is one example:
datetime mains stove kitchen microwave TV
2013-04-21 14:22:13 341.03 6 57 5 0
2013-04-21 14:22:16 342.36 6 57 5 0
2013-04-21 14:22:20 342.52 6 58 5 0
2013-04-21 14:22:23 342.07 6 57 5 0
2013-04-21 14:22:26 341.77 6 57 5 0
2013-04-21 14:22:30 341.66 6 55 5 0
I want to use the KNN algorithm to compare the pattern of the mains signal with the patterns of other signals. So my training set would consist of labeled measurements of every appliance and the test data set would consist of the mains signal measurements. The aim of this is to detect changes in the signal - which appliance was turned on in which time.
What I actually want to ask is:
how to cope with the datetime format? In which format should it be passed to KNN? (I wonder there will be some conversion to integer or normalization?)
is the KNN algorithm suitable for this task?
how to generally perform pattern matching with KNN?
What I've already tried - I tried to put single vector consisting of labeled patterns of data (of each appliance) to the KNN as the training set and then put mains data as the test set. I totally omitted the datetime column. I've got bad resuls.
I'm implementing this in R language.
Any ideas?

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Testing recurrences and orders in strings matlab

I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.

Resources