I have a fairly simple data set which I'm analyzing with glmmPQL(), then using glht() for Tukey comparisons of the groups. Group A has data with mostly numbers and a few zero values, Group B has all zeros except for one data value, and Group C is all zeros. (I'm doing this analysis on such a sparse data set to match the work done on a more robust data set.)
my glht()/Tukey results say that A and B are different, but not A and C. This makes no sense to me, as C is slightly further from A than B is.
Also, if I change group C by changing one of the zeros to a number (1), then glht()/Tukey sees the statistical difference between A and C, even though I actually moved the two groups closer together.
Does anyone understand why this happens?
Related
I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS).
Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0-values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0-values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here
adehabitatHS compana test returns lambda = NaN?
They said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions.
I tried also this approach, but never changed (it freezes when there are 0-values in the available matrix, or returns NaN value for Lambda if I replace 0- values wit small values).
I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types.
Can you resolve this BIG problem?
PARTIALLY SOLVED: Asking to some researchers, they told me that the number of habitats shouldn't be higher than the number of animals.
In fact I merged some habitats in order to have six animals per six habitats and now the function works when I replace 0-values in the 'avaiable' matrix with small values (e.d. 0.001).
Unfortunately this is not what I wanted, because I needed to find values (rankings, Log-ratios, etc..) for each habitat type (originally they were 21).
Let's say that I have 2 words, Happy and Happiness, and I store them each as a factor. I want to find the difference of those 2 factors, so I will preferably want a function that will spit out yiness, inessy, iness and y in any combination, just iness, or just y.
Let's also consider if the code would work for the 2 phrases:
al,y,a€a%f,a,s$lf,askdʇjg,asfg
and
al,y,a€a%f,a,s$lf,askd/879,876/jg,asfg
both without spaces, notice the only difference is ʇ is replaced for /879,876/.
Many thanks in advance!
I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.
I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.
I am looking for the best way to compare 2 or more images.
The images I have are now in matrix format, so basically I am comparing matrices.
They aren't square (but this isn't a problem).
This is an example of what I have with only two matrices:
#Original data
M1<-cbind(c(0,0,20,40,50,35),c(0,0,5,20,90,80),c(0,0,10,25,85,0),c(58,70,20,50,0,5))
#Data to be compared with M1
M2<-cbind(c(0,5,25,25,60,15),c(0,30,15,10,116,67),c(0,2,9,20,90,1),c(69,50,22,30,0,2))
I can check for the differences and the correlation, but I also want to be able to say for example, if:
high values in M2 occur in the same positions that M1
high values in M2 occur close to the positions in M1
high values in M2 occur far away
Same thing for low values.
By high values I mean maximum values, for example if the max value in M1 is in position (M1_maxvalue(x,y)), than I M2 max value should be a similar value observed in M1 as well as in the same or close position M1_maxvalue(x,y).
I can extract the positions, the variation of the positions of the maximum values, however I am looking for existent methods where I can base my comparisons.
What type of calculations can I use to do such type of analysis?
I can use both image processing packages as well as matrices algorithms.
Sounds like a job better handled with ImageJ or SAODS9 at http://hea-www.harvard.edu/RD/ds9/ .
IIRC those apps have built-in tools for spot and blob-finding, which may save you a lot of time and pain.