modelTest and negative edges lengths with phangorn in R - r

I'm analysing protein data sets. I'm trying to build a tree with the package phangorn in R.
When I construct it, I get negative edge lengths that sometimes makes difficult to proceed with the analysis (modelTest).
Depending on the size of the dataset (more than 250 proteins), I can't perform a modelTest. Apparently there is a problem due to negative edge lengths. However, for shorter datasets I can perform a modelTest even though there are some negative edge lengths.
I am runing it directly from my terminal.
library(phangorn)
dat = read.phyDat(file, format="fasta", type="AA")
tax <- read.table("organism_names.txt", sep="\t", row.names=1)
names(dat) <- tax[,1]
distance <- dist.ml(dat, model="WAG")
tree <- bionj(distance)
mt <- modelTest(dat, tree, model=c("WAG", "LG", "cpREV", "mtArt", "MtZoa", "mtREV24"),multicore=TRUE)
Error: NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In pml(tree, data) : negative edges length changed to 0!
Does somebody have any idea of what can I do?
cheers, Alba

As #Marc said, your example isn't really reproducible...
If the problem really is negative or zero branch lengths, you could try to make them a really small positive number, for instance:
tree$edge.length[which(tree$edge.length <= 0)] <- 0.0000001
Another tip is to subscribe to R-sig-phylo, a mail list about phylogenies in R. People there are really knowledgeable an usually respond pretty fast.

Related

How to deal with zero in dataset when trying to create a Dissimilarity matrix?

I have a simple question, which took me about hundred hours of googling today, and it is still unresolved. I hope someone here can miraculously answer to this.
I am trying to make a bray curtis dissimilarity matrix, and nMDS and then run a permanova for my species community data. The problem is that when I assign the community to each plot, not all the species are presents, for obvious reasons right? Now, the function metaMDS from the vegan package does not let me create anything. How do I deal with the zero in a matrix? Anybody have any scripts or ideas or any magical things to fix my day??
this is my code so far:
Crabdata3<- read.csv("Crab_Edited_Plots_02_12.csv")
str(Crabdata3)
Crabbie1= Crabdata3[,-(1:2),drop=FALSE]
Crabbie2 <-as.matrix(Crabbie1)
is.matrix(Crabbie2)
Disscrabmatrix1 <-data.matrix(Crabbie2)
Disscrabmatrix2=matrix(Crabbie2,nrow=42,ncol=18, byrow =TRUE,
dimnames= list(paste("community", 1:42,sep="")
,paste(colnames(Crabbie2,1:18))))
example_NMDS=metaMDS(Disscrabmatrix2, distance ="bray", k=2)## number od reduced dimensions
This is the error I am getting
example_NMDS=metaMDS(Disscrabmatrix2, distance ="bray", k=2)## number od reduced dimensions
Square root transformation
Wisconsin double standardization
Error in cmdscale(dist, k = k) : NA values not allowed in 'd'
In addition: Warning messages:
1: In distfun(comm, method = distance, ...) :
you have empty rows: their dissimilarities may be meaningless in method “bray”
2: In distfun(comm, method = distance, ...) : missing values in results
I'd recommend looking at this link https://www.tutorialspoint.com/how-to-remove-rows-that-contains-all-zeros-in-an-r-data-frame
The author suggests using df1[rowSums(df1[])>0,] to remove all 0 rows. I'm trying to do a similar nMDS/PERMANOVA analysis and found this did the trick.
Cheers.

R - Vegan package. metaMDS error

I would like to know why im getting this error running metaMDS:
'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE
I would like to do NMDS and dendogram graphs but can do so with the error above.
My data set is available for download if anyone wants to check DATASET. After importing the data, I transposed the column and rows. Afterwhich, I replaced the NA values with O before trying to run metaMDS.
abundance <- read.table("1_abundance.txt", header = TRUE)
abundance[is.na(abundance)] <- 0
abundance_trans <- t(abundance)
metaMDS(abundance_trans, distance = "bray", k = 2, trymax = 50)
It is not an error message but information: metaMDS tells you that you have negative data entries, and it will not make some tricks that it defaults to do with non-negative data.
Second issue is that you ask for Bray-Curtis dissimilarities that are only applicable with non-negative data.
You have two alternatives: either take care of negative values, or use a dissimilarity measure than can handle them. If you think that you do not have negative data, you are wrong: computer knows. You may have an error when reading in your data, and you may have columns or rows that you should not have. Check your data.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

What can I do about svmpath needing so much memory in one gulp?

I am trying out the svmpath package, which is supposed to find optimal hyperparameters for a trained SVM without requiring multiple runs over different subsets of the data. More importantly, it's supposed to be less computationally complex (according to its docs).
However, it seems to ask for a lot of memory all at once.
Minimal working example:
library(data.table)
library(svmpath)
# Loaded svmpath 0.953
features <- data.table(matrix(runif(100000*16),ncol=16))
labels <- (runif(100000) > 0.7)
svmpath(x=features,y=labels)
# Error in x %*% t(y) : requires numeric/complex matrix/vector arguments
svmpath(x=as.matrix(features),y=labels)
# Error: cannot allocate vector of size 74.5 Gb
library(kernlab)
ksvm(as.matrix(features),y=labels,kernel=vanilla)
# runs
Inspecting the training function only shows one line that pops out as possibly big, Kscript <- K * outer(y, y). Indeed this seems to be the culprit: runif(100000) %o% runif(100000) produces the same error.
Are there any quick fixes that are easy to implement in R?
Apparently, it doesn't find the optimal C (cost) value.
But, it lists all the C values that you should try in order
to find the best one using N folds cross validation or a test dataset.

plotting loess with standard errors in R causes integer overflow

I am attempting to use predict with a loess object in R. There are 112406 observations. There is one particular line inside stats:::predLoess which attempts to multiply N*M1 where N=M1=112406. This causes an integer overlow and the function bombs out. The line of code that does this is the following (copied from predLoess source):
L <- .C(R_loess_ise, as.double(y), as.double(x), as.double(x.evaluate[inside,
]), as.double(weights), as.double(span), as.integer(degree),
as.integer(nonparametric), as.integer(order.drop.sqr), as.integer(sum.drop.sqr),
as.double(span * cell), as.integer(D), as.integer(N), as.integer(M1),
double(M1), L = double(N * M1))$L
Has anyone solved this or found a solution to this problem? I am using R 2.13. The name of this forum is fitting for this problem.
It sounds like you're trying to get predictions for all N=112406 observations. First, do you really need to do this? For example, if you want graphical output, it's faster just to get predictions on a small grid over the range of your data.
If you do need 112406 predictions, you can split your data into subsets (say of size 1000 each) and get predictions on each subset independently. This avoids forming a single gigantic matrix inside predLoess.

Resources