I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).
Related
I'm making a very dirty version of an autocorrelation function in R.
I have a loop that works up to a specified max lag and then returns all the correlations as a matrix, as the acf() function does.
The idea is to replicate the output of the acf() function as shown:
Autocorrelations of series ‘acfData’, by lag
0 1 2 3 4 5 6 7 8
1.000 -0.038 0.253 0.266 0.250 0.267 -0.182 0.281 -0.013
9 10 11 12 13
-0.067 -0.122 -0.115 -0.023 -0.337
What I have so far is the input of data, the specified max lag and the code then works over the range by sliding the data frame back the needed amount and then performing the covariance and standard deviation calculations over the necessary range of data matrices. This is repeated over the range of lags and then appended to the matrices as shown, I also included the cor() function with the data frames created to test.
My problem is that the code returns the correct value for the first loop, or slide, and then returns slightly wrong values from then on.
myAcf <- function(dat, lg){
dataF <- data.frame("data" = dat)
names(dataF)[1] <- "acfData"
lagMat <- c()
testMat <- c()
for(i in 0:lg){
dataLag <- slide(dataF, "acfData", slideBy = -i)
covacf <- cov(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
sd1 <- sd(dataLag[(1+i):nrow(dataLag[1]), 1])
sd2 <- sd(dataLag[(1+i):nrow(dataLag[1]), 2])
corrCalc <- covacf/(sd1 * sd2)
lagMat <- c(lagMat, corrCalc)
a <- cor(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
testMat <- c(testMat, a)
}
plot(lagMat)
return(list(lagMat, testMat))
}
My code then returns for the same data as the acf() function input:
[[1]]
[1] 1.00000000 -0.03786539 0.27700596 0.30197418 0.31009956
[6] 0.37123797 -0.19520518 0.44399863 0.05684766 0.02063488
[11] -0.03724332
[[2]]
[1] 1.00000000 -0.03842146 0.27502462 0.29292583 0.35052131
[6] 0.40931426 -0.23637159 0.52320559 0.07270497 0.02555461
[11] -0.04524035
Any help is greatly appreciated!
I'm performing a latent class analysis using Mplus, and trying to get the output into R via the MplusAutomation package (since I'm doing this many times, I want to avoid copying by hand). I'd like to grab the "Results in Probability Scale" subsection in the "Model Results" section of the Mplus output, but I'm unable to find it in the R object MplusAutomation creates from the .out file. That object contains a "parameters" data frame which includes other information from the "Model Results" section, so is it a matter of "Results in Probability Scale" being a simple transformation of the other model results data, that I could do myself in R? If not, is there some other way of recreating the results of this section from what info I do have in R? Or is the information I'm looking for stored somewhere else in the output?
The "Results in Probability Scale"-section does not seem to be parsed by MplusAutomation.
However, you can convert the threshold parameters yourself to probability scale using the formula prob = 1 / (1 + exp(est)).
For example, the code below should reproduce the results in probability scale from this UCLA example:
library(dplyr)
library(tidyr)
library(MplusAutomation)
# Fetch & write output from UCLA LCA-example to temp file
lca_ex_out = tempfile(fileext = '.out')
fileConn <- file(lca_ex_out)
writeLines(readLines('https://stats.idre.ucla.edu/stat/mplus/dae/lca1.out'), fileConn)
close(fileConn)
lca_ex_result = readModels(lca_ex_out) # extract results from temp file
# select threshold parameters, covert to probability & layout in table
lca_ex_result$parameters$unstandardized %>%
filter(paramHeader == 'Thresholds') %>%
mutate(est_probscale = 1 / (1 + exp(est))) %>%
select(param, LatentClass, est_probscale) %>%
spread(LatentClass, est_probscale)
Output:
param 1 2 3
1 ITEM1$1 0.908 0.312 0.923
2 ITEM2$1 0.337 0.164 0.546
3 ITEM3$1 0.067 0.036 0.426
4 ITEM4$1 0.065 0.056 0.418
5 ITEM5$1 0.219 0.044 0.765
6 ITEM6$1 0.320 0.183 0.471
7 ITEM7$1 0.113 0.098 0.512
8 ITEM8$1 0.140 0.110 0.619
9 ITEM9$1 0.325 0.188 0.349
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
I am currently trying to filter variables data based on their correlation
with the carret package from R in RStudio on my Mac.
So far I can calculate and print the correlation of the data set. However, once I am applying the findCorrelation method I am not getting any data returned. I only get the following warning:
" Combination row and column is above the cut-off, value = Flagging column"
library(caret)
preProcessAttributeClass <- function (data.convert) {
classe <- data.convert$classe
data.convert <- as.data.frame(sapply(data.convert,as.numeric))
data.convert$X.1 <- NULL
data.convert$X <- NULL
data.convert$user_name <- NULL
data.convert$raw_timestamp_part_1 <- NULL
data.convert$raw_timestamp_part_2 <- NULL
data.convert$cvtd_timestamp <- NULL
data.convert$new_window <- NULL
data.convert$num_window <- NULL
data.convert
}
data.train <- read.csv(file="training.csv",na.strings=c("NA",""))
data.train <- preProcessAttributeClass(data.train)
descrCor <- (cor(na.omit(data.train),use="complete.obs"))
highlyCorDescr <- findCorrelation(na.omit(descrCor), cutoff = .9, verbose=TRUE,names=FALSE)
Any ideas what could be the cause of my problem?
I think that the problem is your correlation matrix:
> class(na.omit(descrCor))
[1] "matrix"
> dim(na.omit(descrCor))
[1] 0 153
These data have columns with a lot of missing data:
> pct_na <- unlist(lapply(data.train, function(x) mean(is.na(x))))
> summary(pct_na)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.9793 0.6401 0.9793 0.9793
I'll punt on whether columns with ~95% missing are useful but they are preventing you from getting a useful correlation matrix. I suggest doing the correlation filter with fewer columns:
> sum(pct_na > .1)
[1] 100
> keepers <- data.train[,names(which(pct_na <= .1))]
> descrCor <- cor(keepers ,use="complete.obs")
Most of the remaining columns have either no correlations or very high ones:
> summary(descrCor[upper.tri(descrCor)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.992000 -0.108800 0.001911 0.001667 0.088680 0.980900
Now do the filter:
> highlyCorDescr <- findCorrelation(descrCor, cutoff = .9, verbose=TRUE,names=FALSE)
Compare row 10 and column 1 with corr 0.992
Means: 0.266 vs 0.164 so flagging column 10
Compare row 1 and column 9 with corr 0.925
Means: 0.247 vs 0.161 so flagging column 1
Compare row 9 and column 4 with corr 0.928
Means: 0.229 vs 0.158 so flagging column 9
Compare row 8 and column 2 with corr 0.966
Means: 0.24 vs 0.154 so flagging column 8
Compare row 19 and column 18 with corr 0.918
Means: 0.089 vs 0.155 so flagging column 18
Compare row 46 and column 31 with corr 0.914
Means: 0.099 vs 0.158 so flagging column 31
Compare row 46 and column 33 with corr 0.933
Means: 0.081 vs 0.161 so flagging column 33
All correlations <= 0.9
> keep_these <- names(data.train)[!(names(data.train) %in% colnames(descrCor)[highlyCorDescr])]
> data.train.subset <- data.train[, keep_these]
I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names:
>data =read.table(datafilename,header=TRUE)
>names(data)
[1] "price" "model" "size" "year" "color"
What I want to do is create several linear models using different combinations of the variables (price being the target ), such as:
> attach(data)
> model1 = lm(price~model+size)
> model2 = lm(price~model+year)
> model3 = lm(price~model+color)
> model4 = lm(price~model+size)
> model4 = lm(price~size+year+color)
#... and so on for all different combination...
My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.
If your goal is model selection there are several tools available in R which attempt to automate this process. Read the documentation on dredge(...) in the MuMIn package.
# dredge: example of use
library(MuMIn)
df <- mtcars[,c("mpg","cyl","disp","hp","wt")] # subset of mtcars
full.model <- lm(mpg ~ cyl+disp+hp+wt,df) # model for predicting mpg
dredge(full.model)
# Global model call: lm(formula = mpg ~ cyl + disp + hp + wt, data = df)
# ---
# Model selection table
# (Intrc) cyl disp hp wt df logLik AICc delta weight
# 10 39.69 -1.5080 -3.191 4 -74.005 157.5 0.00 0.291
# 14 38.75 -0.9416 -0.01804 -3.167 5 -72.738 157.8 0.29 0.251
# 13 37.23 -0.03177 -3.878 4 -74.326 158.1 0.64 0.211
# 16 40.83 -1.2930 0.011600 -0.02054 -3.854 6 -72.169 159.7 2.21 0.096
# 12 41.11 -1.7850 0.007473 -3.636 5 -73.779 159.9 2.37 0.089
# 15 37.11 -0.000937 -0.03116 -3.801 5 -74.321 161.0 3.46 0.052
# 11 34.96 -0.017720 -3.351 4 -78.084 165.6 8.16 0.005
# 9 37.29 -5.344 3 -80.015 166.9 9.40 0.003
# 4 34.66 -1.5870 -0.020580 4 -79.573 168.6 11.14 0.001
# 7 30.74 -0.030350 -0.02484 4 -80.309 170.1 12.61 0.001
# 2 37.88 -2.8760 3 -81.653 170.2 12.67 0.001
# 8 34.18 -1.2270 -0.018840 -0.01468 5 -79.009 170.3 12.83 0.000
# 6 36.91 -2.2650 -0.01912 4 -80.781 171.0 13.55 0.000
# 3 29.60 -0.041220 3 -82.105 171.1 13.57 0.000
# 5 30.10 -0.06823 3 -87.619 182.1 24.60 0.000
# 1 20.09 2 -102.378 209.2 51.68 0.000
You should consider these tools to help you make intelligent decisions. Do not let the tool make the decision for you!!!
For example, in this case dredge(...) suggests that the "best" model for predicting mpg, based on the AICc criterion, includes cyl and wt. But note that AICc for this model is 157.7 whereas the second best model has an AICc of 157.8, so these are basically the same. In fact, the first 5 models in this list are not significantly different in their ability to predict mpg. It does, however, narrow things down a bit. Among these 5, I would want to look at distribution of residuals (should be normal), trends in residuals (there should be none), and leverage (do some points have undue influence), before picking a "best" model.
Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):
vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
}
}
Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=data)
}
You can use stepwise multiple regression to determine what variables make sense to include. To get this started you write one lm() statement with all variables, such as:
library(MASS)
fit <- lm(price ~ model + size + year + color)
Then you continue with:
step <- stepAIC(model, direction="both")
Finally, you can use to following to show the results:
step$anova
Hope this gives you some inspiration for advancing your script.