I am trying to do a phylogenetic comparison of two trees which contain the same taxa. I want to colour the connections based on isolation site. I had thought I had performed this successfully but there is error in my work flow i.e. the coloured lines are not corresponding to isolation site accurately . I was wondering if you have any insights, please find my reproducible example below.
site <- structure(list(name = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,9L, 10L, 2L), .Label = c("t1", "t10", "t2", "t3", "t4", "t5","t6", "t7", "t8", "t9"), class = "factor"), site = c(1L, 1L,1L, 2L, 2L, 3L, 1L, 3L, 2L, 2L)), .Names = c("name", "site"), row.names = c(NA,10L), class = "data.frame")
library(ape)
t1 <- rtree(10)
t2 <- rtree(10)
order <- cbind(t1$tip.label)
list <- merge(order, site, by.x="V1", by.y="name")
x <- list$site
A <- cbind(t1$tip.label, t1$tip.label)
cophyloplot(t1, t2, assoc = A, show.tip.label = T, space=50, col = x)
As it stands this is my current output:
Just spotted this thread on extracting tip labels and it works.
correct order of tip labels in ape
I also need to incorporate sort=F into the merge function.
So for a finish the workflow looks like:
site <- structure(list(name = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,9L,
10L, 2L), .Label = c("t1", "t10", "t2", "t3", "t4", "t5","t6", "t7", "t8",
"t9"), class = "factor"), site = c(1L, 1L,1L, 2L, 2L, 3L, 1L, 3L, 2L, 2L)),
.Names = c("name", "site"), row.names = c(NA,10L), class = "data.frame")
library(ape)
t1 <- rtree(10)
t2 <- rtree(10)
is_tip<- t1$edge[,2] <= length(t1$tip.label)
ordered_tips <- t1$edge[is_tip,2]
order <-t1$tip.label[ordered_tips]
order <- as.data.frame(order)
list <- merge(order, site, by.x="V1", by.y="name", sort=F)
x <- list$site
A <- cbind(t1$tip.label, t1$tip.label)
cophyloplot(t1, t2, assoc = A, show.tip.label = T, space=50, col = x)
Only as a follow up, in my work the correct order of the labels was being altered by the merge command. My tree structure is quite complicated and probably the absence/presence of individuals between both trees was creating this problem. I just fixed by adding a vector with the positions to the order data.frame.
order <- as.data.frame(order, seq=seq(1:length(order)) )
Latter one can easily rearrange the data.frame accordingly with the tree structure.
Cheers,
Related
I applied the UMAP dimentionaity reduction over my data, and clustred it. I got three different clusters:
I have the data that specefices to which cluster does eahc sample belong, with the name of the sample and everything. Here is a subsample of it, let's call it df_cluster:
structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
The samples of df_cluster are the same in the original data, data, which I used for the clustering. Which is basically just the samples you saw as rows, and features as columns, looks something like this:
structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
I just want to view each of those features (the columns of data), in each cluster, using a box plot or a violin plot. Kind of a comparison between the clusters.
So in the X-axis I'll have clusters 1, 2, and 3, the Y-axis would be the values. Each feature will get a plot. I've drawn an example by hand to make it more clear:
You could use facets.
But first you need to pivot the dataframe.
df_cluster <- structure(list(X1 = c(17.6942795910888, 16.5328416912875, 15.0031683863395,
16.3550118351627, 17.6931159161312, 16.9869249394253, 16.3790173297882,
15.8964870189374, 17.1055608092973, 16.4568632337052), X2 = c(-1.64953541728691,
0.185674946464158, -1.38521677790428, -0.448487127519734, -1.63670327964466,
-0.456667476792068, -0.091689040488956, -1.77486494294163, -1.86407675524967,
0.14666260432486), cluster = c(1L, 2L, 2L, 1L, 2L, 1L, 3L, 3L,
1L, 3L)), row.names = c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), class = "data.frame")
data <- structure(c(-0.0741098696855045, -0.094401270881699, 0.0410284948786532,
-0.163302950330185, -0.0942478217207681, -0.167314411991775,
-0.118272811489486, -0.0366277340916379, -0.0349008907108641,
-0.167823357941815, -0.178835447722468, -0.253897294559596, -0.0372301980787381,
-0.230579110769457, -0.224125346052727, -0.196933050675633, -0.344608041139497,
-0.0550538743643369, -0.157003425700701, -0.162295446209879,
-0.0384421660291032, -0.0275306107582565, 0.186447606591857,
-0.124972070102036, -0.15348122673842, -0.106812144494277, -0.104757782473888,
0.0686746776877563, -0.0662055287009653, 0.00388752358937872), dim = c(10L,
3L), dimnames = list(c("Patient1", "Patient13", "Patient2", "Patient99",
"Patient10", "Patient43", "Patient167", "Patient8", "Patient17", "Patient16"
), c("Feature1", "Feature2",
"Feature3")))
library(tidyverse)
data %>%
as.data.frame() %>%
rownames_to_column("Patient") %>%
left_join(df_cluster %>% rownames_to_column("Patient") %>% select(Patient, cluster)) %>%
pivot_longer(- c(cluster, Patient)) %>% #Pivot the dataframe
ggplot(aes(as.factor(cluster), value)) +
geom_boxplot() +
facet_grid(~ name)
I have a data set that looks like these two first columns are just IDs and the last is the date,
I need to find a relation between them in R but am lost since my first problem is how to visualize my data correctly. I have the id as a factor but each time that I do a plot it gives me a numeric value of that.
You might start visualizing the relationship between your variables using the pairs.panel from the psych package. Here is an output using the sample data you shared. Note the data points are sparse but you have more data points.
library(psych)
pairs.panels(df)
Output
Data
structure(list(id1 = structure(c(6L, 2L, 2L, 1L, 5L, 4L, 5L,
3L), .Label = c("10017097", "17596277", "20501146", "3603827",
"57106539", "7596227"), class = "factor"), id2 = structure(c(3L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("10122", "10197", "13840"
), class = "factor"), t_date = structure(c(17966, 17590, 17956,
17984, 17478, 17483, 17513, 17544), class = "Date")), class = "data.frame", row.names = c(NA,
-8L))
The documentation is available at pairs.panels
.
For a sample dataframe:
df <- structure(list(area = structure(c(1L, 4L, 3L, 8L, 5L, 7L, 6L,
2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9", "DEA", "DEB",
"DEC"), class = "factor"), to.delete = c(1L, 0L, 1L, 0L, 1L,
1L, 1L, 0L)), .Names = c("area", "to.delete"), class = "data.frame", row.names = c(NA,
-8L))
I want to create a list of the areas which have a '1' in the 'to'delete' column. I know how to subset the 1s out of this dataframe, however I want the list of areas as eventually I will use this list to extract these areas from the main master data file (df2, listed below).
df2 <- structure(list(id = 1:24, area = structure(c(1L, 1L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 5L, 7L, 7L, 7L, 6L, 6L,
2L, 2L, 2L, 2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9",
"DEA", "DEB", "DEC"), class = "factor")), .Names = c("id", "area"
), class = "data.frame", row.names = c(NA, -24L))
I prefer to do this in two steps, so I can easily see which areas I have deleted (thanks to answers below for suggestions of using list).
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[df2$area %in% a,]
This however doesn't seem to work at the moment, so if anyone has any ideas, then that would be great.
df2 should then be left with only areas DE5, DEC and DE3.
Many thanks.
Here is another method using split to collect the areas into two lists:
# get two lists of areas and give list items appropriate names
keepDrop <- setNames(split(df$area, df$to.delete), c("drop", "keep"))
# now perform dropping
df2.smaller <- df2[df2$area %in% keepDrop[["keep"]],]
We can use subset. Based on the description, the OP wants to subset the rows of a main data ('maindata') based on the 'area' that corresponds to 1 in 'to.delete' column. In that case, we extract the 'area' (df$area[df$to.delete ==1]) and with %in% we subset the 'maindata'.
subset(maindata, area %in% df$area[df$to.delete==1])
It's not too clear what you are asking.
This will create a list where each element is a different Area:
lapply(df$area[df$to.delete == 1], function(x) x)
If you want a list with just one element containing all the areas:
list(df$area[df$to.delete == 1])
Edit:
To answer the second part of your question:
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[!df2$area %in% a[[1]], ]
Here's what you can try .
a <- as.list(subset(df,df$to.delete == 1))
> a
$area
[1] DE1 DE4 DE9 DEB DEA
Levels: DE1 DE3 DE4 DE5 DE9 DEA DEB DEC
$to.delete
[1] 1 1 1 1 1
I've been trying to make a graph that looks like this (but nicer)
based on what I found in this discussion using the transitionPlot() function from the Gmiscpackage.
However, I can't get my transition_matrix right and I also can't seem to plot the different state classes in separate third column.
My data is based on the symptomatic improvement of patients following surgery. The numbers in the boxes are the number of patients in each "state" pre vs. post surgery. Please note the (LVAD) is not a necessity.
The data for this plot is this called df and is as follows
dput(df)
structure(list(StudyID = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("P1", "P2", "P3",
"P4", "P5", "P6", "P7"), class = "factor"), MeasureTime = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Postoperative",
"Preoperative"), class = "factor"), NYHA = c(3L, 3L, 3L, 3L,
3L, 2L, 3L, 1L, 3L, 1L, 3L, 3L, 1L, 1L)), .Names = c("StudyID",
"MeasureTime", "NYHA"), row.names = c(NA, -14L), class = "data.frame")
I've made a plot in ggplot2 that looked like this
but my supervisor didn't like it, because I had to jitterthe lines so that they didn't overlap and so one could see what was happening with each patient and thus the points/lines aren't exactly lined up with the y-axis.
So I was wondering if anyone had an idea, how I'd be able to do this using the Gmisc package making what seems to me to be a transitionPlot.
Your help and time is much appreciated.
Thanks.
Using your sample df data, here are some pretty low-level plotting function that can re-create your sample image. It should be straigtforward to customize however you like
First, make sure pre comes before post
df$MeasureTime<-factor(df$MeasureTime, levels=c("Preoperative","Postoperative"))
then define some plot helper functions
textrect<-function(x,y,text,width=.2) {
rect(x-width, y-width, x+width, y+width)
text(x,y,text)
}
connect<-function(x1,y1,x2,y2, width=.2) {
segments(x1+width,y1,x2-width,y2)
}
now draw the plot
plot.new()
par(mar=c(0,0,0,0))
plot.window(c(0,4), c(0,4))
with(unique(reshape(df, idvar="StudyID", timevar="MeasureTime", v.names="NYHA", direction="wide")[,-1]),
connect(2,NYHA.Preoperative,3,NYHA.Postoperative)
)
with(as.data.frame(with(df, table(NYHA, MeasureTime))),
textrect(as.numeric(MeasureTime)+1,as.numeric(as.character(NYHA)), Freq)
)
text(1, 1:3, c("I","II","III"))
text(1:3, 3.75, c("NYHA","Pre-Op","Post-Op"))
text(3.75, 2, "(LVAD)")
which results in
I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.