Bootstrapped tree values differ from PAST - r

When I compute a bootstrapped tree in R I get different values to when I use PAST (http://folk.uio.no/ohammer/past/). How can I get the output to match from the two programs?
Here's what I'm doing in R (data below):
library("ape")
library("phytools")
library("phangorn")
library("cluster")
# compute neighbour-joined tree
f <- function(xx) nj(daisy(xx))
nj_tree <- f(tab)
nj_tree_root <- root(nj_tree, 1, r = TRUE)
## bootstrap
# bootstrap values do not match PAST output - why is that?
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
# Are bootstrap values stable?
for (i in 1:10){
print(boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE, quiet = TRUE))
}
# yes, they seem ok
# plot tree with bootstrap values
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Typical output for the bootstrap is [1] 100 6 39 27 23 57 53 75 71 and here's the plot (far LHS value should be 100, it was cropped somehow):
I transform the data to send it to PAST like so:
tab1 <- t(apply(tab, 1, as.numeric))
write.table(tab1, "tab.txt")
In PAST I open the tab.txt file, do multivariate -> cluster -> Neighbour Joining with Euclidian and 100 bootstrap replications, using an outgroup. From PAST I get this plot:
And the values are very different. What do I need to do with R to make the output match that from PAST? Is PAST wrong?
The data:
tab <- structure(list(X1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L), .Label = c("0", "1"), class = "factor"), X2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X3 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X4 = structure(c(2L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X5 = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
X6 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X7 = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), X8 = structure(c(2L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X9 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X10 = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X11 = structure(c(1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X12 = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X13 = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X14 = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
X15 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L), .Label = c("0", "1"), class = "factor"), X16 = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), X17 = structure(c(2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
X18 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L,
1L), .Label = c("0", "1"), class = "factor"), X19 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X20 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X21 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("0", "1"), class = "factor"), X22 = structure(c(2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), X23 = structure(c(1L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
X24 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L), .Label = c("0", "1"), class = "factor"), X25 = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), X26 = structure(c(1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("X1",
"X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11",
"X12", "X13", "X14", "X15", "X16", "X17", "X18", "X19", "X20",
"X21", "X22", "X23", "X24", "X25", "X26"), row.names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j", "k"), class = "data.frame")

After much searching around, it turn out the answer is in the ape package FAQ Q14:
I have done a bootstrap analysis with boot.phylo but some bootstrap
values seem at the wrong place after rooting the tree. This is because
the bootstrap values are counted as the frequencies of clades, and not
as actual bipartitions. So these values are really associated to the
nodes, not to the edges. A consequence is that some of the bootstrap
values are lilely to loose their meaning after (re)rooting the tree
since this will affect the definition of the clades in the tree. A
simple solution is to include the rooting process in the definition of
the function FUN that is given as argument to boot.phylo. Obviously
the estimated tree must also be rooted in the same way before doing
the bootstrap. In this situation, it is more convenient to define FUN
beforehand. An example code would be:
outgroup <- 1 # may be several tips, numeric or tip labels
foo <- function(xx) root(nj(dist.dna(xx)), outgroup)
tr <- foo(X) # X is the matrix of DNA sequences
bp <- boot.phylo(tr, X, foo)
plot(tr)
nodelabels(bp) # will have "100" at the root
In the specific case of my question:
nj_tree_root_boot <- boot.phylo(nj_tree, FUN = f, tab, rooted = TRUE)
plot(nj_tree_root, use.edge.length = FALSE)
nodelabels(nj_tree_root_boot, adj = c(1.2, 1.2), frame = "none")
Which matches the PAST output quite well.

Related

How can I correctly plot my qualitative variables in FAMD analysis using FactoMineR?

community!
I'm trying to run FAMD on a morphology-based dataset with 25 qualitative variables recording the presence and absence of fluorescence on a body part (binary) and six quantitative variables. Furthermore, I have a few supplementary variables such as sex, genus and depth.
First I ran the code for the FAMD on my data set after I had removed all missing values with na.omit():
res.famd1<-FAMD(fluo_famd1,sup.var=c(1,2,28,35),graph=FALSE, ncp=5)
and retrieved a bunch of results like eigenvalues, scree plot etc.
I then tried to plot my qualitative variables within the two dimensions like in this example:
[Example][1]
This is the code I used:
quali.var1 <- get_famd_var(res.famd1, "quali.var")
quali.var1
fviz_famd_var(res.famd1, "quali.var")
Instead of plotting the categories R is plotting decimal numbers I can't explain.
[Missing categories][2]
After this I tried running the FAMD on my data set with missing values using the code given in the package description:
require(missMDA)
res.impute <- imputeFAMD(fluo_famd2, ncp=3)
res.famd2 <- FAMD(fluo_famd2,tab.disj=res.impute$tab.disj,sup.var=c(1,2,28))
When trying to plot the categories now, they do appear in the plot but they are doubled and labelled with _0 and _1.
[doubled categories][3]
My questions are:
Can you identify an obvious mistake? Why would the categories be plotted twice in the graph? Does it have an impact on the overall analysis? Is FAMD suited for a data set like this?
[1]: https://i.stack.imgur.com/8UFlA.png
[2]: https://i.stack.imgur.com/qb3Cz.png
[3]: https://i.stack.imgur.com/O1Dff.png
Please find a subset of my data here:
structure(list(genus = structure(c(5L, 7L, 7L, 7L, 9L, 7L, 7L,
9L, 9L, 7L, 7L, 9L, 7L, 6L, 7L), .Label = c("Cryptochirus",
"Dacryomaia",
"Fizesereneia", "Fungicola", "Hapalocarcinus", "Hiroia",
"Lithoscaptus",
"Neotroglocarcinus", "Opecarcinus", "Pseudohapalocarcinus",
"Xynomaia"
), class = "factor"), sex = structure(c(1L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("f", "m"), class
=
"factor"),
frontal_dorsal = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class =
"factor"),
frontal_ventral = structure(c(1L, 2L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class =
"factor"),
mesogastric = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class =
"factor"),
cardial = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
branchial = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
1L, 1L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
ps1 = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
ps2 = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
ps3 = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
ps4 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
ps6 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
telson = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
eyes = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 2L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
eyestalk = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
antennules = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class =
"factor"),
anntenullar_peduncle = structure(c(1L, 1L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("0", "1"), class
=
"factor"),
depth = c(NA, 10.3, 16, 16.1, 14.3, 12.8, 10.8, 12.6, 10.2,
11, 11.9, 13.1, 10.7, 10.1, 12.3), carapace_fluo = c(NA,
NA, 0.0999104660846311, 0.459446596994549, 0.639459602769835,
0.0157309627508303, NA, 0.792912115871697, 0.385646421420439,
0.0934932558564838, 0.118926192063408, 0.334765757290687,
NA, 0.712954991372207, 0.816431146170724), ap_fluo = c(NA,
0, 0.153709650160554, NA, 0.526410945516736, 0,
0.0572985597508758,
NA, 0.0105633802816901, 0.284174213022855, 0.305258467023173,
0.402286503491138, NA, 0, 0.0679211592610398), prod_fluo = c(NA,
0, 0, NA, 0.528576376861794, 0, 0, 0.15260360009031, 0,
0.0252962625341841,
0.241194486983155, 0.0717077570655442, NA, 0.479219143576826,
0), pol_fluo = c(NA, 0, 0, NA, 0, 0, 0, 0.118164567879938,
0, 0, 1, 0, NA, 0.299160251924423, 0), dac_fluo = c(NA, 0,
0, NA, 0, 0, 0, 0.102848534648042, 0, 0, 0.309536216779573,
0, NA, 0.0654761904761905, 0), sum_chel = c(NA, 0, 0, NA,
0.345118733509235, 0, 0, 0.14349725008088, 0, 0.0155266470835082,
0.347599820547331, 0.0451661774453177, NA, 0.32612422524067,
0)), row.names = c(NA, -15L), class = c("tbl_df", "tbl",
"data.frame"))

Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE for Gower distance

I am struggling to get hierarchical clustering, in R. Please do not downgrade this post since I have tried what is at this link How to use 'hclust' as function call in R
Yet I haven't succeeded. A sample of data is here:
structure(list(respondents_id = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10"), comorbidities = c("hypertension", "asthma",
"diabetes_type_two", "hypertension", "hypertension", "lung_condition",
"asthma", "obesity", "obesity", "obesity"), chills = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
cough = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L
), .Label = c("No", "Yes"), class = "factor"), diarrhoea = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"
), class = "factor"), fatigue = structure(c(2L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
headache = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("No", "Yes"), class = "factor"), loss_smell_taste = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"
), class = "factor"), muscle_ache = structure(c(2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
nasal_congestion = structure(c(1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
nausea_vomiting = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
shortness_breath = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
sore_throat = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("No", "Yes"), class = "factor"), sputum = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"
), class = "factor"), temperature = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"),
loss_appetite = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("No", "Yes"), class = "factor"), chest_pain = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"
), class = "factor"), itchy_eyes = structure(c(1L, 1L, 2L,
2L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"),
joint_pain = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("No", "Yes"), class = "factor")), row.names = c(NA,
10L), class = "data.frame")
Here is the code:
gower_distance <- cluster::daisy(data_test[,3:19], metric = "gower")
class(gower_distance)
divisive_clustering <- diana(as.matrix(gower_distance), diss = TRUE, keep.diss = TRUE)
hc_complete <- hclust(divisive_clustering, method = "complete")

Transform a data frame into a table with option

I have a data frame with different variables (columns).
I want to transform this data frame into a table with a different structure to make it more readable.
For example, I have a data frame like this:
myData = structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "20", class = "factor"),
Y = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("20", "100"), class = "factor"),
MethodType = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L,
2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L), .Label = c("E", "Q", "R",
"W"), class = "factor"), MethodType2 = structure(c(1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Metric1 = c(0.970017512487058, 0.969647220975651,
0.965873991040769, 0.966242788535318, 0.986725852301671,
0.98696657967457, 0.98252107117733, 0.982655296614757, 0.278826941542694,
-0.990926101696033, 0.194574672498287, 0.281916524368647,
0.152983364411985, 1.44135982835554, 0.330270447575806, -0.369627160641594
), Metric2 = c(0.987541353383459, 0.987007518796992, 0.980984962406015,
0.981646616541353, 0.984082706766917, 0.984481203007519,
0.988165413533835, 0.988375939849624, -0.109331599015822,
-0.148471161609603, 1.31331396089969, -1.34238564643737,
2.14014350779371, -0.422879539464588, -1.25706359685425,
1.09603324772565)), row.names = c(NA, -16L), class = "data.frame")
and I want to have a table like this:
Which kind of manipulation I can use? Which tool I can use. I'm looking for something flexible that can work also with more factors.

Getting percentages out of a list of dataframes in R

I am very new to R (a few months experience from online learning and reading) and have no coding experience before this.
I have been using a data set obtained from work (healthcare) for some practice. I wanted to demonstrate certain patient outcomes over time (by month) in this data set.
I've separated the data by month into a separate data frames that I have stored in a list. I then narrowed down each data frame within the list to the 3 post-operative outcomes that I want to look at. All three outcomes are binary (Y or N).
I would like to know if there is anyway I can work out the percentages of "Y" for each of these outcomes by month, and then store this in an object that I can then plot to show the trend over time (by month).
Have I approached this problem completely wrongly? Should I not have used a list at all?
I managed to get to a point where I have a list of tables of Y's and N's but am now completely clueless as to what to do from there.
list(structure(list(Mobilised_D1 = structure(c(2L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("N", "Y"), class =
"factor"),
Catheter_rm_D1 = structure(c(2L, 1L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"),
Diet_D1 = structure(c(2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), class =
"data.frame", row.names = 2:15),
structure(list(Mobilised_D1 = structure(c(1L, 2L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("N",
"Y"), class = "factor"), Catheter_rm_D1 = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L
), .Label = c("N", "Y"), class = "factor"), Diet_D1 = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("N", "Y"), class = "factor")), class = "data.frame",
row.names = 16:31),
structure(list(Mobilised_D1 = structure(c(2L, 1L, 1L, 2L,
1L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"),
Catheter_rm_D1 = structure(c(1L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"),
Diet_D1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), class =
"data.frame", row.names = 32:42),
structure(list(Mobilised_D1 = structure(c(2L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("N",
"Y"), class = "factor"), Catheter_rm_D1 = structure(c(2L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("N", "Y"), class = "factor"), Diet_D1 =
structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("N", "Y"), class = "factor")), class = "data.frame",
row.names = 43:60),
structure(list(Mobilised_D1 = structure(c(1L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 2L, NA), .Label = c("N",
"Y"), class = "factor"), Catheter_rm_D1 = structure(c(1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("N",
"Y"), class = "factor"), Diet_D1 = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("N",
"Y"), class = "factor")), class = "data.frame", row.names = 61:74),
structure(list(Mobilised_D1 = structure(c(1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L), .Label = c("N",
"Y"), class = "factor"), Catheter_rm_D1 = structure(c(1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L
), .Label = c("N", "Y"), class = "factor"), Diet_D1 = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("N", "Y"), class = "factor")), class = "data.frame",
row.names = 75:90))
For each component of the input list, L, take the indicated mean arranging that into a multivariate time series with one row per month. Then plot it on a single panel. Remove facet=NULL if you want each series in a separate panel.
library(zoo)
library(ggplot2)
series <- zoo( t(sapply(L, function(x) colMeans(x == "Y"))) )
autoplot(series, facet = NULL) + geom_point()
(continued after graph)
Alternative
An alternative is to create a data frame DF from L along with a month vector aggregating by month as shown. This makes use of the fact that DF will have row names consisting of the month followed by a decimal point and a row number from the original component that each input row was was constructed from.
DF <- do.call("rbind", setNames(L, seq_along(L)))
month <- as.integer(rownames(DF))
series <- aggregate(zoo(DF == "Y"), month, mean)
autoplot(series, facet = NULL) + geom_point()

Indicator feature creation in R based on multiple columns

I have a dataset with 10 columns and out of them 10, 3 are of interest to create a new indicator feature. The features are "pT", "pN", & "M" and they all take different values. Off all the values that these 3 features take, there are a toal of 9 unique combinations that needs to be captures in the new variable.
PATHOT PATHON PATHOM
1 pT2 pN1 M0
4 pT1 pN1 M0
13 pT3 pN1 M0
161 pT1 *pN2 M0
391 pT1 pN1 *M1
810 *pTIS pN1 M0
948 pT3 *pN2 M0
1043 pT2 pN1 *M1
1067 *pT4 pN1 M0
For example, the new variable will have value "1" when PATHOT=pT2, PATHON=pN1 & PATHOM=M0 and so on upto value 9. I have completed the task but after spending almost 20 lines of code involving vectorised operation for all unique combinations.
diag3_bs$sfd[diag3_bs$pathot=="pT2" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 1
diag3_bs$sfd[diag3_bs$pathot=="pT1" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 2
diag3_bs$sfd[diag3_bs$pathot=="pT3" & diag3_bs$pathon=="pN1" &
diag3_bs$pathom=="M0"] <- 3... so on upto 9.
I want to ask if there is a better more automated way of getting the same result?
dput(data.frame) is given below
structure(list(F_STATUS = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Y", class = "factor"), EVENT_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BASELINE", class =
"factor"),
PAG_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "BR2", class = "factor"), PTSIZE = c(3, 4,
2.7, 2, 0.9, 3, 3, 0.9, 3, 4.5), PTSIZE_U = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CM", class = "factor"),
PT_SYM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "-", "<", ">"), class = "factor"), PATHOT = structure(c(4L,
4L, 4L, 3L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("*pT4", "*pTIS",
"pT1", "pT2", "pT3"), class = "factor"), PATHON = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*pN2", "pN1"
), class = "factor"), PATHOM = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("*M1", "M0"), class = "factor"),
RSUBJID = 901000:901009, RUSUBJID = structure(1:10, .Label = c(
"000301-000-901-251", "000301-000-901-252", "000301-000-901-253",
"000301-000-901-254", "000301-000-901-255", "000301-000-901-256",
"000301-000-901-257", "000301-000-901-258", "000301-000-901-259",
"000301-000-901-260", "000301-000-901-261", "000301-000-901-262")
, class = "factor")), .Names = c("F_STATUS", "EVENT_ID", "PAG_NAME", "PTSIZE", "PTSIZE_U", "PT_SYM", "PATHOT",
"PATHON", "PATHOM", "RSUBJID", "RUSUBJID"), row.names = c(NA, 10L),
class = "data.frame")
Thanks.
I tried to edit the data so it didn't throw an error on input. Also created a version of that tabulation of possible combinations:
stg_tbl <- structure(list(PATHOT = structure(c(4L, 3L, 5L, 3L, 3L, 2L, 5L,
4L, 1L), .Label = c("*pT4", "*pTIS", "pT1", "pT2", "pT3"), class = "factor"),
PATHON = structure(c(2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("*pN2",
"pN1"), class = "factor"), PATHOM = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L), .Label = c("*M1", "M0"), class = "factor")), .Names = c("PATHOT",
"PATHON", "PATHOM"), class = "data.frame", row.names = c("1",
"4", "13", "161", "391", "810", "948", "1043", "1067"))
Make a vector of text-equivalents of the categories:
stg_lbls <- with(stg_tbl, paste(PATHOT, PATHON, PATHOM, sep="_") )
Then the as.numeric values of a factor created using those levels will be the desired result:
dat$stg <- with(dat, factor( paste(PATHOT, PATHON, PATHOM, sep="_"), levels=stg_lbls))
as.numeric(dat$stg)
#[1] 1 1 1 2 2 1 1 2 1 1
You can just assign those values in the usual way:
dat$sfd <- as.numeric(dat$stg)
I made some new data, that should be useful for your problem.
k<-expand.grid(data.frame(a=letters[1:3],b=letters[4:6],c=letters[7:9]))
library(dplyr)
k %>% mutate(groups=paste0(a,b,c))->k2
k2$groups<-as.numeric(factor(k2$groups))
k2
It's crude, and you're not picking which combination get's which numbers, so it'd take some digging afterwards, but it's quick.

Resources