I would like to apply the functions available in the WeightedCluster package to analyze multichannel sequences I obtained through TraMineR. I am trying so, but due to the fact that multichannel sequences are lists composed by each channel separatedly, I get errors in functions like seqtreedisplay() and all those which require a sequence object.
This is an example:
fullsequences <- list(
work_sequence2 = work_sequence[which(rownames(work_sequence) %in% commonid),],
educ_sequence2 = educ_sequence[which(rownames(educ_sequence) %in% commonid),],
part_sequence2 = part_sequence[which(rownames(part_sequence) %in% commonid),],
kid_sequence2 = kid_sequence[which(rownames(kid_sequence) %in% commonid),]
) # a total of 926 with complete sequences on all channels
multidist <- seqdistmc(
channels = fullsequences,
method = "OM",
norm = FALSE,
sm = list("TRATE","TRATE","TRATE","TRATE"),
with.missing=FALSE,
full.matrix=TRUE,
link="sum")
clusterward <- hclust(as.dist(multidist), method = "ward")
seqtreedisplay(as.seqtree(clusterward, ncluster = 5,
seqdata = fullsequences , diss = multidist))
Error in seqlegend(seqdata, fontsize = legend.fontsize, title = "Legend", :
data is not a sequence object, use seqdef function to create one
Is there a method to use the functionalities of WeightedCluster package upon a multichannel-type object (list of sequences). I am specially interested in using the Partition Around Medioids algorithm with initial ward clusters (function wcKMedioids()). If it is not possible, which is the best alternative to cluster multichannels in R?
Thanks a lot in advance!
The as.seqtree function (from WeightedCluster) requires an object of class stslist (as produced by the TraMineR seqdef function) as seqdata argument. In your case, fullsequences is a list of such objects (the list of parallel sequences), which is NOT itself of class stslist. This causes the error.
Even if you would be able to define a tree of parallel sequences, the problem would be that the seqtreedisplay does not know how to plot parallel sequences. This means that you would have to define a plot function for a list of state sequences and, using the more general disstreedisplay function instead of seqtreedisplay, pass the plot function as imagefunc argument.
To summarize, there are two problems. First you need some as.disstree equivalent of as.seqtree that would work for hierarchical clustering of non-stslist objects. Second, you need a plot function for parallel sequences. The first problem is purely technical and should be easily solved. The second is more conceptual.
Related
I am trying to create a new model for the parsnip package from an existing modeling function foo.
I have followed the tutorial in building new models in parsnip and followed the README on Github, but I still cannot figure out some things.
How does the fit function in parsnip know how to assign its input data (e.g. a matrix) to my idiosyncratic function call?
Imagine if there was an idiosyncratic model function foo where the conventional roles of x and y arguments were reversed: i.e. foo(x,y) where x should be an outcome vector and y should be a predictor matrix, bizarrely.
For example: suppose a is a matrix of predictors and b is a vector of outcomes. Then I call fit_xy(object=my_model, x=a, y=b). Internally, how does fit_xy() know to call foo(x=y,y=x) ?
The function to validate the input is check_final_param, which require that each argument e.g. have to be named. That is why order is not important.
https://github.com/tidymodels/parsnip/blob/f7ba069671684f61af0ca1eadb1927fedec8a9c6/R/misc.R#L235
The README file linked by you pointing out:
"To create the model fit call, the protect arguments are populated with the appropriate objects (usually from the data set), and rlang::call2 is used to create a call that can be executed. "
Example of randomForest which using ntree instead of default trees argument.
They created a translation calls which will be used during evaluation.
https://github.com/tidymodels/parsnip/blob/228a6dc6975fc91562b63d191e43d2164cc78e3d/R/rand_forest_data.R#L339
If we use call2 and unpack the named args the order does not matter. And as we know that args will be properly named because of additional translation step.
args <- list(na.rm = TRUE, trim = 0)
rlang::call2("mean", 1:10, !!!args)
The way we do this is through the set_fit() function. Most models are pretty sensible and we can use default mappings (for example, from data argument to data argument or x to x) but you are right that some models use different norms. An example of this are the Spark models that use x to mean what we might normally call data with a formula method.
The random forest set_fit() function for Spark looks like this:
set_fit(
model = "rand_forest",
eng = "spark",
mode = "classification",
value = list(
interface = "formula",
data = c(formula = "formula", data = "x"),
protect = c("x", "formula", "type"),
func = c(pkg = "sparklyr", fun = "ml_random_forest"),
defaults = list(seed = expr(sample.int(10 ^ 5, 1)))
)
)
Notice especially the data element of the value argument. You can read a bit more here.
I made a model using R2jags. I like the jags syntax but I find the output produced by R2jags not easy to use. I recently read about the rstanarm package. It has many useful functions and is well supported by the tidybayes and bayesplot packages for easy model diagnostics and visualisation. However, I'm not a fan of the syntax used to write a model in rstanarm. Ideally, I would like to get the best of the two worlds, that is writing the model in R2jags and convert the output into a Stanreg object to use rstanarm functions.
Is that possible? If so, how?
I think then question isn't necessarily whether or not it's possible - I suspect it probably is. The question really is how much time you're prepared to spend doing it. All you'd have to do is try to replicate in structure the object that gets created by rstanarm, to the extent that it's possible with the R2jags output. That would make it so that some post-processing tasks would probably work.
If I might be so bold, I suspect a better use of your time would be to turn the R2jags object into something that could be used with the post-processing functions you want to use. For example, it only takes a small modification to the JAGS output to make all of the mcmc_*() plotting functions from bayesplot work. Here's an example. Below is the example model from the jags() function help.
# An example model file is given in:
model.file <- system.file(package="R2jags", "model", "schools.txt")
# data
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
jags.data <- list("y","sd","J")
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
jagsfit <- jags(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file, n.chains = 2)
Now, what the mcmc_*() plotting functions from bayesplot expect is a list of matrices of MCMC draws where the column names give the name of the parameter. By default, jags() puts all of them into a single matrix. In the above case, there are 5000 iterations in total, with 2500 as burnin (leaving 2500 sampled) and the n.thin is set to 2 in this case (jags() has an algorithm for identifying the thinning parameter), but in any case, the jagsfit$BUGSoutput$n.keep element identifies how many iterations are kept. In this case, it's 1250. So you could use that to make a list of two matrices from the output.
jflist <- list(jagsfit$BUGSoutput$sims.matrix[1:jagsfit$BUGSoutput$n.keep, ],
jagsfit$BUGSoutput$sims.matrix[(jagsfit$BUGSoutput$n.keep+1):(2*jagsfit$BUGSoutput$n.keep), ])
Now, you'd just have to call some of the plotting functions:
mcmc_trace(jflist, regex_pars="theta")
or
mcmc_areas(jflist, regex_pars="theta")
So, instead of trying to replicate all of the output that rstanarm produces, it might be a better use of your time to try to bend the jags output into a format that would be amenable to the post-processing functions you want to use.
EDIT - added possibility for pp_check() from bayesplot.
The posterior draws of y in this case are in the theta parameters. So, we make an object that has elements y and yrep and make it of class foo
x <- list(y = y, yrep = jagsfit$BUGSoutput$sims.list$theta)
class(x) <- "foo"
We can then write a pp_check method for objects of class foo. This come straight out of the help file for bayesplot::pp_check().
pp_check.foo <- function(object, ..., type = c("multiple", "overlaid")) {
y <- object[["y"]]
yrep <- object[["yrep"]]
switch(match.arg(type),
multiple = ppc_hist(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]),
overlaid = ppc_dens_overlay(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]))
}
Then, just call the function:
pp_check(x, type="overlaid")
Does anybody know how to create a dendrogram for an integrated Seurat object. I can do it for a non-integrated object, but when I try:
immune.combined <- BuildClusterTree(object = immune.combined, slot = "data")
I see the error:
Error in hclust(d = data.dist) : NA/NaN/Inf in foreign function call (arg 10)
If you followed the normal Seurat workflow, at some point you will have changed the default assay to "RNA". Looking at the source for BuildClusterTree, it uses the most variable features from the chosen assay (var.features in the Large Seurat object under your chosen assay). For the integrated workflow, you only calculated these values for the "integrated" assay, not the RNA assay. You therefore need to do the analysis on the integrated assay. That would imply something like this:
sampleIntegrated <- BuildClusterTree(sampleIntegrated,assay="integrated")
For some reason that does not work, and the same error is produced. If you first explicitly set the default assay to integrated, however, it works:
DefaultAssay(sampleIntegrated) <- "integrated"
sampleIntegrated <- BuildClusterTree(sampleIntegrated,assay="integrated")
You can then use your visualization method of choice. For example, using the ggtree package and Tool from Seurat:
library(ggtree)
myPhyTree <- Tool(object=sampleIntegrated, slot = "BuildClusterTree")
ggtree(myPhyTree)+geom_tiplab()+theme_tree()+xlim(NA,400)
Is there a way to specify weights in relrisk.ppp function in spatstat (version 1.63-3)?
The relrisk.ppp function calls the density.ppp function, which does allow users to specify their own weights.
For example, let us build upon the provided spatstat.data::urkiola data where, instead of individual trees, the locations are tree stands and we have a second numeric mark for the frequency of trees at each point-location:
urkiola_new <- spatstat.data::urkiola
urkiola_new$marks <- data.frame("type" = urkiola_new$marks, "freq" = rpois(urkiola_new$n, 3))
f1 <- spatstat::relrisk(urkiola_new, weights = urkiola_new$marks$freq)
When using the urkiola_new in a call of relrisk, urkiola_new is caught by stopifnot(is.multitype(X)) in relrisk.ppp. I next tried specifying the weights separately as a vector while using the original urkiola data,
f2 <- spatstat::relrisk(urkiola, weights = urkiola_new$marks$freq)
but was caught by a warning from the pixellate.ppp function within the internal density.ppp function:
Error in pixellate.ppp(x, ..., padzero = TRUE) : length(weights) == npoints(x) || length(weights) == 1 is not TRUE
The same error occurs when I convert the weights into a list
urkiola_weights <- split(urkiola_new$marks$freq, urkiola_new$marks$type)
f3 <- spatstat::relrisk(urkiola, weights = urkiola_weights)
I suspect there is a way to specify the weights cleverly, but it yet escapes me. Any suggestions or guidance would be helpful, thank you!
The function relrisk.ppp is not currently designed to handle weights. The help entry for relrisk.ppp does not mention weights.
The example above does not work because relrisk.ppp applies density.ppp separately to the sub-patterns of points of each type, and the extra argument weights is the wrong length for these sub-patterns.
I will take this question as a feature request, to add this capability to relrisk.ppp. It should be done soon.
Update: this is now implemented in the development version, spatstat 1.64-0.018 available at the spatstat github repository
Been trying to use the rpart.plot package to plot a ctree from the partykit library. The reason for this being that the default plot method is terrible when the tree is deep. In my case, my max_depth = 5.
I really enjoy rpart.plot's output as it allows for deep trees to visually display better. How the output looks for a simple example:
rpart
library(partykit)
library(rpart)
library(rpart.plot)
df_test <- cu.summary[complete.cases(cu.summary),]
multi.class.model <- rpart(Reliability~., data = df_test)
rpart.plot(multi.class.model)
I would like to get this output from the partykit model using ctree
ctree
multi.class.model <- ctree(Reliability~., data = df_test)
rpart.plot(multi.class.model)
>Error: the object passed to prp is not an rpart object
Is there some way one could coerce the ctree object to rpart so this would run?
To the best of my knowledge all the other packages for visualizing rpart trees are really rpart-specific and not based on the agnostic party class for representing trees/recursive partitions. Also, we haven't tried to implement an as.rpart() method for party objects because the rpart class is really not well-suited for this.
But you can try to tweak the partykit visualizations which are customizable through panel functions for almost all aspects of the tree. One thing that might be helpful is to compute a simpleparty object which has all sorts of simple summary information in the $info of each node. This can then be used in the node_terminal() panel function for printing information in the tree display. Consider the following simple example for predicting one of three school types in the German Socio-Economic Panel. To achieve the desired depth I switch significance testing essentiall off:
library("partykit")
data("GSOEP9402", package = "AER")
ct <- ctree(school ~ ., data = GSOEP9402, maxdepth = 5, alpha = 0.5)
The default plot(ct) on a sufficiently big device gives you:
When turning the tree into a simpleparty you get a textual summary by default:
st <- as.simpleparty(ct)
plot(st)
This has still overlapping labels so we could set up a small convenience function that extracts the interesting bits from the $info of each node and puts them into a longer character vector with less wide entries:
myfun <- function(i) c(
as.character(i$prediction),
paste("n =", i$n),
format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)
plot(st, tp_args = list(FUN = myfun), ep_args = list(justmin = 20))
In addition to the arguments of the terminal panel function (tp_args) I have tweaked the arguments of the edge panel function (ep_args) to avoid some of the overplotting in the edges.
Of course, you could also change the entire panel function and roll your own...