Visualizing pcAlgo object in ggdag? - r

There's documentation here (https://www.rdocumentation.org/packages/pcalg/versions/2.7-3/topics/pc) on how to run the pc algorithm for binary data.
library(pcalg)
##################################################
## Using discrete data
##################################################
## Load data
data(gmD)
V <- colnames(gmD$x)
## define sufficient statistics
suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE)
## estimate CPDAG
pc.D <- pc(suffStat,
## independence test: G^2 statistic
indepTest = disCItest, alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated CPDAG
par(mfrow = c(1,2))
plot(pc.D, main = "Estimated CPDAG")
plot(gmD$g, main = "True DAG")
}
I'm wondering if there's anyway I can use the pc.D object to visualize this in ggdag (https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-ggdag.html).
Is there a way?

Related

Plot Confidence Ellipses for Multiple Factor Analysis

I'm trying to plot ellipses with the function http://dx.doi.org/10.1016/j.foodqual.2012.04.010 for the results of a Multiple Factor Analysis however I can't get the individual ellipses. For this I am trying to use the example of the FactoMineR package. If anyone can help me identify the error I would be very grateful.
library(FactoMineR)
data(wine)
res <- MFA(wine, group=c(2,5,3,10,9,2), type=c("n",rep("s",5)),
ncp=5, name.group=c("orig","olf","vis","olfag","gust","ens"),
num.group.sup=c(1,6))
Article function:
MFAconf = function(MFAresob, axes = c(1,2)){
if (!require("FactoMineR")) install.packages("FactoMineR");
library("FactoMineR")
# The number of samples (n)
n = dim(MFAresob$ind$coord)[1]
# The number of groups of variables (m)
m = dim(MFAresob$group$coord)[1]
# Creating a new data frame with one row for each sample’s assiociated MFA group of variables.
CATnames <- vector(mode="character", length = n*m)
for (j in 1:n){CATnames[(((j-1)*m)+1):(j*m)] <- dimnames(MFAresob$ind$coord[order(row.names(MFAresob$ind$coord)),])[[1]][j]}
PartielDim <- cbind.data.frame(names = CATnames,MFAresob$ind$coord.partiel)
PartielDim$names = as.factor(PartielDim$names)
# Bootstrapping the new data frame
Boot <- simule(PartielDim, nb.simul = 500)
# Creating ellipses around 95% of the bootstrapped means
EllipCoord <- coord.ellipse(Boot$simul, level.conf = 0.95, bary = FALSE, axes=axes, npoint = 100)
#Plotting the ellipses
plot.MFA(MFAresob, choix = "ind",title ="", axes = axes, ellipse = EllipCoord,ellipse.par = NULL)
}
#Applying the function to the results for MFA:
MFAconf(MFAresob = res)

sampling from posterior predictive distribution (stan vs inla)

I'm trying to implement functions from bayesplot package on a INLA object and a little unsure of how to draw from the posterior predictive distribution. I think I almost have it but rstan draws are more variable than the INLA ones.
In rstan, using the simplified example from bayesplot vignette I can:
library(bayesplot)
library(ggplot2)
library(rstanarm)
library(ggpubr)
library(tidyverse)
#rstan model set up
roaches$roach100 <- roaches$roach1 / 100 # pre-treatment number of roaches (in 100s)
fit_poisson <- stan_glm(y ~ roach100 + treatment + senior, offset = log(exposure2), family = poisson(link = "log"), data = roaches, seed = 1111, refresh = 0)
#In order to use the PPC functions from the bayesplot package we need a vector y of outcome values:
y <- roaches$y
#and a matrix yrep of draws from the posterior predictive distribution,
yrep_poisson <- posterior_predict(fit_poisson, draws = 500)
#then plot:
p1 <- bayesplot::ppc_dens_overlay(y, yrep_poisson[1:50, ])
p1
I want to replicate that plot on a INLA object. According to the bayesplot vignette you can do this as they have provided code to define a simple pp_check method that creates fitted model objects of class e.g. foo:
pp_check.foo <- function(object, type = c("multiple", "overlaid"), ...) {
type <- match.arg(type)
y <- object[["y"]]
yrep <- object[["yrep"]]
stopifnot(nrow(yrep) >= 50)
samp <- sample(nrow(yrep), size = ifelse(type == "overlaid", 50, 5))
yrep <- yrep[samp, ]
if (type == "overlaid") {
ppc_dens_overlay(y, yrep, ...)
} else {
ppc_hist(y, yrep, ...)
}
}
To use pp_check.foo we can just make a list with y and yrep components and give it class foo:
x <- list(y = rnorm(200), yrep = matrix(rnorm(1e5), nrow = 500, ncol = 200))
class(x) <- "foo"
#create plot above:
pp_check(x, type = "overlaid")
INLA
#create same model but in inla:
library(INLA)
fit_poisson_inla <- inla(y ~ roach100 + treatment + senior, offset = log(exposure2), data = roaches,
control.predictor = list(compute = T),
family = "poisson")
inla_object_name$marginals.fitted.values returns a posterior predictive distribution for each y:
fit_poisson_inla$marginals.fitted.values
#so to get distribution for first oberservation:
fitted.Predictor.1 <- fit_poisson_inla$marginals.fitted.values[[1]]
I think repeatedly sampling from this would give me what I need but there are only 75 values (dim(fitted.Predictor.1) per observation used to create this distribution when in reality I would want to be sampling from a full range of values. I think we can do this (section 4.3 here) by using inla.tmarginal using linear predictor:
fitted_dist <- fit_poisson_inla$marginals.linear.predictor
#should i have used "inla.rmarginal(n, marginal)"?
marginal_dist <- lapply(fitted_dist, function(y) inla.tmarginal(function(x) {exp(x)}, y)) %>% map(~ as.data.frame(.) %>% rename(., xx = x))
#resample 500 times
yrep_poisson_inla <- as.matrix(bind_rows(rerun(500, lapply(marginal_dist, function(x) sample(x$xx, 1)) %>% as.data.frame())))
#convert to class foo for pp_check
x <- list(y = y, yrep = yrep_poisson_inla[1:50, ])
class(x) <- "foo"
p2 <- pp_check(x, type = "overlaid")
#plot
ggarrange(p1, p2, ncol = 1, nrow = 2, labels = c("rstan", "inla sample"))
My question is how do I correctly get a matrix of draws from the posterior predictive distribution from this inla (fit_poisson_inla) object to pass into pp_check? yrep_poisson produces discrete values while yrep_poisson_inla produces continuous values. There is a lot more variation in the rstan draws than INLA (second plot). Is what I have done correct and this is just some sampling issue or is it an artifact of the different methods? In more complicated examples the differences could be substantial.
Thanks

Do you need to tokenize your text to visualize data from a LDA topic model?

I'm currently using the textmineR package to run LDA topicmodels on news articles from 2016-2019.
However, I am quite new to R and i don't know how to display my results from the model.
I want to show the prevalence of the 8 topics my model finds, over the time period I have collected data. The data is structured in a dataframe. My data is defined at the day to day level as %y-%m-%d
My LDA model is made like this:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
Can anyone tell me how to plot the most prevalent topics the model finds over time?
Do I need to tokenize the text or something like that?
I hope this makes sense.
Best,
Tokenization happens in the CreateDtm function. So, it doesn't sound like that's your issue.
You can get the prevalence of topics over a set of documents by taking a mean over the columns of theta, a matrix that's part of the resulting model.
I can't give you an exact answer with your data, but I can show you a similar example with the nih_sample data that ships with textmineR
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")

How do I extract random effects from MCMCglmm?

I am looking for a command similar to ranef() used in nlme, lme4, and brms that will allow me to extract the individual random effects in my MCMCglmm model. In my dataset, I have 40 providers and I would like to extract the random effects for each provider and plot them in a caterpillar plot. Any suggestions would be great. Thank you.
In case it is helpful, here is my MCMCglmm model:
prior.3 <- list(R = list(R1 = list(V = diag(2), nu = 0.002)),
G = list(G1 = list(V = diag(2), nu = 0.002),
G2 = list(V = diag(2), nu = 0.002)))
mc_mod2 <- MCMCglmm(outcome ~ 1, data = filter(data, rem2 == "white" | rem2 == "rem"),
random = ~ idh(rem2):id + us(rem2):provider,
rcov = ~idh(rem2):units,
verbose = TRUE,
prior = prior.3,
family = "gaussian",
nitt = 100000, burnin = 5000,
pr = TRUE)
A little more detail, since the package doesn't seem to have caterpillar plots built in: note you need to use pr=TRUE when calling MCMCglmm in order to store the random effects values.
library(MCMCglmm)
data(PlodiaPO)
model1 <- MCMCglmm(PO~1, random=~FSfamily, data=PlodiaPO, verbose=FALSE,
nitt=1300, burnin=300, thin=1,
pr=TRUE)
if (!require("postMCMCglmm")) {
devtools::install_github("JWiley/postMCMCglmm")
library("postMCMCglmm")
}
ranef() appears to return a matrix of the random effects (rows=levels, columns=samples). Convert to a data frame with mean and quantiles:
qfun <- function(x,lev) unname(quantile(x,lev))
rsum <- as.data.frame(t(apply(ranef(model1),1,
function(x) c(est=mean(x),
min=qfun(x,0.025),max=qfun(x,0.975)))))
Order for plotting:
rsum$term <- reorder(factor(rownames(rsum)),
rsum$est)
Plot:
library(ggplot2)
ggplot(rsum,aes(term,est))+
geom_pointrange(aes(ymin=min,ymax=max))+
coord_flip()
I overlooked an additional package that needed to be installed (thanks for pointing this out, Ben).
To be able to run ranef(), simply install the postMCMCglmm package - https://github.com/jwiley/postMCMCglmm/
#install.packages("devtools")
require(devtools)
install_github("JWiley/postMCMCglmm")

Maximum Likelihood Estimation three-parameter Weibull for right censored data

I am trying to estimate the parameters of the three-parametric Weibull distribution with ML for censored data.
I've worked it out by using the package flexsurv where I've defined an "own" density function.
I've also followed the instructions given in the documentation of the function flexsurv::flexsurvregto build the list with all required information to do the MLE with a customer density function.
In the following you can see what I've done so far.
library(FAdist)
library(flexsurv)
set.seed(1)
thres <- 3500
data <- rweibull(n = 1000, shape = 2.2, scale = 25000) + thres
y <- sample(c(0, 1), size = 1000, replace = TRUE)
df1 <- data.frame(x = data, status = y)
dweib3 <- function(x, shape, scale, thres, log = FALSE) {
dweibull(x - thres, shape, scale, log = log)
}
pweib3 <- function(q, shape, scale, thres, log_p = FALSE) {
pweibull(q - thres, shape, scale, log.p = log_p)
}
# Not required
#qweib3 <- function(p, shape, scale, thres, log.p = FALSE) {
# if (log.p == TRUE) {
# p <- exp(p)
# }
# qwei3 <- thres + qweibull(p, shape, scale)
# return(qwei3)
#}
dweib3 <- Vectorize(dweib3)
pweib3 <- Vectorize(pweib3)
custom.weibull <- list(name = "weib3",
pars = c('shape', 'scale', 'thres'), location = 'scale',
transforms = c(log, log, log),
inv.transforms = c(exp, exp, exp),
inits = function(t) {
c(1.2 / sqrt((var(log(t)))), exp(mean(log(t)) + (.572 / (1.2 / sqrt((var(log(t))))))), .5 * min(t))
}
)
ml <- flexsurvreg(Surv(df1$x, df1$status) ~ 1, data = df1, dist = custom.weibull)
The variable y should represent the status of a unit where 1 is a failure and 0 is an unfailed unit until censoring.
The initial values for shape and scale are taken from the moments which are also defined in the fitdistrpluspackage.
For the threshold parameter there must be a constraining since the threshold must be really smaller than the minimum of the data. Therefore a constraint of threshold is at its max .99 * t_min would be satisfactory (this is something which I haven't implement until now).
The output of the above MLE is the following:
> ml
Call:
flexsurvreg(formula = Surv(df1$x, df1$status) ~ 1, data = df1,
dist = custom.weibull)
Estimates:
est L95% U95% se
shape 2.37e+00 2.12e+00 2.65e+00 1.33e-01
scale 3.52e+04 3.32e+04 3.74e+04 1.08e+03
thres 2.75e+03 1.51e+03 5.02e+03 8.44e+02
N = 1000, Events: 481, Censored: 519
Total time at risk: 25558684
Log-likelihood = -5462.027, df = 3
AIC = 10930.05
The estimated parameters aren't fine even if there is censoring.
I've did this procedure a few times with other randomly generated data... the estimates are always pretty far away from the "truth".
Therefore I need an improvement of my code or another possibility to estimate the parameters of a three-parameter Weibull with MLE.

Resources