Receiving an error when running ggsurvplot - r

I am trying to run a survival analysis and then create a kaplan meier curve using the ggsurvplot function. However, when I run the code, I get the following error:
`Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 4, 0, 8...`
Does anyone know where I may be going wrong? Thank you!!!
`MRE_time <- as.numeric(c(10, 20, 15, 30))
MRE_status <- as.factor(c(1, 0, 1, 0))
MRE <- data.frame(MRE_time, MRE_status)
sfit1 <- survfit(Surv(MRE_time, MRE_status)~1, data = MRE)
ggsurvplot(sfit1, data = MRE)`

MRE_status should be numeric, not factor.
MRE_time <- as.numeric(c(10, 20, 15, 30))
MRE_status <- c(1, 0, 1, 0)
MRE <- data.frame(MRE_time, MRE_status)
sfit1 <- survfit(Surv(MRE_time, MRE_status)~1, data = MRE)
ggsurvplot(sfit1, data = MRE)

Related

Using LmFuncs (Linear Regression) in Caret for Recursive Feature Elimination: How do I fix "same number of samples in x and y" error?

I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model.
I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.
I use the following to split into test/training data (which hasn't evoked errors)
inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length
My control parameters are as follows (also runs without error)
control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
But when I run RFE I get an error message saying
Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) :
there should be the same number of samples in x and y
My code for RFE is as follows, with the target variable in first column
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I've looked through various forums, but nothing seems to work.
This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1
Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes
Warning message:
In createDataPartition(data$EBI_SUM, p = 0.8, list = F) :
Some classes have a single record
when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.
Mega thanks in advance :)
p.s
Here's the dput for the target variable (EBI_SUM) and a couple of variables
data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2,
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3,
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
>
Your data object should not contain spaces:
library(caret)
data <- data.frame(
TargetVar = c(243, 243, 243, 243, 355, 355),
Dosing = c(2, 2, 2, 2, 2, 2),
QIDS_1 = c(1, 1, 3, 1, 1, 1),
QIDS_2 = c(3, 3, 2, 3, 3, 3),
QIDS_3 = c(1, 2, 1, 1, 1, 2)
)
inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)

R-hat against iterations RStan

I am trying to generate a similar plot as below to show the change in R-hat over iterations:
I have tried the following options :
summary(fit1)$summary : gives R-hat all chains are merged
summary(fit1)$c_summary : gives R-hat for each chain individually
Can you please help me to get R-hat for each iteration for a given parameter?
rstan provides the Rhat() function, which takes a matrix of iterations x chains and returns R-hat. We can extract this matrix from the fitted model and apply Rhat() cumulatively over it. The code below uses the 8 schools model as an example (copied from the getting started guide).
library(tidyverse)
library(purrr)
library(rstan)
theme_set(theme_bw())
# Fit the 8 schools model.
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(file = 'schools.stan', data = schools_dat)
# Extract draws for mu as a matrix; columns are chains and rows are iterations.
mu_draws = as.array(fit)[,,"mu"]
# Get the cumulative R-hat as of each iteration.
mu_rhat = map_dfr(
1:nrow(mu_draws),
function(i) {
return(data.frame(iteration = i,
rhat = Rhat(mu_draws[1:i,])))
}
)
# Plot iteration against R-hat.
mu_rhat %>%
ggplot(aes(x = iteration, y = rhat)) +
geom_line() +
labs(x = "Iteration", y = expression(hat(R)))

How do I graph a Bayesian Network with instantiated nodes using bnlearn and graphviz?

I am trying to graph a Bayesian Network (BN) with instantiated nodes using the libraries bnlearn and Rgraphviz. My workflow is as follow:
After creating a data frame with random data (the data I am actually using is obviously not random) I then discretise the data, structure learn the directed acyclic graph (DAG), fit the data to the DAG and then plot the DAG. I also plot a DAG which shows the posterior probabilities of each of the nodes.
#rm(list = ls())
library(bnlearn)
library(Rgraphviz)
# Generating random dataframe
data_clean <- data.frame(a = runif(min = 0, max = 100, n = 1000),
b = runif(min = 0, max = 100, n = 1000),
c = runif(min = 0, max = 100, n = 1000),
d = runif(min = 0, max = 100, n = 1000),
e = runif(min = 0, max = 100, n = 1000))
# Discretising the data into 3 bins
bins <- 3
data_discrete <- discretize(data_clean, breaks = bins)
# Creating factors for each bin in the data
lv <- c("low", "med", "high")
for (i in names(data_discrete)){
levels(data_discrete[, i]) = lv
}
# Structure learning the DAG from the training set
whitelist <- matrix(c("a", "b",
"b", "c",
"c", "e",
"a", "d",
"d", "e"),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("from", "to")))
bn.hc <- hc(data_discrete, whitelist = whitelist)
# Plotting the DAG
dag.hc <- graphviz.plot(bn.hc,
layout = "dot")
# Fitting the data to the structure
fitted <- bn.fit(bn.hc, data = data_discrete, method = "bayes")
# Plotting the DAG with posteriors
graphviz.chart(fitted, type = "barprob", layout = "dot")
The next thing I do is to manually change the distributions in the bn.fit object, assigned to fitted, and then plot a DAG that shows the instantiated nodes and the updated posterior probability of the response variable e.
# Manually instantiating
fitted_evidence <- fitted
cpt.a = matrix(c(1, 0, 0), ncol = 3, dimnames = list(NULL, lv))
cpt.c = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.c) <- c(3, 3)
dimnames(cpt.c) <- list("c" = lv, "b" = lv)
cpt.b = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.b) <- c(3, 3)
dimnames(cpt.b) <- list("b" = lv, "a" = lv)
cpt.d = c(0, 0, 1,
0, 1, 0,
1, 0, 0)
dim(cpt.d) <- c(3, 3)
dimnames(cpt.d) <- list("d" = lv, "a" = lv)
fitted_evidence$a <- cpt.a
fitted_evidence$b <- cpt.b
fitted_evidence$c <- cpt.c
fitted_evidence$d <- cpt.d
# Plotting the DAG with instantiation and posterior for response
graphviz.chart(fitted_evidence, type = "barprob", layout = "dot")
This is the result I get but my actual BN is much larger with many more arcs and it would be impractical to manually change the bn.fit object.
I would like to find out if there is a way to plot a DAG with instantiation without changing the bn.fit object manually? Is there a workaround or function that I am missing?
I think/hope I have read the documentation for bnlearn thoroughly. I appreciate any feedback and would be happy to change anything in the question if I have not conveyed my thoughts clearly enough.
Thank you.
How about using cpdist to draw samples from the posterior given the evidence. You can then estimate the updated parameters using bn.fit using the cpdist samples. Then plot as before.
An example:
set.seed(69184390) # for sampling
# Your evidence vector
ev <- list(a = "low", b="low", c="low", d="high")
# draw samples
updated_dat <- cpdist(fitted, nodes=bnlearn::nodes(fitted), evidence=ev, method="lw", n=1e6)
# refit : you'll get warnings over missing levels
updated_fit <- bn.fit(bn.hc, data = updated_dat)
# plot
par(mar=rep(0,4))
graphviz.chart(updated_fit, type = "barprob", layout = "dot")
Note I used bnlearn::nodes as nodes is masked by a dependency of Rgraphviz. I tend to load bnlearn last.

The predict()-function is returning unexpected output

Problem
i have a linear regression model created with some dataset (i.d. logAnalysis <- lm(log(wage) ~ female+exper+school) ) everything works fine and looks as expected.
I now got a matrix of new data:
students <- matrix(c(
0, 3, 10,
1, 17, 12,
1, 8, 9,
0, 20, 10,
0, 34, 9,
0, 2, 13
), ncol = 3, byrow = TRUE)
With the first column being the female/male trade the second being the work-experience and the third being school education. I now want to make a prediction about the expected wages. This is how I thought it would go:
predictionData <- data.frame(female=students[,1], exper=students[,2], school=students[,3])
predictedIncome <- predict(logAnlaysis, newData = predictionData)
but as it turns out predictedIncome is not an vector of 6 (i.d. 6 predictions, one for each student) but an vektor of [1:3296]. I cannot make sense of that. Maybe I missunderstood the whole function. But I wouldn't know what else it does.
Thank you for your help
Regards
There was just a typo. newData = predictionData instead of newdata = predictionData.

R: apply the pclm function

I have trouble to apply the Penalized Composite Link Model (PCLM) function which only works with vectors. I use the pclm function to generate single years of age (syoa) population data from 5-year age group population data.
pclm() can be installed by following the instructions given by the author on https://github.com/mpascariu/ungroup.
Usage of the function:
pclm(x, y, nlast,control = list())
-x: vector of the cumulative sum points of the sequence in y.
-y: vector of values to be ungrouped.
-nlast: Length of the last interval.
-control: List with additional parameters.
Here's my training dataset:
data<-data.frame(
GEOID= c(1,2),
name= c("A","B"),
"Under 5 years"= c(17,20),
"5-9 years"= c(82,90),
"10-14 years"= c(18, 22),
"15-19 years"= c(90,88),
"20-24 years"= c(98, 100),
check.names=FALSE)
#generating a data.frame storing the fitted values from the pclm for the first row: GEOID=1.
#using the values directly
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(17,82,18,90,98), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
#or referring to the vector by its rows and columns
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(data[1,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
As my data have many observations, I'd like to apply the pclm() function across all the rows for columns 3-7: data[,3:7].
apply(data[3:7], 1, pclm(x=c(0, 5, 10, 15, 20), y=c(data[,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1)))
but it's not working and gives the following error message:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
(list) object cannot be coerced to type 'double'
I don't know the issue's related to apply() or the pclm ()function. Can anyone help? Thanks.
It's easier than I thought.
pclm <- data.frame(apply(data[3:7], 1, function(x){
pclm <- pclm(x=c(0, 5, 10, 15, 20), y=c(x), nlast=5, control = list(lambda = NA, deg = 3, kr = 1))
round(fitted(pclm))
}))

Resources