R: apply the pclm function - r

I have trouble to apply the Penalized Composite Link Model (PCLM) function which only works with vectors. I use the pclm function to generate single years of age (syoa) population data from 5-year age group population data.
pclm() can be installed by following the instructions given by the author on https://github.com/mpascariu/ungroup.
Usage of the function:
pclm(x, y, nlast,control = list())
-x: vector of the cumulative sum points of the sequence in y.
-y: vector of values to be ungrouped.
-nlast: Length of the last interval.
-control: List with additional parameters.
Here's my training dataset:
data<-data.frame(
GEOID= c(1,2),
name= c("A","B"),
"Under 5 years"= c(17,20),
"5-9 years"= c(82,90),
"10-14 years"= c(18, 22),
"15-19 years"= c(90,88),
"20-24 years"= c(98, 100),
check.names=FALSE)
#generating a data.frame storing the fitted values from the pclm for the first row: GEOID=1.
#using the values directly
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(17,82,18,90,98), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
#or referring to the vector by its rows and columns
syoa <- data.frame(fitted(pclm(x=c(0, 5, 10, 15, 20), y=c(data[1,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1))))
As my data have many observations, I'd like to apply the pclm() function across all the rows for columns 3-7: data[,3:7].
apply(data[3:7], 1, pclm(x=c(0, 5, 10, 15, 20), y=c(data[,3:7]), nlast=5, control = list(lambda = .1, deg = 3, kr = 1)))
but it's not working and gives the following error message:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
(list) object cannot be coerced to type 'double'
I don't know the issue's related to apply() or the pclm ()function. Can anyone help? Thanks.

It's easier than I thought.
pclm <- data.frame(apply(data[3:7], 1, function(x){
pclm <- pclm(x=c(0, 5, 10, 15, 20), y=c(x), nlast=5, control = list(lambda = NA, deg = 3, kr = 1))
round(fitted(pclm))
}))

Related

Multiply probability distributions in R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)

Using LmFuncs (Linear Regression) in Caret for Recursive Feature Elimination: How do I fix "same number of samples in x and y" error?

I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model.
I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.
I use the following to split into test/training data (which hasn't evoked errors)
inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length
My control parameters are as follows (also runs without error)
control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
But when I run RFE I get an error message saying
Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) :
there should be the same number of samples in x and y
My code for RFE is as follows, with the target variable in first column
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I've looked through various forums, but nothing seems to work.
This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1
Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes
Warning message:
In createDataPartition(data$EBI_SUM, p = 0.8, list = F) :
Some classes have a single record
when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.
Mega thanks in advance :)
p.s
Here's the dput for the target variable (EBI_SUM) and a couple of variables
data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2,
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3,
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
>
Your data object should not contain spaces:
library(caret)
data <- data.frame(
TargetVar = c(243, 243, 243, 243, 355, 355),
Dosing = c(2, 2, 2, 2, 2, 2),
QIDS_1 = c(1, 1, 3, 1, 1, 1),
QIDS_2 = c(3, 3, 2, 3, 3, 3),
QIDS_3 = c(1, 2, 1, 1, 1, 2)
)
inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)

Two-way repeated measure Anova R with ezANOVA error, One or more cells is missing data

I've created this minimal dataset for the example :
data_long <- data.frame(Subject = factor(c(1, 2, 3, 1, 2, 3)),
Trt = factor(c("T1","T2","T3","T1","T2","T3")),
Day = factor(c(7, 7, 7, 14, 14, 14)),
Value = c(7.6, 5.3, 8.6, 12.4, 11.2, 11))
But when I try to make a two way repeated measure ANOVA with ezANOVA, I have this error :
m2 <- ezANOVA(data = data_long, dv = Value, wid = Subject, within = c(Day,Trt))
Erreur dans ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
I definitely don't have missing data, but this error still occurs. Is there a way to fix that ?
Thank you in advance,
Yemoloh
I think the problem you are having is that for each level of the Trt factor one single participant is present.
You can see this by adding the same participants to each condition (so that each participant is present for each Trt condition):
data_long <- data.frame(Subject = factor(rep(1:3, each = 6)),
Trt = factor(rep(c("T1", "T2", "T3"), times = 6)),
Day = factor(rep(c(7, 14), times = 3, each = 3)),
Value = rnorm(n = 18, mean = 6))
With this data structure you would be able to run the ANOVA as you specified it.

R-hat against iterations RStan

I am trying to generate a similar plot as below to show the change in R-hat over iterations:
I have tried the following options :
summary(fit1)$summary : gives R-hat all chains are merged
summary(fit1)$c_summary : gives R-hat for each chain individually
Can you please help me to get R-hat for each iteration for a given parameter?
rstan provides the Rhat() function, which takes a matrix of iterations x chains and returns R-hat. We can extract this matrix from the fitted model and apply Rhat() cumulatively over it. The code below uses the 8 schools model as an example (copied from the getting started guide).
library(tidyverse)
library(purrr)
library(rstan)
theme_set(theme_bw())
# Fit the 8 schools model.
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(file = 'schools.stan', data = schools_dat)
# Extract draws for mu as a matrix; columns are chains and rows are iterations.
mu_draws = as.array(fit)[,,"mu"]
# Get the cumulative R-hat as of each iteration.
mu_rhat = map_dfr(
1:nrow(mu_draws),
function(i) {
return(data.frame(iteration = i,
rhat = Rhat(mu_draws[1:i,])))
}
)
# Plot iteration against R-hat.
mu_rhat %>%
ggplot(aes(x = iteration, y = rhat)) +
geom_line() +
labs(x = "Iteration", y = expression(hat(R)))

R: how to get optim to output parameter values at each iteration

library(stats4)
x <- 0:10
y <- c(26, 17, 13, 12, 20, 5, 9, 8, 5, 4, 8)
## Easy one-dimensional MLE:
nLL <- function(lambda) -sum(stats::dpois(y, lambda, log = TRUE))
fit0 <- mle(nLL, start = list(lambda = 5), nobs = NROW(y), method = "L-BFGS-B")
This is a toy example from mle's documentation. The optimization method I chose to use is L-BFGS-B. I'm interested in seeing the lambda values at different iterations.
Looking into optim's help page, I tried adding trace = TRUE. But that seems to give me the likelihood at each iteration and not the lambda values.
> fit0 <- mle(nLL, start = list(lambda = 5), nobs = NROW(y), method = "L-BFGS-B", control = list(trace = TRUE))
final value 42.726780
converged
How can I obtain the lambda estimates at each iteration?

Resources