My analysis will randomly sample values from pre-specified distributions for each parameter. I am using the rdecision package for this.
For example, in a simplified example where I have just two parameters:
v1 <- BetaModVar$new("Beta1", alpha = a1, beta = b1, units="")
v2 <- BetaModVar$new("Beta2", alpha = a2, beta = b2, units="")
I want to create
v3 <- v1 + v2
However, this is not possible given the nature of v1 and v2. How can I create v3? Essentially this would combine values randomly drawn from the respective distributions of v1 and v2.
Assuming your question is about using the rdecision package, you can add the v1 and v2 model variables by creating v3 as an expression model variable.
library(rdecision)
library(rlang)
a1 <- 2
b1 <- 2
a2 <- 20
b2 <- 20
# with model variables
v1 <- BetaModVar$new(description = "v1", units = "", alpha = a1, beta = b1)
v2 <- BetaModVar$new(description = "v2", units = "", alpha = a2, beta = b2)
v3 <- ExprModVar$new(description = "v3", units = "", quo = quo(v1 + v2))
print(v3$mu_hat())
print(v3$sigma_hat())
# with base R
V1 <- rbeta(1000, shape1 = a1, shape2 = b1)
V2 <- rbeta(1000, shape1 = a2, shape2 = b2)
V3 <- V1 + V2
print(mean(V3))
print(sd(V3))
Expression model variables can be used in decision analytic models in the same way as regular model variables whose uncertainty follows a particular distribution. But some of their properties (such as standard deviation) may be undefined, so expression model variables offer the mu_hat and sigma_hat methods to allow the mean and standard deviation to be estimated, if you want to summarise their distributions, as in the example above.
Related
I search for one approach for comparing linear, non-linear and different parameterization non-linear models. For this:
#Packages
library(nls2)
library(minpack.lm)
# Data set - Diameter in function of Feature and Age
Feature<-sort(rep(c("A","B"),22))
Age<-c(60,72,88,96,27,
36,48,60,72,88,96,27,36,48,60,72,
88,96,27,36,48,60,27,27,36,48,60,
72,88,96,27,36,48,60,72,88,96,27,
36,48,60,72,88,96)
Diameter<-c(13.9,16.2,
19.1,19.3,4.7,6.7,9.6,11.2,13.1,15.3,
15.4,5.4,7,9.9,11.7,13.4,16.1,16.2,
5.9,8.3,12.3,14.5,2.3,5.2,6.2,8.6,9.3,
11.3,15.1,15.5,5,7,7.9,8.4,10.5,14,14,
4.1,4.9,6,6.7,7.7,8,8.2)
d<-dados <- data.frame(Feature,Age,Diameter)
str(d)
I will create three different models, two non-linear models with specific parametization and one linear model. In my example
a suppose that all the coefficients of each mode were significant (and not considering real results).
# Model 1 non-linear
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000))
# Model 2 linear
m2<-lm(Diameter ~ Age, data=d)
# Model 3 another non-linear
e2<- Diameter ~ a1^(-Age/a2)
m3 <- nls2(e2, data = d, alg = "brute-force",
start = data.frame(a1 = c(-1, 1), a2 = c(-1, 1)),
control = nls.control(maxiter = 1000))
Now, my idea is comparing the "better" model despite the different nature of each model, than I try a proportional measure
and for this I use each mean square error of each model comparing of total square error in data set, when a make this I have if
a comparing model 1 and 2:
## MSE approach (like pseudo R2 approach)
#Model 1
SQEm1<-summary(m1)$sigma^2*summary(m1)$df[2]# mean square error of model
SQTm1<-var(d$Diameter)*(length(d$Diameter)-1)#total square error in data se
R1<-1-SQEm1/SQTm1
R1
#Model 2
SQEm2<-summary(m2)$sigma^2*summary(m2)$df[2]# mean square error of model
R2<-1-SQEm2/SQTm1
R2
In my weak opinion model 1 is "better" that model 2. My question is, does this approach sounds correct? Is there any way to compare these models types?
Thanks in advance!
#First cross-validation approach ------------------------------------------
#Cross-validation model 1
set.seed(123) # for reproducibility
n <- nrow(d)
frac <- 0.8
ix <- sample(n, frac * n) # indexes of in sample rows
e1<- Diameter ~ a1 * Age^a2
#Algoritm Levenberg-Marquardt
m1 <- nlsLM(e1, data = d,
start = list(a1 = 0.1, a2 = 10),
control = nls.control(maxiter = 1000), subset = ix)# in sample model
BOD.out <- d[-ix, ] # out of sample data
pred <- predict(m1, new = BOD.out)
act <- BOD.out$Diameter
RSS1 <- sum( (pred - act)^2 )
RSS1
#[1] 56435894734
#Cross-validation model 2
m2<-lm(Diameter ~ Age, data=d,, subset = ix)# in sample model
BOD.out2 <- d[-ix, ] # out of sample data
pred <- predict(m2, new = BOD.out2)
act <- BOD.out2$Diameter
RSS2 <- sum( (pred - act)^2 )
RSS2
#[1] 19.11031
# Sum of squares approach -----------------------------------------------
deviance(m1)
#[1] 238314429037
deviance(m2)
#[1] 257.8223
Based in gfgm and G. Grothendieck comments, RSS2 has lower error that RSS1 and comparing deviance(m2) and deviance(m2) too, than model 2 is better than model 1.
I am trying to label a PCA biplot with sample names rather then the standard numbers. I am using the codes:
PRCOMP1 <- prcomp(~ Max + Min + Range + Average + P10 + P20 +
P50 + P100 + D10 + D20 + D50 + D100 + D500,
data = turbidity,
na.action = na.omit,
scale = TRUE
biplot(PRCOMP1, cex = 0.8, choices=c(1,2))
which provides the below plot - does any know I can label the points with a column labelled Sample in my datasheet.
Also is there a easy way to change the colour of the arrows? Any help would be much appreciated.
You would name the rows of your input data with the value of the Sample column:
row.names(turbidity) <- turbidity$Sample
The dots on your biplot will then be labelled with their cognate sample name.
I try with an example:
#creating an example data frame with 5 numeric and one character variables
mydata1 <- as.data.frame(matrix(rnorm(100, 0, 2), ncol = 5))
mydata1$sample <- c(sapply(1:20, function(i) paste("s", i, sep = "")))
#view of the df
mydata1
V1 V2 V3 V4 V5 sample
1 1.7398057 -0.8074246 0.009826488 0.58566480 3.88569625 s1
2 -1.3259889 -2.4359229 -1.258855445 2.65124987 -2.64137545 s2
3 -2.3961068 -0.3108402 -1.330362255 -0.35209302 -2.39282594 s3
This is a 20 rows by 6 variables dataframe
biplot(prcomp(mydata1[,-6]))
This statement will return a plot without the sample label, only numbers.
#naming rows of the df with the sample column value
row.names(mydata1) <- mydata1$sample
#viewing the df
head(mydata1)
V1 V2 V3 V4 V5 sample
s1 1.739806 -0.8074246 0.009826488 0.5856648 3.8856962 s1
s2 -1.325989 -2.4359229 -1.258855445 2.6512499 -2.6413755 s2
s3 -2.396107 -0.3108402 -1.330362255 -0.3520930 -2.3928259 s3
#plotting
biplot(prcomp(mydata1[,-6]))
The latter plot will now render the observations with their labels.
Let me know if that is what you had in mind.
I'm using the gamlss package in R to implement wormplots for the residuals study.
The function wp() has an argument xvar which is used for bucketing.
Assume I have a "numeric" vector x1 which if passed as "xvar = x1" behaves differently than "xvar = ~x1". Basically the second case is treated as a formula. The buckets created for both cases will be different from each other.
Code :-
library(gamlss)
glc<-gamlss.control(n.cyc = 200)
myseed <- 12345
set.seed(myseed) #this will make results reproducible
# generate data
N<-10000 # this is the sample size
dd<-data.frame(x1=rpois(N,1)
,x2=rnorm(N,.7,.3)
,x3=log(rgamma(N,shape=6,scale=10))
,x4=sample(letters[1:3], N, replace = T)
,x5=sample(letters[3:6], N, replace = T)
,ind = rbinom(N,size=1,prob=0.5)
)
#Generate distributions
dd$y_wei1<-rweibull(N,scale=exp(.3*dd$x1+.8*dd$x3),shape=5)
m1 <- gamlss(formula = y_wei1 ~ x1 + x3 + x4 + x5,
data = dd ,
family = "WEI" ,
K = 2,
control = glc
)
# Case 1.
wp(object = m1, xvar = x1, n.iter = 4)
# Case 2.
wp(object = m1, xvar = ~x1, n.iter = 4)
Edit :
I do observed that this happens only when the overlap argument is set to 0. Because when overlap=0 then internally another function( check.overlap) is called. Why is this function called?
the function has been written such that xvar = ~x1 indicated x1 is a factor/char variable and so grouping occurs based on its unique values. When user calls with xvar = x1 then bins are created based on the range and that is used to generate the wormplots.
The difference is because internally there is a check.overlap fucntion written which is impemented only if x1 is numeric. Incase of overlapping, it clips it to have non-overlapping intervals. This is missing if user calls it as xvar = ~x1.
I have a logistic regression using glm and I would like to add a term of the form
c1(k+ac2)/(t+c2)
where k and t are columns in a data frame, a is a constant. I would like R to find best-fit values for c1 and c2. Is this possible?
If I only wanted a fixed value, say c2 = 2,
c1(k+2a)/(t+2)
I could just write
glm( model$y ~ I((model$k + 2*a)/(model$t + 2)) + model$otherterms,
family = binomial(logit) )
which is similar to what I am doing now. But I don't think that 2 is optimal and iterating 'manually' is very time-consuming.
You can use function gnm from package gnm.
gnm(y~Mult(1, # c1
offset(k)+1,# c3=a*c2
Inv(offset(t)+1)) # c2
+other terms,
family=binomial,
data=models)
EDIT (solution for constrained coefficients)
term_fun <- function(predLabels, varLabels){
paste0(predLabels[1],"*(",varLabels[1],
"+",predLabels[2],"*3)/(", # a=3 for example
varLabels[2],"+", predLabels[3],")")}
Ratio <- function(t,x){
list(predictors = list(C1 = 1, C2 = 1),
variables = list(substitute(t), substitute(x)),
term = term_fun)
}
class(Ratio) <- "nonlin"
fit <- gnm(Y~Ratio(k,t), data=models, family=binomial)
Let's say that you have a normally distributed variable y with a 3-group categorical predictor x that has the orthogonal contrasts c1 and c2. I am trying to create a program in R that, given x, c1, and c2, creates y such that c1 and c2 have effect sizes r1 and r2 specified by the user.
For example, let's say that x, c1, c2, r1, and r2 were created like the following:
x <- factor(rep(c(1, 2, 3), 100))
contrasts(x) <- matrix(c(0, -.5, .5, -2/3, 1/3, 1/3),
nrow = 3, ncol = 2, dimnames = list(c("1", "2", "3"), c("c1", "c2")))
contrasts(x)
c1 c2
1 0.0 -0.6666667
2 -0.5 0.3333333
3 0.5 0.3333333
r1 <- .09
r2 <- 0
I would like the program to create y such that the variance in y accounted for by c1 equals r1 (.09) and the variance in y accounted for by c2 equals r2 (0).
Does anybody know how I might go about this? I know that I should be using the rnorm function, but I'm stuck on which population means / sds rnorm should use when it does its sampling.
Courtesy of some generous advice from my colleagues, I now have one function that creates simulated data given a specified number of groups, a set of contrasts, a set of regression coefficients, a specified N per cell, and a specified within-group variance
sim.factor <- function(levels, contr, beta, perCell, errorVar){
# Build design matrix X
X <- cbind(rep(1,levels*perCell), kronecker(contr, rep(1,perCell)))
# Generate y
y <- X %*% beta + rnorm(levels*perCell, sd=sqrt(errorVar))
# Build and return data frame
dat <- cbind.data.frame(y, X[,-1])
names(dat)[-1] <- colnames(contr)
return(dat)
}
I also wrote a function that, given a set of regression coefficients, N per cell, number of groups, set of orthogonal contrasts, desired delta-R^2 for a contrast of interest, returns the required within-group variance:
ws.var <- function(levels, contr, beta, perCell, dc){
# Build design matrix X
X <- cbind(rep(1,levels), contr)
# Generate the expected means
means <- X %*% beta
# Find the sum of squares due to each contrast
var <- (t(means) %*% contr)^2 / apply(contr^2 / perCell, 2, sum)
# Calculate the within-conditions sum of squares
wvar <- var[1] / dc - sum(var)
# Convert the sum of squares to variance
errorVar <- wvar / (3 * (perCell - 1))
return(errorVar)
}
After doing some testing as follows, the functions seem to generate the desired delta R^2 for contrast c1.
contr <- contr.helmert(3)
colnames(contr) <- c("c1","c2")
beta <- c(0, 1, 0)
perCell <- 50
levels = 3
dc <- .08
N <- 1000
# Calculate the error variance
errorVar <- ws.var(levels, contr, beta, perCell, dc)
# To store delta R^2 values
d1 <- vector("numeric", length = N)
# Use the functions
for(i in 1:N)
{
d <- sim.factor(levels=3,
contr=contr,
beta=beta,
perCell=perCell,
errorVar=errorVar)
d1[i] <- lm.sumSquares(lm(y ~ c1 + c2, data = d))[1, 2] # From the lmSupport package
}
m <- round(mean(d1), digits = 3)
bmp("Testing simulation functions.bmp")
hist(d1, xlab = "Percentage of variance due to c1", main = "")
text(.18, 180, labels = paste("Mean =", m))
dev.off()
Patrick