Linear formula calculation - math

Recently trying to solve a problem which involves a formula.
I have one already as an example.
So
MIN 0.58 equals X = 0.50
MAX 1.23 equals X = 4.50
You can achieve any in between values with the following formula: (0.1621 * X) + 0.4990, works great.
Example:
(0.1621 * 1.20) + 0.4990 = 0.69, so 0.69 equals X = 1.20
However the scenario changed, and so the formula, and I can't find out.
MIN 0.59 equals X = 0.20
MAX 1.10 equals X = 4.80

Related

ROC Curve Plot using R (Error code: Predictor must be numeric or ordered)

I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)

rstan MCMC: Different squence of data resulting in different results, why?

I am new to Stan and rstan.
I recently may find a weird issue when I worked on Markov chain Monte Carlo (MCMC). In short, for example, the data has 10 observations, say ID 1 to 10. Now, I permutate it by shifting the 10th row between the original first and second rows, say ID 1, 10, and 2 to 9. Two different scenarios of data will give different estimates, even I fix the same random seed.
To illustrate the issue in a simpler way, I write the following R scripts.
##TEST 01
# generate data
N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code1 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
}
"
# compile model
library(rstan)
model1 <- stan_model(model_code = stan_code1) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit <- sampling(model1, list(N=N, Y=Y), iter=200, chains=4)
print(fit)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Yp <- Y[c(1,100,2:99)]
set.seed(123)
fit2 <- sampling(model1, list(N=N, Y=Yp), iter=200, chains=4)
print(fit2)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.59 1.61 1.62 1.63 1.66 480 0.99
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 139 1.02
# lp__ 117.79 0.09 0.95 115.72 117.35 118.05 118.49 118.77 124 1.01
As we can see from the above simple case, two results fit and fit2 are different.
And, even stranger, if I write the likelihood before the priors (previousy, the priors are written ahead of the likelihood) in code file, the same random seed and the same data will still give different estimates.
##TEST 01'
# generate data
#N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code11 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
}
"
# compile model
#library(rstan)
model11 <- stan_model(model_code = stan_code11) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit11 <- sampling(model11, list(N=N, Y=Y), iter=200, chains=4)
print(fit11)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 455 0.99
# sigma 0.19 0.00 0.01 0.16 0.18 0.18 0.20 0.21 94 1.04
# lp__ 117.68 0.08 0.93 115.24 117.18 117.90 118.45 118.77 149 1.01
##TEST01 was
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Stan does not utilize the same pseudo random-number generator as R. Thus, calling set.seed(123) only makes Y repeatable and does not make the MCMC sampling repeatable. In order to accomplish the later, you need to pass an integer as the seed argument to the stan (or sampling) function in the rstan package like
sampling(model11, list(N = N, Y = Y), seed = 1234).
Even then, I could imagine that permuting the observations could result in different realizations of the draws from the posterior distribution due to floating-point reasons. But none of this really matters (unless you conduct too few iterations or get warning messages) because the posterior distribution is the same even if a finite set of realizations from the posterior distribution are randomly different numbers.

Penalized Regression: "ridge" RMSE greater than that for plain "lm"

Working with the "prostate" dataset in "ElemStatLearn" package.
set.seed(3434)
fit.lm = train(data=trainset, lpsa~., method = "lm")
fit.ridge = train(data=trainset, lpsa~., method = "ridge")
fit.lasso = train(data=trainset, lpsa~., method = "lasso")
Comparing RMSE (for bestTune in case of ridge and lasso)
fit.lm$results[,"RMSE"]
[1] 0.7895572
fit.ridge$results[fit.ridge$results[,"lambda"]==fit.ridge$bestTune$lambda,"RMSE"]
[1] 0.8231873
fit.lasso$results[fit.lasso$results[,"fraction"]==fit.lasso$bestTune$fraction,"RMSE"]
[1] 0.7779534
Comparing absolute value of coefficients
abs(round(fit.lm$finalModel$coefficients,2))
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
0.43 0.58 0.61 0.02 0.14 0.74 0.21 0.03 0.01
abs(round(predict(fit.ridge$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.49 0.62 0.01 0.14 0.65 0.05 0.00 0.01
abs(round(predict(fit.lasso$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.56 0.61 0.02 0.14 0.72 0.18 0.00 0.01
My question is: how can "ridge" RMSE be higher than that of plain "lm". Doesn't that defeat the very purpose of penalized regression vs plain "lm"?
Also, how can the absolute value of the coefficient of "lweight" be actually higher in ridge (0.62) vs that in lm (0.61)? Both coefficients are positive originally without the abs().
I was expecting ridge to perform similar to lasso, which not only reduced RMSE but also shrank the size of coefficients vs plain "lm".
Thank you!

Apply log and log1p to several columns with an if condition

I have a dataframe and I need to calculate log for all numbers greater than 0 and log1p for numbers equal to 0. My dataframe is called tcPainelLog and is it like this (str from columns 6:8):
$ IDD: num 0.04 0.06 0.07 0.72 0.52 ...
$ Soil: num 0.25 0.22 0.16 0.00 0.00 ...
$ QAI: num 0.00 0.50 0.00 0.71 0.26 ...
Therefore, I guess need to concatenate an ifelse statement with log and log1p functions. However, I tried several different ways to do it, but none has succeeded. For instance:
tcPainelLog <- tcPainel
cols <- names(tcPainelLog[,6:17]) # These are the columns I need to calculate
tcPainelLog$IDD <- ifelse(X = tcPainelLog$IDD>0, log(X), log1p(X))
tcPainelLog[cols] <- lapply(tcPainelLog[cols], function(x) ifelse((x > 0), log(x), log1p(x)))
tcPainelLog[cols] <- if(tcPainelLog[,cols] > 0) log(.) else log1p(.)
I haven't been able to perform it and I would appreciate any help for that. I am really sorry it there is an explanation for that, I searched by many words but I didn't find it.
Best regards.

How does R treat positional arguments

I'm a python guy and very new to R (so far, all I've done is copy-paste code and screen-shot the resulting, graph).
I would now like to actually learn the language so that I can draw useful plots (right now, I am trying to plot this).
In attempting my first plot, I came across this function call:
sets_options("universe", seq(from = 0, to = 25, by = 0.1))
Now, I would like to know if I can achieve the same result by calling
sets_options("universe", seq(0, 25, 0.1))
The help page for seq doesn't speak to this specifically (or I'm not reading it correctly), so I was hoping someone could shed some light on how R handles positional arguments
I tried calling the function that way in R and it worked (no syntax errors, etc), but I don't know how to test the output of that function, so I'm forced to ask here
Calling sets_options() will display the current settings. From the following log, it seems that the positional arguments are treated as expected:
> sets_options("universe", seq(0,5,0.25))
> sets_options()
$quote
[1] TRUE
$hash
[1] TRUE
$openbounds
[1] "()"
$universe
[1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00
> sets_options("universe", seq(from=0,to=5,by=0.25))
> sets_options()
$quote
[1] TRUE
$hash
[1] TRUE
$openbounds
[1] "()"
$universe
[1] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00
The question is what seq is doing with positional versus named objects. The way to address this looking at the ?seq page which lays out the named arguments and their order:
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
length.out = NULL, along.with = NULL, ...)
So seq(0, 25, 0.1) will be interpreted the same way as seq(from = 0, to = 25, by = 0.1) since the order is the same as name in Usage listing.

Resources