How to create a function within function - r

I have a problem in creating my function to impute my generated missing values. I have a generating data function, generating missing values function and imputing missing values function. But how can I combine them into one function?
# generate data
data <- function (n,alpha,kappa,miu){
X = rvm(n,alpha,kappa)
delta = rvm(n, 0, kappa)
epsilon = rvm(n, 0, kappa)
x = (X + delta)%%(2*pi)
Y = (alpha + X)%%(2*pi)
y = (Y + epsilon)%%(2*pi)
sample = cbind(x,y)
return(sample)
}
#generate missing values
misVal <- ampute(data=data(10,0.7854,5,0),prop=0.25,bycases=FALSE)
#impute missing values
impData <- mice(misVal,m=5,maxit=50,meth='pmm',seed=500)
summary(impData)

Combine all three functions into one large custom function...
nameYourFunction <- function(n,alpha,kappa,miu){
X = rvm(n,alpha,kappa)
delta = rvm(n, 0, kappa)
epsilon = rvm(n, 0, kappa)
x = (X + delta)%%(2*pi)
Y = (alpha + X)%%(2*pi)
y = (Y + epsilon)%%(2*pi)
sample = cbind(x,y)
#generate missing values
misVal <- ampute(data=sample,prop=0.25,bycases=FALSE)
#impute missing values
impData <- mice(misVal,m=5,maxit=50,meth='pmm',seed=500)
return(impData)
}
Then to run...
final_data <- nameYourFunction(n = 10, alpha = 0.7854, kappa = 5, miu = 0)
summary(final_data)
Obviously you may want to rename the function based on your own preferences.
If you wanted something more flexible, like to be able to easily supply arguments for the other function called within nameYourFunction, then you would add them to the list of arguments provided in the first line of code. So it might end up looking more like...
nameYourFunction <- function(n,alpha,kappa,miu,prop,m,maxit,meth,seed){...}
Then supplying those values to the function call like...
final_data <- nameYourFunction(n = 10, alpha = 0.7854, kappa = 5, miu = 0, prop = 0.25, m = 5, maxit = 50, meth = 'pmm', seed = 500)
And removing the hard coded values from within the custom function. I would probably recommend against this though as that is a lot of arguments to keep track of!

Related

Avoiding duplication in R

I am trying to fit a variety of (truncated) probability distributions to the same very thin set of quantiles. I can do it but it seems to require lots of duplication of the same code. Is there a neater way?
I am using this code by Nadarajah and Kotz to generate the pdf of the truncated distributions:
qtrunc <- function(p, spec, a = -Inf, b = Inf, ...)
{
tt <- p
G <- get(paste("p", spec, sep = ""), mode = "function")
Gin <- get(paste("q", spec, sep = ""), mode = "function")
tt <- Gin(G(a, ...) + p*(G(b, ...) - G(a, ...)), ...)
return(tt)
}
where spec can be the name of any untruncated distribution for which code in R exists, and the ... argument is used to provide the names of the parameters of that untruncated distribution.
To achieve the best fit I need to measure the distance between the given quantiles and those calculated using arbitrary values of the parameters of the distribution. In the case of the gamma distribution, for example, the code is as follows:
spec <- "gamma"
fit_gamma <- function(x, l = 0, h = 20, t1 = 5, t2 = 13){
ct1 <- qtrunc(p = 1/3, spec, a = l, b = h, shape = x[1],rate = x[2])
ct2 <- qtrunc(p = 2/3, spec, a = l, b = h, shape = x[1],rate = x[2])
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2- ct2)^2
return(sqrt(sum(dist)))
}
where l is the lower truncation, h is the higher and I am given the two tertiles t1 and t2.
Finally, I seek the best fit using optim, thus:
gamma_fit <- optim(par = c(2, 4),
fn = fit_gamma,
l = l,
h = h,
t1 = t1,
t2 = t2,
method = "L-BFGS-B",
lower = c(1.01, 1.4)
Now suppose I want to do the same thing but fitting a normal distribution instead. The names of the parameters of the normal distribution that I am using in R are mean and sd.
I can achieve what I want but only by writing a whole new function fit_normal that is extremely similar to my fit_gamma function but with the new parameter names used in the definition of ct1 and ct2.
The problem of duplication of code becomes very severe because I wish to try fitting a large number of different distributions to my data.
What I want to know is whether there is a way of writing a generic fit_spec as it were so that the parameter names do not have to be written out by me.
Use x as a named list to create a list of arguments to pass into qtrunc() using do.call().
fit_distro <- function(x, spec, l = 0, h = 20, t1 = 5, t2 = 13){
args <- c(x, list(spec = spec, a = l, b = h))
ct1 <- do.call(qtrunc, args = c(list(p = 1/3), args))
ct2 <- do.call(qtrunc, args = c(list(p = 2/3), args))
dist <- vector(mode = "numeric", length = 2)
dist[1] <- (t1 - ct1)^2
dist[2] <- (t2 - ct2)^2
return(sqrt(sum(dist)))
}
This is called as follows, which is the same as your original function.
fit_distro(list(shape = 2, rate = 3), "gamma")
# [1] 13.07425
fit_gamma(c(2, 3))
# [1] 13.07425
This will work with other distributions, for however many parameters they have.
fit_distro(list(mean = 10, sd = 3), "norm")
# [1] 4.08379
fit_distro(list(shape1 = 2, shape2 = 3, ncp = 10), "beta")
# [1] 12.98371

Can't pass variable into function in R

I am trying to fit a list of dataframes and I can't figure out why I can't define conc and t0 outside of the function.
If I do it like this I get error:
'Error in nls.multstart::nls_multstart(y ~ fit_drx_mono(assoc_time,
t0, : There must be as many parameter starting bounds as there are
parameters'
conc <- 5e-9
t0 <- 127
nls.multstart::nls_multstart(y ~ fit_mono(assoc_time, t0, conc, kon, koff, ampon, ampoff),
data = data_to_fit,
iter = 100,
start_lower = c(kon = 1e4, koff = 0.00001, ampon = 0.05, ampoff = 0),
start_upper = c(kon = 1e7, koff = 0.5, ampon = 0.6, ampoff = 0.5),
lower = c(kon = 0, koff = 0, ampon = 0, ampoff = 0))
When I specify the values in the function everything works as it is supposed to. And I don't understand why.
It turned out I cannot define data = data_to_fit otherwise the function looks for variables only in that dataframe. Once I defined every variable outside of the function without specifying data it works.

"One after the other" realisation of discrete random variables

I'm stuck with the following problem:
There are given n+1 discrete random variables:
X = {1,...,n} with P(x=i) = p_i
Y_i = {1,...,n_i} with P(y_i = j) = p_ij and i = 1,...,n
We do the following:
We draw from X and the result determines which Y_i we choose for the next step: If x = a, we use Y_a.
We draw from this Y_a.
Now my questions to this:
How do I get the Expected Value and the Variance of the whole?
Can this "process" be defined by a single random variable?
Assume we only know the EV and Var of all Y_i, but not all (or even none) of the probabilities. Can we still calculate the EV and Var of the whole process?
If 2) can be done, how to do this efficiently in R?
To give you an example of what I've tried:
X = {1,2} with P(x = 1) = 0.3 and P(x = 2) = 0.7
Y_1 = {2,3} with P(y_1 = 1) = 0.5 and P(y_1 = 3) = 0.5
Y_2 = {1,5,20} with P(y_2 = 1) = 0.3, P(y_2 = 5) = 0.6 and P(y_2 = 20) = 0.1
I have tried to combine those to a single random variable Z, but I'm not sure, if that can be done that way:
Z = {2,3,1,5,20} with probabilities (0.5*0.3, 0.5*0.3, 0.3*0.7, 0.6*0.7, 0.1*0.7)
The weighted EV is correct, but the "weighted" Var is different - if it is correct to use the formula for Var of linear combination for independent random variables. (Maybe just the formula for the combined Var is wrong.)
I used R and the package "discreteRV":
install.packages("discreteRV")
library(discreteRV)
#defining the RVs
Y_1 <- RV(outcomes = c(2, 3), probs = c(0.5, 0.5)) #occures 30% of the time
Y_2 <- RV(outcomes = c(1, 5, 20), probs = c(0.3, 0.6, 0.1)) #occures 70% of the time
Z <- RV(outcomes = c(2, 3, 1, 5, 20),
probs = c(0.5*0.3, 0.5*0.3, 0.3*0.7, 0.6*0.7, 0.1*0.7))
#calculating the EVs
E(Z)
E(Y_1)*0.3 + E(Y_2)*0.7
#calculating the VARs
V(Z)
V(Y_1)*(0.3)^2 + V(Y_2)*(0.7)^2
Thank you for your help.
Actually Z has a larger sample space expanded by Y1 and Y2, which is not a linear superposition of two components. In other words, we should present Z like Z = [0.3*Y1, 0.7*Y2] rather than Z = 0.3*Y1 + 0.7*Y2.
Since we have
V(Z) = E(Z**2)-E(Z)**2
> E(Z**2) -E(Z)**2
[1] 20.7684
> V(Z)
[1] 20.7684
We will easily find that in the term E(Z)**2, there are cross-product terms between Y1 and Y2, which makes V(Z) != V(Y_1)*(0.3)^2 + V(Y_2)*(0.7)^2.

MLE error: initial value in 'vmmin' is not finite

We simulated a data set and created a model.
set.seed(459)
# seed mass
n <- 1000
seed.mass <- round(rnorm(n, mean = 250, sd = 75),digits = 1)
## Setting up the deterministic function
detFunc <- function(a,b,x){
return(exp(a+b*x)) / (1+exp(a+b*x))
}
# logit link function for the binomial
inv.link <- function(z){
p <-1/(1+exp(-z))
return(p)
}
#setting a and b values
a <- -2.109
b <- 0.02
# Simulating data
germination <- (rbinom(n = n, size = 10,
p = inv.link(detFunc(x = seed.mass, a = a, b = b))
))/10
## make data frame
mydata <- data.frame("predictor" = seed.mass, "response" = germination)
# plotting the data
tmp.x <- seq(0,1e3,length.out=500)
plot(germination ~ seed.mass,
xlab = "seed mass (mg)",
ylab = "germination proportion")
lines(tmp.x,inv.link(detFunc(x = tmp.x, a = a, b = b)),col="red",lwd=2)
When we check the model we created and infer the parameters, we get an error:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) : initial value in 'vmmin' is not finite
library(bbmle)
mod1<-mle2(response ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list("a"= -2.109 ,"b"= 0.02))
We're stumped and can't figure out why we're getting this error.
Your problem is that you're trying to fit a binomial outcome (which must be an integer) to a proportion.
You can use round(response*10) as your predictor (to put the proportion back on the count scale; round() is because (a/b)*b is not always exactly equal to a in floating-point math ...) Specifically, with your setup
mod1 <- mle2(round(response*10) ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list(a = -2.109 ,b = 0.02))
works fine. coef(mod1) is {-1.85, 0.018}, plausibly close to the true values you started with (we don't expect to recover the true values exactly, except as the average of many simulations [and even then MLE is only asymptotically unbiased, i.e. for large data sets ...]
The proximal problem is that trying to evaluate dbinom() with a non-integer value gives NA. The full output from your model fit would have been:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) :
initial value in 'vmmin' is not finite
In addition: There were 50 or more warnings (use warnings() to see the first 50)
It's always a good idea to check those additional warnings ... in this case they are all of the form
1: In dbinom(x = c(1, 1, 1, 0.8, 1, 1, 1, 1, 1, 1, 1, 0.8, ... :
non-integer x = 0.800000
which might have given you a clue ...
PS you can use qlogis() and plogis() from base R for your link and inverse-link functions ...

MXNet: sequence length in LSTM in a non-sequence data (R)

My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.

Resources