I am trying to fit a list of dataframes and I can't figure out why I can't define conc and t0 outside of the function.
If I do it like this I get error:
'Error in nls.multstart::nls_multstart(y ~ fit_drx_mono(assoc_time,
t0, : There must be as many parameter starting bounds as there are
parameters'
conc <- 5e-9
t0 <- 127
nls.multstart::nls_multstart(y ~ fit_mono(assoc_time, t0, conc, kon, koff, ampon, ampoff),
data = data_to_fit,
iter = 100,
start_lower = c(kon = 1e4, koff = 0.00001, ampon = 0.05, ampoff = 0),
start_upper = c(kon = 1e7, koff = 0.5, ampon = 0.6, ampoff = 0.5),
lower = c(kon = 0, koff = 0, ampon = 0, ampoff = 0))
When I specify the values in the function everything works as it is supposed to. And I don't understand why.
It turned out I cannot define data = data_to_fit otherwise the function looks for variables only in that dataframe. Once I defined every variable outside of the function without specifying data it works.
Related
I have a problem in creating my function to impute my generated missing values. I have a generating data function, generating missing values function and imputing missing values function. But how can I combine them into one function?
# generate data
data <- function (n,alpha,kappa,miu){
X = rvm(n,alpha,kappa)
delta = rvm(n, 0, kappa)
epsilon = rvm(n, 0, kappa)
x = (X + delta)%%(2*pi)
Y = (alpha + X)%%(2*pi)
y = (Y + epsilon)%%(2*pi)
sample = cbind(x,y)
return(sample)
}
#generate missing values
misVal <- ampute(data=data(10,0.7854,5,0),prop=0.25,bycases=FALSE)
#impute missing values
impData <- mice(misVal,m=5,maxit=50,meth='pmm',seed=500)
summary(impData)
Combine all three functions into one large custom function...
nameYourFunction <- function(n,alpha,kappa,miu){
X = rvm(n,alpha,kappa)
delta = rvm(n, 0, kappa)
epsilon = rvm(n, 0, kappa)
x = (X + delta)%%(2*pi)
Y = (alpha + X)%%(2*pi)
y = (Y + epsilon)%%(2*pi)
sample = cbind(x,y)
#generate missing values
misVal <- ampute(data=sample,prop=0.25,bycases=FALSE)
#impute missing values
impData <- mice(misVal,m=5,maxit=50,meth='pmm',seed=500)
return(impData)
}
Then to run...
final_data <- nameYourFunction(n = 10, alpha = 0.7854, kappa = 5, miu = 0)
summary(final_data)
Obviously you may want to rename the function based on your own preferences.
If you wanted something more flexible, like to be able to easily supply arguments for the other function called within nameYourFunction, then you would add them to the list of arguments provided in the first line of code. So it might end up looking more like...
nameYourFunction <- function(n,alpha,kappa,miu,prop,m,maxit,meth,seed){...}
Then supplying those values to the function call like...
final_data <- nameYourFunction(n = 10, alpha = 0.7854, kappa = 5, miu = 0, prop = 0.25, m = 5, maxit = 50, meth = 'pmm', seed = 500)
And removing the hard coded values from within the custom function. I would probably recommend against this though as that is a lot of arguments to keep track of!
My simple code is yielding: Error in dimnames<-.data.frame(*tmp*, value = list(n)) :
invalid 'dimnames' given for data frame
Any help appreciated
library(bvarsv)
library(tidyverse)
library(janitor)
library(readxl)
set.seed(1)
test = read_excel("Desktop/test.csv")
bvar.sv.tvp(test, p = 2, tau = 40, nf = 10, pdrift = TRUE, nrep = 50000,
nburn = 5000, thinfac = 10, itprint = 10000, save.parameters = TRUE,
k_B = 4, k_A = 4, k_sig = 1, k_Q = 0.01, k_S = 0.1, k_W = 0.01,
pQ = NULL, pW = NULL, pS = NULL)
Edit:
The documentation specifies:
Y - Matrix of data, where rows represent time and columns are
different variables.Y must have at least two columns.
So when you read in your dataset, time will be a column at first, meaning you have to transform the dataframe that the time column will be your rownames. (Maybe you also want to use the lubridate package to parse your time column first).
tst <- read.csv("Desktop/tst.csv", TRUE, ",")
# tst_df <- data.frame(tst) # Should not be necassary
rownames(tst_df) <- tst_df[,1]
tst_df[,1] <- NULL
bvar.sv.tvp(tst_df, ...)
You can also the usmacro dataset as an example to see how the input data of bvar.sv.tvp() should look like.
data(usmacro)
print(usmacro)
Original Post:
I don't know how your csv looks like. So it is hard to tell what the actual issue is.
But you can try wrapping your data in "as.data.frame(test)" like this:
bvar.sv.tvp(as.data.frame(test), p = 2, tau = 40, nf = 10, pdrift = TRUE, nrep = 50000,
nburn = 5000, thinfac = 10, itprint = 10000, save.parameters = TRUE,
k_B = 4, k_A = 4, k_sig = 1, k_Q = 0.01, k_S = 0.1, k_W = 0.01,
pQ = NULL, pW = NULL, pS = NULL)
I am learning to use the XGBoost package in R and I encountered some very weird behaviour that I'm not sure how to explain. Perhaps someone can give me some directions. I simplified the R code as much as possible:
rm(list = ls())
library(xgboost)
setwd("/home/my_username/Documents/R_files")
my_data <- read.csv("my_data.csv")
my_data$outcome_01 = ifelse(my_data$outcome_continuous > 0.0, 1, 0)
reg_features = c("feature_1", "feature_2")
class_features = c("feature_1", "feature_3")
set.seed(93571)
train_data = my_data[seq(1, nrow(my_data), 2), ]
mm_reg_train = model.matrix(~ . + 0, data = train_data[, reg_features])
train_DM_reg = xgb.DMatrix(data = mm_reg_train, label = train_data$outcome_continuous)
var_nrounds = 190
xgb_reg_model = xgb.train(data = train_DM_reg, booster = "gbtree", objective = "reg:squarederror",
nrounds = var_nrounds, eta = 0.07,
max_depth = 5, min_child_weight = 0.8, subsample = 0.6, colsample_bytree = 1.0,
verbose = F)
mm_class_train = model.matrix(~ . + 0, data = train_data[, class_features])
train_DM_class = xgb.DMatrix(data = mm_class_train, label = train_data$outcome_01)
xgb_class_model = xgb.train(data = train_DM_class, booster = "gbtree", objective = "binary:logistic",
eval_metric = 'auc', nrounds = 70, eta = 0.1,
max_depth = 3, min_child_weight = 0.5, subsample = 0.75, colsample_bytree = 0.5,
verbose = F)
probabilities = predict(xgb_class_model, newdata = train_DM_class, type = "response")
print(paste0("simple check: ", sum(probabilities)), quote = F)
Here is the problem: The outcome of sum(probabilities) depends on the value of var_nrounds!
How could that be? After all var_nrounds enters only in xgb_reg_model, while the probabilities are computed with the xgb_class_model that does (should) not know anything about the value of var_nrounds. The only thing that I change in this code is the value of var_nrounds and yet the sum of probabilities changes when I rerun it. It also changes deterministically, i.e., with var_nrounds = 190 I always get (with my data) 5324.3 and with var_nrounds = 285:5322.8. However, if I remove the line set.seed(93571), then the result changes non-deterministically every time I rerun the code.
Could it be that XGBoost has some sort of in-built stochastic behaviour that changes depending on the number of rounds run beforehand in another model and that also gets controlled by setting a seed somewhere in the code before training the XGBoost? Any ideas?
We simulated a data set and created a model.
set.seed(459)
# seed mass
n <- 1000
seed.mass <- round(rnorm(n, mean = 250, sd = 75),digits = 1)
## Setting up the deterministic function
detFunc <- function(a,b,x){
return(exp(a+b*x)) / (1+exp(a+b*x))
}
# logit link function for the binomial
inv.link <- function(z){
p <-1/(1+exp(-z))
return(p)
}
#setting a and b values
a <- -2.109
b <- 0.02
# Simulating data
germination <- (rbinom(n = n, size = 10,
p = inv.link(detFunc(x = seed.mass, a = a, b = b))
))/10
## make data frame
mydata <- data.frame("predictor" = seed.mass, "response" = germination)
# plotting the data
tmp.x <- seq(0,1e3,length.out=500)
plot(germination ~ seed.mass,
xlab = "seed mass (mg)",
ylab = "germination proportion")
lines(tmp.x,inv.link(detFunc(x = tmp.x, a = a, b = b)),col="red",lwd=2)
When we check the model we created and infer the parameters, we get an error:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) : initial value in 'vmmin' is not finite
library(bbmle)
mod1<-mle2(response ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list("a"= -2.109 ,"b"= 0.02))
We're stumped and can't figure out why we're getting this error.
Your problem is that you're trying to fit a binomial outcome (which must be an integer) to a proportion.
You can use round(response*10) as your predictor (to put the proportion back on the count scale; round() is because (a/b)*b is not always exactly equal to a in floating-point math ...) Specifically, with your setup
mod1 <- mle2(round(response*10) ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list(a = -2.109 ,b = 0.02))
works fine. coef(mod1) is {-1.85, 0.018}, plausibly close to the true values you started with (we don't expect to recover the true values exactly, except as the average of many simulations [and even then MLE is only asymptotically unbiased, i.e. for large data sets ...]
The proximal problem is that trying to evaluate dbinom() with a non-integer value gives NA. The full output from your model fit would have been:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) :
initial value in 'vmmin' is not finite
In addition: There were 50 or more warnings (use warnings() to see the first 50)
It's always a good idea to check those additional warnings ... in this case they are all of the form
1: In dbinom(x = c(1, 1, 1, 0.8, 1, 1, 1, 1, 1, 1, 1, 0.8, ... :
non-integer x = 0.800000
which might have given you a clue ...
PS you can use qlogis() and plogis() from base R for your link and inverse-link functions ...
My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.