I'm asking this question because I couldn't figure it out why nlxb fitting function does not work with the predict() function.
I have been looking around to solve this but so far no luck:(
I use dplyr to group data and use do to fit each group using nlxb from nlmrt package.
Here is my attempt
set.seed(12345)
set =rep(rep(c("1","2","3","4"),each=21),times=1)
time=rep(c(10,seq(100,900,100),seq(1000,10000,1000),20000),times=1)
value <- replicate(1,c(replicate(4,sort(10^runif(21,-6,-3),decreasing=FALSE))))
data_rep <- data.frame(time, value,set)
> head(data_rep)
# time value set
#1 10 1.007882e-06 1
#2 100 1.269423e-06 1
#3 200 2.864973e-06 1
#4 300 3.155843e-06 1
#5 400 3.442633e-06 1
#6 500 9.446831e-06 1
* * * *
library(dplyr)
library(nlmrt)
d_step <- 1
f <- 1e9
d <- 32
formula = value~Ps*(1-exp(-2*f*time*exp(-d)))*1/(sqrt(2*pi*sigma))*exp(-(d-d_ave)^2/(2*sigma))*d_step
dffit = data_rep %>% group_by(set) %>%
do(fit = nlxb(formula ,
data = .,
start=c(d_ave=44,sigma=12,Ps=0.5),
control=nls.lm.control(maxiter = 100),
trace=TRUE))
--------------------------------------------------------
There are two points I would like to get finally,
1)First, how to get fitting coefficients of each group in continuation to dffitpipeline.
2) Doing prediction of based on new x values.
for instance range <- data.frame(x=seq(1e-5,20000,length.out=10000))
predict(fit,data.frame(x=range)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "nlmrt"
Since nlxb is working smoothly compared to nls r-minpack-lmnls-lm-failed-with-good-results I would prefer solutions with nlxb. But if you have a better solution please let us know.
There are no coef or predict methods for "nlmrt" class objects but the nlmrt package does provide wrapnls which will run nlmrt and then nls so that an "nls" object results and then that object can be used with all the "nls" class methods.
Also note that nls.lm.control is from the nlsLM package and should not be used here -- use list instead.
Related
I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.
The R package dynlm offers extended formula notation including two functions very useful for specifying multiple lags: d and L. However these are not exported functions, and only work in the context of the dynlm() command as a replacement for lm(). The lags accept vectors for the number of lags and produce multiple series if the vectors are longer than 1. This question arises because I do not understand how to hand a function that returns a matrix or tibble of terms to the RHS of a formula in such a way that each column will receive a coefficient when the formula is interpreted, and where there are also other terms.
One purpose for this is to make it easy to specify models that are different only in the number of lags, a common task.
I've found other packages that let you do something similar, but only with their own estimation function. I want a function that works with any package that evaluates a model expressed as a formula that is linear (at least in the lags).
I would like a set of similar functions L.() and d.() that could be used in a formula in any package that accepts a formula as input, such as packages for general linear models, other time series models like ARIMA or GARCH, and the like, or in more general functions of formulas that can be defined over time series, such as nls.
I’d prefer to be agnostic about the exact specification of the time series, as I am using my data with the tsibble/fable packages but also with other packages like DREGAR, dynlm and rugarch where I have not yet established whether they choke on ts files in tsibble format. However I do not actually know if format agnosticism is possible here. If not, I’d settle for doing my own format conversions.
I would most like it if the functions could be inserted directly into formulas, creating new variables equal to the number of lags, that are treated as linear combinations in the usual way, with each variable getting its own coefficient.
So if
ts1 <- ts(1:4)
where 1 is most recent, then L.(ts1, 0:2) is
ts1_L0 ts1_L1 ts1_L2
1 NA NA
2 1 NA
3 2 1
4 3 2
and
set.seed(1)
y <- rnorm(4)
extra_var <- 1:4 * (1+ 0.1*y)
so that if form_1 <- as.formula(y ~ L.(ts1, 0:2) + extra_var) then
lm(form_1, na.action = na.omit) would be the linear model of y on the four variables as specified above (and would actually crash since it has five coefficients estimated on two rows of data).
and for
ts2 <- ts(2^1:3),
d.(ts1, 1:2) is similarly
ts2_d1 ts2_d2
-1 -3
-2 -6
-41 NA
NA NA
and should work in formulas in the same way.
If you do not mind the variable names being altered to a function name, then there is quick and dirty work around. The idea is, that in an lm call, variable names can be wrapped in functions such as as.factor() or lag() which change the underlying variable. If you want to create several variables based on a single variable name, like in your case with L.(var, 0:2), then the result of this function needs to be a matrix.
library(dplyr)
L. <- function(x, k) {
res <- as.matrix(dplyr::bind_cols(lapply(k, function(k) dplyr::lag(x, k))))
colnames(res) <- paste0("_lag", seq_along(1:ncol(res))-1)
res
}
lm(cyl ~ L.(mpg, 0:2), mtcars)
#>
#> Call:
#> lm(formula = cyl ~ L.(mpg, 0:2), data = mtcars)
#>
#> Coefficients:
#> (Intercept) L.(mpg, 0:2)_lag0 L.(mpg, 0:2)_lag1 L.(mpg, 0:2)_lag2
#> 11.334907 -0.245130 -0.015968 0.004683
Created on 2020-05-19 by the reprex package (v0.3.0)
dyn in the dyn package is similar in purpose to dynlm but works with more modeling functions and time series classes. Quoting from the help file:
"dyn" currently works with any regression function that makes use of
"model.frame" and is written in the style of "lm". This includes "lm",
"glm", "loess", "rlm" (from "MASS"), "lqs" (from "MASS"),
"MCMCregress" (from "MCMCpack"), "randomForest" (from "randomForest"),
"rq" (from "quantreg") and others. The time series objects can be one
of the following classes: "ts", "irts", "its", "zoo" or "zooreg".
The dyn help file also discusses how to add additional time series classes without modifying the dyn package itself.
dyn does not provide lag or diff but rather works with the native lag and diff of the time series class you are using. In particular, zoo supports multi-lag lag and diff functions.
There are 8 demos that come with dyn that illustrate its use with various modelling functions. demo(package = "dyn").
The way it works is that you just preface the modeling function, e.g. lm, with dyn$ . (This example will give some warnings about combining integer and numeric indexes but it will work anyways. If you are careful to ensure that all the series you use have the same class of index you can avoid those warnings.)
library(dyn)
set.seed(1)
n <- 10
ts1 <- ts(rnorm(n))
z1 <- zoo(ts1) # so that we can use zoo's multi-lag lag function
y <- rnorm(n)
extra_var <- 1:n * (1+ 0.1*y)
fm2 <- dyn$lm(y ~ lag(z1, 0:-2) + extra_var)
fm1 <- dyn$lm(y ~ lag(z1, 0:-1) + extra_var)
anova(fm1, fm2)
As an aside, if you are using dplyr note that dplyr overwrites lag with a version that is incompatible with base lag and will cause numerous time series packages to fail so be sure NOT to have dplyr loaded or in R 3.6 or later you can use library(dplyr, exclude = c("lag", "filter"))
I'm working with the caret package, training a model for text classification, but I've faced a problem that bugs me and I'm not finding a proper solution.
I got a data.frame of training like this:
training <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(1,1,1), result =c('good','good','bad'))
training
x y z result
1 0 0 1 good
2 0 1 1 good
3 1 0 1 bad
So I train my model like this:
library(caret)
svm_mod <- train(sent ~ .,df, method = "svmLinear")
# There were 42 warnings (use warnings() to see them) Some warnings, not the point of the question
Now let's skip the testing part, let's think that's ok.
Now I've the real work, i.e. predict unknown data. My problem is that the "applying" data can have different columns from the training dataset, and predicting is not always permitted:
# if the columns are the same, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1))
predict(svm_mod, applying)
# if the columns in applying are more than in train, it's ok
applying <- data.frame(x = c(0,0,1),y = c(0,1,0), z = c(0,1,1), k=c(1,1,1))
predict(svm_mod, applying)
# if in applying is missing a column that is in train it does not work:
applying <- data.frame(x = c(0,0,1),y = c(0,1,0))
predict(svm_mod, applying)
# Error in eval(predvars, data, env) : object 'z' not found
Now the solution should be to add all the missing column in training as 0s:
applying$z <- 0
in the applying dataset, but I find it not so correct/nice. Are there a correct solution to do this? I've read several question about this (my favourite is this, my question is about finding a workaround about this issue).
My data are phrases, and I'm using document term matrix as inputs, in a production environment, this mean that's going to have newer input, without the columns in train.
I am trying to run a least discriminant analysis (lda()) on a data.frame I created by dividing several variables by an additional scaling variable (not shown here) in R using the MASS package. Below is a sample dataset and a sample version of the code I am using that reproduces the error.
class Var1 Var2 Var3 Var4
2 0.732459522 0.973014649 0.612952968 0.127216654
3 0.76692254 0.990230286 0.629448709 0.104675506
2 0.847487002 1.021663778 0.649046794 0.187175043
3 0.823583181 1.050274223 0.673674589 0.170018282
1 0.796279894 1.058458813 0.583702391 0.222320638
2 0.925681255 1.009909166 0.636663914 0.205615194
2 0.627334465 1.074702886 0.59762309 0.23344652
3 0.980376124 1.011447261 0.646770237 0.232215863
3 0.79342723 1.048826291 0.750234742 0.248826291
1 0.960655738 1.042622951 0.6 0.262295082
2 0.963788301 1.005571031 0.590529248 0.233983287
1 1.013157895 1.049342105 0.657894737 0.223684211
2 1.211538462 1.060897436 0.733974359 0.288461538
3 1.25083612 1.023411371 0.759197324 0.311036789
3 0.959196485 1.009416196 0.635907094 0.12868801
1 0.823681936 1.005185825 0.590319793 0.219533276
2 0.777508091 0.998381877 0.624595469 0.165048544
3 0.749114103 0.985825656 0.585400425 0.133947555
1 0.816999133 1.036426713 0.604509974 0.197745013
data<-read.csv("data.csv",header=TRUE)
data_train<-na.omit(data)
scores_train<-data_train[-c(1)]
lda_train<-lda(data_train$class~scores_train,prior = c(1,1,1)/3,CV=TRUE)
scores_test<-data[-c(1)]
lda_test<-predict(lda_train,as.data.frame(scores_test),prior = c(1,1,1)/3)
lda_train<-lda(data_train$class~as.matrix(scores_train),prior = c(1,1,1)/3,CV=TRUE)
class(scores_train)
class(scores_test)
When I try to perform the lda using the dataset, I get the following error message.
Error in model.frame.default(formula = data_train$class ~ scores_train) :
invalid type (list) for variable 'scores_train'
I am able to coerce the data into working by coercing it into a matrix format using as.matrix. Notably, trying to do something similar using as.data.frame() and data.frame() does not work. However then when I try to apply the resulting discriminant function to the total dataset the I get the following message...
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
However, when I check the class of the objects of using class(), it says both objects are in a data.frame format. I checked the dataset to see if there were any incomplete rows or columns that could cause it to treat them as a series of lists instead of a single data.frame, but there are no missing values. Similarly, it does not appear to be due to the names of any variables.
I am not sure why R is treating the object as a list instead of a data.frame (and thereby causing the least discriminant analysis to fail), especially as it recognizes the objects are of the class data.frame.
for lda, you have to provide the formula, so the below works if you provide a dataframe:
lda_train<-lda(class ~ .,data=data_train,prior = c(1,1,1)/3,CV=TRUE)
else if you don't provide the formula, do:
lda(grouping=data_train$class,x=data_train[,-1],prior = c(1,1,1)/3, CV=TRUE)
When you use CV=TRUE, it uses leave-one-out cross validation to give you the posterior, but unfortunately it is not able to retain the model, and you can see it:
class(lda_train)
[1] "list"
To predict, you need to train with CV=FALSE. You provide a data.frame or matrix that has the same column has as that used for the training, and in your case it will be:
lda_train<-lda(class ~ .,data=data_train,prior = c(1,1,1)/3)
data_test=data.frame(Var1=rnorm(10),Var2=rnorm(10),
Var3=rnorm(10),Var4=rnorm(10))
predict(lda_train,data_test)
For lda from MASS, there is no hyper-parameter to be obtained from training, so maybe you want to elaborate on why you need the cross-validation?
In case you would want to explore it, here's how you can run cross-validation for lda (note using lda2):
data_train$class =factor(data$class)
lda_train = train(class ~ .,data=data_train,method="lda2",
trControl = trainControl(method = "cv"))
predict(lda_train,data_test)
The formula argument is looking for a structured formula declaring how the variables relate. Each variable named must be a vector. You can pass all the names in the same dataframe whilst declaring the data argument:
lda(class ~ Var1 + Var2 + Var3 + Var4,
data = data, prior = c(1,1,1)/3, CV=TRUE)
Or pass the columns separately:
lda(data$class ~ scores_train$Var1 +
scores_train$Var2 +
scores_train$Var3 +
scores_train$Var4,
prior = c(1,1,1)/3, CV=TRUE)
For the problem of predict not accepting it as an object, you need to change CV to FALSE, otherwise it only returns a list (not a lda object which predict needs):
model <- lda(data$class ~ scores_train$Var1 +
scores_train$Var2 +
scores_train$Var3 +
scores_train$Var4,
prior = c(1,1,1)/3, CV=FALSE)
predict(model)
I am using an accelerated failure time / AFT model with a weibull distribution to predict data. I am doing this using the survival package in R. I am splitting my data in training and test, do training on the training set and afterwards try to predict the values for the test set. To do that I am passing the the test set as the newdata parameter, as stated in the references. I get an error, saying that newdata does not have the same size as the training data (obviously!). Then the function seems to evaluate predict the values for the training set.
How can I predict the values for the new data?
# get data
library(KMsurv)
library(survival)
data("kidtran")
n = nrow(kidtran)
kidtran <- kidtran[sample(n),] # shuffle row-wise
kidtran.train = kidtran[1:(n * 0.8),]
kidtran.test = kidtran[(n * 0.8):n,]
# create model
aftmodel <- survreg(kidtransurv~kidtran.train$gender+kidtran.train$race+kidtran.train$age, dist = "weibull")
predicted <- predict(aftmodel, newdata = kidtran.test)
Edit: As mentioned by Hack-R, there was this line of code missing
kidtransurv <- Surv(kidtran.train$time, kidtran.train$delta)
The problem seems to be in your specification of the dependent variable.
The data and code definition of the dependent was missing from your question, so I can't see what the specific mistake was, but it did not appear to be a proper Surv() survival object (see ?survreg).
This variation on your code fixes that, makes some minor formatting improvements, and runs fine:
require(survival)
pacman::p_load(KMsurv)
library(KMsurv)
library(survival)
data("kidtran")
n = nrow(kidtran)
kidtran <- kidtran[sample(n),]
kidtran.train <- kidtran[1:(n * 0.8),]
kidtran.test <- kidtran[(n * 0.8):n,]
# Whatever kidtransurv was supposed to be is missing from your question,
# so I will replace it with something not-missing
# and I will make it into a proper survival object with Surv()
aftmodel <- survreg(Surv(time, delta) ~ gender + race + age, dist = "weibull", data = kidtran.train)
predicted <- predict(aftmodel, newdata = kidtran.test)
head(predicted)
302 636 727 121 85 612
33190.413 79238.898 111401.546 16792.180 4601.363 17698.895