I'm trying to apply a custom function to a nested dataframe
I want to apply a machine learning algorithm to predict NA values
After doing a bit of reading online, it seemed that the map function would be the most applicable here
I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))))
Here is the function I intend to use:
my_function (test,train) {
et <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , "mean_night_flow"] <- 0
pred <- predict(et, newdata = test1[, "mean_night_flow"])
test1[ , "mean_night_flow"] <- pred
return(test1)
I have tried the following code, however it does not work:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map(data3, data2, ~my_function(.x,.y)))
It gives the following error:
Error: Index 1 must have length 1, not 33
This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?
Many thanks
Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map2(data3, data2, ~my_function(.x,.y)))
If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:
your arguments need to be enclosed within a list, so
map2(data3, data12, ...)
becomes
pmap(list(data3, data12), ...)
you refer to them with double-dot number position, ..1, ..2, ..3, etc, so
~ my_function(.x, .y)
becomes
~ my_function(..1, ..2)
An alternative that simplifies your overall flow just a little.
my_function (test, train = NULL, fld = "mean_night_flow") {
if (is.null(train)) {
train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
test <- test[ is.na(test[[fld]]),, drop = FALSE ]
}
et <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , fld] <- 0
pred <- predict(et, newdata = test1[, fld])
test1[ , fld] <- pred
return(test1)
}
which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))
(It's important to name fld=, since otherwise it will be confused for train.)
If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.
Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps
test1 <- test
pred <- predict(et, newdata = test1)
test1[ , fld] <- pred
return(test1)
(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).
Related
broom::augment outputs only columns from data that are used in formula. This is problematic behavior, because being able to find something like a respondent id can be very helpful at times. Using the newdata argument could be a workaround, but it still doesn't provide a fix when working with nested data.
Additional explanations in-line:
#simulated glm data
glmdata = data.frame(ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
#fit model, not including the ID variable
glmfit = glm(response~A+B, glmdata,family='binomial')
#ID variable is contained in glm$data
str(glmfit$data)
#works!
head(glmfit$data$ID)
#use broom::augment
augmented = glmfit %>% augment
#does not work, wth broom?!
augmented$ID
#ok ... I could use the newdata argument
augmented = glmfit %>% augment(newdata=glmdata)
augmented$ID
#however, that is a hacky workaround ....
#... and it does not fix the following scenario:
#Let's say I want to use nest
#simulated glm data
glmdata1 = data.frame(segm=1,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
glmdata2 = data.frame(segm=2,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-3*A-2*B)) ))
glmdata_nest = rbind(glmdata1,glmdata2) %>% group_by(segm) %>% nest
#fit the two models via map
glmfit_nest= glmdata_nest %>% mutate(model=map(data, glm, formula=response~A+B, family='binomial') )
#run augment via map
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment))
#ID is not here ...
glmfit_nest_augmented$augmented$ID
#ok, so then we have to use map2 ....
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map2(model,data,augment,newdata=.y))
#but even this doesn't work
#also, trying to recycling glm$data does not work
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment,newdata=.$data))
Update:
broom developers deliberately choose this inconsistent behavior
https://github.com/tidymodels/broom/issues/753
Here the .x, and .y goes with the anonymous function call with ~
glmfit_nest_augmented <- glmfit_nest %>%
mutate(augmented = map2(model,data,~ augment(.x, newdata=.y))
I'm trying to use LIME to explain a binary classification model that I've trained using XGboost. I run into an error when calling the explain() function from LIME, which implies that I have columns that aren't matching in my model (or explainer) and the new data I'm trying to explain predictions for.
This vignette for LIME does demonstrate a version with xgboost, however it's a text problem which is a little different to my tabular data. This question seems to be encountering the same error, but also for a document term matrix, which seems to obscure the solution for my case. I've worked up a minimal example with mtcars which produced exactly the same errors I get in my own larger dataset.
library(pacman)
p_load(tidyverse)
p_load(xgboost)
p_load(Matrix)
p_load(lime)
### Prepare data with partition
df <- mtcars %>% rownames_to_column()
length <- df %>% nrow()
df_train <- df %>% select(-rowname) %>% head((length-10))
df_test <- df %>% select(-rowname) %>% tail(10)
### Transform data into matrix objects for XGboost
train <- list(sparse.model.matrix(~., data = df_train %>% select(-vs)), (df_train$vs %>% as.factor()))
names(train) <- c("data", "label")
test <- list(sparse.model.matrix(~., data = df_test %>% select(-vs)), (df_test$vs %>% as.factor()))
names(test) <- c("data", "label")
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
### Train model
watchlist <- list(train=dtrain, test=dtest)
mod_xgb_tree <- xgb.train(data = dtrain, booster = "gbtree", eta = .1, nrounds = 15, watchlist = watchlist)
### Check prediction works
output <- predict(mod_xgb_tree, test$data) %>% tibble()
### attempt lime explanation
explainer <- df_train %>% select(-vs) %>% lime(model = mod_xgb_tree) ### works, no error or warning
explanation <- df_test %>% select(-vs) %>% explain(explainer, n_features = 4) ### error, Features stored names in `object` and `newdata` are different!
names_test <- test$data#Dimnames[[2]] ### 10 names
names_mod <- mod_xgb_tree$feature_names ### 11 names
names_explainer <- explainer$feature_type %>% enframe() %>% pull(name) ### 11 names
### see whether pre-processing helps
my_preprocess <- function(df){
data <- df %>% select(-vs)
label <- df$vs
test <<- list(sparse.model.matrix( ~ ., data = data), label)
names(test) <<- c("data", "label")
dtest <- xgb.DMatrix(data = test$data, label=test$label)
dtest
}
explanation <- df_test %>% explain(explainer, preprocess = my_preprocess(), n_features = 4) ### Error in feature_distribution[[i]] : subscript out of bounds
### check that the preprocessing is working ok
dtest_check <- df_test %>% my_preprocess()
output_check <- predict(mod_xgb_tree, dtest_check)
I assume that because the explainer only has the names of the original predictor columns, where test data in its transformed state also has an (Intercept) column, this is causing the problem. I just haven't figured out a neat way of preventing this occurring. Any help would be much appreciated. I assume there must be a neat solution.
If you look at this page (https://rdrr.io/cran/xgboost/src/R/xgb.Booster.R), you will see that some R users are likely to get the following error message: "Feature names stored in object and newdata are different!".
Here is the code from this page related to the error message:
predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,reshape = FALSE, ...)
object <- xgb.Booster.complete(object, saveraw = FALSE)
if (!inherits(newdata, "xgb.DMatrix"))
newdata <- xgb.DMatrix(newdata, missing = missing)
if (!is.null(object[["feature_names"]]) &&
!is.null(colnames(newdata)) &&
!identical(object[["feature_names"]], colnames(newdata)))
stop("Feature names stored in `object` and `newdata` are different!")
identical(object[["feature_names"]], colnames(newdata)) => If the column names of object (i.e. your model based on your training set) are not identical to the column names of newdata (i.e. your test set), you will get the error message.
For more details:
train_matrix <- xgb.DMatrix(as.matrix(training %>% select(-target)), label = training$target, missing = NaN)
object <- xgb.train(data=train_matrix, params=..., nthread=2, nrounds=..., prediction = T)
newdata <- xgb.DMatrix(as.matrix(test %>% select(-target)), missing = NaN)
While setting by yourself object and newdata with your data thanks to the code above, you can probably fix this issue by looking at the differences between object[["feature_names"]] and colnames(newdata). Probably some columns that don't appear in the same order or something.
Try this in your new dataset,
colnames(test)<- make.names(colnames(test))
newdataset<- test %>% mutate_all(as.numeric)
newdataset<- as.matrix(newdataset)
nwtest<-xgb.DMatrix(newdataset)
I had the same problem but the columns weren't in alphabetical order. To fix this, I matched the order of the column names in the df_test to df_train so that the column names were in the same order.
Create list of df_test column numbers in same order as df_train:
idx<- match(colnames(df_train), colnames(df_test))
Create new df_test file using this column order:
df_test_match <- df_test[,idx]
To prevent the (Intercept) column showing up, you need to change your code slightly when creating the sparse matrix for your test data.
Change the line:
test <- list(sparse.model.matrix( ~ ., data = data), label)
to:
test <- list(sparse.model.matrix( ~ .-1, data = data), label)
Hope this helps
In the past I've used the lm function with matrix-type data and data.frame-type. But I guess this is the first time that I tried to use predict using a model fitted without a data.frame. And I'm can't figure out how to make it work.
I read some other questions (such as Getting Warning: " 'newdata' had 1 row but variables found have 32 rows" on predict.lm) and I'm pretty sure that my problem is related with the coefficient names I'm getting after fitting the model. For some reason the coefficients names are a paste of the matrix name with the column name... and I haven't been able to find how to fix that...
library(tidyverse)
library(MASS)
set.seed(1)
label <- sample(c(T,F), nrow(Boston), replace = T, prob = c(.6,.4))
x.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(-medv) %>% as.matrix()
y.train <- Boston %>% dplyr::filter(., label) %>%
dplyr::select(medv) %>% as.matrix()
x.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(-medv) %>% as.matrix()
y.test <- Boston %>% dplyr::filter(., !label) %>%
dplyr::select(medv) %>% as.matrix()
fit_lm <- lm(y.train ~ x.train)
fit_lm2 <- lm(medv ~ ., data = Boston, subset = label)
predict(object = fit_lm, newdata = x.test %>% as.data.frame()) %>% length()
predict(object = fit_lm2, newdata = x.test %>% as.data.frame()) %>% length()
# they get different numbers of predicted data
# the first one gets a number a results consistent with x.train
Any help will be welcome.
I can't fix your tidyverse code because I don't work with this package. But I am able to explain why predict fails in the first case.
Let me just use the built-in dataset trees for a demonstration:
head(trees, 2)
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
The normal way to use lm is
fit <- lm(Girth ~ ., trees)
The variable names (on the RHS of ~) are
attr(terms(fit), "term.labels")
#[1] "Height" "Volume"
You need to provide these variables in the newdata when using predict.
predict(fit, newdata = data.frame(Height = 1, Volume = 2))
# 1
#11.16125
Now if you fit a model using a matrix:
X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"
The variable you need to provide in newdata for predict is now X, not Height or Girth. Note that since X is a matrix variable, you need to protect it with I() when feeding it to a data frame.
newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame': 1 obs. of 1 variable:
# $ X: AsIs [1, 1:2] 1 2
predict(fit2, newdat)
# 1
#11.16125
It does not matter that cbind(1, 2) has no column names. What is important is that this matrix is named X in newdat.
I am trying to write a custom function to do logistic regression-based ML with the caTools package, but I keep getting the error: undefined columns selected.
I checked the input to xlearn and ylearn arguments to the logit_boost function and, as explained in the documentation, they are respectively dataframe containing feature and a vector of labels. So not sure what I am doing wrong.
# needed libraries
library(dplyr)
library(rlang)
library(caTools)
# function body
logit_boost <- function(data, x, y, split_size = 0.8) {
# creating a dataframe
data <-
dplyr::select(.data = data,
!!rlang::enquo(x),
!!rlang::enquo(y))
# for reproducibility
set.seed(123)
# creating indices to choose rows from the data
train_indices <-
base::sample(x = base::seq_len(length.out = nrow(data)),
size = floor(split_size * nrow(data)))
# training dataset
train <- data[train_indices, ]
# testing dataset
test <- data[-train_indices, ]
# defining label column we are interested in and everything else
label_train <-
train %>% dplyr::select(.data = ., !!rlang::enquo(x))
data_train <-
train %>% dplyr::select(.data = ., -!!rlang::enquo(x))
# training model (y ~ x)
logit_model <-
caTools::LogitBoost(xlearn = data_train,
ylearn = label_train)
# prediction
# stats::predict(object = logit_model, test, type = "raw")
}
logit_boost(data = mtcars, x = am, y = mpg)
#> Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns selected
In help(LogitBoost) examples section, Label = iris[, 5] results in a vector, as expected in the ylearn argument to LogitBoost().
In your code, label_train <- train %>% dplyr::select(.data = ., !!rlang::enquo(x)) results in a data.frame. dplyr, by design, defaults to drop = FALSE (and even ignores the argument) when only one column is selected.
We could do:
logit_model <- caTools::LogitBoost(xlearn = data_train, ylearn = dplyr::pull(label_train))
I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)
I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)