Why is tidymodels with a ranger engine so much slower than ranger? - r

I'm taking a first look at tidymodels. My alternative for the current project would be non-tidyfied ranger. On a test run, classification random forest with tidymodels using the ranger engine is much slower than hand-held ranger (approximately ten times slower) when run on the classic iris dataset. Why is that?
library(tidymodels)
library(ranger)
# Make example data
data("iris")
mydata <- iris[sample(1:nrow(iris), 600, replace=T),]
# Recipe
myrecipe <- mydata %>% recipe( Species ~ . )
# Setting a Ranger RF model
myRF <- rand_forest( trees = 300, mtry = 3, min_n = 1) %>%
set_mode("classification") %>%
set_engine("ranger")
# Setting a workflow
myworkflow <- workflow() %>%
add_model(myRF) %>%
add_recipe(myrecipe)
# Compare base ranger and tidy setup
time <- Sys.time()
fit_ranger <- ranger( Species ~ . , data = mydata, probability = T,
mtry = 3, num.trees = 300, min.node.size = 1)
ranger_time <- difftime( Sys.time(), time, "secs")
time <- Sys.time()
fit_tidy <- myworkflow %>%
fit(data= mydata)
tidy_time <- difftime( Sys.time(), time, "secs")
tidy_time
ranger_time

Related

Dimension error when making a neural network with Keras in R

I am trying to build a neural network using keras in R. I am trying to use a dataset comprised of certain variables from another dataset. I think I have made an error in data partition, but I am not too sure what is the error.
For context, I have used feature selection methods to identify the most important features. Then, I created a dataset comprised of the most important features. I am now trying to use this dataset to train a neural network.
I am getting the following error:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Data cardinality is ambiguous:
x sizes: 3505
y sizes: 626
Make sure all arrays contain the same number of samples.
I am using the following R code:
#Loading Libraries
library(dplyr)
library(faux)
library(caret)
library(randomForest)
library(ROSE)
library(e1071)
library(corrplot)
library(tidymodels)
library(tidyverse)
library(class)
library(caTools)
library(DataExplorer)
library(glmnet)
library(xgboost)
library(class)
library(pROC)
library(neuralnet)
library(keras)
hc <- data$hc
cwtfp <- data$cwtfp
irr <- data$irr
pl_c <- data$pl_c
pl_n <- data$pl_n
forestry_change <- data$forestry_change
agriculture_change <- data$agriculture_change
growthbucket_factor_imp <- data$growthbucket_factor
data_important <- data.frame(hc, cwtfp, irr, pl_c, pl_n, forestry_change, agriculture_change, growthbucket_factor_imp)
#Splitting the important dataset
set.seed(1)
sample <- sample.split(data_important$growthbucket_factor, SplitRatio = 0.7)
train_imp <- subset(data_important, sample == TRUE)
test_imp <- subset(data_important, sample == FALSE)
#Dense Neural Network
#data preparation
data_important_matrix <- as.matrix(data_important)
dimnames(data_important_matrix) <- NULL
#Partition
set.seed(1234)
ind <- sample(2, nrow(data_important_matrix), replace = T, prob = c(0.7,0.3))
training_nn <- data_important_matrix[ind==1, 1:7]
test_nn <- data_important_matrix[ind==2, 1:7]
trainingtarget_nn <- data_important_matrix[ind==1, 8]
testtarget_nn <- data_important_matrix[ind==2, 8]
#Data processing
training_nn <- as.numeric(unlist(training_nn))
test_nn <- as.numeric(unlist(training_nn))
#one hot encoding
trainnnLabels <- to_categorical(trainingtarget_nn)
testnnLabels <- to_categorical(testtarget_nn)
#Creating the model specification
model <- keras_model_sequential()
model %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(8)) %>%
layer_dense(units = 4, activation = 'relu') %>%
layer_dense(units = 1, activation = 'softmax')
summary(model)
#Compile
model %>% compile(loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = 'accuracy')
#Fitting the model
history <- model %>%
fit(training_nn,
trainnnLabels,
epoch = 200,
batch_size = 32,
validation_split = 0.2)

How can I tune the minority_prop argument of the ROSE upsampling algorithm using tidymodels?

I have an imbalanced data set and am using the tidymodels framework to build predictive models. To correct for the imbalance, I use the upsampling ROSE algorithm, which has two arguments I'd like to tune, namely over_ratio and minority_prop.
To do so, I specified in the step recipe that each argument =tune()and then I built a CV grid with the corresponding names. However, the minority_pro argument is not recognized when I run the CV search.
# data
set.seed(20)
y <- rbinom(100, 1, 0.1)
X <- MASS::mvrnorm(100, c(1,2), diag(2))
dat <- cbind(y,X)
dat <- data.frame(dat)
dat$y <- as.factor(dat$y)
# define the recipe
my_recipe <-
recipe(y ~ ., data = dat) |>
step_rose(y, over_ratio = tune(), minority_prop = tune(),
skip = TRUE) %>%
step_normalize(all_numeric_predictors(), skip = FALSE)
# MODEL
mod <-
svm_rbf(mode = "classification", cost = tune(),
rbf_sigma = tune()) %>%
set_engine("kernlab")
# set the workflow
svc_workflow <- workflow() %>%
# add the recipe
add_recipe(my_recipe) %>%
# add the model
add_model(mod)
grid_svc <- expand.grid(rbf_sigma = seq(0, 10, 2), cost = seq(0,10,2),
over_ratio = seq(0.5,1.5,0.5), minority_prop = seq(0.5,0.8,0.15))
# cv tuning
doParallel::registerDoParallel()
cv_tuning <- tune_grid(svc_workflow,
resamples = vfold_cv(dat),
grid = grid_svc,
metrics = metric_set(f_meas, precision, recall,
accuracy, pr_auc))
I then receive the following error.
Error in `check_grid()`:
! The provided `grid` has the following parameter columns that have not been marked for tuning by `tune()`: 'minority_prop'.
Run `rlang::last_error()` to see where the error occurred.
I tried tuning only over over_ratio without minority_prop and it worked. What am I doing wrong?

Tidymodels / XGBoost error in last_fit with rsplit value

I am trying to follow this tutorial here - https://juliasilge.com/blog/xgboost-tune-volleyball/
I am using it on the most recent Tidy Tuesday dataset about great lakes fishing - trying to predict agency based on many other values.
ALL of the code below works except the final row where I get the following error:
> final_res <- last_fit(final_xgb, stock_folds)
Error: Each element of `splits` must be an `rsplit` object.
I searched that error and came to this page - https://github.com/tidymodels/rsample/issues/175
That site has it called a bug and seems to be fixed - but it is with initial_time_split, not initial_split that I am using. I would rather not change it because then I would have to rerun the xgboost that took 9 hours. What went wrong here?
# Setup ----
library(tidyverse)
library(tidymodels)
stocked <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-08/stocked.csv')
stocked_modeling <- stocked %>%
mutate(AGENCY = case_when(
AGENCY != "OMNR" ~ "other",
TRUE ~ AGENCY
)) %>%
select(-SID, -MONTH, -DAY, -LATITUDE, -LONGITUDE, -GRID, -STRAIN, -AGEMONTH,
-MARK_EFF, -TAG_NO, -TAG_RET, -LENGTH, -WEIGHT, - CONDITION, -LOT_CODE,
-NOTES, - VALIDATION, -LS_MGMT, -STAT_DIST, -ST_SITE, -YEAR_CLASS, -STOCK_METH) %>%
mutate_if(is.character, factor) %>%
drop_na()
# Start making model ----
set.seed(123)
stock_split <- initial_split(stocked_modeling, strata = AGENCY)
stock_train <- training(stock_split)
stock_test <- testing(stock_split)
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(), loss_reduction = tune(),
sample_size = tune(), mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), stock_train),
learn_rate(),
size = 20
)
xgb_workflow <- workflow() %>%
add_formula(AGENCY ~ .) %>%
add_model(xgb_spec)
set.seed(123)
stock_folds <- vfold_cv(stock_train, strata = AGENCY)
doParallel::registerDoParallel()
# BEWARE, THIS CODE BELOW TOOK 9 HOURS TO RUN
set.seed(234)
xgb_res <- tune_grid(
xgb_workflow,
resamples = stock_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
# Explore results
best_auc <- select_best(xgb_res, "roc_auc")
final_xgb <- finalize_workflow(
xgb_workflow,
best_auc)
final_res <- last_fit(final_xgb, stock_folds)
If we look at the documentation of last_fit() We see that split must be
An rsplit object created from `rsample::initial_split().
You accidentally passed the cross-validation folds object stock_folds into split but you should have passed rsplit object stock_split instead
final_res <- last_fit(final_xgb, stock_split)

data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

I have detected a peculiar effect that RMSE gets lower for the testing set than that of the training set with the sample function with the caret package.
My code does a common split of training and testing set:
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- dataset[-training.index, ]
This e.g. gives an RMSE for testing set 0.651 which is higher than training set RMSE 0.575 - as expected.
Following the recommendation of many sources, e.g. here, the data should be shuffled, so I do it before the above split:
# shuffle data - short version:
set.seed(17)
dataset <- data %>% nrow %>% sample %>% data[.,]
After this shuffle, the testing set RMSE gets lower 0.528 than the training set RMSE 0.575! This finding is consistent across a number of algorithms including lm, glm, knn, kknn, rf, gbm, svmLinear, svmRadial etc.
According to my knowledge, the default of sample() is replace = FALSE so there can't be any data leakage into the testing set. The same observation occurs in classification (for accuracy and kappa) although the createDataPartition performs stratification, so any data imbalance should be handled.
I don't use any extraordinary configuration, just ordinary cross-validation:
training.configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = CV.REPEATS
, savePredictions = "final",
# , returnResamp = "all"
)
What did I miss here?
--
Update 1: Hunch about data leakage into testing set
I checked the data distribution and found a potential hint for the described effect.
Training set distribution:
. Freq prop
1 1 124 13.581599
2 2 581 63.636364
3 3 194 21.248631
4 4 14 1.533406
Testing set distribution without shuffle:
. Freq prop
1 1 42 18.502203
2 2 134 59.030837
3 3 45 19.823789
4 4 6 2.643172
Testing set distribution with shuffle:
. Freq prop
1 1 37 16.299559
2 2 139 61.233480
3 3 45 19.823789
4 4 6 2.643172
If we look at the mode (most frequent value), its proportion in the testing set with shuffle 61.2% is closer to the training set proportion 63.6% than without shuffle 59.0%.
I don't know how to interpret this statistically by underlying theory - can anybody?
An intuition of mine is that the shuffling makes the stratification of the testing set distribution (implicitly performed by createDataPartition()) "more stratified" - by that I mean "closer to the training set distribution". This may cause the effect of data leakage into the opposite direction - into the testing set.
Update 2: Reproducible Code
library(caret)
library(tidyverse)
library(magrittr)
library(mlbench)
data(BostonHousing)
seed <- 171
# shuffled <- TRUE
shuffled <- FALSE
if (shuffled) {
dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
dataset <- BostonHousing %>% as_tibble()
}
target_label <- "medv"
features_labels <- dataset %>% select_if(is.numeric) %>%
select(-target_label) %>% names %T>% print
# define ml algorithms to train
algorithm_list <- c(
"lm"
, "glmnet"
, "knn"
, "gbm"
, "rf"
)
# repeated cv
training_configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = 10
, savePredictions = "final",
# , returnResamp = "all"
)
# preprocess by standardization within each k-fold
preprocess_configuration = c("center", "scale")
# select variables
dataset %<>% select(target_label, features_labels) %>% na.omit
# dataset subsetting for tibble: [[
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]
########################################
# 3.2: Select the target & features
########################################
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame
########################################
# 3.3: Train the models
########################################
models.list <- list()
models.list <- algorithm_list %>%
map(function(algorithm_label) {
model <- train(
x = features,
y = target,
method = algorithm_label,
preProcess = preprocess_configuration,
trControl = training_configuration
)
return(model)
}
) %>%
setNames(algorithm_list)
UPDATE: Code to calculate testingset performance
observed <- testing.set[[target_label]]
models.list %>%
predict(testing.set) %>%
map_df(function(predicted) {
sqrt(mean((observed - predicted)^2))
}) %>%
t %>% as_tibble(rownames = "model") %>%
rename(RMSE.testing = V1) %>%
arrange(RMSE.testing) %>%
as.data.frame
Running this code both for shuffled = FALSE and shuffled = TRUE on the testing.set gives:
model RMSE.testing RMSE.testing.shuffled
1 gbm 3.436164 2.355525
2 glmnet 4.516441 3.785895
3 knn 3.175147 3.340218
4 lm 4.501077 3.843405
5 rf 3.366466 2.092024
The effect is reproducible!
The reason you get a different test RMSE is because you have a different test set. You are shuffling your data and then using the same training.index each time so there's no reason to believe the test set would be the same each time.
In your original comparison you need to compare the RMSE from the shuffled test data with the RMSE of the shuffled training data, not the original training data.
Edit: the shuffling is also unnecessary as createDataPartition has its own sampling scheme. You can just change the seed if you want a different test/training split
I completely agree with the answer of Jonny Phelps. Based on your code and caret functions code there is no reason to suspect any kind of data leakage when using createDataPartition on shuffled data. So the variation in performance must be due to different train/test splits.
In order to prove this I checked the performance of the shuffled and non-shuffled workflow using 10 different seeds with 4 algorithms:
I omitted lm and replaced gbm with xgboost. The reason is my own preference.
In my opinion the result suggests there is no performance pattern between shuffled and non shuffled data. Perhaps only KNN looks suspicions. But this is just one algorithm.
Code:
create initial seeds:
set.seed(1)
gr <- expand.grid(sample = sample(1L:1e5L, 10),
shuffled = c(FALSE, TRUE))
loop over initial seeds:
apply(gr, 1, function(x){
print(x)
shuffled <- x[2]
set.seed(x[1])
if (shuffled) {
dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
dataset <- BostonHousing %>% as_tibble()
}
target_label <- "medv"
features_labels <- dataset %>% select_if(is.numeric) %>%
select(-target_label) %>% names %T>% print
algorithm_list <- c(
"glmnet",
"knn",
"rf",
"xgbTree"
)
training_configuration <- trainControl(
method = "repeatedcv",
number = 5, #5 folds and 3 reps is plenty
repeats = 3,
savePredictions = "final",
search = "random") #tune hyper parameters
preprocess_configuration = c("center", "scale")
dataset %<>% select(target_label, features_labels) %>% na.omit
set.seed(x[1])
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame
models.list <- list()
models.list <- algorithm_list %>%
map(function(algorithm_label) {
model <- train(
x = features,
y = target,
method = algorithm_label,
preProcess = preprocess_configuration,
trControl = training_configuration,
tuneLength = 100 #get decent hyper parameters
)
return(model)
}
) %>%
setNames(algorithm_list)
observed <- testing.set[[target_label]]
models.list %>%
predict(testing.set) %>%
map_df(function(predicted) {
sqrt(mean((observed - predicted)^2))
}) %>%
t %>% as_tibble(rownames = "model") %>%
rename(RMSE.testing = V1) %>%
arrange(RMSE.testing) %>%
as.data.frame
}) -> perf
do.call(rbind, perf) %>%
mutate(shuffled = rep(c(FALSE, TRUE), each = 40)) %>%
ggplot()+
geom_boxplot(aes(x = model, y = RMSE.testing, color = shuffled)) +
theme_bw()

What can be the cause for difference in MAE outcome from deep-learing with R between these datasets?

I’m trying to replicate the deep learning example below with the same Boston housing dataset from another source.
https://jjallaire.github.io/deep--with-r-notebooks/notebooks/3.6-predicting-house-prices.nb.html
Originally the data source is:
library(keras) dataset <- dataset_boston_housing()
Alternatively I try to use:
library(mlbench)
data(BostonHousing)
The difference between the datasets are:
the dataset from mlbench contains column names.
the dataset from keras is already split between test and train.
the set from keras is organised with lists containing matrices while the dataset from mlbench is a dataframe
the fourth column contains a categorical variable "chas" which could not be preprocessed from the mlbench dataset while it can be preprocessed from the keras dataset. To compare apples with apples I have deleted this column from both datasets.
In order to compare both datasets I have merged the train and testset from keras into 1 dataset. After this I have compared the merged dataset from keras with mlbench with summary() and these are identical for every feature (min, max, median, mean).
Since the dataset from keras is already split between test and train (80-20), I can only use one training set for the deep learning proces. This training set gives a validation_mae of around 2.5. See this graph:
If I partition the data from mlbench at 0.8 to construct a training set of similar size, run the deep learing code and do this several times, I never reach a validation_mae of around 2.5. The range is between 4 and 6. An example of the output is this graph:
Does someone know what can be the cause for this difference?
Code with dataset from keras:
library(keras)
dataset <- dataset_boston_housing()
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset
train_data <- train_data[,-4]
test_data <- test_data[,-4]
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)
# After this line the code is the same for both code examples.
# =========================================
# Because we will need to instantiate the same model multiple times,
# we use a function to construct it.
build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = dim(train_data)[[2]]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
}
k <- 4
indices <- sample(1:nrow(train_data))
folds <- cut(1:length(indices), breaks = k, labels = FALSE)
num_epochs <- 100
all_scores <- c()
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepare the validation data: data from partition # k
val_indices <- which(folds == i, arr.ind = TRUE)
val_data <- train_data[val_indices,]
val_targets <- train_targets[val_indices]
# Prepare the training data: data from all other partitions
partial_train_data <- train_data[-val_indices,]
partial_train_targets <- train_targets[-val_indices]
# Build the Keras model (already compiled)
model <- build_model()
# Train the model (in silent mode, verbose=0)
model %>% fit(partial_train_data, partial_train_targets,
epochs = num_epochs, batch_size = 1, verbose = 0)
# Evaluate the model on the validation data
results <- model %>% evaluate(val_data, val_targets, verbose = 0)
all_scores <- c(all_scores, results$mean_absolute_error)
}
all_scores
mean(all_scores)
# Some memory clean-up
k_clear_session()
num_epochs <- 500
all_mae_histories <- NULL
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepare the validation data: data from partition # k
val_indices <- which(folds == i, arr.ind = TRUE)
val_data <- train_data[val_indices,]
val_targets <- train_targets[val_indices]
# Prepare the training data: data from all other partitions
partial_train_data <- train_data[-val_indices,]
partial_train_targets <- train_targets[-val_indices]
# Build the Keras model (already compiled)
model <- build_model()
# Train the model (in silent mode, verbose=0)
history <- model %>% fit(
partial_train_data, partial_train_targets,
validation_data = list(val_data, val_targets),
epochs = num_epochs, batch_size = 1, verbose = 1
)
mae_history <- history$metrics$val_mean_absolute_error
all_mae_histories <- rbind(all_mae_histories, mae_history)
}
average_mae_history <- data.frame(
epoch = seq(1:ncol(all_mae_histories)),
validation_mae = apply(all_mae_histories, 2, mean)
)
library(ggplot2)
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_line()
Code with dataset from mlbench (after the line with "=====", the code is the same as in the code above:
library(dplyr)
library(mlbench)
library(groupdata2)
data(BostonHousing)
parts <- partition(BostonHousing, p = 0.2)
test_data <- parts[[1]]
train_data <- parts[[2]]
train_targets <- train_data$medv
test_targets <- test_data$medv
train_data$medv <- NULL
test_data$medv <- NULL
train_data$chas <- NULL
test_data$chas <- NULL
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)
library(keras)
# After this line the code is the same for both code examples.
# =========================================
build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = dim(train_data)[[2]]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
}
k <- 4
indices <- sample(1:nrow(train_data))
folds <- cut(1:length(indices), breaks = k, labels = FALSE)
num_epochs <- 100
all_scores <- c()
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepare the validation data: data from partition # k
val_indices <- which(folds == i, arr.ind = TRUE)
val_data <- train_data[val_indices,]
val_targets <- train_targets[val_indices]
# Prepare the training data: data from all other partitions
partial_train_data <- train_data[-val_indices,]
partial_train_targets <- train_targets[-val_indices]
# Build the Keras model (already compiled)
model <- build_model()
# Train the model (in silent mode, verbose=0)
model %>% fit(partial_train_data, partial_train_targets,
epochs = num_epochs, batch_size = 1, verbose = 0)
# Evaluate the model on the validation data
results <- model %>% evaluate(val_data, val_targets, verbose = 0)
all_scores <- c(all_scores, results$mean_absolute_error)
}
all_scores
mean(all_scores)
# Some memory clean-up
k_clear_session()
num_epochs <- 500
all_mae_histories <- NULL
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepare the validation data: data from partition # k
val_indices <- which(folds == i, arr.ind = TRUE)
val_data <- train_data[val_indices,]
val_targets <- train_targets[val_indices]
# Prepare the training data: data from all other partitions
partial_train_data <- train_data[-val_indices,]
partial_train_targets <- train_targets[-val_indices]
# Build the Keras model (already compiled)
model <- build_model()
# Train the model (in silent mode, verbose=0)
history <- model %>% fit(
partial_train_data, partial_train_targets,
validation_data = list(val_data, val_targets),
epochs = num_epochs, batch_size = 1, verbose = 1
)
mae_history <- history$metrics$val_mean_absolute_error
all_mae_histories <- rbind(all_mae_histories, mae_history)
}
average_mae_history <- data.frame(
epoch = seq(1:ncol(all_mae_histories)),
validation_mae = apply(all_mae_histories, 2, mean)
)
library(ggplot2)
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_line()
Thank you!
writing here because I can't comment...
I checked the mlbench dataset here and it said, that it contains the 14 columns of the original boston data set and 5 additional columns. Not sure if you might have a faulty dataset because you state that there are no differences in the column counts of the datasets.
Another guess might be, that the second example graph is from a model which is stuck in a local minima. To get more comparable models, you might want to work with the same seeds to make sure that the inizialisations of the weights etc. are the same to get the same results.
Hope this helps and feel free to ask.

Resources