Error: Can't find column `RH_train` in `.data` - r

i'm trying to train a model with the caret Package (Random Forest), after running the "train" code, i get: Error: Can't find column RH_train in .data. Then, i tried converting the dependent variable (Rendimiento) to a factor but i get: Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X0.5, X0.6, X0.65, X0.7, X0.75, X0.79, X0.8, X0.81, X0.82, X0.83, X0.85, X0.86, X0.87, X0.88, X0.9, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
library(rpart)
library(rpart.plot)
library(RWekajars)
library(randomForest)
library(party)
library(caret)
library(e1071)
library(dplyr)
####Cargar base de datos####
setwd("C:/Users/Frankenstein/Downloads")
RH <- read_excel("RH.xlsx")
RH$`Año Ingreso`=NULL
RH$`Mes ingreso`=NULL
RH$`Status empleado para Gestión t`=NULL
RH$`Horario trabajo`=NULL
RH$Nacional=NULL
RH$Jefe=NULL
RH$`N Personal`=NULL
colnames(RH)
names(RH)[names(RH) == "Grado de distancia"] <- "Distancia"
names(RH)[names(RH) == "Clave para el estado civil"] <- "EstadoCivil"
names(RH)[names(RH) == "Clave de sexo"] <- "Sexo"
####Analizar la estructura del los datos#
str(RH)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 325 obs. of 10 variables:
$ Rendimiento: num 0.6 0.8 0.85 0.86 0.85 0.8 1 0.86 1 0.9 ...
$ Edad : num 36 37 21 26 25 28 32 32 29 36 ...
$ Posición : chr "ANA" "ANA" "AUX" "AUX" ...
$ Sexo : num 1 1 0 1 1 1 1 1 1 1 ...
$ Distancia : num 5 3 1 1 3 2 2 4 5 4 ...
$ Estrato : num 2 5 3 3 3 3 5 2 5 6 ...
$ EstadoCivil: num 1 2 1 1 1 1 2 1 1 2 ...
$ Hijos : num 1 1 0 0 0 0 0 1 0 0 ...
$ Formación : chr "PREGRADO" "PREGRADO" "PREGRADO" "PREGRADO" ...
$ Educación : num 3 3 3 3 3 3 4 3 3 3 ...
hist(RH$Rendimiento)
summary(RH)
####Dividir datos en entrenamiento y testeo####
glimpse(RH_train)
glimpse(RH_test)
RH_train <- RH[1:243, ]
RH_test <- RH[244:325, ]
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid",
classProb=TRUE)
set.seed(1234)
RH_train$Rendimiento=factor(RH_train$Rendimiento)
RendimientoFactor=factor(RH_train$Rendimiento)
# Run the model
rf_default <- train(RH_train$Rendimiento ~ RH_train$Edad + RH_train$Sexo + RH_train$Distancia + RH_train$Estrato + RH_train$EstadoCivil + RH_train$Hijos + RH_train$Educación,
data=RH_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)

Related

caret::predict giving Error: $ operator is invalid for atomic vectors

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]
class(newdata) gives [1] "tbl_df" "tbl" "data.frame".
The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work
predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider
By checking the str of both datasets, some of the factor columns in the training is character class
str(model$trainingData)
'data.frame': 1277525 obs. of 7 variables:
$ pitcher : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
$ stand : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
$ p_throws : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
$ balls : num 0 1 0 1 2 2 2 0 0 0 ...
$ strikes : num 0 0 0 0 0 1 2 0 1 2 ...
$ outs_when_up: num 1 1 1 1 1 1 1 2 2 2 ...
$ .outcome : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...
str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
$ balls : int 3
$ strikes : int 2
$ outs_when_up: int 1
$ stand : chr "R"
$ pitcher : int 605200
$ p_throws : chr "R"
An option is to make levels same for factor class
nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])
Now do the prediction
predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider

Error in cross validation with factor value

I have this code:
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(is_nocnv ~., data = mydata, method = "lm", trControl = train.control)
# Summarize the results
print(model)
When I execute this code I obtain this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
The field: is_nocnv is factor the value of this field is 'YES' , 'NO'
str(mydata)
'data.frame': 8334 obs. of 7 variables:
$ chr : Factor w/ 1 level "chr1": 1 1 1 1 1 1 1 1 1 1 ...
$ start : int 3218610 154080441 154089408 61735 2069681 2074104 3135175 3137913 3214732 5901288 ...
$ stop : int 154074261 154081058 247813706 2061969 2071738 3130590 3136858 3212946 5900106 5902086 ...
$ strand : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
$ num_probes : int 69643 3 59364 379 2 333 2 33 1943 3 ...
$ segment_mean: num -0.122 -13.462 -0.1 -0.326 -25.242 ...
$ is_nocnv : Factor w/ 2 levels "NO","YES": 2 2 2 1 1 1 1 1 1 1 ...
Here a small part of my dataset (csv)
"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr1,3218610,154074261,*,69643,-0.122,YES
chr1,154080441,154081058,*,3,-13.462,YES
chr1,154089408,247813706,*,59364,-0.1003,YES
chr1,61735,2061969,*,379,-0.326,NO
chr1,2069681,2071738,*,2,-25.242,NO
chr1,2074104,3130590,*,333,-0.3957,NO

Caret: There were missing values in resampled performance measures

I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)

R Standardizing numeric variables in dataframe while retaining factor variables

I have a dataframe (dcc) loaded in R which I have narrowed down to complete cases.
str(dcc)
'data.frame': 41715 obs. of 9 variables:
$ XCoord : num 661382 661412 661442 661472 661502 ...
$ YCoord : num 648092 648092 648092 648092 648092 ...
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ POINTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ GRID_CODE : int 0 0 0 0 0 0 0 0 0 0 ...
$ APPL_COST_DIST_RIV_COAST: num 21350 21674 22185 22748 23448 ...
$ APPL_DEM30 : int 785 793 792 769 765 777 784 789 781 751 ...
$ APPL_DEM30_SLOPE : num 19.7 13.3 18.6 23.2 21 ...
$ APPL_SITE_NONSITE : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
I want to standardize the numeric and integer variables by subtracting the mean and dividing by the standard deviation. When I apply the following code, I inadvertently drop the factor variable APPL_SITE_NONSITE from the dataframe:
ind <- sapply(dcc, is.numeric)
dcc.s<-sapply(dcc[,ind], function(x) (x-mean(x))/sd(x))
dcc.s<-data.frame(dcc.s)
If I'm not mistaken, that happens because ind=FALSE for that variable. It seems like I need some combination of a for loop and if/else statement to standardize the numeric variables and leave the factor variable alone. I have tried a number of permutations, but keep getting errors. For example, the following code:
dcc.s <- for (i in 1:ncol(dcc)){ sapply(dcc[,i],
if (is.numeric(dcc[,i])==TRUE) {
function(x) (x-mean(x))/sd(x) }
else {dcc[,i]})
}
returns the error:
Error in match.fun(FUN) :
c("'if (is.numeric(dcc[, i]) == TRUE) {' is not a function, character or symbol", "' function(x) (x - mean(x))/sd(x)' is not a function, character or symbol", "'} else {' is not a function, character or symbol", "' dcc[, i]' is not a function, character or symbol", "'}' is not a function, character or symbol")
Perhaps this is a simple formatting error or misplaced bracket, but I'm thoroughly stuck. I am open to other approaches if there is an more elegant way to do this. Any help would be much appreciated.
You need to use rapply instead of sapply
set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame': 10 obs. of 3 variables:
$ A: num -0.626 0.184 -0.836 1.595 0.33 ...
$ b: int 1 2 3 4 5 6 7 8 9 10
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
The code you need to use:
> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
A b C
1 -0.97190653 -1.4863011 1
2 0.06589991 -1.1560120 2
3 -1.23987805 -0.8257228 1
4 1.87433300 -0.4954337 2
5 0.25276523 -0.1651446 1
6 -1.22045645 0.1651446 2
7 0.45507643 0.4954337 1
8 0.77649606 0.8257228 2
9 0.56826358 1.1560120 1
10 -0.56059319 1.4863011 2
> str(D)
'data.frame': 10 obs. of 3 variables:
$ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
..- attr(*, "scaled:center")= num 0.132
..- attr(*, "scaled:scale")= num 0.781
$ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
..- attr(*, "scaled:center")= num 5.5
..- attr(*, "scaled:scale")= num 3.03
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>
Here is a dplyr and scale solution.
For dplyr < 1.0.0
require(dplyr)
df %>% mutate_if(is.numeric, scale)
# a runif(20) rnorm(20)
#1 y 0.5783877 -0.004177104
#2 n -0.2344854 -0.866626472
#3 m 1.5629961 1.526857969
#4 h 0.9648646 -1.557975547
#5 u -0.7212756 0.533400304
#6 u 1.4753675 -0.072289864
#7 b 0.5346870 -0.464299111
#8 l -0.4287559 0.426600473
#9 m -1.2050841 -0.880135405
#10 h -0.6150410 -0.040636433
#11 r 1.3768249 -0.719785950
#12 a -1.3929511 0.083010969
#13 a -0.4422665 0.385574213
#14 l -0.7719473 -0.934716525
#15 m 1.4483803 0.131974911
#16 k 0.6291919 2.598581195
#17 k -1.0356817 -1.018890381
#18 s -1.0960083 1.560216350
#19 y -0.8826702 -0.367821579
#20 v 0.2554671 -0.318862011
For dplyr >= 1.0.0
df %>% mutate(across(where(is.numeric), scale))
Note that scale(x) will do the same as (x - mean(x)) / sd(x); if you want to scale based on different metrics (e.g. a robust/modified Z score based on the median and MAD) you can do that using sweep.
Sample data
set.seed(2017);
df <- cbind.data.frame(a = factor(sample(letters, 20, replace = T)), runif(20), rnorm(20));
ind <- sapply(dcc, is.numeric)
dcc.s <- as.data.frame(lapply(dcc[,ind], function(x) (x-mean(x))/sd(x)))
dcc.s <- cbind(dcc, dcc.s)
If you don't need the "old" dataframe you can also do
ind <- sapply(dcc, is.numeric)
dcc[,ind] <- vapply(dcc[,ind], function(x) (x-mean(x))/sd(x))

Calculate new column with growth rate based on two factors without splintering dataframe

Hej hej,
I would like to calculate growth rates, storing them in a new column of my data frame e.g. named growth.per.day. I am - as always - looking for a way that doesn't include hundreds and hundreds of lines of manually edited code.
I have six levels of algae and 25 levels of nutrients.
This means i have 150 "subgroups" for which i want to calculate the rates. Those subsets differ in length based on the individual algae.
So, basically:
Algae A ->
Nutrient (1) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... ;
Nutrient (2) -> C.mikro.gr.L (Day 2) - C.mikro.gr.L (Day 1),C.mikro.gr.L (Day 3) - C.mikro.gr.L (Day 2) ... etc.
I already split the data frame by algae
X <- split(data, data$ALGAE)
names(X) <- c("ANKI", "CHLAMY", "MIX_A", "MIX_B", "SCENE", "STAURA")
list2env(X, envir = .GlobalEnv)
and i have also split those again, creating the aforementioned lovely 150 subsets. Then i applied
ratio1$growth.per.day <- c(NA,ratio1[2:nrow(ratio1), 16] - ratio1[1:(nrow(ratio1)-1), 16])
which is perfect and does what i want, BUT i would really very much appreciate a shorter, more elegant way without butchering my dataframe.
'data.frame': 3550 obs. of 16 variables:
$ SAMPLE.ID : Factor w/ 150 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ COMMUNITY : chr "com.1" "com.1" "com.1" "com.1" ...
$ NUTRIENT : Factor w/ 25 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ RATIO : Factor w/ 23 levels "3.2","4","5.4",..: 11 9 6 4 1 14 10 8 5 2 ...
$ PHOS : Factor w/ 5 levels "0.09","0.195",..: 5 5 5 5 5 4 4 4 4 4 ...
$ NIT : Factor w/ 5 levels "1.5482","3.0964",..: 5 4 3 2 1 5 4 3 2 1 ...
$ DATUM : Factor w/ 35 levels "30.08.16","31.08.16",..: 1 1 1 1 1 1 1 1 1 1 ...
$ DAY : int 0 0 0 0 0 0 0 0 0 0 ...
$ TYPE : chr "mono" "mono" "mono" "mono" ...
$ ALGAE : Factor w/ 6 levels "ANK","CHLA","MIX A",..: 5 5 5 5 5 5 5 5 5 5 ...
$ MEAN : num 864 868 882 873 872 ...
$ GROW : num 0.00116 0.00115 0.00113 0.00115 0.00115 ...
$ FLUORO : num NA NA NA NA NA NA NA NA NA NA ...
$ MEAN.MQ : num 0.964 0.969 0.985 0.975 0.973 ...
$ GROW.MQ : num 1.04 1.03 1.02 1.03 1.03 ...
$ C.mikro.gr.L: num -764 -913 -1394 -1085 -1039 ...
I hope this sufficiently describes the problem,
Thanks so much!
Hope it is what you asked for:
df = data.frame(algae = sort(rep(LETTERS[1:6], 20)),
nutrient = rep(letters[22:26], 24),
day = rep(c(rep(1, 5),
rep(2, 5),
rep(3, 5),
rep(4, 5)), 6),
growth = runif(120, 30, 60))
library(dplyr)
df = df %>% group_by(algae, nutrient) %>% mutate(rate = c(NA, diff(growth, lag = 1)))
And there the table for alga A and nutrient v:
algae nutrient day growth rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 A v 1 48.68547 NA
2 A v 2 55.63570 6.950232
3 A v 3 53.28569 -2.350013
4 A v 4 44.83022 -8.455465

Resources