data partitionning function CreateDataPartition cross validation problem - r

I am trying to get predictions of a multiple variables model, its eplt, its made of 7 scores and one final exam score moy_exam2, I want to predict the later using the 7 scores, I have 29441 obs,like this:
'data.frame': 19643 obs. of 8 variables:
$ HG : num 11.5 14 7.5 10.5 9.5 9.5 10 14 11.5 14 ...
$ Math : num 8 7.25 9.25 13.25 4.25 ...
$ Ar : num 11.2 12.8 8.5 11.5 9.5 ...
$ Fr : num 4 4.25 6.5 6.75 5.5 ...
$ EI : num 8 10.5 2.5 4 7 9.5 8.5 9.5 12 14 ...
$ SVT : num 5.25 9.25 7 11.5 12.5 ...
$ PC : num 11.5 16.75 4.25 13.75 10 ...
$ moy_exam2: num 8.15 9.48 7.23 10.33 7.44 ...
I decided 85% for training and 15% for testing out the model, so in partitioning the data with CreateDataPartition I try this :
# Load the data
data("neplt")
# Inspect the data
library(tidyverse)
sample_n(neplt, 3)
# Split the data into training and test set
set.seed(1,sample.kind = "Rounding")
#remember the last sample
training.samples=neplt$moy_exam2
library(Rcpp)
training.samples <- neplt$moy_exam2 %>%
createDataPartition(neplt,p = 0.85, list = FALSE,times = 1)
train.data <- neplt[training.samples, ]
test.data <- neplt[-training.samples, ]
# Build the model
model <- lm(moy_exam2 ~., data = train.data, na.action=na.omit)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$moy_exam2),
RMSE = RMSE(predictions, test.data$moy_exam2),
MAE = MAE(predictions, test.data$moy_exam2))
I get the error
Error in split_indices(as.integer(splitv), attr(splitv, "n")) :
function 'Rcpp_precious_remove' not provided by package 'Rcpp'
I don't use any split_indices function here! and the Rccp is already loaded, so I continue the executing, but the program gets stuck on the CreateDataPartition line,
I clean the data eplt using na.omit and also with na.exclude to remove any doubt about the NA missing values,
then, I tried adding the sample.kind = "Rounding" attribute to the set.seed to get it to work, still the Rstudio keeps loading indefinitely, and the console shows a + sign:
does it seems to be related to the memory capacity? or doesnt it have indefinite number of sample that the it couldn't finish it in 100 years, its been running for hours with no results!

I had a similar problem and error code when running summarySE. It seems like others have had issues like this too: Rcpp package doesn't include Rcpp_precious_remove
I installed and loaded Rcpp again and it worked thereafter!

Related

Error in topsis(d, w, i) : 'decision' must be a matrix or data frame

I am doing a TOPSIS analysis with R using topsis.
For that, I started by importing the data.
As it was a table, and topsis requires the decision to be a matrix or a datafram, I have proceeded to convert it to a df as follows
data.df <- as.data.frame(data)
Or
data.df <- data.frame(data)
And indeed gives what one is looking for
>>> str(data.df)
'data.frame': 225 obs. of 2 variables:
$ time: int 6 6 7 7 6 7 6 7 8 7 ...
$ MAE : num 5.43 5.63 5.35 5.48 5.62 5.48 5.53 5.43 5.24 5.42 ...
However, when I run
d <- data.df
w <- c(1, 1)
i <- c("-", "-")
topsis(d, w, i)
I am getting the following error
Error in topsis(d, w, i) : 'decision' must be a matirx or data frame
It may be a something related with the way the library is reading the df, as when I converted it to a matrix
data.df <- as.matrix(data)
It worked well.

Box-Cox Tranformation Error: object 'x' not found

hopefully a relatively easy one for those more experienced than me!
Trying to perform a Box-Cox transformation using the following code:
fit <- lm(ABOVEGROUND_BIO ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
bc <- boxcox(fit)
lambda<-with(bc, x[which.max(y)])
MYCORRHIZAL_VARIANCE$bc <- ((x^lambda)-1/lambda)
boxplot(bc ~ TREATMENT * P_LEVEL, data = MYCORRHIZAL_VARIANCE)
however when I run it, I get the following error message:
Error: object 'x' not found. (on line 4)
For context, here's the str of my dataset:
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 24 obs. of 14 variables:
$ TREATMENT : Factor w/ 2 levels "Mycorrhizal",..: 1 1 1 1 1 1 1 1 1 1 ...
$ P_LEVEL : Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 2 2 2 2 ...
$ REP : int 1 2 3 4 5 6 1 2 3 4 ...
$ ABOVEGROUND_BIO : num 7.5 6.8 5.3 6 6.7 7 12 12.7 12 10.2 ...
$ BELOWGROUND_BIO : num 3 2.4 2 4 2.7 3.6 7.9 8.8 9.5 9.2 ...
$ ROOT_SHOOT : num 0.4 0.35 0.38 0.67 0.4 0.51 0.66 0.69 0.79 0.9 ...
$ ROOT_SHOOT.log : num -0.916 -1.05 -0.968 -0.4 -0.916 ...
$ ABOVEGROUND_BIO.log : num 2.01 1.92 1.67 1.79 1.9 ...
$ ABOVEGROUND_BIO.sqrt : num 2.74 2.61 2.3 2.45 2.59 ...
$ ABOVEGROUND_BIO.cubert: num 1.96 1.89 1.74 1.82 1.89 ...
$ BELOWGROUND_BIO.log : num 1.099 0.875 0.693 1.386 0.993 ...
$ BELOWGROUND_BIO.sqrt : num 1.73 1.55 1.41 2 1.64 ...
$ BELOWGROUND_BIO.cubert: num 1.44 1.34 1.26 1.59 1.39 ...
$ TOTAL_BIO : num 10.5 9.2 7.3 10 9.4 10.6 19.9 21.5 21.5 19.4 ...
- attr(*, "spec")=
.. cols(
.. TREATMENT = col_factor(levels = c("Mycorrhizal", "Non-mycorrhizal"), ordered = FALSE, include_na = FALSE),
.. P_LEVEL = col_factor(levels = c("Low", "High"), ordered = FALSE, include_na = FALSE),
.. REP = col_integer(),
.. ABOVEGROUND_BIO = col_number(),
.. BELOWGROUND_BIO = col_number(),
.. ROOT_SHOOT = col_number()
.. )
I understand there's no variable named bc in the MYCORRHIZAL_VARIANCE dataset, but I'm just following basic instructions given to me on performing a Box-Cox, and I guess I'm confused as to what 'x' should actually be denoted as, since I thought 'x' was being defined in line 3? Any suggestions as to how to fix this error?
Thanks in advance!
I thought 'x' was being defined in line 3?
Line 3 is lambda<-with(bc, x[which.max(y)]). It doesn't define x, it defines lambda. It does use x, which it looks for within the bc environment. If you're using boxcox() from the MASS package, bc should indeed include x and y components, so bc$x shouldn't give you the same error message. I'd expect an error about the replacement lengths. Because...
bc$x are the potential lambda values tried by boxcox - you're using the default seq(-2, 2, 1/10), and it would be an unlikely coincidence if your data had a multiple of 41 rows needed to not give an error when assigning 41 values to a new column.
Line 3 picks out the lambda value that maximizes the likelihood, so you shouldn't need the rest of the values in bc ever again. I'd expect you to use that lambda values to transform your response variable, as that's what the Box Cox transformation is for. ((x^lambda)-1/lambda) doesn't make any statistical or programmatic sense. Use this instead:
MYCORRHIZAL_VARIANCE$bc <- (MYCORRHIZAL_VARIANCE$ABOVEGROUND_BIO ^ lambda - 1) / lambda
(Note that I also corrected the parentheses. You want (y ^ lambda - 1) / lambda, not (y ^ lambda) - 1 / lambda.)

Why can't I use cv.glm on the output of bestglm?

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

Reverse Johnson transformation

I want to perform a regression and I have a data set with a left-skewed target variable (Murder) like this:
data("USAArrests")
str(USAArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
hist(USAArrests&Murder)
Since the data is left-skewed. I can do a log transformation of the target in order to improve the performance of the model.
train = USArrests[1:30,]
train$Murder = log(train$Murder)
test = USArrests[31:50,]
If I want to apply this model on the test set a have to reverse the transformation to get the actual result. This I can do by exp.
fit = lm(Murder~., data = train)
pred = predict(fit, test)
exp(pred)
However, in my case, the log transformation is not enough to get a normal distribution of the target. So I used the Johnson transformation.
library(bestNormalize)
train$Murder = yeojohnson(train$Murder)$x.t
Is there a possibility to reverse this transformation like the log transformation like above?
As noted by Rui Barradas, the predict function can be used here. Instead of directly pulling out x.t from the yeojohnson function, you can do the following:
# Store the transformation object
yj_obj <- yeojohnson(train$Murder)
# Perform transformation
yj_vals <- predict(yj_obj)
# Reverse transformation
orig_vals <- predict(yj_obj, newdata = yj_vals, inverse = TRUE)
# Should be the same as the original values
all.equal(orig_vals, train$Murder)
The same workflow can be done with the log and exponentiation transformation via the log_x function (together with the predict function and the inverse = TRUE argument).

Error Standardising variables in R : 'only defined on a data frame with all numeric variables'

I am simply looking to standardise my set of data frame variables to a 100 point scale. The original variables were on a 10 point scale with 4 decimal points.
I can see that my error is not unheard of e.g
Why am I getting a function error in seemingly similar R code?
Error: only defined on a data frame with all numeric variables with ddply on large dataset
but I have verified that all variables are numeric using
library(foreign)
library(scales)
ches <- read.csv("chesshort15.csv", header = TRUE)
ches2 <- ches[1:244, 3:10]
rescale(ches2, to = c(0,100), from = range(ches2, na.rm = TRUE, finite = TRUE))
This gives the error: Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
I have verified that all variables are of type numeric using str(ches2) - see below:
'data.frame': 244 obs. of 8 variables:
$ galtan : num 8.8 9 9.65 8.62 8 ...
$ civlib_laworder : num 8.5 8.6 9.56 8.79 8.56 ...
$ sociallifestyle : num 8.89 7.2 9.65 9.21 8.25 ...
$ immigrate_policy : num 9.89 9.6 9.38 9.43 9.13 ...
$ multiculturalism : num 9.9 9.6 9.57 8.77 9.07 ...
$ ethnic_minorities : num 8.8 9.6 9.87 9 8.93 ...
$ nationalism : num 9.4 10 9.82 9 8.81 ...
$ antielite_salience: num 8 9 9.47 8.88 8.38
In short, I'm stumped as to why it refuses to carry out the code.
For info, Head(bb) gives :
galtan civlib_laworder sociallifestyle immigrate_policy multiculturalism ethnic_minorities
1 8.800 8.500 8.889 9.889 9.900 8.800
2 9.000 8.600 7.200 9.600 9.600 9.600
3 9.647 9.563 9.647 9.375 9.571 9.867
4 8.625 8.786 9.214 9.429 8.769 9.000
5 8.000 8.563 8.250 9.133 9.071 8.929
6 7.455 8.357 7.923 8.800 7.800 8.455
nationalism antielite_salience
1 9.400 8.000
2 10.000 9.000
3 9.824 9.471
4 9.000 8.882
5 8.813 8.375
6 8.000 8.824
The rescale function is throwing that error because it expects a numeric vector, and you are feeding it a dataframe instead. You need to iterate; go through every column on your dataframe and scale them individually.
Try this:
sapply(ches2, rescale, to = c(0,100))
You don't need the range(ches2, na.rm = TRUE, finite = TRUE) portion of your code because rescale is smart enough to remove NA values on its own

Resources