"'groups' must be a factor" on Shapiro-Wilk test on Rcmdr - r

I am trying to run a shapiro-wilk normality test on R (Rcmdr to be more accurate) by going to "Statistics=>Summary=>Descriptive statistics" and then selecting one of my dependent variable and choosing "summary by group".
Rcmdr automatically triggers the following code :
normalityTest(Algometre.J0 ~ Modalite, test="shapiro.test",
data=Dataset)
And I am getting the following error message :
'groups' must be a factor.
I have already categorized my independant variable as a factor (I swear, I did !)
Any idea what's wrong ?
Thanx in advance
Here is what str(Dataset) shows :
'data.frame': 76 obs. of 11 variables:
$ Modalite : chr "C" "C" "C" "C" ...
$ Angle.J0 : num 20.1 20.5 21 22.5 19.1 ...
$ Angle.J1 : num 21.7 22.6 22.8 23.3 20.5 ...
$ Angle.J2 : num 22.3 23 23.9 24.2 21 ...
$ Epaisseur.J0: num 1.97 1.54 1.76 1.89 1.53 1.87 1.54 2 1.79 1.41 ...
$ Epaisseur.J1: num 2.07 1.49 1.87 1.91 1.54 1.9 1.51 2.03 1.71 1.48 ...
$ Epaisseur.J2: num 2.08 1.69 1.77 2 1.61 1.99 1.38 2.06 1.86 1.53 ...
$ Algometre.J0: num 45 40 105 165 66.3 ...
$ Algometre.J1: num 32.7 39.7 91.7 124 63.7 ...
$ Algometre.J2: num 51.3 58.7 101 138 60.3 ...
$ ObsNumber : int 1 2 3 4 5 6 7 8 9 10 ...
What does that mean ?

Related

In R, how to construct vectors of the corresponding elements of the data frames

I have a list of data frames, for example the first three
[[1]]
01oct 24sep 17sep 10sep 03sep 27aug 20aug 13aug 06aug 30jul 23jul 16jul 09jul 02jul 25jun 18jun 11jun 04jun 28may 21may 14may 07may 30apr 23apr
3.25 9.50 0.80 6.85 6.70 6.65 14.35 62.35 9.75 2.35 18.55 8.90 17.85 14.75 0.90 0.50 17.05 19.15 44.25 0.15 42.05 10.45 12.00 5.05
16apr 09apr 02apr
0.15 12.90 23.20
[[2]]
30sep 23sep 16sep 09sep 02sep 26aug 19aug 12aug 05aug 29jul 22jul 15jul 08jul 01jul 24jun 17jun 10jun 03jun 27may 20may 13may 06may 29apr 22apr
1.90 4.60 23.95 3.95 12.65 26.30 38.30 2.80 2.35 34.10 8.30 7.30 28.85 2.45 5.20 15.35 1.85 36.75 0.95 8.40 22.35 37.70 6.00 0.40
15apr 08apr
3.25 5.45
[[3]]
28sep 21sep 14sep 07sep 31aug 24aug 17aug 10aug 03aug 27jul 20jul 13jul 06jul 29jun 22jun 15jun 08jun 01jun 25may 18may 11may 04may 27apr 20apr
5.85 13.70 2.85 12.50 43.40 13.25 5.65 4.80 12.20 5.40 3.05 12.90 20.70 21.75 13.20 18.60 0.70 13.15 20.30 2.40 2.30 13.50 4.70 19.60
13apr 06apr
17.60 14.50
I am trying to create vectors of the corresponding elements of each data frame. In the above example, the first three elements of my first vector would be 3.25, 1.90, 5.85. The second vector would be 9.5, 4.6, 13.7. The strings showing dates ideally would be left out, since at a later stage the vectors will be used to compute correlations.
My ultimate goal is an array of these vectors.
I know this could be done with a nested loops, however I've tried and have other problems with this kind of array assignment in R (but that's for another thread). I also know that nested loops are inefficient and not best practice (at least I understood that).
What is the most reproducible way to construct these vectors and the array of them in R?
library(tidyverse)
# example data
l <- list(
data.frame(a = 3.25, b = 9.5),
data.frame(c = 1.90, d = 4.6),
data.frame(e = 5.85, d = 13.70)
)
combined <-
l %>%
map(~ {
# unify column names
colnames(.x) <- colnames(.x) %>% length() %>% seq()
.x
}) %>%
reduce(bind_rows)
combined[[1]]
#> [1] 3.25 1.90 5.85
combined[[2]]
#> [1] 9.5 4.6 13.7
Created on 2022-03-01 by the reprex package (v2.0.0)
It sounds like you want a transposition of the vectors. This is going to run into problems, because to do so suggests that the vectors should all be the same length, which they're not.
lengths(yourlist)
# [1] 27 26 26
We can fix that by padding them with NA,
yourlist <- lapply(yourlist, `length<-`, max(lengths(yourlist)))
yourlist
# [[1]]
# 01oct 24sep 17sep 10sep 03sep 27aug 20aug 13aug 06aug 30jul 23jul 16jul 09jul 02jul 25jun 18jun 11jun 04jun 28may 21may 14may 07may 30apr 23apr 16apr 09apr 02apr
# 3.25 9.50 0.80 6.85 6.70 6.65 14.35 62.35 9.75 2.35 18.55 8.90 17.85 14.75 0.90 0.50 17.05 19.15 44.25 0.15 42.05 10.45 12.00 5.05 0.15 12.90 23.20
# [[2]]
# 30sep 23sep 16sep 09sep 02sep 26aug 19aug 12aug 05aug 29jul 22jul 15jul 08jul 01jul 24jun 17jun 10jun 03jun 27may 20may 13may 06may 29apr 22apr 15apr 08apr
# 1.90 4.60 23.95 3.95 12.65 26.30 38.30 2.80 2.35 34.10 8.30 7.30 28.85 2.45 5.20 15.35 1.85 36.75 0.95 8.40 22.35 37.70 6.00 0.40 3.25 5.45 NA
# [[3]]
# 28sep 21sep 14sep 07sep 31aug 24aug 17aug 10aug 03aug 27jul 20jul 13jul 06jul 29jun 22jun 15jun 08jun 01jun 25may 18may 11may 04may 27apr 20apr 13apr 06apr
# 5.85 13.70 2.85 12.50 43.40 13.25 5.65 4.80 12.20 5.40 3.05 12.90 20.70 21.75 13.20 18.60 0.70 13.15 20.30 2.40 2.30 13.50 4.70 19.60 17.60 14.50 NA
Given that, a list-transpose:
yourlist2 <- do.call(Map, c(list(f = c, use.names = FALSE), yourlist))
str(yourlist2)
# List of 27
# $ : num [1:3] 3.25 1.9 5.85
# $ : num [1:3] 9.5 4.6 13.7
# $ : num [1:3] 0.8 23.95 2.85
# $ : num [1:3] 6.85 3.95 12.5
# $ : num [1:3] 6.7 12.6 43.4
# $ : num [1:3] 6.65 26.3 13.25
# $ : num [1:3] 14.35 38.3 5.65
# $ : num [1:3] 62.4 2.8 4.8
# $ : num [1:3] 9.75 2.35 12.2
# $ : num [1:3] 2.35 34.1 5.4
# $ : num [1:3] 18.55 8.3 3.05
# $ : num [1:3] 8.9 7.3 12.9
# $ : num [1:3] 17.9 28.9 20.7
# $ : num [1:3] 14.75 2.45 21.75
# $ : num [1:3] 0.9 5.2 13.2
# $ : num [1:3] 0.5 15.3 18.6
# $ : num [1:3] 17.05 1.85 0.7
# $ : num [1:3] 19.1 36.8 13.2
# $ : num [1:3] 44.25 0.95 20.3
# $ : num [1:3] 0.15 8.4 2.4
# $ : num [1:3] 42 22.4 2.3
# $ : num [1:3] 10.4 37.7 13.5
# $ : num [1:3] 12 6 4.7
# $ : num [1:3] 5.05 0.4 19.6
# $ : num [1:3] 0.15 3.25 17.6
# $ : num [1:3] 12.9 5.45 14.5
# $ : num [1:3] 23.2 NA NA

How to perform sensitivity analysis using Lek's profile in R?

I am trying to do sensitivity analysis using R. My data set has few continuous explanatory variables and a categorical response variable (7 categories).
I tried to run the below mentioned code.
model=train(factor(mode)~Time+Cost+Age+Income,
method="nnet",
preProcess("center","scale"),
data=train,
verbose=F,
trControl=trainControl(method='cv', verboseIter = F),
tuneGrid=expand.grid(.size=c(1:20), .decay=c(0,0.001,0.01,0.1)))
After getting the output through this code, I tried to develop Lek's profile using the below mentioned code.
Lekprofile(model)
However, I got the error stating "Errors in xvars[, x_names]: subscript out of bound"
Please help me to resolve the error.
It doesn't work for a classification model , for example, if we use a regression model:
library(caret)
library(NeuralNetTools)
library(mlbench)
data(BostonHousing)
str(BostonHousing)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : num 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ b : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We train the model, exclude the categorical chas:
model = train(medv ~ .,data=BostonHousing[,-4],method="nnet",
trControl=trainControl(method="cv",number=10),
tuneGrid=data.frame(size=c(5,10,20),decay=0.1))
lekprofile(model)
You can see the y-axis is meant to be continuous. We can try to discretize our response variable medv and you can see it crashes:
BostonHousing$medv = cut(BostonHousing$medv,4)
model = train(medv ~ .,data=BostonHousing[,-4],method="nnet",
trControl=trainControl(method="cv",number=10),
tuneGrid=data.frame(size=c(5,10,20),decay=0.1))
lekprofile(model)
Error in `[.data.frame`(preds, , ysel, drop = FALSE) :
undefined columns selected

How to create a loop to run 2 variable generalized regression models?

I have 19 variables and I want to run 19 different regressions that consist of 2 independent variables from my dataset.
*Update -This is my dataset's structure:
$ Failure_Response_Var_Yr: num 0 0 0 0 0 0 0 0 0 0 ...
$ exp_var_nocorr_2 : num 4.61 5.99 6.13 3.17 4.4 ...
$ exp_var_nocorr_3 : num 4.16 5.46 5.24 2.86 3.72 ...
$ exp_var_nocorr_4 : num 0.00191 2.23004 0.5613 1.07986 0.99836 ...
$ exp_var_nocorr_5 : num 0.709 2.79 6.846 15.478 11.418 ...
$ exp_var_nocorr_6 : num 0.724 0.497 1.782 0.156 2.525 ...
$ exp_var_nocorr_7 : num 0 168.17 92.041 0.584 265.338 ...
$ exp_var_nocorr_8 : num -38.64 4.89 1.5 24.8 16.56 ...
$ exp_var_nocorr_9 : num 116 88.3 56.4 60.6 57.6 ...
$ exp_var_nocorr_10 : num 0 10.3 0 93.7 0 ...
$ exp_var_nocorr_11 : num 1.02 1.23 1.31 2.06 1.33 ...
$ exp_var_nocorr_12 : num 60 140 124 275 203 ...
$ exp_var_nocorr_13 : num 10.835 5.175 1.838 0.347 0.783 ...
$ exp_var_nocorr_14 : num 59 60.2 87.2 42.2 84.2 ...
$ exp_var_nocorr_15 : num 61.9 68.3 99 50.2 103.9 ...
$ exp_var_nocorr_16 : num 4.4 11.24 8.23 6.9 8.84 ...
$ exp_var_nocorr_17 : num 6.43 18.62 10.72 15.62 10.35 ...
I wrote this code:
col17 <- names(my.sample)[-c(1:9,26:29)]
Such that now dput(col17) gives out:
c("exp_var_nocorr_2", "exp_var_nocorr_3", "exp_var_nocorr_4", "exp_var_nocorr_5", "exp_var_nocorr_6", "exp_var_nocorr_7", "exp_var_nocorr_8", "exp_var_nocorr_9", "exp_var_nocorr_10", "exp_var_nocorr_11", "exp_var_nocorr_12", "exp_var_nocorr_13", "exp_var_nocorr_14", "exp_var_nocorr_15", "exp_var_nocorr_16", "exp_var_nocorr_17" )
`logit.test2 <- vector("list", length(col17))
#start of loop #
for(i in seq_along(col17)){
for(k in seq_along(col17)){
logit.test2[i] <- glm(reformulate(col17[i]+col17[k], "Failure_Response_Var_Yr"),
family=binomial(link='logit'), data=my.sample)
}
}`
# end of loop #
but it printed out this problem:
"Error in col17[i] + col17[k] : non-numeric argument to binary operator"
Can anybody hand me out a code that can fix this problem?

Error in r: undefined columns selected

I was trying to do a partition plot, and I used the following codes:
install.packages('klaR')
library(klaR)
partimat(Type~. , data = training, method = "lda")
partimat('Type'~. , data = training, method = "qda")
R gave me this error code:
Error in `[.data.frame`(m, xvars) : undefined columns selected
and my data is like this
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 178 obs. of 13 variables:
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ Malic acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ Alcalinity of ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
$ Total phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ Nonflavanoid phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
$ Type : int 1 1 1 1 1 1 1 1 1 1 ...
Please let me know how to solve it!
There is no Type variable in the UCI Machine Learning Wine data set. The classification variable is class, and it is the first column in the data set.
# data source: UCI ML Repository Wine data
# https://archive.ics.uci.edu/ml/datasets/wine
library(klaR)
colNames <- c("class","alcohol","malicAcid","ash","acalinityOfAsh",
"magnesium","totalPhenols","flavanoids","nonflavanoidPhenols",
"proanthocyanins","colorIntensity","hue","od280.od315OfDilutedWines",
"proline")
wine <- read.csv("./data/wine.csv",header=FALSE,col.names=colNames)
wine$class <- as.factor(wine$class)
partimat(class ~ alcohol + malicAcid, data=wine, method="lda",plot.matrix=FALSE)
...and the output:
I had the same problem and I could fix it by changing the name of my varibles. In my data set I had a variable whose name had a blank space at the beginning. The program could not recognize it and that triggered the error. I removed that blank space and the problem disappeared.

ggbiplot graphical display in groups

I am learning biplot with wine data set. How does R know Barolo, Grignolino and Barbera are wine.class while we don't see the wine class column in the data set?
More details about the wine data set are in the following links
ggbiplot - how not to use the feature vectors in the plot
https://github.com/vqv/ggbiplot
Thanks very much
In the wine dataset, you have 2 objects, one data.frame wine with 178 observations of 13 quantitative variables:
str(wine)
'data.frame': 178 obs. of 13 variables:
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ MalicAcid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ AlcAsh : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Mg : int 127 100 101 113 118 112 96 121 97 98 ...
$ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flav : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ NonFlavPhenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proa : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ OD : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
There is also one vector wine.class that contains 178 observations of the qualitative wine.class variable:
str(wine.class)
Factor w/ 3 levels "barolo","grignolino",..: 1 1 1 1 1 1 1 1 1 1 ...
The 13 quantitative variables are used to compute the PCA:
wine.pca <- prcomp(wine, scale. = TRUE)
while the wine.class variable is just used to color the points on the plot

Resources