R - NA handing in a mann-whitney u test - r

I need to compute weighted Mann Whitney U test results a few hundred times. Each iteration involves is a two-sample test for differences between two groups. I can't figure out how to get the existing function to handle missing values without dynamically deleting cases.
The data for a few of the comparisons are here, in a data frame I call dat. All variables with numbers in this sheet are numeric in type.
Here's how I call the sjstats::mannwhitney() function:
mannwhitney(dat, measure1, group)
When I do so, I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
I suspect this is because of the missing value in the 212th observation of measure1. But wrapping the vector names in na.omit() or !is.na() don't address the problem, perhaps because doing so still results in a data frame where the number of non-NA values of group are greater than the number of non-NA values in measure1.
Any thoughts on how I could incorporate dynamic NA handling into the function call?

I am not sure what class your group column is, but if I do it like this:
library(sjstats)
dat = read.csv("question - Sheet1.csv")
str(dat)
'data.frame': 301 obs. of 5 variables:
$ measure1 : num 2 1.6 2.2 2.7 1.8 1.8 4 4 3.9 -3.7 ...
$ measure2 : num 0.9 0.1 0 0.4 -1 -1.3 2.1 0 -1.1 -3.9 ...
$ measure3 : num 1.1 1.1 2.2 1.2 1.9 1.2 0 3 1.9 -3.8 ...
$ measurre4: num 2 2 2 3 3 2 3 4 3 2.36 ...
$ group : int 0 0 0 0 0 0 0 0 0 0 ...
I get:
mannwhitney(dat, measure1, group)
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
Factor your group:
dat$group = factor(dat$group)
mannwhitney(dat, measure1, group)
# Mann-Whitney-U-Test
Groups 1 = 0 (n = 110) | 2 = 1 (n = 190):
U = 16913.000, W = 10808.000, p = 0.621, Z = 0.495
effect-size r = 0.029
rank-mean(1) = 153.75
rank-mean(2) = 148.62
Reading the code, the bug comes from this:
labels <- sjlabelled::get_labels(grp, attr.only = F, values = NULL,
non.labelled = T)
If your group is numeric, it doesn't have attributes and hence you get no labels:
sjlabelled::get_labels(0:1)
NULL
sjlabelled::get_labels(factor(0:1))
[1] "0" "1"

Related

Make a loop function by including gender into the algorithm

I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA

X needs to be a numerical value

I have an assignment for a statistics class where I copied code from the teacher, but I keep getting an error where x should be numerical and I don't see where the issue is
Survey=read.csv("ClassSurvey.csv", na.strings = c("", " ", "NA"))
attach(Survey)
library(FSAdata)
op = par(oma=c(0,0,1.5,0), mar=c(3,3,2,1))
hist(Height ~ Gender,
las=1,
nrow=2, ncol=1,
cex.main=0.9, cex.lab=0.8, cex.axis=0.8,
mgp=c(1.8,0.6,0),
xlab = "Height (cms)")
thank you!!
You have a couple of problems. First, you can't make a histogram on a variable that is not numeric. Try this:
data(ToothGrowth)
str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
attach(ToothGrowth)
Attempt to draw a histogram of supp, a variable of class "factor" (it's not numeric).
> hist(supp)
Error in hist.default(supp ~ dose) : 'x' must be numeric
Your second problem is that the hist function doesn't accept a formula y ~ x. So
hist(len~supp)
fails as well, even though len is numeric. This works though:
hist(len)
I wonder if you copied the teacher's code incorrectly? Or the teacher's code is incorrect? Or something else.

Caret: There were missing values in resampled performance measures

I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)

Caret to predict class with knn: Do I need to provide unknown classes with a random class variable?

I have a tab delimited file with 70 rows of data and 34 columns of characteristics, where the first 60 rows look like this:
groups x1 x2 x3 x4 x5 (etc, up to x34)
0 0.1 0.5 0.5 0.4 0.2
1 0.2 0.3 0.8 0.4 0.1
0 0.4 0.7 0.6 0.2 0.1
1 0.4 0.4 0.7 0.1 0.4
And the last 10 rows look like this:
groups x1 x2 x3 x4 x5
NA 0.2 0.1 0.5 0.4 0.2
NA 0.2 0.1 0.8 0.4 0.1
NA 0.2 0.2 0.6 0.2 0.1
NA 0.2 0.3 0.7 0.1 0.4
The groups are binary (i.e. each row either belongs to group 0 or group 1). The aim is to use the first 60 rows as my training data set, and the last 10 rows as my test data set; to classify the last 10 rows into groups 0 or 1. The class of the last 10 rows is currently labelled as "NA" (as they have not been assigned to a class).
I ran this code:
library(caret)
data <-read.table("data_challenge_test.tab",header=TRUE)
set.seed(3303)
train <-sample(1:60)
data.train <-data[train,]
dim(data.train)
data.test <-data[-train,]
dim(data.test)
data.train[["groups"]] = factor(data.train[["groups"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(groups ~x1+x2+x3+x4+x5, data = data.train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
test_pred <- predict(knn_fit, newdata = data.test)
confusionMatrix(test_pred, data.test$groups)
the test_pred output is:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
and the confusion matrix output is:
> confusionMatrix(test_pred, data.test$groups)
Error in confusionMatrix.default(test_pred, data.test$groups) :
the data cannot have more levels than the reference
Then I checked the str of test_pred and data.test$groups:
> str(test_pred)
Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1
> str(data.test$groups)
int [1:10] NA NA NA NA NA NA NA NA NA NA
So I understand that my error is because my two inputs to the confusion matrix are not of the same type.
So then in my data set, I changed my "NA" columns to randomly either 0 or 1 (i.e. I just manually randomly changed the first 5 unknown classes to class 0 and then the second 5 unknown classes to class 1).
Then I re-ran the above code
The output was:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
> confusionMatrix(test_pred, data.test$groups)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4 2
1 1 3
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.1719
Kappa : 0.4
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.8000
Specificity : 0.6000
Pos Pred Value : 0.6667
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.4000
Detection Prevalence : 0.6000
Balanced Accuracy : 0.7000
'Positive' Class : 0
So I have three questions:
Originally, the class of all my training data set was 0 or 1, and the class of my test data sets were all marked as NA or ?.
caret doesn't seem to like that due to the error described above. When I assigned my test data set random starting binary variables instead of NA/?, the analysis "worked" (as in no errors).
Is the binary groups I've manually randomly assigned to the test data set affecting the confusion matrix (or any aspect of the analysis?), or is this acceptable? If not, what is the solution: what group do I assigned unclassified test data to in the beginning of the analysis.
Is the test_pred output ordered? I wanted the last 10 rows of my table to be predicted and the output of test_pred is: 0 0 0 0 1 1 0 1 1 0. Are these last 10 rows in order?
I would like to visualise the results once I have this issue sorted. Can anyone recommend a standard package that is commonly done to do this (I am new to machine learning)?
Edit: Given that the confusion matrix is directly uses references and prediction to calculate accuracy, I'm pretty sure I cannot just randomly assign classes to the unknown classed rows as it will affect the accuracy of the confusion matrix. So an alternative suggestion would be appreciated.
A confusion matrix is comparison of your classification output to the actual classes. So if your test data set does not have the labels, you cannot draw a confusion matrix.
There are other ways of checking how well your classification algorithm did. You can for now read about AIC which is analogous to Linear Regressions R-squared.
If you still want a confusion matrix, use first 50 rows for training and 50-60 for testing. This output will let you create a confusion matrix.
Yes the output is ordered and you can column bind it to your test set.
Visualising classification tasks is done by drawing a ROC curve. CARET library should have that too.

r - subsetting one row of data frame drops zero decimal from number

Data
Given a data frame
df <- data.frame("id"=c(1,2,3), "a"=c(10.0, 11.2, 12.3),"b"=c(10.1, 11.9, 12.9))
> df
id a b
1 1 10.0 10.1
2 2 11.2 11.9
3 3 12.3 12.9
> str(df)
'data.frame': 3 obs. of 3 variables:
$ id: num 1 2 3
$ a : num 10 11.2 12.3
$ b : num 10.1 11.9 12.9
Question
When subsetting the first row, the .0 decimal part from the 10.0 in column a gets dropped
> df[1,]
id a b
1 1 10 10.1
> str(df[1,])
'data.frame': 1 obs. of 3 variables:
$ id: num 1
$ a : num 10
$ b : num 10.1
I 'assume' this is intentional, but how do I subset the first row so that it keeps the .0 part?
Notes
Subsetting two rows keeps the .0
> df[1:2,]
id a b
1 1 10.0 10.1
2 2 11.2 11.9
I assume you understand this is a matter of how the number is printed, and not about how the value is stored by R. Anyway, you can use format to ensure the digits will be printed:
> format(df[1,], nsmall = 1)
id a b
1 1.0 10.0 10.1
> format(df[1,], nsmall = 2)
id a b
1 1.00 10.00 10.10
The reason for this behavior is not about the number of rows being printed. R will try to display the minimum number of decimals possible. But all numbers in a column will have the same number of digits to improve the display:
> df2 <- data.frame(a=c(1.00001, 1), b=1:2)
> df2
a b
1 1.00001 1
2 1.00000 2
Now if I print only the row with the non-integer number:
> df2[1,]
a b
1 1.00001 1

Resources