Bug in R e1071 Naive Bayes? - r

I have no experience in the R community, so please point me somewhere else if this is not the appropriate forum...
Long story short, I'm afraid that e1071::naiveBayes favors giving labels by alphabetical order.
In an earlier question here I had noticed some strange behavior with numerical predictors in the e1071 implementation of naive Bayes. While I got a more reasonable answer, some probabilities seemed biased upwards.
Can anyone shed any light on why this simulation would end up like this? I can only imagine that it is a bug at this point...
library(e1071)
# get a data frame with numObs rows, and numDistinctLabels possible labels
# each label is randomly drawn from letters a-z
# each label has its own distribution of a numeric variable
# this is normal(i*100, 10), i in 1:numDistinctLabels
# so, if labels are t, m, and q, t is normal(100, 10), m is normal(200, 10), etc
# the idea is that all labels should be predicted just as often
# but it seems that "a" will be predicted most, "b" second, etc
doExperiment = function(numObs, numDistinctLabels){
possibleLabels = sample(letters, numDistinctLabels, replace=F)
someFrame = data.frame(
x=rep(NA, numObs),
label=rep(NA, numObs)
)
numObsPerLabel = numObs / numDistinctLabels
for(i in 1:length(possibleLabels)){
label = possibleLabels[i]
whichAreNA = which(is.na(someFrame$label))
whichToSet = sample(whichAreNA, numObsPerLabel, replace=F)
someFrame[whichToSet, "label"] = label
someFrame[whichToSet, "x"] = rnorm(numObsPerLabel, 100*i, 10)
}
someFrame = as.data.frame(unclass(someFrame))
fit = e1071::naiveBayes(label ~ x, someFrame)
# The threshold argument doesn't seem to change the matter...
someFrame$predictions = predict(fit, someFrame, threshold=0)
someFrame
}
# given a labeled frame, return the label that was predicted most
getMostFrequentPrediction = function(labeledFrame){
names(which.max(sort(table(labeledFrame$prediction))))
}
# run the experiment a few thousand times
mostPredictedClasses = sapply(1:2000, function(x) getMostFrequentPrediction(doExperiment(100, 5)))
# make a bar chart of the most frequently predicted labels
plot(table(mostPredictedClasses))
This gives a plot like:
Giving every label the same normal distribution (i.e. mean 100, stdev 10) gives:
Regarding confusion in comment:
This is maybe getting away from Stack Overflow territory here, but anyways...
While I would expect classification to be less clumpy, the effect of the standard deviations does a lot to flatten out the pdfs, and you can observe if you do this enough that one or two actually tend to dominate (red and black in this case).
Too bad we can't exploit the knowledge that the standard deviation is the same for all of them.
If you add just a little noise to the mean it becomes much more evenly distributed, even though there's still some misclassification.

The problem is not naiveBayes, it's your getMostFrequentPrediction function. You are returning only one value even when there are ties for first. Since you are using table(), the counts are being implicitly sorted alphabetically in the table. So when you grab the first max value, it will also be the "smallest" alphabetically speaking. So if you dun this a bunch of times:
getMostFrequentPrediction(data.frame(predictions=sample(rep(letters[1:3], 5))))
you will always get "a" even though the letters "a" "b" and "c" all appear 5 times.
If you want to randomly choose one of the most frequently predicted categories, here's another possible implementation
getMostFrequentPrediction = function(labeledFrame){
tt<-table(labeledFrame$predictions)
names(sample(tt[tt==max(tt)], 1))
}
This gives

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

PCA - All variables with same signal on PC1 coordinates

So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).
For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.
I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.
For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.
library(factoextra)
PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)
fviz_pca_biplot(res.pca)
However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:
res.pca <- prcomp(PCAf, scale = TRUE)
fviz_pca_biplot(res.pca)
This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.
res.var <- get_pca_var(res.pca)
res.var$coord
These are the values for the non centered PCA:
And for the centered PCA:
Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?
My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.
This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.
Anyway, thank you all so much in advance.
PS: I uploaded a file containing a sample of my data, code and images on github
PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)
res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)
fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)
Not centered - generic PCA:
Centered - generic PCA:

Remove the outlier in the dataset before running factor analysis in R

The introduction about my dataset: It is the questionnaire data, mentioning the different reasons for students' antisocial behaviours. And I want to run the factor analysis to organize similar reasons to a factor.
For instance, there is one reason that students have antisocial behaviour because of their parents' educating, and another reason is that this happens because of their parents' educational background. There are some similarity between these two reasons, so that I am wondering whether these two reasons could be merged into one factor, so I want to run a factor analysis to see whether I could merge different reasons in one factor.
In order to run the factor analysis, removing the outlier(those which is smaller than mean minus 3 standard deviation, and bigger than mean add 3 standard deviation) is quite important from my understanding. However, I am not sure whether it is necessary for the questionnaire data, and if it is necessary, or at least it is not completely redundant, then with which R code could I reach this aim?
I did some research on Median Absolute Deviation (MAD) method, which could partial out the outliers. And I also wrote the R code as below:
mad.mean.D.O <- as.numeric(D.O.Mean.data$D.O_Mean)
median(mad.mean.D.O)
mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE)
print(Upper.MAD <- (median(mad.mean.D.O)+3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
print(Lower.MAD <- (median(mad.mean.D.O)-3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
D.O.clean.mean.data <- D.O.Mean.data %>%
select(ID_t,
anonymity,
fail_exm,
pregnant,
deg_job,
new_job,
crowded,
stu_req,
int_sub,
no_org,
child,
exm_cont,
lec_sup,
fals_exp,
fin_prob,
int_pro,
family,
illness,
perf_req,
abroad,
relevanc,
quickcash,
deg_per,
lack_opp,
prac_work,
D.O_Mean) %>%
filter(D.O_Mean < 4.197032 & D.O_Mean > 0.282968)
This R code works.
However, I just wonder whether there are also other methods which could reach the same aim, but in a simpler approach.
In addition, my data set looks like this:
All the variables are questionnaire data, being measured by likert scale. And all of those are reasons for antisocial behaviour. For example, the first participants, she/he give 1 to anonymity, that means from not exactly yo exactly, he/ she think anonymity not exactly contribute to his/ her antisocial behaviour.
I would be really thankful for all of your input here.
You can try this function to remove outliers. It'll comb through all columns to identify outliers, so be sure to temporarily remove columns do not need outliers removed and you can cbind() it back later.
#identify outliers
idoutlier<- function(data, cutoff = 3) {
# Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
# Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
#remove outliers
rmoutlier<- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
cbind() if necessary, and then na.omit() to remove your outliers

Calculating a correlation coefficient that includes missing values

I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.
One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.
Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.
EDIT: I have revised this answer based on rereading the question.
My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.
As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

Resources