Document Term Matrix for Naive Bayes classfier: unexpected results R - r

I'm having some very annoying problems getting a Naive Bayes Classifier to work with a document term matrix. I'm sure I'm making a very simple mistake but can't figure out what it is. My data is from accounts spreadsheets. I've been asked to figure out which categories (in text format: mostly names of departments or names of budgets) are more likely to spend money on charities and which ones mostly (or only) spend on private companies. They suggested I use Naive Bayes classifiers to do this. I have a thousand or so rows of data to train a model and many hundreds of thousands of rows to test the model against. I have prepared the strings, replacing spaces with underscores and ands/&s with +, then treated each category as one term: so 'alcohol and drug addiction' becomes: alcohol+drug_addiction.
Some example rows:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
Using this example as a template, I wrote the following function to come up with my document term matrix (using tm), both for training and test data.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
So far, so good. The problem is when I try to a) apply a Naive Bayes model and b) predict from that model. Using klar package - I get a zero probability error, since many of the terms have zero instances of one category and playing around with the laplace terms does not seem to fix this. Using e1071, the model worked, but then when I tested the model using:
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
... every single item predicted the same category, even though they should be roughly equal. Something in the model clearly isn't working. Looking at some of the terms in model$tables - I can see that many have high values for private and zero for charity and others vice versa. I have used as.factor for the code.
output:
rs 1 2
1 0 0
2 19 17
Any ideas on what is going wrong? Do dtm matrices not play nice with naivebayes? Have I missed a step out in preparing the data? I'm completely out of ideas. Hope this is all clear. Happy to clarify if not. Any suggestions would be much appreciated.

I have already had the problem myself. You have done (as far as I see it) everything right, the Naive Bayes Implementation in e1071 (and thus klar) is buggy.
But there is an easy and quick fix so that Naive Bayes as implemented in e1071 works again: You should change your text-vectors to categorial variables, i.e. as.factor. You have already done this with your target variable traindata$Code, yet you have to also do this for your trainmatrix and for sure then your testdata.
I could not track the bug to 100% percent down, but it lies in this part in the naive bayes implementation from e1071 (I may note, klar is only a wrapper around e1071):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
You see that there is an if-else-condition: if we have no numerics, naive bayes is used as we expect it to work. If we have numerics - and here comes the bug - this naive bayes automatically assumes a normal distribution. If you only have 0 and 1 in your text, dnorm pretty much sucks. I assume due to very low values created by dnorm the prob. are always replaced by the threshold and thus the variable with the higher a priori factor will always „win“.
However, if I understand your problem correct, you do not even need prediction, rather the a priori factor for identifying which department gives money to whom. Then all you have to do is have a deep look at your model. In your model for every term there appears the apriori probability, which is what I assume you are looking for. Let's do this and the aforementioned with a slightly modified version of your sample:
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
But as I have already said, you do not need to predict. A look at the model is all right, e.g. model$tables$premises will give you the likelihood for the premises giving money to private corporations: 100 %.
If you are dealing with very large datasets, you should specify threshold and eps in your model. Eps defines the limit, when the threshold should be supplied. E.g. eps = 0 and threshold = 0.000001 can be of use.
Furthermore you should stick to using term-frequency weighting. tf*idv e.g. will not work due to the dnorm in the naive bayes.
Hope I can finally get my 50 reputation :P

Related

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

Obtain t-statistic for regression coefficients of an “mlm” object returned by `lm()`

I've used lm() to fit multiple regression models, for multiple (~1 million) response variables in R. Eg.
allModels <- lm(t(responseVariablesMatrix) ~ modelMatrix)
This returns an object of class "mlm", which is like a huge object containing all the models. I want to get the t-statistic for the first coefficient in each model, which I can do using the summary(allModels) function, but its very slow on this large data and returns a lot of unwanted info too.
Is there a faster way of calculating the t-statistic manually, that might be faster than using the summary() function
Thanks!
You can hack the summary.lm() function to get just the bits you need and leave the rest.
If you have
nVariables <- 5
nObs <- 15
y <- rnorm(nObs)
x <- matrix(rnorm(nVariables*nObs),nrow=nObs)
allModels <-lm(y~x)
Then this is the code from the lm.summary() function but with all the excess baggage removed (note, all the error handling has been removed as well).
p <- allModels$rank
rdf <- allModels$df.residual
Qr <- allModels$qr
n <- NROW(Qr$qr)
p1 <- 1L:p
r <- allModels$residuals
f <- allModels$fitted.values
w <- allModels$weights
mss <- if (attr(allModels$terms, "intercept"))
sum((f - mean(f))^2) else sum(f^2)
rss <- sum(r^2)
resvar <- rss/rdf
R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
se <- sqrt(diag(R) * resvar)
est <- allModels$coefficients[Qr$pivot[p1]]
tval <- est/se
tval is now a vector of the t statistics as also give by
summary(allModels)$coefficients[,3]
If you have problems on the large model you might want to rewrite the code so that it keeps fewer objects by compounding multiple lines/assignments into fewer lines.
Hacky solution I know. But it will be about as fast as possible. I suppose it would be neater to put all the lines of code into a function as well.

Resources