Why do I have so few classified labels - azure-machine-learning-studio

In my dataset I have 300k rows, I do a 70/30 split and the result seems to be an alright model, until i view the true-positive, false-negative, false-positive and true-negative numbers.
TP is 20, FN is 2. FP is 3 and TN is 41.
That's extremely low? So the results a great, but if the model were only able to classify 66 of 90 000 is rather useless.
What can I do to improve this? Two Class Boosted decision tree or neural net does not change the outcome that much. Any recommendations?

can you please check if you have any missing values in your dataset?

Related

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

Calculation of Precision,Recall and F-Score with confusionMatrix

We have developed an algorithm that detects number of repetions per resistance exercise machine out of accelerometer data. People performed always 10 repetitions 2x per machine.
n people x 10 repetitions x 2 sets = total amount of repetitions performed .
Now, I wanted to calculate the precision, recall and f-score with confusionMatrix from the caret package.
I made an xlsx file with two rows representing real (upper row) and algorithmically predicted number of repetitions (lower row) as depicted in the picture:
I coded the following:
reps_prec_phone1<- read.xlsx("Reps_for_Precision_Recall_FSCORE.xlsx", sheet = "2Vec_Phone1", startRow = 0, colNames = FALSE)
reps_prec_pred_phone1<-as.factor(reps_prec_phone1[1,])
reps_prec_real_Phone1<-as.factor(reps_prec_phone1[2,])
result_phone1 <- confusionMatrix(reps_prec_pred_phone1, reps_prec_real_Phone1, mode="prec_recall")
The result looks like this:
As you can see in the confusionMatrix, 385 sets (1 set consists of 10 repetitions) instead of 3850 repetitions were counted. Now I am wondering, methodologically how can I get confusionMatrix to calculate the number of repetitions instead of the number of sets.
In my case the error rate is 1-Accuracy = 2.5%. As 1 set consists of 10 repetitions. As set vs repetition is a factor of 10, I could simply divide the error rate by 10 and recalulate the accuracy 1-0.0025 = 0.9975. However,
does anyone know how to solve this issue with confusionMatrix?
Thank you in advance for your brain power & experience!
There's a theoretical mistake.
A confusion matrix is made to compare observations given with predicted values, R convert your data as factor, then your values {10,11} are interpreted as the levels of that factor not as numeric values, then R count the ties. In short put, you have a wrong idea about what a confusion matrix is.
Also, any model will perform biased predictions because you have extremely unbalanced data, in short put, there's nothing to predict.
Then, you don't have a programming problem it's more a theoretical one. Visit Stack Exchange to clear your ideas.
Visit me!

Unsupervised Supervised Clusters with NAs and Qualitative Data in R

I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?
K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.
The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.

Can I use quickpred in Mice to impute a subset of variables from a larger set of variables in a nested longitudinal (and long) dataframe?

I've tried to create a test data.frame to demonstrate my question but my r capacity isn't quite strong enough to even do that. I am not in a position to share my true database. I hope my question can stand on its own.
I am working with a nested longitudinal dataset that is saved as a long file (1000 subjects nested in 8 sites, 4 potential time points/subject, 68 potential predictor variables). I want to impute missing values on 4 static predictors (e.g., maternal education, family income) prior to conducting lme on the longitudinal outcomes in order to have a consistent number of cases for all models.
I am working with the package mice in r. From all that I have read, it is recommended that I use all the variables in my models and any other variables that may predict the missing values in my imputation. Given the number of variables in my models, I need something like quickpred to simplify this. But I'm getting an error that I do not understand.
I tried the following initial code for my database N2NPL, indicating c(14, 16, 18, 19) as the variables that I want to predict.
iniN2NPL <- mice(N2NPL[,c(14,16,18,19)], pred= quickpred(N2NPL,
minpuc = 0.25, exclude = c('ID','TypeConvNon','TypeCtPr','TypeName','CHR_converter')),
maxit = 0)
"Error in check.predictorMatrix(setup) :
The predictorMatrix has 73 rows and 73 columns. Both should be 4'
I know that mice::quickpred needs to be a square matrix, but is there anyway of not imputing all of the variables? Is it sufficient to include site as a predictor given the nesting of subjects within sites?
Thank you for any help directing me to the proper code or instructions on this. The examples I see all seem much simpler than mine, and thus little help with the issues I'm having.

comparing kappa coefficients (intercoder agreements) on categorical data

I have a list of 282 items that has been classified by 6 independent coders into 20 categories.
The 20 categories are defined by words (example "perceptual", "evaluation" etc).
The 6 coders have different status: 3 of them are experts, 3 are novices.
I calculated all the kappas (and alphas) between each pair of coders, and the overall kappas among the 6 coders, and the kappas between the 3 experts and between the 3 novices.
Now I would like to check whether there is a significant difference between the interrater agreements achieved by the experts vs those achieved by the novices (whose kappa is indeed lower).
How would you approach this question and report the results?
thanks!
You can at least simply obtain the Cohen's Kappa and its sd in R (<- by far the best option in my opinion).
The PresenceAbsence package has a Kappa (see ?Kappa) function.
You can get the package with the regular install.packages("PresenceAbsence"), then pass a confusion matrix, i.e.:
# we load the package
library(PresenceAbsence)
# a dummy confusion matrix
cm <- matrix(round(runif(16, 0, 10)), nrow=4)
Kappa(cm)
you will obtain the Kappa and its sd. As far as I know there are limitations about testing using the Kappa metric (eg see https://en.wikipedia.org/wiki/Cohen's_kappa#Significance_and_magnitude).
hope this helps

Resources