I'm trying to get a 2 way table in R similar to this one from Stata. I was trying to use CrossTable from gmodels package, but the table is not the same. Do you known how can this be done in R?
I hope at least to get the frequencies from
when cursmoke1 == "Yes" & cursmoke2 == "No" and reversed
In R I'm only getting totals from yes, no and NA.
Here is the output:
Stata
. tabulate cursmoke1 cursmoke2, cell column miss row
+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
| cell percentage |
+-------------------+
Current |
smoker, | Current smoker, exam 2
exam 1 | No Yes . | Total
-----------+---------------------------------+----------
No | 1,898 131 224 | 2,253
| 84.24 5.81 9.94 | 100.00
| 86.16 7.59 44.44 | 50.81
| 42.81 2.95 5.05 | 50.81
-----------+---------------------------------+----------
Yes | 305 1,596 280 | 2,181
| 13.98 73.18 12.84 | 100.00
| 13.84 92.41 55.56 | 49.19
| 6.88 35.99 6.31 | 49.19
-----------+---------------------------------+----------
Total | 2,203 1,727 504 | 4,434
| 49.68 38.95 11.37 | 100.00
| 100.00 100.00 100.00 | 100.00
| 49.68 38.95 11.37 | 100.00
R
> CrossTable(cursmoke2, cursmoke1, missing.include = T, format="SAS")
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke1
cursmoke2 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 2203 | 0 | 0 | 2203 |
| 1122.544 | 858.047 | 250.409 | |
| 1.000 | 0.000 | 0.000 | 0.497 |
| 1.000 | 0.000 | 0.000 | |
| 0.497 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 0 | 1727 | 0 | 1727 |
| 858.047 | 1652.650 | 196.303 | |
| 0.000 | 1.000 | 0.000 | 0.389 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.389 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
NA | 0 | 0 | 504 | 504 |
| 250.409 | 196.303 | 3483.288 | |
| 0.000 | 0.000 | 1.000 | 0.114 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Maybe I'm missing something here. The default settings for CrossTable seem to provide essentially what you are looking for.
Here is CrossTable with minimal arguments. (I've loaded the dataset as "temp".) Note that the results are the same as what you posted from the Stata output (you just need to multiply by 100 if you want the result as a percentage).
library(gmodels)
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 0.842 | 0.058 | 0.099 | 0.508 |
| 0.862 | 0.076 | 0.444 | |
| 0.428 | 0.030 | 0.051 | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 0.140 | 0.732 | 0.128 | 0.492 |
| 0.138 | 0.924 | 0.556 | |
| 0.069 | 0.360 | 0.063 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 0.497 | 0.389 | 0.114 | |
-------------|-----------|-----------|-----------|-----------|
Alternatively, you can use format="SPSS" if you want the numbers displayed as percentages.
with(temp, CrossTable(cursmoke1, cursmoke2, missing.include=TRUE, format="SPSS"))
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 4434
| cursmoke2
cursmoke1 | No | Yes | NA | Row Total |
-------------|-----------|-----------|-----------|-----------|
No | 1898 | 131 | 224 | 2253 |
| 541.582 | 635.078 | 4.022 | |
| 84.243% | 5.814% | 9.942% | 50.812% |
| 86.155% | 7.585% | 44.444% | |
| 42.806% | 2.954% | 5.052% | |
-------------|-----------|-----------|-----------|-----------|
Yes | 305 | 1596 | 280 | 2181 |
| 559.461 | 656.043 | 4.154 | |
| 13.984% | 73.177% | 12.838% | 49.188% |
| 13.845% | 92.415% | 55.556% | |
| 6.879% | 35.995% | 6.315% | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 2203 | 1727 | 504 | 4434 |
| 49.684% | 38.949% | 11.367% | |
-------------|-----------|-----------|-----------|-----------|
Update: prop.table()
Just FYI (to save you the tedious work you did in making your own data.frame as you did), you may also be interested in the prop.table() function.
Again, using the data you linked to and assuming it is named "temp", the following gives you the underlying data from which you can construct your data.frame. You may also be interested in looking into the functions margin.table() or addmargins():
## Your basic table
CurSmoke <- with(temp, table(cursmoke1, cursmoke2, useNA = "ifany"))
CurSmoke
# cursmoke2
# cursmoke1 No Yes <NA>
# No 1898 131 224
# Yes 305 1596 280
## Row proportions
prop.table(CurSmoke, 1) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.84243231 0.05814470 0.09942299
# Yes 0.13984411 0.73177442 0.12838148
## Column proportions
prop.table(CurSmoke, 2) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.86155243 0.07585408 0.44444444
# Yes 0.13844757 0.92414592 0.55555556
## Cell proportions
prop.table(CurSmoke) # * 100 # If you so desire
# cursmoke2
# cursmoke1 No Yes <NA>
# No 0.42805593 0.02954443 0.05051872
# Yes 0.06878665 0.35994587 0.06314840
Related
I have r dataframe in following format
+--------+---------------+--------------------+--------+
| time | Stress_ratio | shear_displacement | CX |
+--------+---------------+--------------------+--------+
| <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 |
| 50.2 | -0.219 | 4.98 | 0.0100 |
| . | . | . | . |
| . | . | . | . |
| 249.3 | -0.217 | 4.97 | 0.0200 |
| 250.4 | -0.214 | 4.96 | 0.0300 |
| 251.1 | -0.222 | 4.91 | 0.06 |
| 252.1 | -0.222 | 4.91 | 0.06 |
| 253.3 | -0.222 | 4.91 | 0.06 |
| 254.5 | -0.222 | 4.91 | 0.06 |
| 256.8 | -0.222 | 4.91 | 0.06 |
| . | . | . | . |
| . | . | . | . |
| 500.1 | -0.22 | 4.91 | 0.6 |
| 501.4 | -0.22 | 4.91 | 0.6 |
| 503.1 | -0.22 | 4.91 | 0.6 |
+--------+---------------+--------------------+--------+
and I want a new column which has repetitive values based on the difference between a range of values in column time. The range should be 250 for the column time. For example in all the rows of new_column I should get number 1 when df$time[1] and df$time[1]*4.98 is 250. Similarly this number 1 should change to 2 when the next chunk starts of difference of 250. So the new dataframe should be like
+--------+---------------+--------------------+--------+------------+
| time | Stress_ratio | shear_displacement | CX | new_column |
+--------+---------------+--------------------+--------+------------+
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 50.1 | -0.224 | 4.9 | 0 | 1 |
| 50.2 | -0.219 | 4.98 | 0.0100 | 1 |
| . | . | . | . | 1 |
| . | . | . | . | 1 |
| 249.3 | -0.217 | 4.97 | 0.0200 | 1 |
| 250.4 | -0.214 | 4.96 | 0.0300 | 2 |
| 251.1 | -0.222 | 4.91 | 0.06 | 2 |
| 252.1 | -0.222 | 4.91 | 0.06 | 2 |
| 253.3 | -0.222 | 4.91 | 0.06 | 2 |
| 254.5 | -0.222 | 4.91 | 0.06 | 2 |
| 256.8 | -0.222 | 4.91 | 0.06 | 2 |
| . | . | . | . | . |
| . | . | . | . | . |
| 499.1 | -0.22 | 4.91 | 0.6 | 2 |
| 501.4 | -0.22 | 4.91 | 0.6 | 3 |
| 503.1 | -0.22 | 4.91 | 0.6 | 3 |
+--------+---------------+--------------------+--------+------------+
If I understand what you're trying to do, a base R solution could be:
df$new_column <- df$time %/% 250 + 1
The %/% operator is integer division (sort of the complement of the modulus operator) and tells you how many copies of 250 would fit into your number; we add 1 to get the value you want.
The tidyverse version:
df <- df %>%
mutate(new_column = time %/% 250 + 1)
library(data.table)
setDT(df)[, new_column := rleid(time %/% 250)][]
I wanted to know if really AODE may be better than Naive Bayes in its way, as the description says:
https://cran.r-project.org/web/packages/AnDE/AnDE.pdf
--> "AODE achieves highly accurate classification by averaging over all of a small space."
https://www.quora.com/What-is-the-difference-between-a-Naive-Bayes-classifier-and-AODE
--> "AODE is a weird way of relaxing naive bayes' independence assumptions. It is no longer a generative model, but it relaxes the independence assumptions in a slightly different (and less principled) way than logistic regression does. It replaces the convex optimization problem used in training a logistic regression classifier by a quadratic (on the number of features) dependency on both training and test times."
But when I experiment it, I found that the predict results seems off, I implemented it with these codes:
library(gmodels)
library(AnDE)
AODE_Model = aode(iris)
predict_aode = predict(AODE_Model, iris)
CrossTable(as.numeric(iris$Species), predict_aode)
Can anyone explain to me about this? or are there any good practical solutions to implement AODE? thankyou in advance
If you check out the vignette for the function:
train: data.frame : training data. It should be a data frame. AODE
works only discretized data. It would be better to
discreetize the data frame before passing it to this
function.However, aode discretizes the data if not done
before hand. It uses an R package called discretization for
the purpose. It uses the well known MDL discretization
technique.(It might fail sometimes)
By default, the discretization function from arules cuts it into 3, which may not be enough for iris. So I first reproduce the result you have with the discretization by arules:
library(arules)
library(gmodels)
library(AnDE)
set.seed(111)
trn = sample(1:nrow(indata),100)
test = setdiff(1:nrow(indata),trn)
indata <- data.frame(lapply(iris[,1:4],discretize,breaks=3),Species=iris$Species)
AODE_Model = aode(indata[trn,])
predict_aode = predict(AODE_Model, indata[test,])
CrossTable(as.numeric(indata$Species)[test], predict_aode)
| predict_aode
as.numeric(indata$Species)[test] | 1 | 3 | Row Total |
---------------------------------|-----------|-----------|-----------|
1 | 15 | 5 | 20 |
| 0.500 | 4.500 | |
| 0.750 | 0.250 | 0.400 |
| 0.333 | 1.000 | |
| 0.300 | 0.100 | |
---------------------------------|-----------|-----------|-----------|
2 | 11 | 0 | 11 |
| 0.122 | 1.100 | |
| 1.000 | 0.000 | 0.220 |
| 0.244 | 0.000 | |
| 0.220 | 0.000 | |
---------------------------------|-----------|-----------|-----------|
3 | 19 | 0 | 19 |
| 0.211 | 1.900 | |
| 1.000 | 0.000 | 0.380 |
| 0.422 | 0.000 | |
| 0.380 | 0.000 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 45 | 5 | 50 |
| 0.900 | 0.100 | |
---------------------------------|-----------|-----------|-----------|
You can see one of the classes is missing in prediction. Let's increase it to 4:
indata <- data.frame(lapply(iris[,1:4],discretize,breaks=4),Species=iris$Species)
AODE_Model = aode(indata[trn,])
predict_aode = predict(AODE_Model, indata[test,])
CrossTable(as.numeric(indata$Species)[test], predict_aode)
| predict_aode
as.numeric(indata$Species)[test] | 1 | 2 | 3 | Row Total |
---------------------------------|-----------|-----------|-----------|-----------|
1 | 20 | 0 | 0 | 20 |
| 18.000 | 4.800 | 7.200 | |
| 1.000 | 0.000 | 0.000 | 0.400 |
| 1.000 | 0.000 | 0.000 | |
| 0.400 | 0.000 | 0.000 | |
---------------------------------|-----------|-----------|-----------|-----------|
2 | 0 | 10 | 1 | 11 |
| 4.400 | 20.519 | 2.213 | |
| 0.000 | 0.909 | 0.091 | 0.220 |
| 0.000 | 0.833 | 0.056 | |
| 0.000 | 0.200 | 0.020 | |
---------------------------------|-----------|-----------|-----------|-----------|
3 | 0 | 2 | 17 | 19 |
| 7.600 | 1.437 | 15.091 | |
| 0.000 | 0.105 | 0.895 | 0.380 |
| 0.000 | 0.167 | 0.944 | |
| 0.000 | 0.040 | 0.340 | |
---------------------------------|-----------|-----------|-----------|-----------|
Column Total | 20 | 12 | 18 | 50 |
| 0.400 | 0.240 | 0.360 | |
---------------------------------|-----------|-----------|-----------|-----------|
It gets only 3 wrong. To me, it's a matter of playing with discretization without overfitting, which can be tricky..
I don't how it could ever not be square, but here is my code.
It is making me type more, though I do not think there is anything else that makes sense to type. It is missing the "-11" id column, but it is not missing the "-11" row. It does not include the numbers from the "-11" column in the total column. Though, as you can see, where the total row and total column intersect, the overall total is correct.
Any input would be appreciated.
(the csv is from https://www.kaggle.com/marianna13/starter-particle-identification-from-94dec2e4-9)
library(class)
library(tidyverse)
library(gmodels)
particle <- read_csv("C:/Users/laura_000/Documents/joe/ML with R/pid-5M.csv")
particles <- particle[sample(nrow(particle), 50000), ]
particles_train <- particles[1:45000, 2:7]
particles_test <- particles[45001:50000, 2:7]
particles_train_labels <- particles[1:45000, 1]
particles_test_labels <- particles[45001:50000, 1]
particles_test_pred <- knn(train = particles_train, test = particles_test, cl = particles_train_labels[,1, drop = TRUE], k = round(45000^.5))
CrossTable(x = particles_test_labels[,1, drop = TRUE], y = particles_test_pred, prop.chisq=FALSE)
Here's the output
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 5000
| particles_test_pred
particles_test_labels[, 1, drop = TRUE] | 211 | 321 | 2212 | Row Total |
----------------------------------------|-----------|-----------|-----------|-----------|
-11 | 20 | 0 | 0 | 20 |
| 1.000 | 0.000 | 0.000 | 0.004 |
| 0.007 | 0.000 | 0.000 | |
| 0.004 | 0.000 | 0.000 | |
----------------------------------------|-----------|-----------|-----------|-----------|
211 | 2759 | 0 | 84 | 2843 |
| 0.970 | 0.000 | 0.030 | 0.569 |
| 0.901 | 0.000 | 0.044 | |
| 0.552 | 0.000 | 0.017 | |
----------------------------------------|-----------|-----------|-----------|-----------|
321 | 181 | 8 | 44 | 233 |
| 0.777 | 0.034 | 0.189 | 0.047 |
| 0.059 | 1.000 | 0.023 | |
| 0.036 | 0.002 | 0.009 | |
----------------------------------------|-----------|-----------|-----------|-----------|
2212 | 101 | 0 | 1803 | 1904 |
| 0.053 | 0.000 | 0.947 | 0.381 |
| 0.033 | 0.000 | 0.934 | |
| 0.020 | 0.000 | 0.361 | |
----------------------------------------|-----------|-----------|-----------|-----------|
Column Total | 3061 | 8 | 1931 | 5000 |
| 0.612 | 0.002 | 0.386 | |
----------------------------------------|-----------|-----------|-----------|-----------|
I am trying to find the symbol of the smallest difference. But I don't know what to do answer finding the difference to compare the two.
I have this set:
+------+------+-------------+-------------+--------------------+------+--------+
| clid | cust | Min | Max | Difference | Qty | symbol |
+------+------+-------------+-------------+--------------------+------+--------+
| 102 | C6 | 11.8 | 12.72 | 0.9199999999999999 | 1500 | GE |
| 110 | C3 | 44 | 48.099998 | 4.099997999999999 | 2000 | INTC |
| 115 | C4 | 1755.25 | 1889.650024 | 134.40002400000003 | 2000 | AMZN |
| 121 | C9 | 28.25 | 30.27 | 2.0199999999999996 | 1500 | BAC |
| 130 | C7 | 8.48753 | 9.096588 | 0.609058000000001 | 5000 | F |
| 175 | C3 | 6.41 | 7.71 | 1.2999999999999998 | 1500 | SBS |
| 204 | C5 | 6.41 | 7.56 | 1.1499999999999995 | 5000 | SBS |
| 208 | C2 | 1782.170044 | 2004.359985 | 222.1899410000001 | 5000 | AMZN |
| 224 | C10 | 153.350006 | 162.429993 | 9.079986999999988 | 1500 | FB |
| 269 | C6 | 355.980011 | 392.299988 | 36.319976999999994 | 2000 | BA |
+------+------+-------------+-------------+--------------------+------+--------+
so far I have this Query
select d.clid,
d.cust,
MIN(f.fillPx) as Min,
MAX(f.fillPx) as Max,
MAX(f.fillPx)-MIN(f.fillPx) as Difference,
d.Qty,
d.symbol
from orders d
inner join mp f on d.clid=f.clid
group by f.clid
having SUM(f.fillQty) < d.Qty
order by d.clid;
What am I missing so that I can compare the min and max and get the smallest different symbol?
mp table:
+------+------+--------+------+------+---------+-------------+--------+
| clid | cust | symbol | side | oQty | fillQty | fillPx | execid |
+------+------+--------+------+------+---------+-------------+--------+
| 123 | C2 | SBS | SELL | 5000 | 273 | 7.37 | 1 |
| 157 | C9 | C | SELL | 1500 | 167 | 69.709999 | 2 |
| 254 | C9 | GE | SELL | 5000 | 440 | 13.28 | 3 |
| 208 | C2 | AMZN | SELL | 5000 | 714 | 1864.420044 | 4 |
| 102 | C6 | GE | SELL | 1500 | 136 | 12.32 | 5 |
| 160 | C7 | INTC | SELL | 1500 | 267 | 44.5 | 6 |
| 145 | C10 | GE | SELL | 5000 | 330 | 13.28 | 7 |
| 208 | C2 | AMZN | SELL | 5000 | 1190 | 1788.609985 | 8 |
| 161 | C1 | C | SELL | 1500 | 135 | 72.620003 | 9 |
| 181 | C5 | FCX | BUY | 1500 | 84 | 12.721739 | 10 |
orders table:
+------+------+--------+------+------+
| cust | side | symbol | qty | clid |
+------+------+--------+------+------+
| C1 | SELL | C | 1500 | 161 |
| C9 | SELL | INTC | 2000 | 231 |
| C10 | SELL | BMY | 1500 | 215 |
| C1 | BUY | SBS | 2000 | 243 |
| C4 | BUY | AMZN | 2000 | 226 |
| C10 | BUY | C | 1500 | 211 |
If you want one symbol, you can use order by and limit:
select d.clid,
d.cust,
MIN(f.fillPx) as Min,
MAX(f.fillPx) as Max,
MAX(f.fillPx)-MIN(f.fillPx) as Difference,
d.Qty,
d.symbol
from orders d join
mp f
on d.clid = f.clid
group by d.clid, d.cust, d.Qty, d.symbol
having SUM(f.fillQty) < d.Qty
order by difference
limit 1;
Notice that I added the rest of the unaggregated columns to the group by.
I used crosstable for kNN model but the output does not present as expected. It shows me a bunch of numbers spread instead of a predictive model. (I would add an image but I need 10 reputation points). I want a table with clear output.
#I'm setting working directory folder
setwd("F:/Level 5/CT5018 - Data Analytics/My project/Official Dataset - Adult")
#start calculating the time to run the code
k <-Sys.time()
#here I'm assigning adults to read the csv file
adults <- read.csv("Adults.csv", stringsAsFactors = FALSE)
#examine the structure of the adultsTr data frame
str(adults)
#drop the fnlwgt feature
adults <- adults[-3]
#table of sex
table(adults$Sex)
#recode Sex as a factor
adults$Sex <- factor(adults$Sex, levels = c("Female","Male"),
labels = c("Women", "Men"))
#table or proportions with more informative labels
round(prop.table(table(adults$Sex)) * 100, digits = 1)
#summarize all numeric features
summary(adults[c("Age", "Education.num", "Capital.gain", "Capital.loss", "Hours.per.week")])
#----------------------------------------------Min-Max normalisation----------------------------- ------------------------------
#create normalization function
normalize <- function(x) {
return ((x - min (x)) / (max(x) - min(x)))
}
#test normalization function - result should be identical
normalize(c(1, 2, 3, 4, 5))
normalize(c(10, 20, 30, 40, 50))
#normalize the adultsTr data
adultsN <- as.data.frame(lapply(adults[c("Age", "Education.num", "Capital.gain", "Capital.loss", "Hours.per.week")], normalize))
#confirm that normalization worked
summary(adultsN$Age)
# create training and test data
adultsTrain <- adultsN[1:14999, ]
adultsTest <- adultsN[15000:19999, ]
# create labels for training and test data
adultsTrainLabels <- adults[1:14999, 1]
adultsTestLabels <- adults[15000:19999, 1]
#instaling package class
#install.packages("class")
library(class)
adultsTestPred <- knn(train = adultsTrain, test = adultsTest,
cl = adultsTrainLabels, k=122)
#installing package for cross tables
#install.packages("gmodels")
library(gmodels)
# Create the cross tabulation of predicted vs. actual
CrossTable(x = adultsTestLabels, y = adultsTestPred,
prop.chisq=FALSE)
This is what it shows to me:
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 5000
| adultsTestPred
adultsTestLabels | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 71 | 72 | 73 | 76 | 77 | 90 | Row Total |
-----------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
17 | 65 | 6 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 74 |
| 0.878 | 0.081 | 0.014 | 0.014 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.014 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.015 |
| 0.556 | 0.200 | 0.012 | 0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.004 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
--------------------------------------------
Where I actually want this:
-----------------------------------------------
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 2000
| predicted Sex
actual Sex | Female | Male | Row Total |
-------------|-----------|-----------|-----------|
Female | 514 | 161 | 675 |
| 0.257 | 0.080 | |
-------------|-----------|-----------|-----------|
Male | 162 | 1163 | 1325 |
| 0.081 | 0.582 | |
-------------|-----------|-----------|-----------|
Column Total | 676 | 1324 | 2000 |
-------------|-----------|-----------|-----------|
I got the same problem.The mistake i had committed was that mixing up the id and label columns.
My data frame was like x=[Id,label,Feature 1,Feature 2 ....]
I assigned label as x[1] instead of x[2].
Try to get the labels before normalizing.