I have a table created using the table() command in R:
y
x 0 1 2 3 4 5 6 7 8 9
0 23 0 0 0 0 1 0 0 0 0
1 0 23 1 0 1 0 1 2 0 2
2 1 1 28 0 0 0 1 0 2 2
3 0 1 0 24 0 1 0 0 0 1
4 1 1 0 0 34 0 3 0 0 0
5 0 0 0 0 0 33 0 0 0 0
6 0 0 0 0 0 2 32 0 0 0
7 0 1 0 1 0 0 0 36 0 1
8 1 1 1 1 0 0 0 1 20 1
9 1 3 0 1 0 1 0 1 0 24
This table shows the results of a classification, and I want to sum the leading diagonal of it (the diagonal with the large numbers - like 23, 23, 28 etc). Is there a sensible/easy way to do this in R?
How about sum(diag(tbl)), where tbl is your table?
Related
I would like to do transform Gender and Country using One-Hot-Encoding.
With the code below I can not create the new dataset including the ID
library(caret)
ID<-1:10
Gender<-c("F","F","F","M","M","F","M","M","F","M")
Country<-c("Mali","France","France","Guinea","Senegal",
"Mali","France","Mali","Senegal","France")
data<-data.frame(ID,Gender,Country)
#One hot encoding
dmy <- dummyVars(" ~Gender+Country", data = data, fullRank = T)
dat_transformed <- data.frame(predict(dmy, newdata = data))
dat_transformed
Gender.M Country.Guinea Country.Mali Country.Senegal
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
5 1 0 0 1
6 0 0 1 0
7 1 0 0 0
8 1 0 1 0
9 0 0 0 1
10 1 0 0 0
I want to get a dataset that include the ID without enconding it.
ID Gender.M Country.Guinea Country.Mali Country.Senegal
1 1 0 0 1 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 1 1 0 0
5 5 1 0 0 1
6 6 0 0 1 0
7 7 1 0 0 0
8 8 1 0 1 0
9 9 0 0 0 1
10 10 1 0 0 0
dat_transformed <- cbind(ID,dat_transformed)
dat_transformed
ID Gender.M Country.Guinea Country.Mali Country.Senegal
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
5 1 0 0 1
6 0 0 1 0
7 1 0 0 0
8 1 0 1 0
9 0 0 0 1
10 1 0 0 0
I have dummy coded the data in R using the package named "dummies". This gave me the output in various levels and create the dummies for each level. I want to consolidate all those levels on the basis of attributes. Please help me out! Thanks in advance. The following is the code I used:
#To read the data from the working directory
bankfull<-read.csv("bank.csv")
#Calling the library named dummies
library(dummies)
#Redefining the bankfull data with the dummy codes
bankfull<-dummy.data.frame(bankfull,sep=",")
#Viewing the data after dummy coding
print(bankfull)
The output is as follows:
#To read the data from the working directory
> bankfull<-read.csv("bank.csv")
> #Calling the library named dummies
> library(dummies)
dummies-1.5.6 provided by Decision Patterns
> #Redefining the bankfull data with the dummy codes
> bankfull<-dummy.data.frame(bankfull,sep=",")
> #Viewing the data after dummy coding
> View(bankfull)
> library(carData)
> #Viewing the data after dummy coding
> print(bankfull)
CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO_TV EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT INSTALL_RATE MALE_DIV
1 0 6 4 0 0 0 1 0 0 1169 4 4 4 0
2 1 48 2 0 0 0 1 0 0 5951 0 2 2 0
3 3 12 4 0 0 0 0 1 0 2096 0 3 2 0
4 0 42 2 0 0 1 0 0 0 7882 0 3 2 0
5 0 24 3 1 0 0 0 0 0 4870 0 2 3 0
6 3 36 2 0 0 0 0 1 0 9055 4 2 2 0
7 3 24 2 0 0 1 0 0 0 2835 2 4 3 0
8 1 36 2 0 1 0 0 0 0 6948 0 2 2 0
9 3 12 2 0 0 0 1 0 0 3059 3 3 2 1
10 1 30 4 1 0 0 0 0 0 5234 0 0 4 0
11 1 12 2 1 0 0 0 0 0 1295 0 1 3 0
12 0 48 2 0 0 0 0 0 1 4308 0 1 3 0
13 1 12 2 0 0 0 1 0 0 1567 0 2 1 0
14 0 24 4 1 0 0 0 0 0 1199 0 4 4 0
15 0 15 2 1 0 0 0 0 0 1403 0 2 2 0
16 0 24 2 0 0 0 1 0 0 1282 1 2 4 0
17 3 24 4 0 0 0 1 0 0 2424 4 4 4 0
18 0 30 0 0 0 0 0 0 1 8072 4 1 2 0
19 1 24 2 0 1 0 0 0 0 12579 0 4 4 0
20 3 24 2 0 0 0 1 0 0 3430 2 4 3 0
21 3 9 4 1 0 0 0 0 0 2134 0 2 4 0
22 0 6 2 0 0 0 1 0 0 2647 2 2 2 0
23 0 10 4 1 0 0 0 0 0 2241 0 1 1 0
24 1 12 4 0 1 0 0 0 0 1804 1 1 3 0
25 3 10 4 0 0 1 0 0 0 2069 4 2 2 0
26 0 6 2 0 0 1 0 0 0 1374 0 2 1 0
27 3 6 0 0 0 0 1 0 0 426 0 4 4 0
28 2 12 1 0 0 0 1 0 0 409 3 2 3 0
29 1 7 2 0 0 0 1 0 0 2415 0 2 3 0
30 0 60 3 0 0 0 0 0 1 6836 0 4 3 0
31 1 18 2 0 0 0 0 0 1 1913 3 1 3 0
32 0 24 2 0 0 1 0 0 0 4020 0 2 2 0
MALE_SINGLE MALE_MAR_or_WID CO_APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS
1 1 0 0 0 4 1 0 67 0 0 1 2
2 0 0 0 0 2 1 0 22 0 0 1 1
3 1 0 0 0 3 1 0 49 0 0 1 1
4 1 0 0 1 4 0 0 45 0 0 0 1
5 1 0 0 0 4 0 1 53 0 0 0 2
6 1 0 0 0 4 0 1 35 0 0 0 1
7 1 0 0 0 4 0 0 53 0 0 1 1
8 1 0 0 0 2 0 0 35 0 1 0 1
9 0 0 0 0 4 1 0 61 0 0 1 1
10 0 1 0 0 2 0 0 28 0 0 1 2
11 0 0 0 0 1 0 0 25 0 1 0 1
12 0 0 0 0 4 0 0 24 0 1 0 1
13 0 0 0 0 1 0 0 22 0 0 1 1
14 1 0 0 0 4 0 0 60 0 0 1 2
15 0 0 0 0 4 0 0 28 0 1 0 1
16 0 0 0 0 2 0 0 32 0 0 1 1
17 1 0 0 0 4 0 0 53 0 0 1 2
18 1 0 0 0 3 0 0 25 1 0 1 3
19 0 0 0 0 2 0 1 44 0 0 0 1
20 1 0 0 0 2 0 0 31 0 0 1 1
21 1 0 0 0 4 0 0 48 0 0 1 3
22 1 0 0 0 3 1 0 44 0 1 0 1
23 1 0 0 0 3 1 0 48 0 1 0 2
24 1 0 0 0 4 0 0 44 0 0 1 1
25 0 1 0 0 1 0 0 26 0 0 1 2
26 1 0 0 0 2 1 0 36 1 0 1 1
27 0 1 0 0 4 0 0 39 0 0 1 1
28 0 0 0 0 3 1 0 42 0 1 0 2
29 1 0 0 1 2 1 0 34 0 0 1 1
30 1 0 0 0 4 0 1 63 0 0 1 2
31 0 1 0 0 3 1 0 36 1 0 1 1
32 1 0 0 0 2 0 0 27 1 0 1 1
JOB NUM_DEPENDENTS TELEPHONE FOREIGN RESPONSE
1 2 1 1 0 1
2 2 1 0 0 0
3 1 2 0 0 1
4 2 2 0 0 1
5 2 2 0 0 0
6 1 2 1 0 1
7 2 1 0 0 1
8 3 1 1 0 1
9 1 1 0 0 1
10 3 1 0 0 0
11 2 1 0 0 0
12 2 1 0 0 0
13 2 1 1 0 1
14 1 1 0 0 0
15 2 1 0 0 1
16 1 1 0 0 0
17 2 1 0 0 1
18 2 1 0 0 1
19 3 1 1 0 0
20 2 2 1 0 1
21 2 1 1 0 1
22 2 2 0 0 1
23 1 2 0 1 1
24 2 1 0 0 1
25 2 1 0 1 1
26 1 1 1 0 1
27 1 1 0 0 1
28 2 1 0 0 1
29 2 1 0 0 1
30 2 1 1 0 0
31 2 1 1 0 1
32 2 1 0 0 1
[ reached getOption("max.print") -- omitted 968 rows ]
In the output the dummies are given according to the levels. I need the output according to the attributes. Can I merge all of them without using any third party packages?
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I am using the following R code to produce a confusion matrix comparing the true labels of some data to the output of a neural network.
t <- table(as.factor(test.labels), as.factor(nnetpredict))
However, sometimes the neural network doesn't predict any of a certain class, so the table isn't square (as, for example, there are 5 levels in the test.labels factor, but only 3 levels in the nnetpredict factor). I want to make the table square by adding in any factor levels necessary, and setting their counts to zero.
How should I go about doing this?
Example:
> table(as.factor(a), as.factor(b))
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
You can see in the table above that there are 7 rows, but 10 columns, because the a factor only has 7 levels, whereas the b factor has 10 levels. What I want to do is to pad the table with zeros so that the row labels and the column labels are the same, and the matrix is square. From the example above, this would produce:
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
The reason I need to do this is two-fold:
For display to users/in reports
So that I can use a function to calculate the Kappa statistic, which requires a table formatted like this (square, same row and col labels)
EDIT - round II to address the additional details in the question. I deleted my first answer since it wasn't relevant anymore.
This has produced the desired output for the test cases I've given it, but I definitely advise testing thoroughly with your real data. The approach here is to find the full list of levels for both inputs into the table and set that full list as the levels before generating the table.
squareTable <- function(x,y) {
x <- factor(x)
y <- factor(y)
commonLevels <- sort(unique(c(levels(x), levels(y))))
x <- factor(x, levels = commonLevels)
y <- factor(y, levels = commonLevels)
table(x,y)
}
Two test cases:
> #Test case 1
> set.seed(1)
> x <- factor(sample(0:9, 100, TRUE))
> y <- factor(sample(3:7, 100, TRUE))
>
> table(x,y)
y
x 3 4 5 6 7
0 2 1 3 1 0
1 1 0 2 3 0
2 1 0 3 4 3
3 0 3 6 3 2
4 4 4 3 2 1
5 2 2 0 1 0
6 1 2 3 2 3
7 3 3 3 4 2
8 0 4 1 2 4
9 2 1 0 0 3
> squareTable(x,y)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 1 3 1 0 0 0
1 0 0 0 1 0 2 3 0 0 0
2 0 0 0 1 0 3 4 3 0 0
3 0 0 0 0 3 6 3 2 0 0
4 0 0 0 4 4 3 2 1 0 0
5 0 0 0 2 2 0 1 0 0 0
6 0 0 0 1 2 3 2 3 0 0
7 0 0 0 3 3 3 4 2 0 0
8 0 0 0 0 4 1 2 4 0 0
9 0 0 0 2 1 0 0 3 0 0
> squareTable(y,x)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 2 1 1 0 4 2 1 3 0 2
4 1 0 0 3 4 2 2 3 4 1
5 3 2 3 6 3 0 3 3 1 0
6 1 3 4 3 2 1 2 4 2 0
7 0 0 3 2 1 0 3 2 4 3
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
>
> #Test case 2
> set.seed(1)
> xx <- factor(sample(0:2, 100, TRUE))
> yy <- factor(sample(3:5, 100, TRUE))
>
> table(xx,yy)
yy
xx 3 4 5
0 4 14 9
1 14 15 9
2 11 11 13
> squareTable(xx,yy)
y
x 0 1 2 3 4 5
0 0 0 0 4 14 9
1 0 0 0 14 15 9
2 0 0 0 11 11 13
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
> squareTable(yy,xx)
y
x 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 4 14 11 0 0 0
4 14 15 11 0 0 0
5 9 9 13 0 0 0
I am using the following R code to produce a confusion matrix comparing the true labels of some data to the output of a neural network.
t <- table(as.factor(test.labels), as.factor(nnetpredict))
However, sometimes the neural network doesn't predict any of a certain class, so the table isn't square (as, for example, there are 5 levels in the test.labels factor, but only 3 levels in the nnetpredict factor). I want to make the table square by adding in any factor levels necessary, and setting their counts to zero.
How should I go about doing this?
Example:
> table(as.factor(a), as.factor(b))
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
You can see in the table above that there are 7 rows, but 10 columns, because the a factor only has 7 levels, whereas the b factor has 10 levels. What I want to do is to pad the table with zeros so that the row labels and the column labels are the same, and the matrix is square. From the example above, this would produce:
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
The reason I need to do this is two-fold:
For display to users/in reports
So that I can use a function to calculate the Kappa statistic, which requires a table formatted like this (square, same row and col labels)
EDIT - round II to address the additional details in the question. I deleted my first answer since it wasn't relevant anymore.
This has produced the desired output for the test cases I've given it, but I definitely advise testing thoroughly with your real data. The approach here is to find the full list of levels for both inputs into the table and set that full list as the levels before generating the table.
squareTable <- function(x,y) {
x <- factor(x)
y <- factor(y)
commonLevels <- sort(unique(c(levels(x), levels(y))))
x <- factor(x, levels = commonLevels)
y <- factor(y, levels = commonLevels)
table(x,y)
}
Two test cases:
> #Test case 1
> set.seed(1)
> x <- factor(sample(0:9, 100, TRUE))
> y <- factor(sample(3:7, 100, TRUE))
>
> table(x,y)
y
x 3 4 5 6 7
0 2 1 3 1 0
1 1 0 2 3 0
2 1 0 3 4 3
3 0 3 6 3 2
4 4 4 3 2 1
5 2 2 0 1 0
6 1 2 3 2 3
7 3 3 3 4 2
8 0 4 1 2 4
9 2 1 0 0 3
> squareTable(x,y)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 1 3 1 0 0 0
1 0 0 0 1 0 2 3 0 0 0
2 0 0 0 1 0 3 4 3 0 0
3 0 0 0 0 3 6 3 2 0 0
4 0 0 0 4 4 3 2 1 0 0
5 0 0 0 2 2 0 1 0 0 0
6 0 0 0 1 2 3 2 3 0 0
7 0 0 0 3 3 3 4 2 0 0
8 0 0 0 0 4 1 2 4 0 0
9 0 0 0 2 1 0 0 3 0 0
> squareTable(y,x)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 2 1 1 0 4 2 1 3 0 2
4 1 0 0 3 4 2 2 3 4 1
5 3 2 3 6 3 0 3 3 1 0
6 1 3 4 3 2 1 2 4 2 0
7 0 0 3 2 1 0 3 2 4 3
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
>
> #Test case 2
> set.seed(1)
> xx <- factor(sample(0:2, 100, TRUE))
> yy <- factor(sample(3:5, 100, TRUE))
>
> table(xx,yy)
yy
xx 3 4 5
0 4 14 9
1 14 15 9
2 11 11 13
> squareTable(xx,yy)
y
x 0 1 2 3 4 5
0 0 0 0 4 14 9
1 0 0 0 14 15 9
2 0 0 0 11 11 13
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
> squareTable(yy,xx)
y
x 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 4 14 11 0 0 0
4 14 15 11 0 0 0
5 9 9 13 0 0 0