R: Squared contingency table [duplicate] - r

This question already has an answer here:
How to create missing values in table in R?
(1 answer)
Closed 2 years ago.
I want to make a contingency table with observations and their predictions based on a neural network. Since I want positives to be on the diagonal, I would like my table to be squared, regardless if there are rows with just 0's. That is, I would like to have
b
a a b c d e f g
a 1 0 1 0 2 1 0
b 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
f 0 0 0 0 0 0 0
g 4 2 1 0 3 1 0
Instead of:
> set.seed(1)
> b<-sample(letters[1:7],40,rep=TRUE)
> a<-sample(letters[1:4],40,rep=TRUE)
>
> table(a,b)
b
a a b c d e f g
a 1 0 1 0 2 1 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
g 4 2 1 0 3 1 0
How can I do this?

Convert a and b to factor with levels as union of both :
tmp <- sort(union(a, b))
table(factor(a, levels = tmp), factor(b, levels = tmp))
# a b c d e f g
# a 0 1 1 2 2 1 4
# b 2 1 1 1 2 3 2
# c 4 0 1 2 0 1 1
# d 0 1 1 1 3 1 1
# e 0 0 0 0 0 0 0
# f 0 0 0 0 0 0 0
# g 0 0 0 0 0 0 0

Related

R Frequency table with condition

I have a dataframe with two columns, "CaseID" and "Event" and want to know how often Event with ID=X is followed by Event with ID=Y. But I am only interested in consecutive events with the same CaseID.
The command
df <- data.frame(CaseID = c(1,1,1,2,2,2,3,3,3),
Event = c("A","B","C","A","B","D","B","C","E"))
df
table(df[1:nrow(df) -1, 2], df[2:nrow(df), 2])
results in
CaseID Event
1 1 A
2 1 B
3 1 C
4 2 A
5 2 B
6 2 D
7 3 B
8 3 C
9 3 E
A B C D E
A 0 2 0 0 0
B 0 0 2 1 0
C 1 0 0 0 1
D 0 1 0 0 0
E 0 0 0 0 0
C -> A and D -> B have different CaseID's and should be 0 so what I am looking for is
B C D E
A 2 0 0 0
B 0 2 1 0
C 0 0 0 1
D 0 0 0 0
E 0 0 0 0
Is there any elegant way to add a condition to the table-command, based on two consecutive rows?
Ben
We can only tabulate consecutive Events with the same CaseID:
> x <- diff(df$CaseID) == 0
> table(df[1:nrow(df) -1, 2][x], df[2:nrow(df), 2][x])
A B C D E
A 0 2 0 0 0
B 0 0 2 1 0
C 0 0 0 0 1
D 0 0 0 0 0
E 0 0 0 0 0
In case CaseID might be non-numeric:
x <- df$CaseID[-1] == df$CaseID[-length(df$CaseID)]
table(df[1:nrow(df) -1, 2][x], df[2:nrow(df), 2][x])

Count values in column by group R

I want to transform the following dataframe into a dataframe that adds a column of index numbers and counts the values in the rows. Like this:
A B C D E value A B C D E
1 2 3 4 4 0 2 2 0 1 1
1 4 4 2 1 => 1 3 0 0 0 2
1 2 2 2 0 2 0 2 2 2 1
0 0 2 0 1 3 0 0 1 1 0
0 0 4 3 2 4 0 1 2 1 1
I am pretty much a beginner in R and can't figure out how to do this.
Thanks in advance :)
You can do:
df <- read.table(header=TRUE, text=
"A B C D E
1 2 3 4 4
1 4 4 2 1
1 2 2 2 0
0 0 2 0 1
0 0 4 3 2")
sapply(df+1, tabulate, nbins=5)
# > sapply(df+1, tabulate, nbins=5)
# A B C D E
# [1,] 2 2 0 1 1
# [2,] 3 0 0 0 2
# [3,] 0 2 2 2 1
# [4,] 0 0 1 1 0
# [5,] 0 1 2 1 1
Eventually you want correct the rownames:
result <- sapply(df+1, tabulate, nbins=5)
rownames(result) <- (1:nrow(result))-1
result

Significance Testing in R

I am trying to determine whether there is a significant difference between two interfaces. I have a text file that looks like this:
group conversion
A 0
A 0
A 1
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
A 1
A 1
A 1
A 1
A 0
A 0
A 0
A 0
A 0
A 1
A 0
A 1
A 0
A 1
A 1
A 0
A 1
A 0
A 1
A 1
A 0
A 0
A 0
A 0
A 0
A 1
A 1
A 0
A 0
A 1
A 1
A 0
A 1
A 1
A 0
A 0
A 0
A 1
A 1
A 0
A 0
A 0
A 0
A 1
A 1
A 0
A 1
A 1
A 1
A 1
A 1
A 1
A 1
A 0
A 0
A 0
A 1
A 1
A 0
A 1
A 1
A 0
A 0
A 1
A 0
A 0
A 0
A 1
A 0
A 1
A 1
A 1
A 0
A 0
A 0
A 0
A 0
A 0
A 0
A 1
A 1
A 1
A 1
A 1
A 1
A 0
A 0
A 1
A 1
B 0
B 0
B 1
B 0
B 0
B 0
B 1
B 0
B 0
B 0
B 0
B 1
B 0
B 1
B 0
B 1
B 0
B 1
B 0
B 0
B 1
B 1
B 1
B 1
B 1
B 1
B 1
B 1
B 1
B 0
B 0
B 1
B 0
B 0
B 1
B 0
B 0
B 0
B 0
B 0
B 1
B 1
B 0
B 0
B 0
B 0
B 1
B 1
B 0
B 0
B 1
B 0
B 1
B 0
B 0
B 0
B 1
B 1
B 1
B 1
B 0
B 1
B 0
B 0
B 1
B 1
B 0
B 0
B 0
B 0
B 0
B 0
B 0
B 1
B 0
B 0
B 1
B 0
B 0
B 0
B 0
B 0
B 0
B 0
B 0
B 1
B 1
B 1
B 0
B 0
B 0
B 0
B 1
B 0
B 1
B 1
B 1
B 1
B 1
B 1
Now I need to find out which method I should use while doing this. So far I've tried the Welch's Two Sample T-test method, which I think is correct. But is that the correct way of determining whether there is a significance or not? By the way, the significance level is 5%.
This is my code:
# Load in the values from "test.txt"
dat = read.delim(“test.txt”)
# Calculate the amount of unique values
length(unique(dat$group))
# Calculate the p-value
t.test(dat$conversion ~ dat$group)
The output on the p-value was: 0.2586, which is larger than 0.05, which should mean there is no significance, right? Or am I doing something wrong? I'm a beginner at R.
I think that you are looking for the Fisher's T-test
using your data I created a data frame named x:
head(x)
group conversion
1 A 0
2 A 0
3 A 1
4 A 0
5 A 0
6 A 1
then I made a frequency table:
y<-table(x)
# and previewed the count table:
y
conversion
group 0 1
A 50 50
B 58 42
Then you run a Fisher's t-test:
fisher.test(y)
Fisher's Exact Test for Count Data
data: y
p-value = 0.3207
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3989079 1.3135633
sample estimates:
odds ratio
0.7253254
And it even tells you that it is for comparing counts. It is a way of evaluating exactly the difference between two categorical identities.

Split data frame into chunk and assign names to chunks from vectors

I have a data frame with 5*n columns, where n is the number of categories listed in a vector. I want to break the data frame into chunks of 5 columns (eg. category 1 is columns 1:5, category 2 is columns 6:10) and then assign the category names from the vector to the chunks.
eg.
*original data frame* *vector of category names*
X a b c d e a b c d e a b c d e 1 apples
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 2 oranges
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 3 bananas
Will become
*apples* *oranges* *bananas*
X a b c d e X a b c d e X a b c d e
1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0
2 0 1 0 1 0 2 0 0 1 0 1 2 1 0 0 0 1
I can find a whole lot of information about splitting data.frames by rows, which is much more common to do, but I can't find anything about splitting a data frame into n chunks by columns. Thanks!
You could split your original_data_frame by column indices similarely:
df <- read.table(header=T, check.names = F, text="
X a b c d e a b c d e a b c d e
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1")
n <- 5 # fixed chunksize (a-e)
lst <- lapply(split(2:ncol(df), rep(seq(ncol(df[-1])/n), each=n)), function(x) df[, x])
names(lst) <- c("apples", "oranges", "bananas")
# lst
# $apples
# a b c d e
# 1 1 0 0 0 1
# 2 0 1 0 1 0
#
# $oranges
# a b c d e
# 1 0 1 0 1 0
# 2 0 0 1 0 1
#
# $bananas
# a b c d e
# 1 0 0 1 1 0
# 2 1 0 0 0 1
I don't know if this is elegant, but it came to my mind, first.

How to create design matrix in r

I have two factors. factor A have 2 level, factor B have 3 level.
How to create the following design matrix?
factorA1 factorA2 factorB1 factorB2 factorB3
[1,] 1 0 1 0 0
[2,] 1 0 0 1 0
[3,] 1 0 0 0 1
[4,] 0 1 1 0 0
[5,] 0 1 0 1 0
[6,] 0 1 0 0 1
You have a couple of options:
Use base and piece it together yourself:
(iris.dummy<-with(iris,model.matrix(~Species-1)))
(IRIS<-data.frame(iris,iris.dummy))
Or use the ade4 package as follows:
dummy <- function(df) {
require(ade4)
ISFACT <- sapply(df, is.factor)
FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE])
NONFACTS <- df[, !ISFACT,drop = FALSE]
data.frame(NONFACTS, FACTS)
}
dat <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(dat)
## x eggs.bar eggs.foo ham.blue ham.green ham.red
## 1 0.3365302 0 1 0 0 1
## 2 1.1341354 0 1 1 0 0
## 3 2.0489741 1 0 0 1 0
## 4 1.1019108 1 0 0 0 1
Assuming your data in in a data.frame called dat, let's say the two factors are given as in this example:
> dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
> dat
f1 f2 id
1 C D 1
2 B E 2
3 B E 3
4 A D 4
5 C E 5
6 C E 6
7 C D 7
8 B E 8
9 C D 9
10 A D 10
11 B E 11
12 C E 12
13 B D 13
14 B E 14
15 A D 15
16 C E 16
17 C D 17
18 C D 18
19 B D 19
20 C D 20
> dat$f1
[1] C B B A C C C B C A B C B B A C C C B C
Levels: A B C
> dat$f2
[1] D E E D E E D E D D E E D E D E D D D D
Levels: D E
You can use outer to get a matrix as you showed, for each factor:
> F1 <- with(dat, outer(f1, levels(f1), `==`)*1)
> colnames(F1) <- paste("f1",sep="=",levels(dat$f1))
> F1
f1=A f1=B f1=C
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 1 0
[4,] 1 0 0
[5,] 0 0 1
[6,] 0 0 1
[7,] 0 0 1
[8,] 0 1 0
[9,] 0 0 1
[10,] 1 0 0
[11,] 0 1 0
[12,] 0 0 1
[13,] 0 1 0
[14,] 0 1 0
[15,] 1 0 0
[16,] 0 0 1
[17,] 0 0 1
[18,] 0 0 1
[19,] 0 1 0
[20,] 0 0 1
Now do the same for the second factor:
> F2 <- with(dat, outer(f2, levels(f2), `==`)*1)
> colnames(F2) <- paste("f2",sep="=",levels(dat$f2))
And cbind them to get the final result:
> cbind(F1,F2)
model.matrix is the process that lm and others use in the background to convert for you.
dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
dat
model.matrix(~dat$f1 + dat$f2)
It creates the INTERCEPT variable as a column of 1's, but you can easily remove that if you need.
model.matrix(~dat$f1 + dat$f2)[,-1]
Edit: Now i see that this is essentially the same as one of the other comments, but more concise.
Expanding and generalizing #Ferdinand.kraft's answer:
dat <- data.frame(
f1 = sample(LETTERS[1:3], 20, TRUE),
f2 = sample(LETTERS[4:5], 20, TRUE),
row.names = paste0("id_", 1:20))
covariates <- c("f1", "f2") # in case you have other columns that you don't want to include in the design matrix
design <- do.call(cbind, lapply(covariates, function(covariate){
apply(outer(dat[[covariate]], unique(dat[[covariate]]), FUN = "=="), 2, as.integer)
}))
rownames(design) <- rownames(dat)
colnames(design) <- unlist(sapply(covariates, function(covariate) unique(dat[[covariate]])))
design <- design[, !duplicated(colnames(design))] # duplicated colnames happen sometimes
design
# C A B D E
# id_1 1 0 0 1 0
# id_2 0 1 0 1 0
# id_3 0 0 1 1 0
# id_4 1 0 0 1 0
# id_5 0 1 0 1 0
# id_6 0 1 0 0 1
# id_7 0 0 1 0 1
Model matrix only allows what it calls "dummy" coding for the first factor in a formula.
If the intercept is present, it plays that role. To get the desired effect of a redundant index matrix (where you have a 1 in every column for the corresponding factor level and 0 elsewhere), you can lie to model.matrix() and pretend there's an extra level. Then trim off the intercept column.
> a=rep(1:2,3)
> b=rep(1:3,2)
> df=data.frame(A=a,B=b)
> # Lie and pretend there's a level 0 in each factor.
> df$A=factor(a,as.character(0:2))
> df$B=factor(b,as.character(0:3))
> mm=model.matrix (~A+B,df)
> mm
(Intercept) A1 A2 B1 B2 B3
1 1 1 0 1 0 0
2 1 0 1 0 1 0
3 1 1 0 0 0 1
4 1 0 1 1 0 0
5 1 1 0 0 1 0
6 1 0 1 0 0 1
attr(,"assign")
[1] 0 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"
attr(,"contrasts")$B
[1] "contr.treatment"
> # mm has an intercept column not requested, so kill it
> dm=as.matrix(mm[,-1])
> dm
A1 A2 B1 B2 B3
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 0 0 1
4 0 1 1 0 0
5 1 0 0 1 0
6 0 1 0 0 1
> # You can also add interactions
> mm2=model.matrix (~A*B,df)
> dm2=as.matrix(mm2[,-1])
> dm2
A1 A2 B1 B2 B3 A1:B1 A2:B1 A1:B2 A2:B2 A1:B3 A2:B3
1 1 0 1 0 0 1 0 0 0 0 0
2 0 1 0 1 0 0 0 0 1 0 0
3 1 0 0 0 1 0 0 0 0 1 0
4 0 1 1 0 0 0 1 0 0 0 0
5 1 0 0 1 0 0 0 1 0 0 0
6 0 1 0 0 1 0 0 0 0 0 1
Things get complicated with model.matrix() again if we add a covariate x and interactions of x with factors.
a=rep(1:2,3)
b=rep(1:3,2)
x=1:6
df=data.frame(A=a,B=b,x=x)
# Lie and pretend there's a level 0 in each factor.
df$A=factor(a,as.character(0:2))
df$B=factor(b,as.character(0:3))
mm=model.matrix (~A + B + A:x + B:x,df)
print(mm)
(Intercept) A1 A2 B1 B2 B3 A0:x A1:x A2:x B1:x B2:x B3:x
1 1 1 0 1 0 0 0 1 0 1 0 0
2 1 0 1 0 1 0 0 0 2 0 2 0
3 1 1 0 0 0 1 0 3 0 0 0 3
4 1 0 1 1 0 0 0 0 4 4 0 0
5 1 1 0 0 1 0 0 5 0 0 5 0
6 1 0 1 0 0 1 0 0 6 0 0 6
So mm has an intercept, but now A:x interaction terms have an unwanted level A0:x
If we reintroduce x as as a separate term, we will cancel that unwanted level
mm2=model.matrix (~ x + A + B + A:x + B:x, df)
print(mm2)
(Intercept) x A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 1 1 0 1 0 0 1 0 1 0 0
2 1 2 0 1 0 1 0 0 2 0 2 0
3 1 3 1 0 0 0 1 3 0 0 0 3
4 1 4 0 1 1 0 0 0 4 4 0 0
5 1 5 1 0 0 1 0 5 0 0 5 0
6 1 6 0 1 0 0 1 0 6 0 0 6
We can get rid of the unwanted intercept and the unwanted bare x term
dm2=as.matrix(mm2[,c(-1,-2)])
print(dm2)
A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 0 1 0 0 1 0 1 0 0
2 0 1 0 1 0 0 2 0 2 0
3 1 0 0 0 1 3 0 0 0 3
4 0 1 1 0 0 0 4 4 0 0
5 1 0 0 1 0 5 0 0 5 0
6 0 1 0 0 1 0 6 0 0 6

Resources