How to remove a row in a contingency table in R? - r

Let say I have a contingency table (made using the table function in R).
digit
ID 1 2 3 4 5 6 7 8 9
1672120 23 16 8 10 12 13 3 3 5
1672121 2 1 0 0 0 0 1 0 0
1672122 1 2 1 0 1 0 0 1 0
1672123 0 1 1 0 0 0 0 0 0
1672124 1 1 0 1 1 0 0 0 0
1672125 5 2 5 1 1 1 0 0 2
1672127 2 1 2 1 0 0 0 0 0
1672128 2 0 0 1 0 1 0 0 1
1672129 1 0 1 0 0 0 1 0 0
If I want to remove the rows where the number of counts is smaller than 5 from the contingency table, how should I do it?

Since you don't provide reproducible sample data here is an example based on the mtcars dataset
Let's create a count table of mtcars$gear vs. mtcars$carb
tbl <- table(mtcars$gear, mtcars$carb)
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0
# 5 0 2 0 1 1 1
We then select only those rows where at least one count is larger than 2
tbl[apply(tbl > 2, 1, any), ]
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0

Related

How to create an empty column

I have to consider three columns of a dataset.
One of them has values from 1 to 10, while the others have values from 2 to 10. I wanted to sum the frequencies for each value for all the three columns but an error appears, I think because two columns don't have values for 1.
How can I solve it?
This is what I have:
take.care
face_prod 2 3 4 5 6 7 8 9 10
anti-age 0 1 0 5 3 8 4 1 3
Hydrating 2 3 1 8 9 14 9 3 9
normal skin 0 0 0 0 4 0 1 0 1
Other 0 1 0 1 1 0 0 0 0
purifying 0 0 1 1 4 7 8 4 5
sensitive skin 0 0 0 0 1 2 0 0 1
look.fundam
face_prod 2 3 4 5 6 7 8 9 10
anti-age 0 0 0 2 2 4 3 5 9
Hydrating 1 0 1 4 12 7 10 5 18
normal skin 0 0 0 0 1 2 1 1 1
Other 0 1 0 0 2 0 0 0 0
purifying 0 1 0 0 3 5 9 3 9
sensitive skin 0 1 0 0 0 0 1 1 1
good.app
face_prod 1 2 3 4 5 6 7 8 9 10
anti-age 0 1 1 3 5 2 3 6 1 3
Hydrating 4 1 5 5 8 9 10 7 4 5
normal skin 0 0 0 0 2 2 1 0 0 1
Other 2 0 1 0 0 0 0 0 0 0
purifying 2 0 1 2 3 4 7 5 5 1
sensitive skin 1 0 0 0 2 0 0 0 0 1
It's not a dataset but the result of the table() function
If there are some levels missing, an option is to standardize with factor and levels specified as 1 to 10
nm1 <- c('take.care', 'look.fundam', 'good.app')
df1[nm1] <- lapply(df1[nm1], factor, levels = 1:10)
and now use the table

How can I calculate the percentage score from test results using tidyverse?

Rather than calculate each individuals score, I want to calculate the percentage of individuals who answered the question correctly. Below is the tibble containing the data, the columns are the candidates, a-r, and the rows are the questions. The data points are the answers given, and the column on the right, named 'correct', shows the correct answer.
A tibble: 20 x 19
question a b c d e g h i j k l m n o p q r correct
<chr> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 001 3 3 3 0 4 0 1 4 4 0 2 3 2 0 3 0 3 1
2 002 2 4 2 3 4 NA 4 2 2 2 4 2 4 3 2 2 3 2
3 003 2 2 2 3 4 2 2 4 4 1 4 3 3 2 4 1 3 2
4 005 2 3 1 3 4 NA 2 4 4 2 4 1 4 2 4 2 2 2
5 006 3 1 2 3 3 NA 2 3 4 2 3 3 3 3 3 NA 3 3
6 008 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 1 3 3
7 010 4 5 4 3 4 4 4 4 4 3 4 4 5 4 4 3 4 4
8 011 3 3 5 3 3 3 3 3 5 4 5 4 4 3 3 2 5 5
9 013 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0
10 014 0 0 0 2 0 1 0 0 0 0 2 0 2 0 0 0 0 0
11 016 3 3 0 0 4 1 1 4 4 2 3 3 3 3 1 0 3 0
12 017 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
13 019 0 1 0 2 1 1 0 1 0 1 2 2 2 1 0 1 1 0
14 020 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0
15 039 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
16 041 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
17 045 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 047 0 0 0 0 0 NA 0 0 0 0 1 0 0 0 0 0 0 0
19 049 3 3 3 3 4 NA 2 4 x 2 4 3 5 3 1 1 3 3
20 050 0 3 3 0 1 NA 0 3 3 0 x 0 0 0 0 0 3 1
I would like to generate a column 'percentage' that gives the proportion of correct answers for each question. I suspect I have to do loops or row-wise operations, but I'm so far out of my depth with that, I just can't figure out how to compare factors. I've tried mutate(), if_else(), group_by() and much more but have not managed to get close to an answer.
Any help would be greatly appreciated.
If your data.frame is called data you may try
library(dplyr)
data %>% rowwise() %>%
mutate(percentage = sum(c_across(a:r) == correct) / length(c_across(a:r)))
You can try this solution using a loop:
#Code
#First select the range of individuals a to r
index <- 2:18
#Create empty var to save results
df$Count <- NA
df$Prop <- NA
#Apply function
for(i in 1:dim(df)[1])
{
x <- df[i,index]
count <- length(which(x==df$correct[i]))
percentage <- count/dim(x)[2]
#Assign
df$Count[i] <- count
df$Prop[i] <- percentage
}
Output:
question a b c d e g h i j k l m n o p q r correct Count Prop
1 1 3 3 3 0 4 0 1 4 4 0 2 3 2 0 3 0 3 1 1 0.05882353
2 2 2 4 2 3 4 NA 4 2 2 2 4 2 4 3 2 2 3 2 8 0.47058824
3 3 2 2 2 3 4 2 2 4 4 1 4 3 3 2 4 1 3 2 6 0.35294118
4 5 2 3 1 3 4 NA 2 4 4 2 4 1 4 2 4 2 2 2 6 0.35294118
5 6 3 1 2 3 3 NA 2 3 4 2 3 3 3 3 3 NA 3 3 10 0.58823529
6 8 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 1 3 3 13 0.76470588
7 10 4 5 4 3 4 4 4 4 4 3 4 4 5 4 4 3 4 4 12 0.70588235
8 11 3 3 5 3 3 3 3 3 5 4 5 4 4 3 3 2 5 5 4 0.23529412
9 13 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 14 0.82352941
10 14 0 0 0 2 0 1 0 0 0 0 2 0 2 0 0 0 0 0 13 0.76470588
11 16 3 3 0 0 4 1 1 4 4 2 3 3 3 3 1 0 3 0 3 0.17647059
12 17 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 15 0.88235294
13 19 0 1 0 2 1 1 0 1 0 1 2 2 2 1 0 1 1 0 5 0.29411765
14 20 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 15 0.88235294
15 39 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 14 0.82352941
16 41 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 14 0.82352941
17 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 1.00000000
18 47 0 0 0 0 0 NA 0 0 0 0 1 0 0 0 0 0 0 0 15 0.88235294
19 49 3 3 3 3 4 NA 2 4 NA 2 4 3 5 3 1 1 3 3 7 0.41176471
20 50 0 3 3 0 1 NA 0 3 3 0 NA 0 0 0 0 0 3 1 1 0.05882353
You had some x in answers so I have replaced by NA in order to make the loop works.

summing all possible left to right diagonals along specified columns in a data frame by group?

Suppose I have something like this:
df<-data.frame(group=c(1, 1,2, 2, 2, 4,4,4,4,6,6,6),
binary1=c(1,0,1,0,0,0,0,0,0,0,0,0),
binary2=c(0,1,0,1,0,1,0,0,0,0,1,1),
binary3=c(0,0,0,0,1,0,1,0,0,0,0,0),
binary4=c(0,0,0,0,0,0,0,1,0,0,0,0))
I want to sum along all possible left to right diagonals within groups (i.e group 1, 2 4 and 6) and return the max sum. This is also in a dataframe, so I would like to specify to only sum along binary1-binary4. Anyone know if this is possible?
Here's my desired output:
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
I have circled the "diagonals" I would like summed for group 4 in this image as an example:
Here is another solution where we use row and col indices to get all possible combinations of diagonals. Use by to split by group and merge it with original dataframe.
max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
merge(df, stack(by(df[-1], df$group, max_diag)), by.x = "group", by.y = "ind")
# group binary1 binary2 binary3 binary4 values
#1 1 1 0 0 0 2
#2 1 0 1 0 0 2
#3 2 1 0 0 0 3
#4 2 0 1 0 0 3
#5 2 0 0 1 0 3
#6 4 0 1 0 0 3
#7 4 0 0 1 0 3
#8 4 0 0 0 1 3
#9 4 0 0 0 0 3
#10 6 0 0 0 0 1
#11 6 0 1 0 0 1
#12 6 0 1 0 0 1
You can split the data.frame and sum the diagonal using diag(). Once you have this sum diagonal per group, it's putting them back into the data.frame by calling the group.
Group 4 should be zero? Or am I missing something:
DIAG = by(df[,-1],df$group,function(i)sum(diag(as.matrix(i))))
df$want = DIAG[as.character(df$group)]
If I get your definition correct, we define a function to calculate sum of main diagonal:
main_diag = function(m){
sapply(1:(ncol(m)-1),function(i)sum(diag(m[,i:ncol(m)])))
}
Thanks to #IceCreamToucan for correcting this. Then we consider the max of all main diagonals, and their transpose:
DIAG = by(df[,-1],df$group,function(i){
i = as.matrix(i)
max(main_diag(i),main_diag(t(i)))
})
df$want = DIAG[as.character(df$group)]
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1

How to convert two factors to adjacency matrix in R?

I have a data frame with two columns (key and value) where each column is a factor:
df = data.frame(gl(3,4,labels=c('a','b','c')), gl(6,2))
colnames(df) = c("key", "value")
key value
1 a 1
2 a 1
3 a 2
4 a 2
5 b 3
6 b 3
7 b 4
8 b 4
9 c 5
10 c 5
11 c 6
12 c 6
I want to convert it to adjacency matrix (in this case 3x6 size) like:
1 2 3 4 5 6
a 1 1 0 0 0 0
b 0 0 1 1 0 0
c 0 0 0 0 1 1
So that I can run clustering on it (group keys that have similar values together) with either kmeans or hclust.
Closest that I was able to get was using model.matrix( ~ value, df) which results in:
(Intercept) value2 value3 value4 value5 value6
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 1 0 0 0
7 1 0 0 1 0 0
8 1 0 0 1 0 0
9 1 0 0 0 1 0
10 1 0 0 0 1 0
11 1 0 0 0 0 1
12 1 0 0 0 0 1
but results aren't grouped by key yet.
From another side I can collapse this dataset into groups using:
aggregate(df$value, by=list(df$key), unique)
Group.1 x.1 x.2
1 a 1 2
2 b 3 4
3 c 5 6
But I don't know what to do next...
Can someone help to solve this?
An easy way to do it in base R:
res <-table(df)
res[res>0] <-1
res
value
#key 1 2 3 4 5 6
# a 1 1 0 0 0 0
# b 0 0 1 1 0 0
# c 0 0 0 0 1 1

create a new data frame with existing ones

Suppose I have the following data frames
treatmet1<-data.frame(id=c(1,2,7))
treatment2<-data.frame(id=c(3,7,10))
control<-data.frame(id=c(4,5,8,9))
I want to create a new data frame that is the union of those 3 and have an indicator column that takes the value 1 for each one.
experiment<-data.frame(id=c(1:10),treatment1=0, treatment2=0, control=0)
where experiment$treatment1[1]=1 etc etc
What is the best way of doing this in R?
Thanks!
Updated as per # Flodel:
kk<-rbind(treatment1,treatment2,control)
var1<-c("treatment1","treatment2","control")
kk$df<-rep(var1,c(dim(treatment1)[1],dim(treatment2)[1],dim(control)[1]))
kk
id df
1 1 treatment1
2 2 treatment1
3 7 treatment1
4 3 treatment2
5 7 treatment2
6 10 treatment2
7 4 control
8 5 control
9 8 control
10 9 control
If you want in the form of 1 and 0 , you can use table
ll<-table(kk)
ll
df
id control treatment1 treatment2
1 0 1 0
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
7 0 1 1
8 1 0 0
9 1 0 0
10 0 0 1
If you want it as a data.frame, then you can use reshape:
kk2<-reshape(data.frame(ll),timevar = "df",idvar = "id",direction = "wide")
names(kk2)[-1]<-sort(var1)
> kk2
kk2
id control treatment1 treatment2
1 1 0 1 0
2 2 0 1 0
3 3 0 0 1
4 4 1 0 0
5 5 1 0 0
6 7 0 1 1
7 8 1 0 0
8 9 1 0 0
9 10 0 0 1
df.bind <- function(...) {
df.names <- all.names(substitute(list(...)))[-1L]
ids.list <- setNames(lapply(list(...), `[[`, "id"), df.names)
num.ids <- max(unlist(ids.list))
tabs <- lapply(ids.list, tabulate, num.ids)
data.frame(id = seq(num.ids), tabs)
}
df.bind(treatment1, treatment2, control)
# id treatment1 treatment2 control
# 1 1 1 0 0
# 2 2 1 0 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 0 0 1
# 6 6 0 0 0
# 7 7 1 1 0
# 8 8 0 0 1
# 9 9 0 0 1
# 10 10 0 1 0
(Notice how it does include a row for id == 6.)
Taking
treatment1<-data.frame(id=c(1,2,7))
treatment2<-data.frame(id=c(3,7,10))
control<-data.frame(id=c(4,5,8,9))
You can use this:
x <- c("treatment1", "treatment2", "control")
f <- function(s) within(get(s), assign(s, 1))
r <- Reduce(function(x,y) merge(x,y,all=TRUE), lapply(x, f))
r[is.na(r)] <- 0
Result:
> r
id treatment1 treatment2 control
1 1 1 0 0
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
5 5 0 0 1
6 7 1 1 0
7 8 0 0 1
8 9 0 0 1
9 10 0 1 0
This illustrates what I was imagining to be the rbind strategy:
alldf <- rbind(treatmet1,treatment2,control)
alldf$grps <- model.matrix( ~ factor( c( rep(1,nrow(treatmet1)),
rep(2,nrow(treatment2)),
rep(3,nrow(control) ) ))-1)
dimnames( alldf[[2]])[2]<- list(c("trt1","trt2","ctrl"))
alldf
#-------------------
id grps.trt1 grps.trt2 grps.ctrl
1 1 1 0 0
2 2 1 0 0
3 7 1 0 0
4 3 0 1 0
5 7 0 1 0
6 10 0 1 0
7 4 0 0 1
8 5 0 0 1
9 8 0 0 1
10 9 0 0 1

Resources