Create n-way frequency tables using R - r

I need some help to create a n-way frequency table.
I am using the code below:
tab <- table(VAR1,VAR2,VAR3)
finaltab <- ftable(tab,row.vars=c(2,3))
print(finaltab)
VAR1, VAR2 and VAR3 are all factor variables. By doing this, I produce the following table:
But since VAR2 and VAR3 have several categories, I got a lot of lines with "0" and I which to remove those lines to keep in which category of VAR2 only frequencies for the categories of VAR3 that really have frequency values, as follows:
Does anyone knows how to do it, either by subsetting the table I created first or using another function that doesn't return all levels of VAR3 in each VAR2 category but only those which actually have frequencies?

Contingency tables have the same number of rows in every category. If you
remove rows from one category you no longer have a table but a matrix.
t <- structure(c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L), .Dim = c(3L, 2L, 3L), .Dimnames = structure(list(c("A", "B", "C"), c("A", "B"), c("A", "B", "C")), .Names = c("","", "")), class = "table")
> (ft <- ftable(t, row.vars=c(2,3)))
A B C
A A 0 0 1
B 1 1 1
C 0 1 0
B A 1 1 0
B 0 0 0
C 1 1 1
> ft[apply(ft, 1, any), ]
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 1 1
[3,] 0 1 0
[4,] 1 1 0
[5,] 1 1 1
The unfortunate effect of subsetting the table is that names are lost. This
can be mitigated to some extent by coercing the table to a matrix before
taking the subset, but the printed output still isn't as pretty as that of a
contengency table.
> as.matrix(ft)[apply(ft, 1, any), ]
_ A B C
A_A 0 0 1
A_B 1 1 1
A_C 0 1 0
B_A 1 1 0
B_C 1 1 1

Related

Replace multiple columns by head string into one column

I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!
You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2
You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))
assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2

Outside Union in R

I need to perform appears to be an union on two tables in R. However the union needs to include columns that are not common to the two parent matrices / tables.
This scenario looks very similar to the Outer Union described here: https://cs.stackexchange.com/questions/6997/what-is-outer-union-and-why-is-it-partially-compatible
I have two Matrices:
Matrix 1
Name Var1 Var2
1 1 0
2 1 0
Matrix 2
Name Var1 Var3
3 0 1
4 0 1
That I need to combine into Matrix 3:
Name Var1 Var2 Var3
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
A base R solution using merge
M <- replace(M<-as.matrix(merge(data.frame(M1),data.frame(M2),all = T)),
which(is.na(M)),
0)
such that
> M
Name Var1 Var2 Var3
[1,] 1 1 0 0
[2,] 2 1 0 0
[3,] 3 0 0 1
[4,] 4 0 0 1
DATA
M1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
M2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))
We can convert to data.frame and use bind_rows. By default, it fills the missing values with NA
library(dplyr)
library(tidyr)
bind_rows(as.data.frame(m1), as.data.frame(m2)) %>%
mutate_all(replace_na, 0) %>%
as.matrix
# Name Var1 Var2 Var3
#[1,] 1 1 0 0
#[2,] 2 1 0 0
#[3,] 3 0 0 1
#[4,] 4 0 0 1
Or as #markus mentioned rbind.fill.matrix from plyr would be useful
plyr::rbind.fill.matrix(m1, m2)
data
m1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
m2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))

Refer to column name and row name within an apply statement in R

I have a dataframe in R which looks like the one below.
a b c d e f
0 1 1 0 0 0
1 1 1 1 0 1
0 0 0 1 0 1
1 0 0 1 0 1
1 1 1 0 0 0
The database is big, spanning over 100 columns and 5000 rows and contain all binaries (0's and 1's). I want to construct an overlap between each and every columns in R. Something like the one given below. This overlap dataframe will be a square matrix with equal number of rows and columns and that will be same as the number of columns in the 1st dataframe.
a b c d e f
a 3 2 2 2 0 2
b 2 3 3 3 0 1
c 2 3 3 1 0 1
d 2 3 1 3 0 3
e 0 0 0 0 0 0
f 2 1 1 3 0 3
Each cell of the second dataframe is populated by the number of cases where both row and column have 1 in the first dataframe.
I'm thinking of constructing a empty matrix like this:
df <- matrix(ncol = ncol(data), nrow = ncol(data))
colnames(df) <- names(data)
rownames(df) <- names(data)
.. and iterating over each cell of this matrix using an apply command reading the corresponding row name (say, x) and column name (say, y) and running a function like the one below.
summation <- function (x,y) (return (sum(data$x * data$y)))
The problem with is I can't find out the row name and column name while within an apply function. Any help will be appreciated.
Any more efficient way than what I'm thinking is more than welcome.
You are looking for crossprod
crossprod(as.matrix(df1))
# a b c d e f
#a 3 2 2 2 0 2
#b 2 3 3 1 0 1
#c 2 3 3 1 0 1
#d 2 1 1 3 0 3
#e 0 0 0 0 0 0
#f 2 1 1 3 0 3
data
df1 <- structure(list(a = c(0L, 1L, 0L, 1L, 1L), b = c(1L, 1L, 0L, 0L,
1L), c = c(1L, 1L, 0L, 0L, 1L), d = c(0L, 1L, 1L, 1L, 0L), e = c(0L,
0L, 0L, 0L, 0L), f = c(0L, 1L, 1L, 1L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-5L))

How to isolate non-0 answers from survey data

I have survey data in which the same people are asked the same question during 6 different periods. Sometimes they answer (in which case we get a score from 1 to 10), sometimes they don’t (in which case the answer is 0).
In the end, I got a data frame that looks like this (the only difference being that in this example the answers are from 1 to 2, that’s just because it was easier to generate an adequate number of 0s that way for me):
period_1 <- sample(0:2, 100, replace=T)
period_2 <- sample(0:2, 100, replace=T)
period_3 <- sample(0:2, 100, replace=T)
period_4 <- sample(0:2, 100, replace=T)
period_5 <- sample(0:2, 100, replace=T)
period_6 <- sample(0:2, 100, replace=T)
df <- cbind(period_1, period_2, period_3, period_4, period_5, period_6)
head(df)
period_1 period_2 period_3 period_4 period_5 period_6
[1,] 0 2 1 1 0 1
[2,] 2 1 1 2 0 0
[3,] 1 0 2 0 1 1
[4,] 1 2 2 1 0 2
[5,] 1 1 2 2 0 2
[6,] 1 0 1 2 2 0
Now, I want to see the evolution of their answer over time. But with the current structure of the data frame, it is a bit awkward: I can’t just compare period 1 to period 2, for instance, because they didn’t all answer at period 1 (or 2).
Instead, what I would like would be a data frame which shows their first answer in one vector, no matter from which period that answer came from, and then the second answer, and so on…
In others words, get the first non-0 answer in survey_1, the second non-0 answer in survey_2, etc…
This is probably not the best solution, but it's the most simple one and it would work just fine for me.
This would allow me to turn this:
period_1 period_2 period_3 period_4 period_5 period_6
[1,] 0 2 1 1 0 1
[2,] 2 1 1 2 1 0
[3,] 1 0 2 0 1 1
Into this:
survey_1 survey_2 survey_3 survey_4 survey_5 survey_6
[1,] 2 1 1 1 0 0
[2,] 2 1 1 2 1 0
[3,] 1 2 1 1 0 0
But to be honest, I'm still a big newbie in R and programming in general and I don't even know where to begin with achieving this, and I've been stuck on this for some time now without making any progress toward a solution.
Can anyone offer me tips, or even a sample code, that would allow me to get to the desired result, please ?
Thank you !
We can use apply and order by whether an element is 0 or not for each row:
df[] <- t(apply(df, 1, function(x) x[order(x == 0)]))
Result:
period_1 period_2 period_3 period_4 period_5 period_6
[1,] 1 2 2 1 0 0
[2,] 2 2 1 0 0 0
[3,] 1 1 1 2 2 0
[4,] 2 2 1 2 1 0
[5,] 2 1 1 1 1 1
[6,] 2 2 1 1 0 0
Data:
df <- structure(c(0L, 2L, 1L, 2L, 2L, 0L, 1L, 0L, 1L, 2L, 1L, 2L, 0L,
2L, 1L, 1L, 1L, 2L, 2L, 0L, 2L, 2L, 1L, 1L, 2L, 0L, 2L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 0L), .Dim = c(6L, 6L), .Dimnames = list(
NULL, c("period_1", "period_2", "period_3", "period_4", "period_5",
"period_6")))

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Resources