I want to create a contingency table that displays the frequency distribution of pairs of variables. Here is an example dataset:
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
All variables are binary with 1 indicating either the presence of specfic movie type or the male gender. In the end, I would like to have the table that counts the presence of different movie types under specific gender. Something like this:
male female
Horror 1 1
Thriller 1 3
Comedy 2 2
Romantic 0 0
Sci.fi 2 0
I know I can create two tables of different movie types for male and female individually (see TarJae's answer here Create count table under specific condition) and cbind them later but I would like to do it in one chunk of code. How to achieve this in an efficient way?
You could do
sapply(split(df, df$gender), function(x) colSums(x[names(x)!="gender"]))
#> 0 1
#> Horror 1 1
#> Thriller 1 3
#> Comedy 0 0
#> Romantic 0 0
#> Sci.fi 1 3
Here is a solution using dplyr and tidyr:
df %>% pivot_longer(cols = -gender, names_to = "type") %>%
mutate(gender = fct_recode(as.character(gender),Male = "0",Female = "1")) %>%
group_by(gender,type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = gender,values_from = sum)
Which gives
# A tibble: 5 x 3
type Male Female
<chr> <dbl> <dbl>
1 Comedy 0 1
2 Horror 1 3
3 Romantic 1 1
4 Sci.fi 1 1
5 Thriller 1 1
The second line is optional but allows to get the levels for the variable gender.
Please find below a reprex with an alternative solution using data.table and magrittr (for the pipes), also in one chunk.
Reprex
Your data (I set a seed for reproducibility)
set.seed(452)
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
df
#> Horror Thriller Comedy Romantic Sci.fi gender
#> 1 0 1 1 0 0 0
#> 2 0 0 0 0 1 0
#> 3 1 0 1 1 0 1
#> 4 0 1 0 0 0 1
#> 5 0 1 0 0 0 1
Code in one chunk
library(data.table)
library(magrittr) # for the pipes!
df %>%
transpose(., keep.names = "rn") %>%
setDT(.) %>%
{.[, .(rn = rn,
male = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 1]),
female = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 0]))][rn !="gender"]}
Output
#> rn male female
#> 1: Horror 1 0
#> 2: Thriller 2 1
#> 3: Comedy 1 1
#> 4: Romantic 1 0
#> 5: Sci.fi 0 1
Created on 2021-11-25 by the reprex package (v2.0.1)
Related
I have a table that looks like so:
Gender
Time
Payband
male
part time
£15,001-20000
male
full time
£25001-30000
female
full time
£35001-40000
male
part time
£35001-40000
female
part time
£35001-40000
female
full time
£25001-30000
And I need R code that makes 2 different dataframes that are filtered by 'Time' and give a count of the different genders in each payband. For example this table below would be filtered where time == part time:
Payband
Male
Female
Total
£15001-20000
1
0
1
£20001-25000
0
0
0
£25001-30000
0
0
0
£35001-40000
1
1
2
There would also be a dataframe where time == full time
I imagine it would be a case of using things such as group_by and summarize but I just can't wrap my head around how to do it. Any help is greatly appreciated and I hope I am explaining the problem properly.
You can do
pay <- c("£15,001-20000", "£20001-25000", "£25001-30000", "£35001-40000")
with(subset(df, Time == 'full time'), t(table(Gender, factor(Payband, pay)))) |>
as.data.frame() |>
tidyr::pivot_wider(names_from = 'Gender', values_from = 'Freq') |>
dplyr::rename(Payband = Var1)
#> # A tibble: 4 x 3
#> Payband female male
#> <fct> <int> <int>
#> 1 £15,001-20000 0 0
#> 2 £20001-25000 0 0
#> 3 £25001-30000 1 1
#> 4 £35001-40000 1 0
with(subset(df, Time == 'part time'), t(table(Gender, factor(Payband, pay)))) |>
as.data.frame() |>
tidyr::pivot_wider(names_from = 'Gender', values_from = 'Freq') |>
dplyr::rename(Payband = Var1)
#> # A tibble: 4 x 3
#> Payband female male
#> <fct> <int> <int>
#> 1 £15,001-20000 0 1
#> 2 £20001-25000 0 0
#> 3 £25001-30000 0 0
#> 4 £35001-40000 1 1
In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1
This is my input data:
Program = c("A","A","A","B","B","C")
Age = c(10,30,30,12,32,53)
Gender = c("F","F","M","M","M","F")
Language = c("Eng","Eng","Kor","Kor","Other","Other")
df = data.frame(Program,Age,Gender,Language)
I would like to output a table like this:
Program
MEAN AGE
ENG
KOR
FEMALE
MALE
A
B
C
Where MEAN AGE is the average age, ENG,KOR,FEMALE,MALE are counts.
I have tried using dplyr and t() but in this case I feel like I'm completely lost as to what are the steps (my first post, new to this). Thank you in advance!
You can take the following approach:
library(dplyr)
df %>%
group_by(Program) %>%
summarise(
`Mean Age` = mean(Age),
ENG = sum(Language=="Eng"),
KOR = sum(Language=="Kor"),
Female = sum(Gender=="F"),
Male = sum(Gender=="M"),
.groups="drop"
)
Output:
# A tibble: 3 x 6
Program `Mean Age` ENG KOR Female Male
<chr> <dbl> <int> <int> <int> <int>
1 A 23.3 2 1 2 1
2 B 22 0 1 0 2
3 C 53 0 0 1 0
Note: .groups is a special variable for dplyr functions. The way it's used here is equivalent to using %>% ungroup() after the calculation. If you type any other name in the summarise function, it will assume it's a column name.
In base R you could do:
df1 <- cbind(df[1:2], stack(df[3:4])[-2])
cbind(aggregate(Age~Program, df, mean),as.data.frame.matrix(table(df1[-2])))
Program Age Eng F Kor M Other
A A 23.33333 2 2 1 1 0
B B 22.00000 0 0 1 2 1
C C 53.00000 0 1 0 0 1
I want to loop over many columns and under certain conditions, replace values. For example, if disease=0 and treatment=1, replace treatment cell with 99.
Data:
df <- data.frame(id=1:5,
disease1=c(1,1,0,0,0),
treatment1=c(1,0,1,0,0),
outcome1=c("survived", "died", "survived", NA,NA),
disease2=c(1,1,0,0,0),
treatment2=c(1,0,1,0,0),
outcome2=c("survived", "died", "survived", NA,NA))
> df
id disease1 treatment1 outcome1 disease2 treatment2 outcome2
1 1 1 1 survived 1 1 survived
2 2 1 0 died 1 0 died
3 3 0 1 survived 0 1 survived
4 4 0 0 <NA> 0 0 <NA>
5 5 0 0 <NA> 0 0 <NA>
For a single column, case_when works well:
df %>% mutate(treatment=case_when((disease1!=1&treatment1==1)~99, TRUE~treatment1))
For multiple columns, the following works in base R:
for(i in 1:2) {
df[,paste0("treatment",i)] <- ifelse(df[,paste0("disease",i)]!=1&df[,paste0("treatment",i)]==1,99, df[,paste0("treatment",i)])
}
I am looking for a way to do this all in tidyverse and I am having trouble finding the right recipe. Thank you in advance.
Maybe consider putting in long form with pivot_longer, then would be easier to mutate across multiple columns. This would be a "tidier" approach if all disease should be together in one column (and same for treatment in 1 column, and outcome in 1 column).
library(tidyverse)
df %>%
pivot_longer(cols = -id, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
mutate(treatment = ifelse(disease == 0 & treatment == 1, 99, treatment))
An option with names_sep in pivot_longer with case_when
library(dplyr)
library(tidyr)
pivot_longer(df, cols = -id, names_to = c('.value', 'number'),
names_sep="(?<=[a-z])(?=[0-9])") %>%
mutate(treatment = replace(treatment, !disease & treatment == 1, 99))
# A tibble: 10 x 5
# id number disease treatment outcome
# <int> <chr> <dbl> <dbl> <chr>
# 1 1 1 1 1 survived
# 2 1 2 1 1 survived
# 3 2 1 1 0 died
# 4 2 2 1 0 died
# 5 3 1 0 99 survived
# 6 3 2 0 99 survived
# 7 4 1 0 0 <NA>
# 8 4 2 0 0 <NA>
# 9 5 1 0 0 <NA>
#10 5 2 0 0 <NA>
I am aware of the spread function in the tidyr package but this is something I am unable to achieve.
I have a data.frame with 2 columns as defined below. I need to transpose the column Subject into binary columns with 1 and 0.
Below is the data frame:
studentInfo <- data.frame(StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
> studentInfo
StudentID Subject
1 1 Maths
2 1 Science
3 1 English
4 2 Maths
5 3 History
6 3 History
And the output I am expecting is:
StudentID Maths Science English History
1 1 1 1 1 0
2 2 1 0 0 0
3 3 0 0 0 1
How can I do this with the spread() function or any other function.
Using reshape2 we can dcast from long to wide.
As you only want a binary outcome we can unique the data first
library(reshape2)
si <- unique(studentInfo)
dcast(si, formula = StudentID ~ Subject, fun.aggregate = length)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Another approach using tidyr and dplyr is
library(tidyr)
library(dplyr)
studentInfo %>%
mutate(yesno = 1) %>%
distinct %>%
spread(Subject, yesno, fill = 0)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Although I'm not a fan (yet) of tidyr syntax...
We can use table from base R
+(table(studentInfo)!=0)
# Subject
#StudentID English History Maths Science
# 1 1 0 1 1
# 2 0 0 1 0
# 3 0 1 0 0
Using tidyr :
library(tidyr)
studentInfo <- data.frame(
StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
pivot_wider(studentInfo,
names_from = "Subject",
values_from = 'Subject',
values_fill = 0,
values_fn = function(x) 1)
#> # A tibble: 3 x 5
#> StudentID Maths Science English History
#> <dbl> <int> <int> <int> <int>
#> 1 1 1 1 1 0
#> 2 2 1 0 0 0
#> 3 3 0 0 0 1
Created on 2019-09-19 by the reprex package (v0.3.0)