Related
For example my table is shown as below
Job
Gender
CEO
Male
Manager
Male
Manager
Female
Manager
Male
Supervisor
Female
Then I would like to organize it to something like below
Job
Male
Female
CEO
1
0
Manager
2
1
Supervisor
0
1
How can I make it right?
Just pivot_wider() with values_fn = length:
library(tidyr)
df %>%
pivot_wider(names_from = Gender, values_from = Gender, values_fn = length, values_fill = 0)
# # A tibble: 3 × 3
# Job Male Female
# <chr> <int> <int>
# 1 CEO 1 0
# 2 Manager 2 1
# 3 Supervisor 0 1
You need to group_by the Job column, then count the Gender in each Job. After that, transform the dataframe into a "wide" format by expanding the count result.
library(tidyverse)
df %>%
group_by(Job) %>%
count(Gender) %>%
pivot_wider(names_from = Gender, values_from = n, values_fill = 0) %>%
ungroup()
# A tibble: 3 × 3
Job Male Female
<chr> <int> <int>
1 CEO 1 0
2 Manager 2 1
3 Supervisor 0 1
Or more simply, a single table function.
table(df$Job, df$Gender)
Female Male
CEO 0 1
Manager 1 2
Supervisor 1 0
Another option using group_by with count and spread like this:
library(dplyr)
library(tidyr)
df %>%
group_by(Job, Gender) %>%
count() %>%
spread(Gender, n, fill = 0)
#> # A tibble: 3 × 3
#> Job Female Male
#> <chr> <dbl> <dbl>
#> 1 CEO 0 1
#> 2 Manager 1 2
#> 3 Supervisor 1 0
Created on 2022-08-11 by the reprex package (v2.0.1)
One possible way to solve your problem:
xtabs(~ Job + Gender, data=df)
Gender
Job Female Male
CEO 0 1
Manager 1 2
Supervisor 1 0
I want to create a contingency table that displays the frequency distribution of pairs of variables. Here is an example dataset:
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
All variables are binary with 1 indicating either the presence of specfic movie type or the male gender. In the end, I would like to have the table that counts the presence of different movie types under specific gender. Something like this:
male female
Horror 1 1
Thriller 1 3
Comedy 2 2
Romantic 0 0
Sci.fi 2 0
I know I can create two tables of different movie types for male and female individually (see TarJae's answer here Create count table under specific condition) and cbind them later but I would like to do it in one chunk of code. How to achieve this in an efficient way?
You could do
sapply(split(df, df$gender), function(x) colSums(x[names(x)!="gender"]))
#> 0 1
#> Horror 1 1
#> Thriller 1 3
#> Comedy 0 0
#> Romantic 0 0
#> Sci.fi 1 3
Here is a solution using dplyr and tidyr:
df %>% pivot_longer(cols = -gender, names_to = "type") %>%
mutate(gender = fct_recode(as.character(gender),Male = "0",Female = "1")) %>%
group_by(gender,type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = gender,values_from = sum)
Which gives
# A tibble: 5 x 3
type Male Female
<chr> <dbl> <dbl>
1 Comedy 0 1
2 Horror 1 3
3 Romantic 1 1
4 Sci.fi 1 1
5 Thriller 1 1
The second line is optional but allows to get the levels for the variable gender.
Please find below a reprex with an alternative solution using data.table and magrittr (for the pipes), also in one chunk.
Reprex
Your data (I set a seed for reproducibility)
set.seed(452)
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
df
#> Horror Thriller Comedy Romantic Sci.fi gender
#> 1 0 1 1 0 0 0
#> 2 0 0 0 0 1 0
#> 3 1 0 1 1 0 1
#> 4 0 1 0 0 0 1
#> 5 0 1 0 0 0 1
Code in one chunk
library(data.table)
library(magrittr) # for the pipes!
df %>%
transpose(., keep.names = "rn") %>%
setDT(.) %>%
{.[, .(rn = rn,
male = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 1]),
female = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 0]))][rn !="gender"]}
Output
#> rn male female
#> 1: Horror 1 0
#> 2: Thriller 2 1
#> 3: Comedy 1 1
#> 4: Romantic 1 0
#> 5: Sci.fi 0 1
Created on 2021-11-25 by the reprex package (v2.0.1)
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4
I am trying to make a function with this data and would really appreciate help with this!
example<- data.frame(Day=c(2,4,8,16,32,44,2,4,8,16,32,44,2,4,8,16,32,44),
Replicate=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Treament=c("CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC",
"HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP",
"LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL"),
AFDM=c(94.669342,94.465752,84.897023,81.435993,86.556221,75.328294,94.262162,88.791240,75.735474,81.232403,
67.050593,76.346244,95.076522,88.968823,83.879073,73.958836,70.645724,67.184695,99.763156,92.022673,
92.245362,74.513934,50.083136,36.979418,94.872932,86.353037,81.843173,67.795465,46.622106,18.323099,
95.089932,93.244212,81.679814,65.352385,18.286525,7.517794,99.559972,86.759404,84.693433,79.196504,
67.456961,54.765706,94.074014,87.543693,82.492548,72.333367,51.304676,51.304676,98.340870,86.322153,
87.950873,84.693433,63.316485,63.723665))
Example:
I want to insert a new row with an AFDM value (e.g., 0.9823666) that was calculated with another function.
This new row must be on each Day 2 (and call it as Day 0), and I want to preserve the name of each Replica and Treatment of each group.
Thus, this new row must be: Day 0, Replicate=same, Treatment=same, AFDM=0.9823666.
This is so I can later run a regression with the data (from 0 to 44, 3 replicates for each Treatment).
I would prefer a solution on dplyr.
Thanks in advance
We can create a grouping column with cumsum, then expand the dataset with complete and fill the other columns
library(dplyr)
library(tidyr)
example %>%
group_by(grp = cumsum(Day == 2)) %>%
complete(Day = c(0, unique(Day)), fill = list(AFDM = 0.9823666)) %>%
fill(Replicate, Treament, .direction = 'updown')
# A tibble: 63 x 5
# Groups: grp [9]
# grp Day Replicate Treament AFDM
# <int> <dbl> <dbl> <chr> <dbl>
# 1 1 0 1 CC 0.982
# 2 1 2 1 CC 94.7
# 3 1 4 1 CC 94.5
# 4 1 8 1 CC 84.9
# 5 1 16 1 CC 81.4
# 6 1 32 1 CC 86.6
# 7 1 44 1 CC 75.3
# 8 2 0 2 CC 0.982
# 9 2 2 2 CC 94.3
#10 2 4 2 CC 88.8
# … with 53 more rows
You can use distinct to get unique Replicate and Treament, add Day and AFDM column with the default values and bind the rows to the original dataframe.
library(dplyr)
example %>%
distinct(Replicate, Treament) %>%
mutate(Day = 0, AFDM = 0.9823666) %>%
bind_rows(example) %>%
arrange(Replicate, Treament)
# Replicate Treament Day AFDM
#1 1 CC 0 0.9823666
#2 1 CC 2 94.6693420
#3 1 CC 4 94.4657520
#4 1 CC 8 84.8970230
#5 1 CC 16 81.4359930
#6 1 CC 32 86.5562210
#7 1 CC 44 75.3282940
#8 1 HP 0 0.9823666
#9 1 HP 2 99.7631560
#10 1 HP 4 92.0226730
#.....