How to easily format a frequency table in R? - r

I am working with a data frame where each row is a patient with a particular illness. There is a column for their age category, and several columns with text (Yes or No) as to whether or not they are experiencing a particular symptom. Example provided below
set.seed(1)
Sick <- data.frame(age=sample(c("Infant", "Child", "Adult", "Elderly"), size=20, replace = TRUE),
cough= sample(c("Yes", "No"), size=20, replace = TRUE),
fever= sample(c("Yes", "No"), size=20, replace = TRUE),
chills= sample(c("Yes", "No"), size=20, replace = TRUE),
fatigue=sample(c("Yes", "No"), size=20, replace = TRUE))
What I am trying to get is a nicely structured frame that indicates how many patients in each category experience the symptom where the columns are the age categories and the rows are the count of how many people in that category experienced that symptom. The code below shows what I want my end result to be.
Count <- data.frame(symptom=c("cough", "fever", "chills", "fatigue"),
Infant=c(5, 1, 4, 2),
Child= c(4, 3, 2, 4),
Adult= c(2, 3, 1, 5),
Elderly = c(1, 0, 0, 0))
I know I could create this with the table and rbind functions, however, I was wondering if anyone had advice on how to streamline this? The real frame has about 10 age categories and 25 symptoms, so doing lots of tables may not be the most efficient.
Thank you for any and all help!

The above is great (upvoted, but tidyverse is needed as well), or even simpler
library(tidyverse)
Sick%>%
pivot_longer(-age,names_to='symptom')%>%
count(age,symptom)%>%
pivot_wider(names_from='symptom',values_from='n')
I've found in learning R that a great many problems can be solved by pivoting long and then wide or vice versa with some transform or calculation in between :)

I hope this is right. If I understand your question you just want to count the yes's for each category. I've put it into a function so just change x = Sick to whatever your dataframe is called and run the function.
EDIT I forget which package the pipe and columns_to_rownames comes from, I've added dplyr as a require but it may come from magrittr. If in doubt just load the tidyverse.
sick_tbl <- function(x = Sick){
require(dplyr)
sick_piv <- pivot_longer(x, names_to = "names", values_to = "values",
-c(age))
count <- sick_piv%>%
count(values, names, age) %>%
filter(values == "Yes") %>%
select(!values)
sick_out <- pivot_wider(count,
names_from = "age",
values_from = "n") %>%
column_to_rownames(var = "names")
sick_out[is.na(sick_out)] <- 0
sick_out <<- sick_out}
To run on your example data:
sick_tbl(x = Sick)
Adult Child Elderly Infant
chills 1 2 4 NA
cough 4 2 5 1
fatigue 3 3 2 1
fever 2 4 2 2

Here are three pretty concise options:
Base R
# Can skip the lapply if the Y/N columns were character to begin with
with(subset(cbind(Sick[1], stack(lapply(Sick[-1], as.character))), values == "Yes"),
table(ind, age))
data.table
library(data.table)
melt(as.data.table(Sick), id.vars="age")[value == "Yes", table(variable, age)]
questionr
library(questionr)
cross.multi.table(Sick[-1], Sick[[1]], true.codes = list("Yes"))

First reshape, then use the datasummary function from the modelsummary package (self-promotion alert). The benefit of this solution is that you can customize the look of your table and save it to many formats (html, latex, word, markdown, etc.):
library('modelsummary')
library('tidyverse')
dat = pivot_longer(Sick, -age, names_to = "Symptom") %>%
filter(value == "Yes")
datasummary(Symptom ~ N * age, data = dat)
Adult
Child
Elderly
Infant
chills
1
2
0
4
cough
2
4
1
5
fatigue
5
4
0
2
fever
3
3
0
1

Related

How to count occurrences of a word/token in a one-token-per-document-per-row tibble

Hello I have a tibble through a pipe from tidytext::unnest_tokens() and count(category, word, name = "count"). It looks like this example.
owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
word = c(rep("hello", 3), rep("world", 4)),
count = sample(1:100, 7))
and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.
I tried the following code that works in principal. The result is what I want.
owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))
However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?
We could use add_count:
library(dplyr)
owl %>%
add_count(word)
output:
category word count n
<dbl> <chr> <int> <int>
1 0 hello 98 3
2 1 hello 30 3
3 2 hello 37 3
4 -1 world 22 4
5 0 world 80 4
6 1 world 18 4
7 2 world 19 4
I played around with a few solutions and microbenchmark. I added TarJae's proposition to the benchmark. I also wanted to use the fantastic ave function just to see how it would compare to a dplyr solution.
library(microbenchmark)
n <- 500
owl2 <- tibble(
category = sample(-10:10, n , replace = TRUE),
word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
count = sample(1:100, n, replace = TRUE))
mb <- microbenchmark(
op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
group_by = owl2 %>% group_by(word) %>% mutate(n = n()),
add_count = owl2 %>% add_count(word),
ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)),
times = 50L)
autoplot(mb) + theme_bw()
The conclusion is that the elegant solution using add_count will save you a lot of time, and, ave speeds up a lot the process.

Compute accordance of column values grouped by another column [duplicate]

This question already has an answer here:
Find out what values occur the most in my collection and its proportion
(1 answer)
Closed 1 year ago.
I have a data frame with a column of IDs spanning multiple rows (col_id) and another column of assessments for this row (col_assessment), like so:
df <- data.frame(col_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
col_assessment = c("Pos", "Pos", "Neu", "Neu", "Neg", "Neu", "Pos", "Neu", "Neg"))
I now want to calculate how much the assessment is in accordance for each row. (I.e. how many of the assessments are the same per ID. For this, I have the following function. (I do not have to use this function and am also open to other solutions.)
compute_ICR <- function(coding_values){
### takes in list of coding values and returns number of the share of agreement (up to 1 if all are in agreement)
most_common_value <- coding_values %>% table() %>% sort(decreasing = TRUE) %>% magrittr::extract(1) %>% names()
share_accordance <- length(which(coding_values == most_common_value)) / coding_values %>% nrow()
# number of matching, most common values divided by number of total values
return(share_accordance)
}
I would now like to apply this to df by group of col_id, like so (not working pseudo-code!)
df %>% group_by(col_id) %>% summarize(share_accordance = compute_ICR(df$col_assessment))
This should give me the following data frame for the above example:
data.frame(col_id = c(1,2,3), share_accordance = c(.6667, 1, .333))
Can someone point out how to achieve this result? Thanks in advance.
I would change the function to -
compute_ICR <- function(x){
sort(table(x), decreasing = TRUE)[1]/length(x)
}
and apply it for each ID .
library(dplyr)
df %>%
group_by(col_id) %>%
summarize(share_accordance = compute_ICR(col_assessment))
# col_id share_accordance
# <dbl> <dbl>
#1 1 0.667
#2 2 0.667
#3 3 0.333
Or in base R -
aggregate(col_assessment~col_id, df, compute_ICR)
As I understand your question you want the largest proportion of answers per ID? The code below will give this answer independent of the number of possible values for col_assessment
library(dplyr)
df1 %>%
group_by(col_id) %>%
summarise(prop = max(prop.table(table(col_assessment))))
Returns:
col_id prop
<dbl> <dbl>
1 1 0.667
2 2 0.667
3 3 0.333

How to merge rows based on conditions with characters values? (Household data)

I have a data frame in which the first column indicates the work (manager, employee or worker), the second indicates whether the person works at night or not and the last is a household code (if two individuals share the same code then it means that they share the same house).
#Here is the reproductible data :
PCS <- c("worker", "manager","employee","employee","worker","worker","manager","employee","manager","employee")
work_night <- c("Yes","Yes","No", "No","No","Yes","No","Yes","No","Yes")
HHnum <- c(1,1,2,2,3,3,4,4,5,5)
df <- data.frame(PCS,work_night,HHnum)
My problem is that I would like to have a new data frame with households instead of individuals. I would like to group individuals based on HHnum and then merge their answers.
For the variable "PCS" I have new categories based on the combination of answers : Manager+work ="I" ; manager+employee="II", employee+employee=VI, worker+worker=III etc
For the variable "work_night", I would like to apply a score (is both answered Yes then score=2, if one answered YES then score =1 and if both answered No then score = 0).
To be clear, I would like my data frame to look like this :
HHnum PCS work_night
1 "I" 2
2 "VI" 0
3 "III" 1
4 "II" 1
5 "II" 1
How can I do this on R using dplyr ? I know that I need group_by() but then I don't know what to use.
Best,
Victor
Here is one way to do it (though I admit it is pretty verbose). I created a reference dataframe (i.e., combos) in case you had more categories than 3, which is then joined with the main dataframe (i.e., df_new) to bring in the PCS roman numerals.
library(dplyr)
library(tidyr)
# Create a dataframe with all of the combinations of PCS.
combos <- expand.grid(unique(df$PCS), unique(df$PCS))
combos <- unique(t(apply(combos, 1, sort))) %>%
as.data.frame() %>%
dplyr::mutate(PCS = as.roman(row_number()))
# Create another dataframe with the columns reversed (will make it easier to join to the main dataframe).
combos2 <- data.frame(V1 = c(combos$V2), V2 = c(combos$V1), PCS = c(combos$PCS)) %>%
dplyr::mutate(PCS = as.roman(PCS))
combos <- rbind(combos, combos2)
# Get the count of "Yes" for each HHnum group.
# Then, put the PCS into 2 columns to join together with "combos" df.
df_new <- df %>%
dplyr::group_by(HHnum) %>%
dplyr::mutate(work_night = sum(work_night == "Yes")) %>%
dplyr::group_by(grp = rep(1:2, length.out = n())) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = grp, values_from = PCS) %>%
dplyr::rename("V1" = 3, "V2" = 4) %>%
dplyr::left_join(combos, by = c("V1", "V2")) %>%
unique() %>%
dplyr::select(HHnum, PCS, work_night)

Insert a (mostly) blank row into an R dataframe for aesthetic purposes (exporting as .csv to LaTeX)

Background:
I have a little dataframe composed of percentage estimates across two characteristics. One is "political party", the other is "sex". As it stands, these specific names are unmentioned in the df even if the categories that comprise them are. Here it is:
d <- data.frame(characteristic = c("Republican","Democrat","Male","Female"),
percent = c(45, 55, 70, 30),
stringsAsFactors=FALSE)
For those without R available to them at the moment, here's what that looks like:
The Problem:
As the result of data analysis, this df is perfectly adequate. But because I want to export it is a .csv file and insert it into a report I'm writing, I'd like to make some aesthetic modifications in order to make that integration as smooth as possible. In particular, I'd like to insert 2 blank rows to act as subheadings. In other words, I'm looking to do something like this:
What I've tried:
I feel I'm quite close, because I can get the new rows in using these two lines of code:
d[nrow(d) + 1,] = c("Party","")
d[nrow(d) + 1,] = c("Sex","")
But that only appends them to the bottom of the dataset and I'm not sure how to move them into the desired positions, or if there a more efficient way to do this.
Thanks for any help!
You could use add_row:
library(dplyr)
d %>%
mutate(percent = as.character(percent)) %>%
add_row(characteristic = "Party", percent = "", .before = 1) %>%
add_row(characteristic = "Sex", percent = "", .before = 4)
Output:
characteristic percent
1 Party
2 Republican 45
3 Democrat 55
4 Sex
5 Male 70
6 Female 30
Note that the above solution converts percent into a character to match your desired output. If you do not care about replacing "" with NA, then you could do:
d %>%
add_row(characteristic = "Party", percent = NA, .before = 1) %>%
add_row(characteristic = "Sex", percent = NA, .before = 4)
Output:
characteristic percent
1 Party NA
2 Republican 45
3 Democrat 55
4 Sex NA
5 Male 70
6 Female 30
As the OP mentioned, it is aesthestic preference, an option is kableExtra. (Note: This is posted as an alternative way to visualize)
library(kableExtra)
kbl(d) %>%
kable_paper("striped", full_width = FALSE) %>%
pack_rows("Party", 1, 3) %>%
pack_rows("Sex", 3, 4)
-output

Remove all individuals in a column that have at least one NA value in variables with repeated measurements

I'm new in R and I would like to ask for some help.
I have a dataframe with the following structure:
DF <- data.frame(patient = c(1, 1, 2, 2, 3, 3), treatment = c("baseline", "on-treatment", "baseline", "on-treatment", "baseline", "on-treatment"), cholesterol_value = c(300, 100, 255, NA, 270, 150))
Patient 2 has a baseline cholesterol value but does not have an on-treatment cholesterol value.
I would like to find a way to remove all the values corresponding to patient 2 in a for a loop, staying only with the values corresponding to patients 1 and 3.
Can anyone help me?
Thanks!
library(tidyverse)
DF %>%
group_by(patient) %>%
filter(!any(is.na(cholesterol_value)))
Using base R
subset(DF, !patient %in% unique(patient[is.na(cholesterol_value)]))
Here is another base R option using ´aveinsubset`
subset(
DF,
!ave(is.na(cholesterol_value), patient, FUN = any)
)
which gives
patient treatment cholesterol_value
1 1 baseline 300
2 1 on-treatment 100
5 3 baseline 270
6 3 on-treatment 150

Resources