apply custom-made function to column pairs and create summary table - r

I have data with ratings on many parameters by two different raters; here are shown just a snippet of ratings on three same-prefix parameters (e.g. DH and DH_ptak):
df <- structure(list(DH = c(0, 1, NA, NA, 1, 1, 1, 1, 1, 1),
DH_ptak = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
SZ = c(1, 1, NA, NA, NA, 0, 1, 0, 1, 1),
SZ_ptak = c(1, 1, NA, NA, NA, 1, 0, NA, 1, 1),
RM = c(0, 1, 1, NA, NA, NA, 0, NA, 1, NA),
RM_ptak = c(0, 1, 1, 1, 1, NA, 0, 1, NA, 1)),
row.names = c(NA, 10L), class = "data.frame")
For each parameter I want to compare the two ratings columns. I use this function to find different ratings:
compare_fun <- function(c1, c2){
case_when(is.na(c1) & is.na(c2) ~ 0,
is.na(c1) | is.na(c2) ~ 1,
c1 != c2 ~ 1,
TRUE ~ 0)
}
I can use this function to sum the differences and compute an agreement percentage agree_pct:
library(dplyr)
df %>%
mutate(diff = compare_fun(DH, DH_ptak)) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum)/nrow(df)*100)
sum agree_pct
1 2 80
The problem is that I have multiple parameters. How can I compute for all ratings-column pairs the respective sum and agree_pct in one go, ideally, to obtain a table like this:
sum agree_pct
DH 2 80
SZ 3 70
RM 5 50

This is what I would do. It mostly involves pivoting the data a few times. First I make a column from row names so that I can use this to keep all the rows straight, then I go from wide to long with pivot_longer. I separate the column names to delineate between the two reviewers and assign them the names "grp1" and "grp2". Then I pivot_wider so that you have 2 columns, one for each reviewer. Lastly I apply your function across all the data, group by the variable of interest and summarize the data.
library(tidyverse)
df %>%
rownames_to_column("col") %>%
pivot_longer( -col) %>%
separate(name, into = c("var", "tmp"), sep = "_") %>%
mutate(grp = ifelse(is.na(tmp), "grp1", "grp2")) %>%
select(col, var, value, grp) %>%
pivot_wider(names_from = grp, values_from = value) %>%
mutate(diff = compare_fun(grp1, grp2)) %>%
group_by(var) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum)/nrow(df)*100)
#> # A tibble: 3 x 3
#> var sum agree_pct
#> <chr> <dbl> <dbl>
#> 1 DH 2 80
#> 2 RM 5 50
#> 3 SZ 3 70

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

calculate the mean of column and also the comments in next column

I want to calculate the mean of column and and also concatenate the texts in second column output.
for example in below i want to calculate the mean of C1 and then concatenate all texts in C1T in next column if there is more than one text in C1T.
df <- data.frame(A1 = c("class","type","class","type","class","class","class","class","class"),
B1 = c("b2","b3","b3","b1","b3","b3","b3","b2","b1"),
C1=c(6, NA, 1, 6, NA, 1, 6, 6, 2),
C1T=c(NA, "Part of other business", NA, NA, NA, NA, NA, NA, NA),
C2=c(NA, 4, 1, 2, 4, 4, 3, 3, NA),
C2T=c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
C3=c(3, 4, 3, 3, 6, NA, 2, 4, 1),
C3T=c(NA, NA, NA, NA, "two part are available but not in source", NA, NA, NA, NA),
C4=c(5, 5, 2, NA, NA, 6, 4, 1, 2),
C5T=c(NA, NA, NA, NA, NA, NA, NA, "Critical Expert", NA),
C5=c(6, 2, 6, 4, 2, 2, 5, 4, 1),
C5T=c(NA, NA, NA, NA, NA, "most of things are stuck", "weather responsible", NA, NA))
var <- "C1"
var1 <- "C1T"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
df1 <- df%>%filter(A1 == "class")
T1<- df1 %>%group_by(B1)%>%summarise(mean=round(mean(!!var,na.rm = TRUE),1))
Comments <- df1 %>% group_by(B1) %>% summarise_at(vars(var1), paste0, collapse = " ") %>%
select(var1) %>% unlist() %>% gsub("NA","",.) %>% stringi::stri_trim_both()
cbind(T1,Comments)
Edited Answer:
var <- "C1"
var1 <- "C1T"
filtercol <- "A1"
filterval <- "class"
groupingvar <- "B1"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
filtercol <- rlang::parse_expr(filtercol)
groupingvar <- rlang::parse_expr(groupingvar)
library(dplyr)
df1 <- df %>% filter(!!filtercol == filterval)
T1 <- df1 %>% group_by(!!groupingvar) %>% summarise(mean=round(mean(as.numeric(!!var),na.rm = TRUE),1))
Comments <- df1 %>% select(!!groupingvar, !!var1) %>%
group_by(!!groupingvar) %>%
summarise_at(vars(!!var1), paste0, collapse = " ") %>%
select(!!var1) %>% unlist() %>% gsub("NA", "", .) %>%
stringi::stri_trim_both()
T1 <- cbind(T1,Comments)
Update on OP's request (see comments):
library(dplyr)
# helper function to coalesce by column
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
pivot_longer(
cols = contains("T"),
names_to = "names",
values_to = "values"
) %>%
filter(names == "C1T") %>%
group_by(names) %>%
summarise(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE),
Comments = coalesce_by_column(values))
Output:
names Mean Comments
<chr> <dbl> <chr>
1 C1T 3.47 Part of other business
First answer
coalesce to construct Comments column
rowwise with c_across to calculate the mean rowwise.
In case you need to group, you can use ``group_by`
library(dplyr)
df %>%
mutate(Comments = coalesce(C1T, C2T, C3T, C4T, C5T),.keep="unused") %>%
rowwise() %>%
mutate(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE)) %>%
select(A1, B1, Mean, Comments)
Output:
A1 B1 Mean Comments
<chr> <chr> <dbl> <chr>
1 class b2 5 NA
2 type b3 3.75 Part of other business
3 class b3 2.6 NA
4 type b1 3.75 NA
5 class b3 4 two part are available but not in source
6 class b3 3.25 most of things are stuck
7 class b3 4 weather responsible
8 class b2 3.6 Critical Expert
9 class b1 1.5 NA

if_else with haven_labelled column fails because of wrong class

I have the following data:
dat <- structure(list(value = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
label = "value: This is my label",
labels = c(`No` = 0, `Yes` = 1),
class = "haven_labelled"),
group = structure(c(1, 2, 1, 1, 2, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 1),
label = "my group",
labels = c(first = 1, second = 2, third = 3),
class = "haven_labelled")),
row.names = c(NA, -20L),
class = c("tbl_df", "tbl", "data.frame"),
label = "test.sav")
As you can see, the data uses a special class from tidyverse's haven package, so called labelled columns.
Now I want to recode my initial value variable such that:
if group equals 1, value should stay the same, otherwise it should be missing
I was trying the following, but getting an error:
dat_new <- dat %>%
mutate(value = if_else(group != 1, NA, value))
# Error: `false` must be a logical vector, not a `haven_labelled` object
I got so far as to understand that if_else from dplyr requires the true and false checks in the if_else command to be of same class and since there is no NA equivalent for class labelled (e.g. similar to NA_real_ for doubles), the code probably fails, right?
So, how can I recode my inital variables and preserve the labels?
I know I could change my code above and replace the if_else by R's base version ifelse. However, this deletes all labels and coerces the value column to a numeric one.
You can try dplyr::case_when for cases where group == 1. If no cases are matched, NA is returned:
dat %>% mutate(value = case_when(group == 1 ~ value))
You can create an NA value in the haven_labelled class with this ugly code:
haven::labelled(NA_real_, labels = attr(dat$value, "labels"))
I'd recommend writing a function for that, e.g.
labelled_NA <- function(value)
haven::labelled(NA_real_, labels = attr(value, "labels"))
and then the code for your mutate isn't quite so ugly:
dat_new <- dat %>%
mutate(value = if_else(group != labelled_NA(value), value))
Then you get
> dat_new[1:5,]
# A tibble: 5 x 2
value group
<dbl+lbl> <dbl+lbl>
1 NA 1 [first]
2 NA 2 [second]
3 0 [No] 1 [first]
4 0 [No] 1 [first]
5 NA 2 [second]

Is there a cleaner way to group and summarize multiple variables multiple ways in R?

This is my first post. Apologies if I botch something.
I have employee opinion survey data that has 5 point likert scale data along with department (and other demographic data). I would like to get a % unfavorable (a 1 or 2 survey response), % neutral (a survey response == 3), and % favorable (a 4 or 5 response). I would also like to have those %s for each department. I have the result I am looking for with the sample data below but I actually have 30+ variables. I'm hoping there is a cleaner way to do this!
Here is my sample data:
survey <- data.frame(department = c('hr', 'hr', 'tech', 'tech', 'tech', 'hr', 'hr', 'tech', 'tech', 'tech'),
pride = c(1, 5, 2, 3, NA, 5, 5, 2, 3, NA),
satisfaction = c(5, 2, 3, NA, 5, 5, 2, 3, NA, 3),
leadership = c(5, 2, 3, NA, 5, 1, 1, 5, 2, 3))
I am able to pretty easily get % favorable using this:
items <- c('pride', 'satisfaction', 'leadership')
output <- survey %>%
group_by(department) %>%
mutate_at(items, recode, `1` = 0, `2` = 0, `3` = 0, `4` = 1, `5` = 1) %>%
summarize_at(items, mean, na.rm = T) %>%
rowwise() %>%
mutate(engagement = mean(c(pride,satisfaction,leadership), na.rm = T)) %>%
filter(!is.na(department))
It starts to become messy once I attempt to do all 3 calculations (%unfav, %neutral, and %fav). Is there a better way than this (which does give me the desired output - again it's not very scalable considering I actually have 30+ variables):
items_fav <- c('pride_fav', 'satisfaction_fav', 'leadership_fav')
items_neutral <- c('pride_neut', 'satisfaction_neut', 'leadership_neut')
items_unfav <- c('pride_unfav', 'satisfaction_unfav', 'leadership_unfav')
all_items <- (c('pride_fav', 'satisfaction_fav', 'leadership_fav','pride_neut', 'satisfaction_neut', 'leadership_neut','pride_unfav', 'satisfaction_unfav', 'leadership_unfav'))
output_3parts <- survey %>%
mutate(pride_fav = pride,
satisfaction_fav = satisfaction,
leadership_fav = leadership,
pride_neut = pride,
satisfaction_neut = satisfaction,
leadership_neut = leadership,
pride_unfav = pride,
satisfaction_unfav = satisfaction,
leadership_unfav = leadership) %>%
mutate_at(items_fav, recode, `1` = 0, `2` = 0, `3` = 0, `4` = 1, `5` = 1) %>%
mutate_at(items_neutral, recode, `1` = 0, `2` = 0, `3` = 1, `4` = 0, `5` = 0) %>%
mutate_at(items_unfav, recode, `1` = 1, `2` = 1, `3` = 0, `4` = 0, `5` = 0) %>%
group_by(department) %>%
summarize_at(all_items, mean , na.rm = T)
Output would look something like this:
Row 1: department pride_fav satisfaction_fav leadership_fav pride_neut satisfaction_neut leadership_neut pride_unfav satisfaction_unfav leadership_unfav
Row 2: hr 0.75 0.5 0.25 0 0 0 0.25 0.5 0.75
Row 3: tech 0 0.25 0.4 0.5 0.75 0.4 0.5 0 0.2
Thanks!
If I understand you correctly, this might do what you're looking for.
library(tidyverse)
)
survey %>%
pivot_longer(cols = -department, names_to = "quality", values_to = "ranking") %>%
group_by(department, quality) %>%
summarise(mean_score = mean(ranking, na.rm = T)) %>%
pivot_wider(names_from = quality, values_from = mean_score)

Summary of N recent values

I am trying to get summary statistics (sum and max here) with most N recent values.
Starting data:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2))
Desired result:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2),
sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))
With the data, I would like to have sum and max of 2 (N=2) most recent values for each row by id. 4th(sum_recent2week) and 5th (max_recent2week) columns are my desired columns
You can use rollsum and rollmax from the zoo package.
dt[, `:=`(sum_recent2week =
shift(rollsum(value, 2, align = 'left', fill = NA), 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
For the sum, if you're using data table version >= 1.12, you can use data.table::frollmean. The default for frollmean is fill = NA, so no need to specify that in this case.
dt[, `:=`(sum_recent2week =
shift(frollmean(value, 2, align = 'left')*2, 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
I'm sure it can be done in a much more elegant way, but here is one tidyverse possibility:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
rowid_to_column() %>%
select(-week, -value) %>%
top_n(-2) %>%
right_join(dt %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
select(-rowid)
id sum_recent2week max_recent2week week value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a NA NA 1. 2.
2 a NA NA 2. 3.
3 a 5. 3. 3. 1.
4 a 4. 3. 4. 0.
5 b NA NA 1. 5.
6 b NA NA 2. 7.
7 b 12. 7. 3. 3.
8 b 10. 7. 4. 2.
First, it is computing the "sum_recent2week" and "max_recent2week" per group. Second, it selects the last two rows per group. Finally, it merges it with the original data.
Or if you want to compute it for all rows, not just for the last two rows per group:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))

Resources