summaries extracted from data frame info - r

data <-
STUDY ID BASE CYCLE1 DIED PROG
1 1 100 30 No Yes
1 2 NA 20 Yes No
1 3 16 NA Yes Yes
1 4 15 10 Yes Yes
I wanted to make a summary of the following:
how many subjects have both baseline and CYCLE1 value?
Of those in 1, how many had DIED?
Of those in 1, how many had DIED or PROG?
Answers:
2-subjects (50% of subjects) ==> subjects 1 & 4
1-subject (25%) ===> this is subject 4
2-subjects (50%) ==> subjectys 1 & 4
A summary table by STUDY for this would be great (showing the number and percentage).
I am using Rstudio.

If it is based on the first filter
library(dplyr)
library(stringr)
data %>%
group_by(STUDY) %>%
filter(!is.na(BASE) & !is.na(CYCLE1)) %>%
summarise(ID = str_c(ID, collapse=", "),
n1 = n(),
n2 = sum(DIED== "Yes"),
n3 = sum(DIED == "Yes"|PROG == "Yes"))
# A tibble: 1 x 5
# STUDY ID n1 n2 n3
# <int> <chr> <int> <int> <int>
#1 1 1, 4 2 1 2
if we need the percentage as well
out <- data %>%
group_by(STUDY) %>%
mutate(i1 = !is.na(BASE) & !is.na(CYCLE1),
perc1 = 100 * mean(i1),
n1 = sum(i1),
i2 = DIED == "Yes" & i1,
perc2 = 100 * mean(i2),
n2 = sum(i2),
i3 = (DIED == "Yes"|PROG == "Yes") & i1,
perc3 = 100 * mean(i3),
n3 = sum(i3)) %>%
filter(i1) %>%
select(STUDY, ID, matches("perc"), matches("n")) %>%
mutate(ID = toString(ID)) %>%
slice(1)
# A tibble: 1 x 8
# Groups: STUDY [1]
# STUDY ID perc1 perc2 perc3 n1 n2 n3
# <int> <chr> <dbl> <dbl> <dbl> <int> <int> <int>
#1 1 1, 4 50 25 50 2 1 2
It can be further modified to format the output
library(tidyr) # 0.8.3.9000
out %>%
pivot_longer(cols = perc1:n3, names_to = c( "perc", "n"),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
group_by(STUDY, ID, n) %>%
summarise(value = sprintf("%d (%d%%)", last(value), first(value))) %>%
select(-n)

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

Comparing values of pairs of columns whose names are suffixed by numbers (R / tidyverse)

I would like to compare the values of pairs of columns whose names start with certain characters but have same suffix in their column name. In the example below, I want to compare q_1 with v_1 and q_2 with v_2 and I want for each pair compared a new column that indicates if the compared columns are equal, i.e. equal_1 and equal_2:
q_1 v_1 q_2 v_2 equal_1 equal_2
0 1 1 0 NO NO
1 1 0 0 YES YES
(above sample data is simplified, in the original dataframe the suffixes go up to 200 and there are a lot of other variables that are suffixed (like i_1 … i_100), so the solution has be specific to the variables wanted.)
My code so far doesn't return the expected results, any hint much appreciated! An answer should use a tidyverse approach.
df <- data.frame(
ID = c(1, 2),
q_1 = c(0,1),
v_1 = c(1,1),
q_2 = c(1,0),
v_2 = c(0,0)
)
df2 <- df %>%
mutate(across(starts_with('q'), ~if_else(.x == sub("q", "v", .x), 'YES', 'NO'), .names = '{sub("q", "equal", .col)}'))
print(df2, quote = FALSE, row.names = FALSE)
I would first pivot the data, giving you just two columns q and v, labeled row-wise by pair. Then it's trivial to compare the two columns.
library(tidyverse)
df_pivoted <- df %>%
pivot_longer(
!ID,
names_to = c(".value", "pair"),
names_sep = "_"
) %>%
mutate(equal = if_else(q == v, "YES", "NO"))
# # A tibble: 4 x 5
# ID pair q v equal
# <dbl> <chr> <dbl> <dbl> <chr>
# 1 1 1 0 1 NO
# 2 1 2 1 0 NO
# 3 2 1 1 1 YES
# 4 2 2 0 0 YES
Whatever you're doing downstream may also be easier in long format; but you can also pivot back to wide:
df2 <- df_pivoted %>%
pivot_wider(names_from = pair, values_from = q:equal)
# # A tibble: 2 x 7
# ID q_1 q_2 v_1 v_2 equal_1 equal_2
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
# 1 1 0 1 1 0 NO NO
# 2 2 1 0 1 0 YES YES
I think it would be easier to pivot to a longer format first and then pivot back to the desired output:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(matches("[a-z]_\\d"),
names_sep = "_",
names_to = c(".value", "pair")) %>%
rowwise() %>%
mutate(equal = n_distinct(c_across(q:v)) == 1) %>%
pivot_wider(names_from = pair,
values_from = q:last_col())
You might need to change some of the tidyselect options used for your more complex data.
Output
ID q_1 q_2 v_1 v_2 equal_1 equal_2
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 1 0 1 1 0 FALSE FALSE
2 2 1 0 1 0 TRUE TRUE
How about a for loop? for the suffixe numbers :
for(i in unique(gsub("[^0-9]", "", names(df)[-1]))){
df <- df %>%
mutate(across(ends_with(i)[1],
~if_else(df[,paste0("q_", i)] == df[,paste0("v_", i)], "YES", "NO"),
.names = '{sub("._", "equal_", .col)}'))
}
> df
# ID q_1 v_1 q_2 v_2 equal_1 equal_2
# 1 1 0 1 1 0 NO NO
# 2 2 1 1 0 0 YES YES
Or in base R, bc it really writes itself better like that :
for(i in unique(strtoi(gsub("[^0-9]","",names(df))))[-1]){
df[,paste0("equal_", i)] <- ifelse(
df[,paste0("q_", i)] == df[,paste0("v_", i)],
"YES", "NO"
)
}
> df
# ID q_1 v_1 q_2 v_2 equal_1 equal_2
# 1 1 0 1 1 0 NO NO
# 2 2 1 1 0 0 YES YES

For loop with if else

i have a df which a part of is similar to the following
| Number|Category| A1|A2|B1|B2|C1|C2|A |B |C |
| ------| -------|---|--|--|--|--|--|--|--|--|
| 1 | 1 | 10|30|5 |15|NA|NA|5 |10|NA|
| 2 | 2 | 10|30|5 |15|25|35|40|20|45|
The conditions are
A1 & A2, B1 & B2, C1 & C2 are the lower and upper limits respectively, of the factors A, B, C
and columns A, B, C represent the measurements.
If the measurement is under the lower limit the factor is "passed",
if it is in between the two limits, then the factor is in "danger",
if the measurement is higher than the higher limit then it is "failed".
For the category=1 we are permitted to have only 1 failure in one of the factors and in that case we classify the asset as "risk",
but if we have 2 failures then the asset in the row 1 "fail".
For Category=2 permitted 2 failures. If one factor fails is at "at risk", if we have 2 failures is "risk" and we have 3 failures then its "fail".
So I would like to calculate for every row(asset) the status of every factor and then the status of the asset. I am trying to that with a for loop and an if-else statement that iterates through all these columns of every row but seems difficult as I am a beginner. The final result is to attach the following columns to the dataset. Thank you in advance
|Number|Aa |Bb |Cc |Result |
|------|------|------|------|-------|
|1 |passed|danger|NA | risk |
|2 |failed|failed|failed| failed|
This can be done in dplyr only without even reshaping the data or using any loop (for/while). Using across, cur_data() and cur_column() which are certainly powerful functions from dplyr.
library(dplyr, warn.conflicts = F)
df
#> Number Category A1 A2 B1 B2 C1 C2 A B C
#> 1 1 1 10 30 5 15 NA NA 5 10 NA
#> 2 2 2 10 30 5 15 25 35 40 20 45
df %>% group_by(Number, Category) %>%
transmute(across(c('A', 'B', 'C'), ~ case_when(is.na(.) | is.na(get(paste0(cur_column(), 1))) |
is.na(get(paste0(cur_column(), 2))) ~ NA_character_,
. < get(paste0(cur_column(), 1)) ~ 'passed',
. <= get(paste0(cur_column(), 2)) ~ 'danger',
TRUE ~ 'failed'),
.names = '{.col}{tolower(.col)}')) %>%
mutate(Result = ifelse(rowSums(cur_data() == 'failed', na.rm = T) <= Category, 'risk', 'failed'))
#> # A tibble: 2 x 6
#> # Groups: Number, Category [2]
#> Number Category Aa Bb Cc Result
#> <int> <int> <chr> <chr> <chr> <chr>
#> 1 1 1 passed danger <NA> risk
#> 2 2 2 failed failed failed failed
Created on 2021-07-06 by the reprex package (v2.0.0)
You can also use the following solution which is a combination of base R and tidyverse:
library(dplyr)
library(purrr)
colnames <- c(1, 2)
tmp <- df[-colnames]
lapply(split.default(tmp, gsub("(\\w)\\d+?", "\\1", names(tmp))),
function(x) cbind(df[colnames], x)) %>%
imap(~ .x %>%
mutate(!!{.y} := pmap_chr(., ~
ifelse(any(is.na(..3), is.na(..4), is.na(..5)), "NA",
ifelse(..5 > ..3 & ..5 < ..4, "danger", ifelse(..5 < ..3, "passed", "failed"))))) %>%
select(-c(3, 4))) %>%
reduce(~ full_join(..1, ..2, id = c("Number", "Category"))) %>%
rowwise() %>%
mutate(Result = case_when(
Category == 1 & sum(c_across(A:C) == "failed") <= 1 ~ "Risk",
Category == 1 & sum(c_across(A:C) == "failed") > 1 ~ "Fail",
Category == 2 & sum(c_across(A:C) == "failed") == 1 ~ "At_Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 2 ~ "Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 3 ~ "Fail"
))
# A tibble: 2 x 6
# Rowwise:
Number Category A B C Result
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 1 passed danger NA Risk
2 2 2 failed failed failed Fail
Much of your problem is caused by the untidy nature of your data frame. I started to provide solutions based on both your untidy data and a tidy equivalent, but the untidy solution, whilst possible, became just too painful.
So, here's a solution based on a tidy equivalent of your data frame.
First, make it tidy. The reason your data frame is untidy is that your column names contain information, namely that A1 and A2 contain the acceptance limits for values in A, and so on. We can correct this by making the data frame longer.
The process is a little long because of the extent of the untidyness of the original. It might be possible to create a more compact version of the transformation using, say, names_pattern and other advanced arguments to pivot_longer(), but the long version at least has the benefit of clarity.
longDF <- df %>%
select(Number, Category, A, B, C) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Value"
) %>%
left_join(
df %>%
select(Number, Category, A1, B1, C1) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Lower"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
) %>%
left_join(
df %>%
select(Number, Category, A2, B2, C2) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Upper"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
)
longDF
# A tibble: 6 x 6
Number Category Variable Value Lower Upper
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 A 5 10 30
2 1 1 B 10 5 15
3 1 1 C NA NA NA
4 2 2 A 40 10 30
5 2 2 B 20 5 15
6 2 2 C 45 25 35
So at this point, we have columns that define the Category of the test, the Variable being measured, its Value and the two acceptance limits (Lower and Upper).
Now, determining the acceptability of each Value is straightforward.
longDF <- longDF %>%
mutate(
Result=ifelse(
Value < Lower,
"Pass",
ifelse(Value < Upper, "Danger", "Fail")
)
)
longDF
# A tibble: 6 x 7
Number Category Variable Value Lower Upper Result
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 1 A 5 10 30 Pass
2 1 1 B 10 5 15 Danger
3 1 1 C NA NA NA NA
4 2 2 A 40 10 30 Fail
5 2 2 B 20 5 15 Fail
6 2 2 C 45 25 35 Fail
Also, note that the categorisation of each value is independent of both the Variable and the number of possible variables. So the code is robust in these respects.
Now we can categorise the results by Number and Category.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 1 1 1 0
2 2 2 0 0 0 3
Again, we are robust with respect to both the number of Categorys and Numbers, and their labels.
Evaluating the overall results is also straightforward, but slightly long winded because of the various options. Note that your text is inconsistent with the desired output because you haven't explained how an overall result of "warn" for Category = 1 is obtained. I've gone with the text. if you want to match the sample output, the changes to the code should be simple once the criteria are defined.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
) %>%
mutate(
Result=ifelse(
Category == 1,
ifelse(Fail == 0, "Pass", ifelse(Fail == 1, "Risk", "Fail")),
ifelse(Fail < 2, "Pass", ifelse(Fail == 2, "Risk", "Fail"))
)
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail Result
<dbl> <dbl> <int> <int> <int> <int> <chr>
1 1 1 1 1 1 0 Pass
2 2 2 0 0 0 3 Fail
If you need to know which Variable caused potential failures, that can also be obtained from longDF with a small change to the grouping.
longDF %>%
group_by(Category, Variable, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Variable,
values_from=Result
)
# A tibble: 2 x 5
Category N A B C
<dbl> <int> <chr> <chr> <chr>
1 1 1 Pass Danger NA
2 2 1 Fail Fail Fail
And, of course, you could join these two data frames together to get a comprehensive description of both the overall results and the component variable assessments.

Adding a Proportion Column with Dplyr

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1
X <- (1:10)
Y<- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
(the code below was provided by a stack exchange member, thank you!)
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...
My expected output would be something like
Z=0 Z=1 PropZ=1
a 2 2 .5
b 1 2 .66
c 0 3 1
Perhaps this helps
library(dplyr)
library(tidyr)
test_df %>%
group_by(Y, Z) %>%
summarise(N = n(), .groups = 'drop') %>%
left_join(test_df %>%
group_by(Y) %>%
summarise(Prop = mean(Z == 1), .groups = 'drop')) %>%
pivot_wider(names_from = Z, values_from = N, values_fill = 0)
-output
# A tibble: 3 x 4
# Y Prop `0` `1`
# <chr> <dbl> <int> <int>
#1 a 0.5 2 2
#2 b 0.667 1 2
#3 c 1 0 3
test_df %>% group_by(Y) %>%
summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())
I am not sure if what is your expected output, but below might be some options
u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
Z
Y 0 1
a 2 2
b 1 2
c 0 3
or
> prop.table(u)
Z
Y 0 1
a 0.2 0.2
b 0.1 0.2
c 0.0 0.3
To calculate proportions of 1 for each letter you can use rowSums.
transform(res, prop_1 = `1`/rowSums(res[-1]))
In dplyr :
library(dplyr)
res %>%
ungroup %>%
mutate(prop_1 = `1`/rowSums(.[-1]))
# Y `0` `1` prop_1
# <chr> <int> <int> <dbl>
#1 a 2 2 0.5
#2 b 1 2 0.667
#3 c 0 3 1

Improving a dplyr solution -- Create a variable by conditional ordering (position) based on other information

I'm working on a dataset where every participant (ID) was evaluated 1, 2 or 3 times. It's a longitudinal study. Unfortunately, when the first analyst coded the dataset, she/he did not assign any information about that.
Because all participant have age information (in months), it's easy to identify when was the first evaluation, when was the second and so on. In the first evaluation, the participant was younger than the second and so on.
I used tidyverse tools to deal with that and everything is working. Howerver,I really know (imagine...) there is many other (much more) elegant solution, and I came to this forum to ask for that. Could someone give me thoughts about how to make this code shorter and clear?
This is a fake data to reproduce the code:
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
#add how many times each child was acessed
ds <- ds %>% group_by(id) %>% mutate(how_many = n())
#Add position
ds %>% group_by(id) %>%
mutate(first = min(months),
max = max(months),
med = median(months)) -> ds
#add label to the third evaluation (the second will be missing)
ds %>%
mutate(group = case_when((how_many == 3) & (months %in% first) ~ "First evaluation",
(how_many == 3) & (months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> ds
#add label to the second evaluation for all children evaluated two times
ds %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> ds
This is my original code:
temp <- dataset %>% select(idind, arm, infant_sex,infant_age_months)
#add how many times each child was acessed
temp <- temp %>% group_by(idind) %>% mutate(how_many = n())
#Add position
temp %>% group_by(idind) %>%
mutate(first = min(infant_age_months),
max = max(infant_age_months),
med = median(infant_age_months)) -> temp
#add label to the first evaluation
temp %>%
mutate(group = case_when(how_many == 1 ~ "First evaluation")) -> temp
#add label to the second evaluation (and keep all previous results)
temp %>%
mutate(group = case_when((how_many == 2) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 2) & (infant_age_months %in% max) ~ "Second evaluation",
TRUE ~ group)) -> temp
#add label to the third evaluation (the second will be missing)
temp %>%
mutate(group = case_when((how_many == 3) & (infant_age_months %in% first) ~ "First evaluation",
(how_many == 3) & (infant_age_months %in% max) ~ "Third evaluation",
TRUE ~ group)) -> temp
#add label to the second evaluation for all children evaluated two times
temp %>% mutate_at(vars(group), funs(if_else(is.na(.),"Second Evaluation",.))) -> temp
Please, keep in mind I used search box before asking that and I really imagine other people can figure the same question when programing.
Thanks much
There you go. I used rank() to give the order of the treatments.
ds <- data.frame(id = seq(1:6),
months = round(rnorm(18, mean=12, sd=2),0),
x1 = sample(0:2),
x2 = sample(0:2),
x3 = sample(0:2),
x4 = sample(0:2))
ds2 = ds %>% group_by(id) %>% mutate(rank = rank(months,ties.method="first"))
labels = c("First", "Second","Third")
ds2$labels = labels[ds2$rank]
Or just arrange by age and use 1:n() instead of n(), which creates a sequence:
ds <- ds %>% group_by(id) %>% arrange(months) %>% mutate(how_many = 1:n())
ds %>% arrange(id, months)
# A tibble: 18 x 7
# Groups: id [6]
id months x1 x2 x3 x4 how_many
<int> <dbl> <int> <int> <int> <int> <int>
1 1 10 1 2 0 1 1
2 1 11 1 2 0 1 2
3 1 12 1 2 0 1 3
4 2 11 0 1 2 2 1
5 2 14 0 1 2 2 2
6 2 14 0 1 2 2 3
You can then use factor to attach a label, if you wish.
ds$label <- factor(ds$how_many, level = 1:3, label = c("First", "Second","Third"))
head(ds)
# A tibble: 18 x 8
# Groups: id [6]
id months x1 x2 x3 x4 how_many label
<int> <dbl> <int> <int> <int> <int> <int> <fct>
1 1 10 1 2 0 1 1 First
2 1 11 1 2 0 1 2 Second
3 1 12 1 2 0 1 3 Third
4 2 11 0 1 2 2 1 First
5 2 14 0 1 2 2 2 Second
6 2 14 0 1 2 2 3 Third

Resources