Sum NA cases in dplyr's summarise - r

I can't find what am I doing wrong summarising values with value and with NA. I have read everywhere around that you can count cases in summarise with sum(), and that, to count NA cases, it could be used sum(is.na(variable)).
Actually, I can reproduce that behaviour with a test tibble:
df <- tibble(x = c(rep("a",5), rep("b",5)), y = c(NA, NA, 1, 1, NA, 1, 1, 1, NA, NA))
df %>%
group_by(x) %>%
summarise(one = sum(y, na.rm = T),
na = sum(is.na(y)))
And this is the expected result:
# A tibble: 2 x 3
x one na
<chr> <dbl> <int>
1 a 2 3
2 b 3 2
For some reason, I cannot reproduce the result with my data:
mydata <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Amphibians",
"Birds", "Mammals", "Reptiles", "Plants"), class = "factor"),
Scenario = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Present",
"RCP 4.5", "RCP 8.5"), class = "factor"), year = c(1940,
1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940, 1940,
1940, 1940, 1940, 1940, 1940, 1940, 1940), random = c("obs",
"obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs",
"obs", "obs", "obs", "obs", "obs", "obs", "obs", "obs"),
species = c("Allobates fratisenescus", "Allobates fratisenescus",
"Allobates fratisenescus", "Allobates juanii", "Allobates juanii",
"Allobates juanii", "Allobates kingsburyi", "Allobates kingsburyi",
"Allobates kingsburyi", "Adelophryne adiastola", "Adelophryne adiastola",
"Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa",
"Adelophryne gutturosa", "Adelphobates quinquevittatus",
"Adelphobates quinquevittatus", "Adelphobates quinquevittatus"
), Endemic = c(1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, -18L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "species", indices = list(
9:11, 12:14, 15:17, 0:2, 3:5, 6:8), group_sizes = c(3L, 3L,
3L, 3L, 3L, 3L), biggest_group_size = 3L, labels = structure(list(
species = c("Adelophryne adiastola", "Adelophryne gutturosa",
"Adelphobates quinquevittatus", "Allobates fratisenescus",
"Allobates juanii", "Allobates kingsburyi")), row.names = c(NA,
-6L), class = "data.frame", vars = "species", .Names = "species"), .Names = c("Group",
"Scenario", "year", "random", "species", "Endemic"))
(my data has several millions of rows, I reproduce here only a part of it)
Testsum <- mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
Endemic = sum(Endemic, na.rm = T),
noEndemic = sum(is.na(Endemic)))
# A tibble: 3 x 7
# Groups: Group, Scenario, year [?]
Group Scenario year random All Endemic noEndemic
<fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
1 Amphibians Present 1940 obs 6 3 0
2 Amphibians RCP 4.5 1940 obs 6 3 0
3 Amphibians RCP 8.5 1940 obs 6 3 0
!!!!
I expected no Endemic to be 3 for all cases, as there are NA in 3 of the species...
I doubled-checked that:
Test3$Endemic %>% class
[1] "numeric"
Obviously, there is something very stupid I am not seen... after several hours messing around. Is it obvious for any of you? Thanks!!!

The reason for this behavior is that we assigned Endemic as a new summarized variable. Instead we should be having a new column name
mydata %>%
group_by(Group, Scenario, year, random) %>%
summarise(All = n(),
EndemicS = sum(Endemic, na.rm = TRUE),
noEndemic = sum(is.na(Endemic))) %>%
rename(Endemic = EndemicS)
# A tibble: 3 x 7
# Groups: Group, Scenario, year [3]
# Group Scenario year random All Endemic noEndemic
# <fctr> <fctr> <dbl> <chr> <int> <dbl> <int>
#1 Amphibians Present 1940 obs 6 3 3
#2 Amphibians RCP 4.5 1940 obs 6 3 3
#3 Amphibians RCP 8.5 1940 obs 6 3 3

Related

Need help making first publication table

I am writing up my first paper. I have a data frame that has the study, symptoms, and the odds ratio that were analyzed for each symptom in each study. For example:
df <- structure(list(Study = c("Study1", "Study2", "Study1", "Study2", "Study1", "Study2"), Symptom = c("Symptom1", "Symptom1", "Symptom2", "Symptom2", "Symptom3", "Symptom3"), OR= c(1L, 0L, 1L, 0L, 1L, 0L), lower = c(-2L, -1L, -2L, -1L, -2L, -1L), upper = c(2L, 1L, 2L, 1L, 2L, 1L)), row.names = c(NA, + -6L), class = "data.frame")
I am wondering how to make a table for publication/what package to use that transforms the data and then prints a table that would look like:
df2 <- structure(list(Symptom = c("Symptom1", "Symptom2", "Symptom3"), Study1 = c("1(-2,2)", "1(-2,2)", "1(-2,2)"), Study2 = c("0(-1,1)", "0(-1,1)", "0(-1,1)")), row.names = c(NA, + -3L), class = "data.frame")
Thanks for the help!
library(dplyr)
library(tidyr)
df %>%
transmute(Study, Symptom, x = sprintf("%i(%i,%i)", OR, lower, upper)) %>%
pivot_wider(Symptom, names_from = Study, values_from = x)
# # A tibble: 3 x 3
# Symptom Study1 Study2
# <chr> <chr> <chr>
# 1 Symptom1 1(-2,2) 0(-1,1)
# 2 Symptom2 1(-2,2) 0(-1,1)
# 3 Symptom3 1(-2,2) 0(-1,1)

How to I get accuracy values by group [duplicate]

This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 3 years ago.
I can't get the average accuracies (proportion of TRUE values) in Correct_answers columns for the groups chart type and condition.
data
structure(list(Element = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6"), class = "factor"), Correct_answer = structure(c(2L,
2L, 2L, 1L, 2L), .Label = c("FALSE", "TRUE"), class = "factor"),
Response_time = c(25.155, 6.74, 28.649, 16.112, 105.5906238
), Chart_type = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("Box",
"Violin"), class = "factor"), Condition = structure(c(1L,
2L, 1L, 2L, 1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
Average by chart_type
av_data_chartType <- data %>% group_by(Chart_type) %>% summarise_each(funs(mean, sd))
Average by condition
av_data_conition <- data %>% group_by(Condition) %>% summarise_each(funs(mean, sd))
No mean produced for accuracy
NA value is place where accuracy should be.
Reproducing your code I had a warning that led me to the answer : you shouldn't compute statistics on factor variables. If you know what you are doing you can convert them to numeric :
data <- structure(list(Element = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = c("1", "2", "3", "4", "5", "6"),
class = "factor"),
Correct_answer = structure(c(2L, 2L, 2L, 1L, 2L),
.Label = c("FALSE", "TRUE"),
class = "factor"),
Response_time = c(25.155, 6.74, 28.649, 16.112, 105.5906238
),
Chart_type = structure(c(2L, 2L, 1L, 1L, 1L),
.Label = c("Box",
"Violin"),
class = "factor"),
Condition = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("0", "1"),
class = "factor")),
row.names = c(NA, 5L), class = "data.frame")
library("dplyr", warn.conflicts = FALSE)
data <- data %>% as_tibble
# av_data_chartType
data %>%
group_by(Chart_type) %>%
mutate_if(.predicate = is.factor, .funs = as.numeric) %>%
summarise_each(list( ~mean, ~sd))
#> `mutate_if()` ignored the following grouping variables:
#> Column `Chart_type`
#> # A tibble: 2 x 9
#> Chart_type Element_mean Correct_answer_~ Response_time_m~ Condition_mean
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Box 1 1.67 50.1 1.33
#> 2 Violin 1 2 15.9 1.5
#> # ... with 4 more variables: Element_sd <dbl>, Correct_answer_sd <dbl>,
#> # Response_time_sd <dbl>, Condition_sd <dbl>
# av_data_condition
data %>%
group_by(Condition) %>%
mutate_if(.predicate = is.factor, .funs = as.numeric) %>%
summarise_each(list( ~mean, ~sd))
#> `mutate_if()` ignored the following grouping variables:
#> Column `Condition`
#> # A tibble: 2 x 9
#> Condition Element_mean Correct_answer_~ Response_time_m~ Chart_type_mean
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1 2 53.1 1.33
#> 2 1 1 1.5 11.4 1.5
#> # ... with 4 more variables: Element_sd <dbl>, Correct_answer_sd <dbl>,
#> # Response_time_sd <dbl>, Chart_type_sd <dbl>
Created on 2019-06-11 by the reprex package (v0.2.1)
This should work:
a$Correct_answer <- as.logical(a$Correct_answer)
av_data_chartType <- a %>% select(Chart_type, Correct_answer) %>% group_by(Chart_type) %>% summarise_each(funs(mean, sd))
av_data_chartType <- a %>% select(Condition, Correct_answer) %>% group_by(Condition) %>% summarise_each(funs(mean, sd))
You had 2 problems:
Your Correct_answer was a factor.
You tried to calculate your functions over every Column
You probably need
library(dplyr)
data %>%
mutate(Correct_answer = as.logical(Correct_answer)) %>%
group_by(Chart_type, Condition) %>%
summarise(avg = mean(Correct_answer))
Or if you need them separately
data %>%
mutate(Correct_answer = as.logical(Correct_answer)) %>%
group_by(Chart_type) %>%
summarise(avg = mean(Correct_answer))
data %>%
mutate(Correct_answer = as.logical(Correct_answer)) %>%
group_by(Condition) %>%
summarise(avg = mean(Correct_answer))

squashing multiple rows by time difference

Assuming these are few timestamped observations in a dataset:
Id Status DateCreated Group
10 Read 2017-11-04 18:24:55 Red
10 Write 2017-11-04 18:24:56 Red
10 Review 2017-11-04 18:25:16 Red
10 Read 2017-11-04 18:26:17 Red
10 Write 2017-11-04 18:26:47 Red
How do I collapse rows that are within 1 minute of each other?
For example, rows 1,2,3 are collapsed into 1 row and rows 4 and 5 are collapsed into second row.
The expected output would look like this:
Id Status DateCreated Date Ended Group
10 Read,Write,Review 2017-11-04 18:24:55 2017-11-04 18:25:16 Red, Red, Red
10 Read,Write 2017-11-04 18:26:17 2017-11-04 18:26:47 Red, Red
Here is the code to reproduce the test dataset in this example:
df <- structure(list(Id = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "10", class = "factor"),
Status = structure(c(1L, 3L, 2L, 1L, 3L), .Label = c("Read",
"Review", "Write"), class = "factor"), DateCreated = structure(1:5, .Label = c("2017-11-04 18:24:55",
"2017-11-04 18:24:56", "2017-11-04 18:25:16", "2017-11-04 18:26:17",
"2017-11-04 18:26:47"), class = "factor"), Group = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Red", class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
I would do something like that:
df %>%
mutate(DateCreated = ymd_hms(DateCreated))%>%
group_by(minute(DateCreated))%>%
arrange(DateCreated)%>%
summarise(Status = paste(Status,collapse = ", "),DateCreated = DateCreated[1],Date_ended = last(DateCreated),Group = paste(Group,collapse = ", "))
library(lubridate)
library(dplyr)
library(purrr)
df <-
structure(
list(
Id = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "10", class = "factor"),
Status = structure(
c(1L, 3L, 2L, 1L, 3L),
.Label = c("Read",
"Review", "Write"),
class = "factor"
),
DateCreated = structure(
1:5,
.Label = c(
"2017-11-04 18:24:55",
"2017-11-04 18:24:56",
"2017-11-04 18:25:16",
"2017-11-04 18:26:17",
"2017-11-04 18:26:47"
),
class = "factor"
),
Group = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Red", class = "factor")
),
class = "data.frame",
row.names = c(NA,-5L)
)
df2 <-
df %>%
mutate(DateCreated = as_datetime(df$DateCreated)) %>%
arrange(DateCreated) %>%
mutate(diff = DateCreated - lag(DateCreated))
df2$diff[1] <- 0L
g <- 0
df3 <- mutate(df2, date_groups =
accumulate(df2$diff, function(x, y)
if (y - x < 60)
g
else {
g <<- g + 1
})) %>%
group_by(date_groups) %>%
summarise(
Status = paste(Status, collapse = ", "),
DateCreated = DateCreated[1],
Date_ended = last(DateCreated),
Group = paste(Group, collapse = ", ")
)
df3
#> # A tibble: 2 x 5
#> date_groups Status DateCreated Date_ended Group
#> <dbl> <chr> <dttm> <dttm> <chr>
#> 1 0 Read, Write… 2017-11-04 18:24:55 2017-11-04 18:24:55 Red, Re…
#> 2 1 Read, Write 2017-11-04 18:26:17 2017-11-04 18:26:17 Red, Red
Created on 2019-01-28 by the reprex package (v0.2.1)

Calculate max value across multiple columns by multiple groups

I have a data file with numeric values in three columns and two grouping variables (ID and Group) from which I need to calculate a single max value by ID and Group:
structure(list(ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 1L, 2L, 2L), .Label =
c("abc",
"def"), class = "factor"), Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L,
0L, 5L, 10L), Score3 = c(0L, 11L, 2L, 11L)), class = "data.frame", row.names =
c(NA,
-4L))
The result I am trying to obtain is:
structure(list(ID = structure(c(1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 2L, 2L), .Label = c("abc",
"def"), class = "factor"), Max = c(11L, 5L, 11L)), class = "data.frame",
row.names = c(NA,
-3L))
I am trying the following in dplyr:
SampTable<-SampDF %>% group_by(ID,Group) %>%
summarize(max = pmax(SampDF$Score1, SampDF$Score2,SampDF$Score3))
But it generates this error:
Error in summarise_impl(.data, dots) :
Column `max` must be length 1 (a summary value), not 4
Is there an easy way to achieve this in dplyr or data.table?
Solution using data.table. Find max value on 3:5 columns (Score columns) by ID and Group.
library(data.table)
setDT(d)
d[, .(Max = do.call(max, .SD)), .SDcols = 3:5, .(ID, Group)]
ID Group Max
1: a1 abc 11
2: a1 def 5
3: a2 def 11
Data:
d <- structure(list(ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 1L, 2L, 2L), .Label =
c("abc",
"def"), class = "factor"), Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L,
0L, 5L, 10L), Score3 = c(0L, 11L, 2L, 11L)), class = "data.frame", row.names =
c(NA,
-4L))
A solution using tidyverse.
library(tidyverse)
dat2 <- dat1 %>%
gather(Column, Value, starts_with("Score")) %>%
group_by(ID, Group) %>%
summarise(Max = max(Value)) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# ID Group Max
# <fct> <fct> <dbl>
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
Here are couple of other options with tidyverse
library(tidyverse)
df1 %>%
group_by(ID, Group) %>%
nest %>%
mutate(Max = map_dbl(data, ~ max(unlist(.x)))) %>%
select(-data)
Or using pmax
df1 %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[3:5]))) %>%
group_by(ID, Group) %>%
summarise(Max = max(Max))
# A tibble: 3 x 3
# Groups: ID [?]
# ID Group Max
# <fct> <fct> <dbl>
#1 a1 abc 11
#2 a1 def 5
#3 a2 def 11
Or using base R
aggregate(cbind(Max = do.call(pmax, df1[3:5])) ~ ID + Group, df1, max)
Here is a tidyverse solution using nest :
library(tidyverse)
df %>%
nest(-(1:2),.key="Max") %>%
mutate_at("Max",map_dbl, max)
# ID Group Max
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
In base R:
res <- aggregate(. ~ ID + Group,df,max)
res <- cbind(res[1:2], Max = do.call(pmax,res[-(1:2)]))
res
# ID Group Max
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
Here is a base R solution
# gives 2x2 table
x <- by(df[, !names(df) %in% c("ID", "Group")], list(df$ID, df$Group), max)
# get requested format
tmp <- expand.grid(ID = rownames(x), Group = colnames(x))
tmp$Max <- as.vector(x)
tmp[complete.cases(tmp), ]
#R ID Group Max
#R 1 a1 abc 11
#R 3 a1 def 5
#R 4 a2 def 11
with
df <- structure(list(
ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1", "a2"), class = "factor"),
Group = structure(c(1L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L, 0L, 5L, 10L),
Score3 = c(0L, 11L, 2L, 11L)),
class = "data.frame", row.names = c(NA, -4L))

Calculating ratios by group with dplyr

Using the following dataframe I would like to group the data by replicate and group and then calculate a ratio of treatment values to control values.
structure(list(group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("case", "controls"), class = "factor"), treatment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "EPA", class = "factor"),
replicate = structure(c(2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L), .Label = c("four",
"one", "three", "two"), class = "factor"), fatty_acid_family = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "saturated", class = "factor"),
fatty_acid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "14:0", class = "factor"),
quant = c(6.16, 6.415, 4.02, 4.05, 4.62, 4.435, 3.755, 3.755
)), .Names = c("group", "treatment", "replicate", "fatty_acid_family",
"fatty_acid", "quant"), class = "data.frame", row.names = c(NA,
-8L))
I have tried using dplyr as follows:
group_by(dataIn, replicate, group) %>% transmute(ratio = quant[group=="case"]/quant[group=="controls"])
but this results in Error: incompatible size (%d), expecting %d (the group size) or 1
Initially I thought this might be because I was trying to create 4 ratios from a df 8 rows deep and so I thought summarise might be the answer (collapsing each group to one ratio) but that doesn't work either (my understanding is a shortcoming).
group_by(dataIn, replicate, group) %>% summarise(ratio = quant[group=="case"]/quant[group=="controls"])
replicate group ratio
1 four case NA
2 four controls NA
3 one case NA
4 one controls NA
5 three case NA
6 three controls NA
7 two case NA
8 two controls NA
I would appreciate some advice on where I'm going wrong or even if this can be done with dplyr.
Thanks.
You can try:
group_by(dataIn, replicate) %>%
summarise(ratio = quant[group=="case"]/quant[group=="controls"])
#Source: local data frame [4 x 2]
#
# replicate ratio
#1 four 1.078562
#2 one 1.333333
#3 three 1.070573
#4 two 1.446449
Because you grouped by replicate and group, you could not access data from different groups at the same time.
#talat's answer solved for me. I created a minimal reproducible example to help my own understanding:
df <- structure(list(a = c("a", "a", "b", "b", "c", "c", "d", "d"),
b = c(1, 2, 1, 2, 1, 2, 1, 2), c = c(22, 15, 5, 0.2, 107,
6, 0.2, 4)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# a b c
# 1 a 1 22.0
# 2 a 2 15.0
# 3 b 1 5.0
# 4 b 2 0.2
# 5 c 1 107.0
# 6 c 2 6.0
# 7 d 1 0.2
# 8 d 2 4.0
library(dplyr)
df %>%
group_by(a) %>%
summarise(prop = c[b == 1] / c[b == 2])
# a prop
# 1 a 1.466667
# 2 b 25.000000
# 3 c 17.833333
# 4 d 0.050000

Resources