I have a dataframe:
levels counts
1, 2, 2 24
1, 2 20
1, 3, 3, 3 15
1, 3 10
1, 2, 3 25
I want to treat, for example, "1, 2, 2" and "1, 2" as the same thing. So, as long as there is a "1" and "2" without any other character, it will count as the level "1, 2". Here is the desired data frame:
levels counts
1, 2 44
1, 3 25
1, 2, 3 25
Here is code to reproduce the original data frame:
df <- data.frame(levels = c("1, 2, 2", "1, 2", "1, 3, 3, 3", "1, 3", "1, 2, 3"),
counts = c(24, 20, 15, 10, 25))
df$levels <- as.character(df$levels)
Split df$levels, get the unique elements, and then sort it. Then use that to obtain aggregate of counts.
df$levels2 = sapply(strsplit(df$levels, ", "), function(x)
paste(sort(unique(x)), collapse = ", ")) #Or toString(sort(unique(x))))
aggregate(counts~levels2, df, sum)
# levels2 counts
#1 1, 2 44
#2 1, 2, 3 25
#3 1, 3 25
A solution uses tidyverse. df2 is the final output.
library(tidyverse)
df2 <- df %>%
mutate(ID = 1:n()) %>%
mutate(levels = strsplit(levels, split = ", ")) %>%
unnest() %>%
distinct() %>%
arrange(ID, levels) %>%
group_by(ID, counts) %>%
summarise(levels = paste(levels, collapse = ", ")) %>%
ungroup() %>%
group_by(levels) %>%
summarise(counts = sum(counts))
Update
Based on comments below, a solution using ideas similar to d.b
df2 <- df %>%
mutate(l2 = map_chr(strsplit(levels, ", "),
.f = ~ .x %>% unique %>% sort %>% toString)) %>%
group_by(l2) %>%
summarise(counts = sum(counts))
Related
I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!
tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.
I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)
I am trying to summarise multiple columns based on an ID column so I don't double count observations. I have managed to use tapply to get what I need for one variable at a time but can't do this for several variables at the same time.
In addition, the data frame I want to apply this to has +50,000 rows and I want to apply this to +10 different count variables. I was wondering if there is a better solution within dplyr as I ultimately want to create a Shiny Dashboard with this data.
I have replicated a small sample of the data and shown the existing cost.
#Creating data frame
df <- data.frame (ID = c(1, 1, 2, 3, 4, 4, 4),
Count = c(1, 1, 30, 15, 1, 1, 1),
Count2 = c(1, 1, 20, 10, 1, 1, 1),
Service = c("Service A", "Service B", "Service C", "Service D",
"Service E", "Service F", "Service G"))
#Create object of variables to count
myvars <- c("Count", "Count2")
#Count number of unique frequencies for two groups
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(myvars, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))
#Count number of unique frequencies (code works for one variable at a time)
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(Count, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))
df %>%
group_by(ID) %>%
summarise(value_sum = sum(tapply(Count2, ID, FUN = max))) %>%
summarise(value_sum = sum(value_sum))
You can use across() to work on multiple variables at the same time within summarise(). In your case:
df %>%
group_by(ID) %>%
summarise(across(myvars, max)) %>%
summarise(across(myvars, sum))
I have a data frame of various hematology values and their collection times. Those values should only be collected at specific times, but occasionally an extra one is added. I want to remove any instances where a value was collected outside the scheduled time.
To illustrate the issue, here's some code to create a very simplified version of the data frame I'm working with (plus some example schedules):
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
hgb_sampling <- c(1, 4)
bili_sampling <- c(1, 5)
ldh_sampling <- c(0, 4)
So, I need an way to conditionally apply a filter based on the value in the Parameter column. The solution needs to fit into a dyplr pipeline and yield something like this:
filtered <- tibble("Parameter" = c(rep("hemoglobin", 2), rep("bilirubin", 2), rep("LDH", 2)),
"Collection" = c(1, 4, 1, 5, 0, 4))
I've tried a couple things (they all amount to something like the below) but the use of "Parameter" trips things up:
df <- example %>%
{if (Parameter == "hgb") filter(., Collection %in% hgb_sampling)}
Any suggestions?
You could create a reference tibble, join it with example and keep only selected rows.
library(dplyr)
ref_df <- tibble::tibble(Parameter = c("hgb","bili", "LDH"),
value = list(c(1, 4), c(1, 5), c(0, 4)))
example %>%
inner_join(ref_df, by = 'Parameter') %>%
group_by(Parameter) %>%
filter(Collection %in% unique(unlist(value))) %>%
select(Parameter, Collection)
# Parameter Collection
# <chr> <dbl>
#1 hgb 1
#2 hgb 4
#3 bili 1
#4 bili 5
#5 LDH 0
#6 LDH 4
Put your valid times in a list with names matching the names in Collection, then group by the values in Collection and filter by the values of each list element in sample_list:
sample_list <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
example %>%
group_by(Parameter) %>%
filter(Collection %in% sample_list[[first(Parameter)]])
Output:
# A tibble: 6 x 2
Parameter Collection
<chr> <dbl>
1 hemoglobin 1
2 hemoglobin 4
3 bilirubin 1
4 bilirubin 5
5 LDH 0
6 LDH 4
Try purrr::imap_dfr:
library(tidyverse)
example <- tibble("Parameter" = c(rep("hgb", 3), rep("bili", 3), rep("LDH", 3)),
"Collection" = c(1, 3, 4, 1, 5, 6, 0, 4, 8))
l <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0, 4))
imap_dfr(l, ~example %>%
filter(Parameter == .y & Collection %in% .x))
# # A tibble: 6 x 2
# Parameter Collection
# <chr> <dbl>
# 1 hgb 1
# 2 hgb 4
# 3 bili 1
# 4 bili 5
# 5 LDH 0
# 6 LDH 4
Simple method that is very easy to modify, add, remove, debug, ...
library(dplyr)
example %>%
filter(Parameter=="hgb" & Collection %in% c(1, 4) |
Parameter=="bili" & Collection %in% c(1, 5) |
Parameter=="LDH" & Collection %in% c(0, 4) )
Or if you want the values to be within a range:
example %>%
filter(Parameter=="hgb" & between(Collection, 1, 4) |
Parameter=="bili" & between(Collection, 1, 5) |
Parameter=="LDH" & between(Collection, 0, 4))
One option involving dplyr, stringr and tibble could be:
enframe(mget(ls(pattern = "sampling"))) %>%
mutate(name = str_extract(name, "[^_]+")) %>%
right_join(example %>%
mutate(Parameter = tolower(Parameter)), by = c("name" = "Parameter")) %>%
filter(Collection %in% unlist(value)) %>%
select(-value)
name Collection
<chr> <dbl>
1 hgb 1
2 hgb 4
3 bili 1
4 bili 5
5 ldh 0
6 ldh 4
If stored in a separate df as shown by #Ronak Shah, then you can do:
example %>%
filter(Collection %in% unlist(ref_df$value[match(Parameter, ref_df$Parameter)]))
additional solution
library(tidyverse)
library(purrr)
fltr <- list(hgb = c(1, 4), bili = c(1, 5), LDH = c(0,4)) %>%
enframe(name = "Parameter")
example %>%
group_by(Parameter) %>%
nest() %>%
left_join(fltr) %>%
mutate(out = map2(.x = data, .y = value, .f = ~ filter(.x, Collection %in% .y))) %>%
unnest(out) %>%
select(Parameter, Collection)
Q1. Is there a more direct (but still tidyverse) way to create a summary table like this?
library(tidyverse)
library(knitr)
library(kableExtra)
df <- data.frame(group=c(1, 1, 1, 1, 0, 0, 0, 0),
v1=c(1, 2, 3, 4, 5, 6, 1, 2),
v2=c(4, 3, 2, 5, 3, 5, 3, 8),
v3=c(0, 1, 0, 1, 1, 0, 1, 1))
df %>%
group_by(group) %>%
summarise(v1=paste0(round(mean(v1), 2),
" (",
round(sd(v1), 2),
")"),
v2=paste0(round(mean(v2), 2),
" (",
round(sd(v2), 2),
")"),
v3=round(mean(v3)*100, 1)
) %>%
dplyr::select(-group) %>%
t() %>%
`rownames<-` (c("v1 mean (SD)",
"v2 mean (SD)",
"Percent v3")) %>%
kable("html",
col.names=c("Group 0", "Group 1")) %>%
kable_styling()
Q2. Related to this, is there a way to combine two levels of summarise (e.g., no grouping + grouping) without repeating the summarise code?
all <-
df %>%
summarise(v1=paste0(round(mean(v1), 2),
" (",
round(sd(v1), 2),
")"),
v2=paste0(round(mean(v2), 2),
" (",
round(sd(v2), 2),
")"),
v3=round(mean(v3)*100, 1)
) %>%
t() %>%
`rownames<-` (c("v1 mean (SD)",
"v2 mean (SD)",
"Percent v3"))
groups <-
df %>%
group_by(group) %>%
summarise(v1=paste0(round(mean(v1), 2),
" (",
round(sd(v1), 2),
")"),
v2=paste0(round(mean(v2), 2),
" (",
round(sd(v2), 2),
")"),
v3=round(mean(v3)*100, 1)
) %>%
dplyr::select(-group) %>%
t() %>%
`rownames<-` (c("v1 mean (SD)",
"v2 mean (SD)",
"Percent v3"))
all %>%
cbind(groups) %>%
kable("html",
col.names=c("All", "Group 0", "Group 1")) %>%
kable_styling()
One solution (especially if you want to expand the number of columns v1, v2, ... in the future) to make your code a bit more concise might be, to put paste0(round(mean(v1), 2)," (", round(sd(v1), 2), ")") into a function: paste_mean_and_sd = function(df_col){paste0(round(mean(df_col), 2)," (", round(sd(df_col), 2), ")")}.
That would shorten your "pipeline" and make it more easily readable:
... %>% summarise(v1 = paste_mean_and_sd(v1), v2 = paste_mean_and_sd(v2), v3=round(mean(v3)*100, 1)) %>% ...
This is the minimum I can think of.
cat_var <- "v3"
df_cal <- function(x, var) {
if (var[1] %in% cat_var) return(as.character(round(mean(x), 1)))
paste0(mean(x), " (", round(sd(x), 2), ")")
}
df_tall <- df %>% gather(var, x, v1:v3) %>% group_by(var)
all <- df_tall %>% summarise(stat = df_cal(x, var)) %>% mutate(group = -1)
groups <- df_tall %>% group_by(group, var) %>% summarise(stat = df_cal(x, var))
bind_rows(all, groups) %>%
ungroup() %>%
mutate(var = factor(var, labels = c(
"v1 mean (SD)", "v2 mean (SD)", "Precent v3"
))) %>%
spread(group, stat) %>%
kable("html", col.names = c(" ", "All", "Group 0", "Group 1")) %>%
kable_styling()