Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914
Related
So I have used the following code to split the below dataframe (df1) into multiple dataframes/tibbles based on the filters so that I can work out the percentile rank of each metric.
df1:
name
group
metric
value
A
A
distance
10569
B
A
distance
12939
C
A
distance
11532
A
A
psv-99
29.30
B
A
psv-99
30.89
C
A
psv-99
28.90
split <- lapply(unique(df1$metric), function(x){
filter <- df1 %>% filter(group == "A" & metric == x)
})
This then gives me a large list of tibbles. I want to now mutate a new column for each tibble to work out the percentile rank of the value column which I can do using the following code:
df2 <- split[[1]] %>% mutate(percentile = percent_rank(value))
I could do this for each metric then row_bind them together, but that seems very messy. Could anyone suggest a better way of doing this?
No need to split the data here. You can use group_by to do the calculation for each metric separately.
library(dplyr)
df %>%
filter(group == "A") %>%
group_by(metric) %>%
mutate(percentile = percent_rank(value))
We can use base R
df1 <- subset(df, group == 'A')
df1$percentile <- with(df1, ave(value, metric, FUN = percent_rank))
df %>%
group_nest(group, metric) %>%
mutate(percentile = map(data, ~percent_rank(.x$value))) %>%
unnest(cols = c("data", "percentile"))
Calculating a function of multiple variables for a dataframe in wide format is very familiar:
library(tidyverse)
df <- tibble(t = 1:3, b = 11:13, c = 21:23)
df <- df %>% mutate(d = b + c) # or base R: df$d <- df$b + df$c
What about when the dataframe is in long format? e.g.
df <- df %>% pivot_longer(-t, names_to = "variable", values_to = "value")
In this long format, you could imagine the same operation working by first group_by(t), and then calculating one value of d for each group, namely that group's variable=b value plus that group's variable=c value. Is this possible? One might think of something like summarise(d = b + c) but that expects wide format.
NB my real-world example has more than two cols b and c and I want to put them into a defined function, not just add them. My working solution is pivoting a huge dataframe from long to wide, calling my multivariable function to define a new column, then pivoting back to long.
Edit: to make the real world example explicit, I need to call a defined function that treats its arguments differently, unlike sum. For example
my.func <- function(b, c) { b^c }
How could the variable d be calculated by applying this function to the values of b and c associated with the same value of t?
We can just do sum instead of +
library(dplyr)
library(tidyr)
df %>%
group_by(t) %>%
summarise(d =sum(value[variable %in% c('b', 'c')]))
If it is to apply the my.func, we need to extract the value that correspond to 'b', 'c'
df %>%
group_by(t) %>%
mutate(new = my.func(value[variable == 'b'], value[variable == 'c']))
I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.
You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])
It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15
Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))
How can I use mutate to achieve the below?
bd_diag_date <- df %>%
apply(1, function(dates) last(na.omit(dates))) %>%
as.data.frame() %>%
`colnames<-`("diag_date")
I tried this below but didn't work. I can't find out why and it says Error: Column 'diagnosis_date' is of unsupported type symbol. Should I assume mutate takes any function operation that can apply to a vector? If not, then what kind of operation does it accept?
bd_diag_date <- df %>%
rowwise() %>%
{mutate(., diag_date=last(na.omit(all_vars(.))))}
I also have a more general questions. That is how can I debug this? Every time I encounter this problem I have to google stack exchange but I feel like this isn't the right way to improve my dplyr skill.
We can use pmap
library(dplyr)
library(purrr)
df %>%
mutate(diag_date = pmap(., ~ last(na.omit(c(...)))))
If the columns are numeric, we can use pmap_dbl, simply using pmap returns a list column
df %>%
mutate(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
# col1 col2 col3 diag_date
#1 1 NA 2 2
#2 NA 2 NA 2
#3 3 4 NA 4
If we need to return only a single column, use transmute
df %>%
transmute(diag_date = pmap_dbl(., ~ last(na.omit(c(...)))))
Or with group_split and map
df %>%
group_split(grp = row_number(), keep = FALSE) %>%
map_dfr(~ .x %>%
transmute(diag_date = last(na.omit(unlist(.)))))
Or using base R with max.col
df$diag_date <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df), 'last'))]
data
df <- data.frame(col1 = c(1, NA, 3), col2 = c(NA, 2, 4), col3 = c(2, NA, NA))
I want to collapse multiple columns across groups such that the remaining summary statistic is the difference between the column values for each group. I have two methods but I have a feeling that there is a better way I should be doing this.
Example data
library(dplyr)
library(tidyr)
test <- data.frame(year = rep(2010:2011, each = 2),
id = c("A","B"),
val = 1:4,
val2 = 2:5,
stringsAsFactors = F)
Using summarize_each
test %>%
group_by(year) %>%
summarize_each(funs(.[id == "B"] - .[id == "A"]), val, val2)
Using tidyr
test %>%
gather(key,val,val:val2) %>%
spread(id,val) %>%
mutate(B.less.A = B - A) %>%
select(-c(A,B)) %>%
spread(key,B.less.A)
The summarize_each way seems relatively simple but I feel like there is a way to do this by grouping on id somehow? Is there a way that could ignore NA values in the columns?
We can use data.table
library(data.table)
setDT(test)[, lapply(.SD, diff), by = year, .SDcols = val:val2]
# year val val2
#1: 2010 1 1
#2: 2011 1 1