After grouping by species and taken max Sepal.Length (column 1) for each group I need to grab the value of column 2 to 4 that are associated to maximum value of column 1 (by group). I'm able to do so for each single column at once but not in an across process. Any tips?
library(dplyr)
library(datasets)
data(iris)
Summarize by species with data associates to max sepal.length (by group), column by column:
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sep_length = max(Sepal.Length),
sep_w_associated_to = Sepal.Width[which.max(Sepal.Length)],
pet_l_associated_to = Petal.Length[which.max(Sepal.Length)],
pet_w_associated_to = Petal.Width[which.max(Sepal.Length)]
)
Now I would like obtain the same result using across, but the outcome is different from that I expected (the df iris_summary has now same number of rows as iris, I can't understand why...)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
max_sepa_length = max(Sepal.Length),
across(
.cols = Sepal.Width : Petal.Width,
.funs = ~ .x[which.max(Sepal.Length)]
)
)
Or use slice_max
library(dplyr) # devel can have `.by` or use `group_by(Species)`
iris %>%
slice_max(Sepal.Length, n = 1, by = 'Species')
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 7.0 3.2 4.7 1.4 versicolor
3 7.9 3.8 6.4 2.0 virginica
in base R you could do:
merge(aggregate(Sepal.Length~Species, iris, max), iris)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.8 4.0 1.2 0.2
2 versicolor 7.0 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2.0
If we want to do the same with across, here is one option:
iris %>%
group_by(Species) %>%
summarise(across(everything(), ~ .[which.max(Sepal.Length)]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4 1.2 0.2
2 versicolor 7 3.2 4.7 1.4
3 virginica 7.9 3.8 6.4 2
Related
I am trying to use a combination of mutate_at and which.max to manipulate a data frame as outlined below.
#This is basically what I want to achieve
df_want <- iris %>% group_by(Species) %>% mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
#Here is my attempt at a smarter solution, but it does not work
df_attempt <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x[which.max("Petal.Width")])
#However, this works
df_test <- iris %>% group_by(Species) %>% mutate_at(c("Sepal.Length", "Sepal.Width"), function(x) x + 100)
The code to produce df_attempt does not work. I get the following error message:
Error in mutate_impl(.data, dots) :
Column `Sepal.Length` must be length 50 (the group size) or one, not 0
Any ideas how I can get around this while still using mutate_at?
The standard dplyr way would be:
df_want <- iris %>%
group_by(Species) %>%
mutate(Sepal.Length = Sepal.Length[which.max(Petal.Width)],
Sepal.Width = Sepal.Width[which.max(Petal.Width)])
df_attempt <- iris %>%
group_by(Species) %>%
mutate_at(vars(Sepal.Length, Sepal.Width), funs(.[which.max(Petal.Width)]))
Result:
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fctr>
1 5 3.5 1.4 0.2 setosa
2 5 3.5 1.4 0.2 setosa
3 5 3.5 1.3 0.2 setosa
4 5 3.5 1.5 0.2 setosa
5 5 3.5 1.4 0.2 setosa
6 5 3.5 1.7 0.4 setosa
7 5 3.5 1.4 0.3 setosa
8 5 3.5 1.5 0.2 setosa
9 5 3.5 1.4 0.2 setosa
10 5 3.5 1.5 0.1 setosa
# ... with 140 more rows
> identical(df_want, df_attempt)
[1] TRUE
Note:
With vars you can reference variables with NSE.
With funs you can reference each column with a ., which is equivalent to function(x) x
library(tidyverse)
iris <- iris
means <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean))
sd <- iris %>%
group_by(Species) %>%
summarise_all(funs(sd))
bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.653510 3.048936 1.288336 0.1406144 setosa
2 5.419829 2.456202 3.790089 1.1282473 versicolor
3 5.952120 2.651503 5.000105 1.7513499 virginica
top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.358490 3.807064 1.635664 0.3513856 setosa
2 6.452171 3.083798 4.729911 1.5237527 versicolor
3 7.223880 3.296497 6.103895 2.3006501 virginica
How do I get the rows of Iris where the values for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width all fall between the values in the top and bottom data frames?
For example, I only want setosa rows where Sepal.Length > 4.65 & Sepal.Length < 5.35 and Sepal.Width is between 3.04 and 3.80, etc. Ideally the end result contains only the 4 numeric columns and the species column.
Thanks.
It would be much easier if you can filter from the beginning without the summarize step:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))
# A tibble: 54 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 5.0 3.4 1.5 0.2 setosa
# 5 4.8 3.4 1.6 0.2 setosa
# 6 5.1 3.5 1.4 0.3 setosa
# 7 5.1 3.8 1.5 0.3 setosa
# 8 5.2 3.5 1.5 0.2 setosa
# 9 5.2 3.4 1.4 0.2 setosa
#10 4.7 3.2 1.6 0.2 setosa
# ... with 44 more rows
Not sure if you can avoid the summarize step, post as an option here.
Or use between:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))
Here is a solution using non-equi joins which is building on the (now deleted) approach of #Frank:
library(data.table)
# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))
# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
ub = lb + 2 * s), by = .(Species, variable)]
# find row numbers of items which fulfill conditions
selected_rn <-
# non-equi join
DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
# all uniqueN(mDT$variable) variables must have been selected
# for an item to pass (thanks to #Frank for tip to avoid hardcoded value)
, .N, by = rn][N == uniqueN(mDT$variable), rn]
head(iris[sort(selected_rn),])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
18 5.1 3.5 1.4 0.3 setosa
Using the iris dataset I'm trying to calculate a z score for each of the variables. I have the data in tidy format, by performing the following:
library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')
That gives me the following:
Species variable value
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
3 setosa Sepal.Length 4.7
4 setosa Sepal.Length 4.6
5 setosa Sepal.Length 5.0
6 setosa Sepal.Length 5.4
But when I try to create a z-score column for each group (e.g. the z-score for Sepal.Length will not be comparable to that of Sepal. Width) using the following:
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = (value - mean(value)) / sd(value))
The resulting z-scores have not been grouped, and are based on all of the data.
What's the best way to return the z-scores by group using dpylr?
Many thanks!
I believe that you were complicating when computing z-scores with mean/sd. Just use function scale.
test <- test %>%
group_by(Species, variable) %>%
mutate(z_score = scale(value))
test
## A tibble: 600 x 4
## Groups: Species, variable [12]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Sepal.Length 5.1 0.26667447
# 2 setosa Sepal.Length 4.9 -0.30071802
# 3 setosa Sepal.Length 4.7 -0.86811050
# 4 setosa Sepal.Length 4.6 -1.15180675
# 5 setosa Sepal.Length 5.0 -0.01702177
# 6 setosa Sepal.Length 5.4 1.11776320
# 7 setosa Sepal.Length 4.6 -1.15180675
# 8 setosa Sepal.Length 5.0 -0.01702177
# 9 setosa Sepal.Length 4.4 -1.71919923
#10 setosa Sepal.Length 4.9 -0.30071802
## ... with 590 more rows
Edit.
Following a comment by the OP, I am posting some code to get the rows where Petal.Width has a positive z_score.
i1 <- which(test$variable == "Petal.Width" & test$z_score > 0)
test[i1, ]
## A tibble: 61 x 4
## Groups: Species, variable [3]
# Species variable value z_score
# <fctr> <fctr> <dbl> <dbl>
# 1 setosa Petal.Width 0.4 1.461300
# 2 setosa Petal.Width 0.3 0.512404
# 3 setosa Petal.Width 0.4 1.461300
# 4 setosa Petal.Width 0.4 1.461300
# 5 setosa Petal.Width 0.3 0.512404
# 6 setosa Petal.Width 0.3 0.512404
# 7 setosa Petal.Width 0.3 0.512404
# 8 setosa Petal.Width 0.4 1.461300
# 9 setosa Petal.Width 0.5 2.410197
#10 setosa Petal.Width 0.4 1.461300
## ... with 51 more rows
Your code is giving you z-scores by group. It seems to me these z-scores should be comparable exactly because you've individually scaled each group to mean=0 and sd=1, rather than scaling each value based on the mean and sd of the full data frame. For example:
library(tidyverse)
First, set up the melted data frame:
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>% # You can also use scale(value) as pointed out by #RuiBarradas
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Now look at the first three rows and compare with direct calculation:
head(dat, 3)
# Species variable value z_score_group z_score_ungrouped
# 1 setosa Sepal.Length 5.1 0.2666745 0.8278959
# 2 setosa Sepal.Length 4.9 -0.3007180 0.7266552
# 3 setosa Sepal.Length 4.7 -0.8681105 0.6254145
# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))
# [1] 0.2666745 -0.3007180 -0.8681105
# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))
# [1] 0.8278959 0.7266552 0.6254145
Now visualize the z-scores: The first graph below is the raw data. The second is the ungrouped z-scores--we've just rescaled the data to an overall mean=0 and SD=1. The third graph is what your code produces. Each group has been individually scaled to mean=0 and SD=1.
gridExtra::grid.arrange(
grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>%
map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")),
aes_string(.x, colour="group")) + geom_density()),
ncol=1)
I want to append the group-maximum to table of observations, e.g:
iris %>% split(iris$Species) %>%
lapply(function(l) mutate(l, species_max = max(Sepal.Width))) %>%
bind_rows() %>% .[c(1,51,101),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
1 5.1 3.5 1.4 0.2 setosa 4.4
51 7.0 3.2 4.7 1.4 versicolor 3.4
101 6.3 3.3 6.0 2.5 virginica 3.8
Is there a more elegant dplyr::group_by solution to achieve this?
How about this:
group_by(iris, Species) %>%
mutate(species_max = max(Sepal.Width)) %>%
slice(1)
# Source: local data frame [3 x 6]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
# <dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 4.4
# 2 7.0 3.2 4.7 1.4 versicolor 3.4
# 3 6.3 3.3 6.0 2.5 virginica 3.8
The difficulty here is that you need to summarise multiple columns (for which summarise_all would be great) but at the same time you need to add a new column (for which you either need a simple summarise or mutate call).
In this regard data.table allows greater flexibility since it only relies on a list in its j-argument. So you can do it as follows with data.table, just as a comparison:
library(data.table)
dt <- as.data.table(iris)
dt[, c(lapply(.SD, first), species_max = max(Sepal.Width)), by = Species]
I am trying to use dplyr to lag some variables (all of which have a common naming convention) for each group in my data set.
I thought mutate_if would work, but I get an error (below). mutate_each works, but for all columns rather than the select few.
For example, I were looking to lag only the Sepal measurements:
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
# mutate_each(funs(lag(.)))
mutate_if(contains("Sepal"), funs(lag(.)))
#> Error in get(as.character(FUN), mode = "function", envir = envir) : object 'p' of mode 'function' was not found
to get a final data set like:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 NA NA 1.4 0.2 setosa
# 2 5.1 3.5 1.4 0.2 setosa
# 3 4.9 3.0 1.3 0.2 setosa
# 4 NA NA 4.7 1.4 versicolor
# 5 7.0 3.2 4.5 1.5 versicolor
# 6 6.4 3.2 4.9 1.5 versicolor
# 7 NA NA 6.0 2.5 virginica
# 8 6.3 3.3 5.1 1.9 virginica
# 9 5.8 2.7 5.9 2.1 virginica
This seems to work,
library(dplyr)
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_if(grepl('Sepal', names(.)), funs(lag(.)))
As #aosmith explains, contains returns an index of the columns that match the string, whereas mutate_if relies on a using predicate functions that return logical vectors, which is why the grepl option works.
In addition, as #StevenBeaupre mentions,
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_at(vars(contains('Sepal')), lag)