I want to create a data frame where I summarize values like number of observations, mean and median, and I want to nest its ggplot histograms. For this, I will use the iris dataset.
This is my first attempt:
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2))
So it gives me:
# A tibble: 4 x 4
Vars obs mean median
<chr> <int> <dbl> <dbl>
1 Petal.Length 150 3.76 4.35
2 Petal.Width 150 1.2 1.3
3 Sepal.Length 150 5.84 5.8
4 Sepal.Width 150 3.06 3
This is the expected table:
# A tibble: 4 x 5
Vars obs mean median plot
<chr> <int> <dbl> <dbl> <list>
1 Petal.Length 150 3.76 4.35 <gg>
2 Petal.Width 150 1.2 1.3 <gg>
3 Sepal.Length 150 5.84 5.8 <gg>
4 Sepal.Width 150 3.06 3 <gg>
This is what I have tried:
iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
nest() %>%
mutate(metrics = lapply(data, function(df) df %>% summarise(obs = n(), mean = mean(Values), median = median(Values))),
plots = lapply(data, function(df) df %>% ggplot(aes(Values)) + geom_histogram()))
Almost there, I see this:
# A tibble: 4 x 4
# Groups: Vars [4]
Vars data metrics plots
<chr> <list> <list> <list>
1 Sepal.Length <tibble [150 × 2]> <tibble [1 × 3]> <gg>
2 Sepal.Width <tibble [150 × 2]> <tibble [1 × 3]> <gg>
3 Petal.Length <tibble [150 × 2]> <tibble [1 × 3]> <gg>
4 Petal.Width <tibble [150 × 2]> <tibble [1 × 3]> <gg>
But I don't know how to see the expected tibble with the obs, mean, median and plots columns without the data and metrics columns. Any help will be greatly appreciated.
We may use cur_data() in summarise and get the output in a list by wrapping
library(dplyr)
library(ggplot2)
library(tidyr)
out <- iris %>%
pivot_longer(-Species,
names_to = "Vars",
values_to = "Values") %>%
group_by(Vars) %>%
summarise(obs = n(),
mean = round(mean(Values),2),
median = round(median(Values),2),
plots = list(ggplot(cur_data(), aes(Values)) + geom_histogram()))
-output
out
# A tibble: 4 × 5
Vars obs mean median plots
<chr> <int> <dbl> <dbl> <list>
1 Petal.Length 150 3.76 4.35 <gg>
2 Petal.Width 150 1.2 1.3 <gg>
3 Sepal.Length 150 5.84 5.8 <gg>
4 Sepal.Width 150 3.06 3 <gg>
Related
I have a list of nested tibbles (which includes lists themselves etc, sort of like inception).
library(tidyverse)
# example function I want to apply
fun1 <- function(data) {
minimum <- min(data$Sepal.Length)
return(minimum)
}
# example of nested list
list_nested_tibbles <- list(
list1 = iris %>% group_by(Species) %>% nest(),
list2 = iris %>% group_by(Species) %>% nest()
)
# applying function to one of the nested tibbles within the list
list_nested_tibbles$list1 %>% mutate(minimum = map_dbl(.x = data, ~ fun1(.x)))
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species data minimum
#> <fct> <list> <dbl>
#> 1 setosa <tibble [50 x 4]> 4.3
#> 2 versicolor <tibble [50 x 4]> 4.9
#> 3 virginica <tibble [50 x 4]> 4.9
# function fails if I apply across whole list
list_nested_tibbles %>% mutate(minimum = map_dbl(.x = data, ~ fun1(.x)))
#> Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "list"
However, I want to apply the function to both list items simultaneously, I'm guessing I need some sort of nested map statement?
Any help appreciated.
Cheers.
Here's how to do it with purrr functions:
map(list_nested_tibbles, ~ .x %>% mutate(minimum = map_dbl(data, ~ fun1(.x))))
$list1
# A tibble: 3 × 3
# Groups: Species [3]
Species data minimum
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 4.3
2 versicolor <tibble [50 × 4]> 4.9
3 virginica <tibble [50 × 4]> 4.9
$list2
# A tibble: 3 × 3
# Groups: Species [3]
Species data minimum
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 4.3
2 versicolor <tibble [50 × 4]> 4.9
3 virginica <tibble [50 × 4]> 4.9
I'm trying to mutate a new variable in a nested dataframe with an ifelse-condition. But the problem is that after implementing the ifelse-condition the nested dataframe turns into a list.
I want to show this problem with the iris dataset:
Here you can see the original nested format:
iris %>% nest(data = -Species)
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
And now I want to mutate a new variable in the nested dataframes:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 5]>
2 versicolor <tibble [50 x 5]>
3 virginica <tibble [50 x 5]>
This code works. The data-column is as desired in the tibble-format.
But if I insert the ifelse-condition now, the tibble-format is lost:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
ifelse(!is.na(x), x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), NA)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <list [200]>
2 versicolor <list [200]>
3 virginica <list [200]>
I want to keep the tibble-format even with the ifelse-condition.
Can anyone help me?
In the first step of the map() computation, i.e. data in setosa, the input x of your custom function is actually
x <- iris[1:50, 1:4]
Then you put x into ifelse()
ifelse(!is.na(x), # part 1
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), # part 2
NA) # part 3
The first part is !is.na(x), which returns 50x4=200 logical values. Hence, the second and third parts will be recycled to length 200. However, the second part, i.e.
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
is a tibble with 5 variables, which is also a list with length 5, so each variable in this tibble will be recycled 40 times and subsequently a list with length 200 will be created. That is why you get 3 lists of length 200.
In your case, ifelse() may not be applicable. You can adjust it to
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map(data, function(x) {
if(is.data.frame(x))
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
else
NULL
}))
# # A tibble: 4 x 2
# Species data
# <chr> <list>
# 1 setosa <tibble [50 × 5]>
# 2 versicolor <tibble [50 × 5]>
# 3 virginica <tibble [50 × 5]>
# 4 example <NULL>
Make sure that the condition in if() must be a single logical value.
Grateful to #27ϕ9 for a neater version with map_if():
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map_if(data, is_tibble,
~ mutate(.x, `Sepal.Length^2` = Sepal.Length^2),
.else = NULL))
I am trying to move away from rowwise() for list columns as I have heard that the tidyverse team is in the process of axing it. However, I am not used to using the purrr functions so I feel like there must be a better way of doing the following:
I create a list-column containing a tibble for each species. I then want to go into the tibble and take the mean of certain variables. The first case is using map and second is the rowwise solution that I personally feel is cleaner.
Does anyone know a better way to use map in this situation?
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(mean_slength = map_dbl(data, ~mean(.$Sepal.Length, na.rm = TRUE)),
mean_swidth = map_dbl(data, ~mean(.$Sepal.Width, na.rm = TRUE))
)
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
iris %>%
group_by(Species) %>%
nest() %>%
rowwise() %>%
mutate(mean_slength = mean(data$Sepal.Length, na.rm = TRUE),
mean_swidth = mean(data$Sepal.Width, na.rm = TRUE))
#> Source: local data frame [3 x 4]
#> Groups: <by row>
#>
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
Created on 2018-12-26 by the reprex package (v0.2.1)
Instead of having two map, use a single one, with summarise_at
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~
.x %>%
summarise_at(vars(matches('Sepal')),
funs(mean_s = mean(., na.rm = TRUE))))) %>%
unnest(out)
Thanks to this site, I'm using the R purrr package to aggregation data based on multiple columns. The aggregation is working how I want but the output is not. Here is a sample using the mtcars dataset.
library(dplyr)
library(purrr)
#pull in data
data <- mtcars
#get colnames
variable1 <- colnames(data)
#map the variables
t1 <- map(variable1, ~ data %>%
group_by_at(.x) %>%
summarize(number = mean(mpg))) %>%
set_names(variable1) %>%
bind_rows(., .id = 'variable')
Were I expect three columns (Predictor Variable, Levels within Each of those Variables, aggregation), I have 8. See the image below:
How can I take my code up at the top and turn out a tidy dataset?
A simple way to do this is to reshape your data to long form, which lets you aggregate with ordinary dplyr:
library(tidyverse)
mpg_means <- mtcars %>%
gather(variable, value, -mpg) %>%
group_by(variable, value) %>%
summarise(mean_mpg = mean(mpg))
mpg_means
#> # A tibble: 146 x 3
#> # Groups: variable [?]
#> variable value mean_mpg
#> <chr> <dbl> <dbl>
#> 1 am 0. 17.1
#> 2 am 1. 24.4
#> 3 carb 1. 25.3
#> 4 carb 2. 22.4
#> 5 carb 3. 16.3
#> 6 carb 4. 15.8
#> 7 carb 6. 19.7
#> 8 carb 8. 15.0
#> 9 cyl 4. 26.7
#> 10 cyl 6. 19.7
#> # ... with 136 more rows
Note that while mtcars is entirely numeric, if you have different types, converting to long form will coerce variable types. The calculations will be the same, but it may cause issues later. To resolve it, use an output format that can handle diverse types, e.g.
mpg_means_in_list_cols <- mtcars %>%
as_tibble() %>% # compact printing for list columns
summarise_all(list) %>% # collapse each column into a list of itself
gather(group, group_values, -mpg) %>%
mutate(mpg_means = map2(mpg, group_values, # for each mpg/value pair, ...
~tibble(mpg = .x, group_value = .y) %>% # ...reconstruct a data frame...
group_by(group_value) %>%
summarise(mean_mpg = mean(mpg)))) # ...and aggregate
mpg_means_in_list_cols
#> # A tibble: 10 x 4
#> mpg group group_values mpg_means
#> <list> <chr> <list> <list>
#> 1 <dbl [32]> cyl <dbl [32]> <tibble [3 × 2]>
#> 2 <dbl [32]> disp <dbl [32]> <tibble [27 × 2]>
#> 3 <dbl [32]> hp <dbl [32]> <tibble [22 × 2]>
#> 4 <dbl [32]> drat <dbl [32]> <tibble [22 × 2]>
#> 5 <dbl [32]> wt <dbl [32]> <tibble [29 × 2]>
#> 6 <dbl [32]> qsec <dbl [32]> <tibble [30 × 2]>
#> 7 <dbl [32]> vs <dbl [32]> <tibble [2 × 2]>
#> 8 <dbl [32]> am <dbl [32]> <tibble [2 × 2]>
#> 9 <dbl [32]> gear <dbl [32]> <tibble [3 × 2]>
#> 10 <dbl [32]> carb <dbl [32]> <tibble [6 × 2]>
While this is decidedly not as pretty, it's capable of holding many types tidily. To extract the result above, just add %>% unnest(mpg_means). As-is, grouping variables are each held in a list element of group_values and in aggregated form in the first column of each mpg_means tibble.
When grouping your data within the map, you can rename the grouping variable to "level", since those values will form the column containing the levels of the grouping variable in the final data set.
When you have mixed types of grouping variables (e.g. both numeric and character), you'll also need to coerce the grouping variable to character in order to be able to bind the results together.
With those additions, you should get what you expect. (You can also skip the bind_rows by using map_df instead of map, to save a little bit of code, like I've done below.)
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
library(purrr)
library(dplyr)
data <- iris
vars <- names(data)
set_names(vars) %>%
map_df(function(var) {
var <- set_names(var, "level")
data %>%
group_by_at(var) %>%
summarize_at("Sepal.Length", "mean") %>%
mutate_at("level", as.character)
}, .id = "variable")
#> # A tibble: 126 x 3
#> variable level Sepal.Length
#> <chr> <chr> <dbl>
#> 1 Sepal.Length 4.3 4.3
#> 2 Sepal.Length 4.4 4.4
#> 3 Sepal.Length 4.5 4.5
#> 4 Sepal.Length 4.6 4.6
#> 5 Sepal.Length 4.7 4.7
#> 6 Sepal.Length 4.8 4.8
#> 7 Sepal.Length 4.9 4.9
#> 8 Sepal.Length 5 5.0
#> 9 Sepal.Length 5.1 5.1
#> 10 Sepal.Length 5.2 5.2
#> # ... with 116 more rows
You could also wrap the process in a function, and allow multiple variables to summarise with multiple functions. You'd have to spend a moment to come up with an evocative name though (I cheated and just used foo here).
foo <- function(data, vars, funs) {
grps <- names(data)
set_names(grps) %>%
map_df(function(grp) {
grp <- set_names(grp, "level")
data %>%
group_by_at(grp) %>%
summarize_at(vars, funs) %>%
mutate_at("level", as.character)
}, .id = "variable")
}
foo(iris, vars(Sepal.Length, Sepal.Width), funs(mean, sd))
#> # A tibble: 126 x 6
#> variable level Sepal.Length_mean Sepal.Width_mean Sepal.Length_sd
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Sepal.Length 4.3 4.3 3.000000 NaN
#> 2 Sepal.Length 4.4 4.4 3.033333 0
#> 3 Sepal.Length 4.5 4.5 2.300000 NaN
#> 4 Sepal.Length 4.6 4.6 3.325000 0
#> 5 Sepal.Length 4.7 4.7 3.200000 0
#> 6 Sepal.Length 4.8 4.8 3.180000 0
#> 7 Sepal.Length 4.9 4.9 2.950000 0
#> 8 Sepal.Length 5 5.0 3.120000 0
#> 9 Sepal.Length 5.1 5.1 3.477778 0
#> 10 Sepal.Length 5.2 5.2 3.425000 0
#> # ... with 116 more rows, and 1 more variables: Sepal.Width_sd <dbl>
Let's say I have two datasets for the same group of irises over two years:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
I would like to fit a separate linear model for each species in the 2007 data, which I can do like this:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
When we look at the results, they make sense as a nicely-organized tibble.
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now let's do the same thing to the 2008 data.
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
Again, we end up with a nice tibble.
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now what I would like to do is use the linear models from the 2008 data to predict results using the 2007 data. Thinking that the best way to do that would be to combine the two datasets (retaining the group structure), here is what happens when I try to merge the two nested tibbles:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
As you can see below, the tibble no longer seems to have the same format as the individual tibbles above. Specifically, the organization is hard to discern (note that I am not including the full output in this chunk, but you get the idea).
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
And although I can still apparently use the models fitted to the 2008 data (as models.y) to the data from 2007 (as data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
The result is again not a nicely-organized tibble: (again not showing full output)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
So my question is -- is this process working even though the tibbles become strangely organized after the merge? Or am I missing something? Thanks!
install.packages("pacman")
pacman::p_load(tidyverse)
iris_2007 <- iris %>% mutate(year = 2007)
iris_2008 <- iris %>% mutate(year = 2008)
iris_2008[1:4] <- 2 * iris_2008[1:4]
# combine data
iris_all_data <- iris_2007 %>%
bind_rows(iris_2008) %>%
group_by(Species) %>%
nest()
# model and predict
iris_predict <- iris_all_data %>%
mutate(
modelData = data %>% map(., ~ filter(., year == 2007)),
validationData = data %>% map(., ~ filter(., year == 2008)),
model = modelData %>% map(., ~ lm(Petal.Length ~ Petal.Width, data = .)),
prediction = map2(
.x = model, .y = validationData, ~ predict(object = .x, newdata = .y)
)
) %>%
select(Species, prediction) %>%
unnest(cols = c(prediction))
print(iris_predict)
I would double nest it first and apply the models later
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
Which is a Tidier version of what you get with inner_join