I'm trying to mutate a new variable in a nested dataframe with an ifelse-condition. But the problem is that after implementing the ifelse-condition the nested dataframe turns into a list.
I want to show this problem with the iris dataset:
Here you can see the original nested format:
iris %>% nest(data = -Species)
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
And now I want to mutate a new variable in the nested dataframes:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 5]>
2 versicolor <tibble [50 x 5]>
3 virginica <tibble [50 x 5]>
This code works. The data-column is as desired in the tibble-format.
But if I insert the ifelse-condition now, the tibble-format is lost:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
ifelse(!is.na(x), x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), NA)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <list [200]>
2 versicolor <list [200]>
3 virginica <list [200]>
I want to keep the tibble-format even with the ifelse-condition.
Can anyone help me?
In the first step of the map() computation, i.e. data in setosa, the input x of your custom function is actually
x <- iris[1:50, 1:4]
Then you put x into ifelse()
ifelse(!is.na(x), # part 1
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), # part 2
NA) # part 3
The first part is !is.na(x), which returns 50x4=200 logical values. Hence, the second and third parts will be recycled to length 200. However, the second part, i.e.
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
is a tibble with 5 variables, which is also a list with length 5, so each variable in this tibble will be recycled 40 times and subsequently a list with length 200 will be created. That is why you get 3 lists of length 200.
In your case, ifelse() may not be applicable. You can adjust it to
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map(data, function(x) {
if(is.data.frame(x))
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
else
NULL
}))
# # A tibble: 4 x 2
# Species data
# <chr> <list>
# 1 setosa <tibble [50 × 5]>
# 2 versicolor <tibble [50 × 5]>
# 3 virginica <tibble [50 × 5]>
# 4 example <NULL>
Make sure that the condition in if() must be a single logical value.
Grateful to #27ϕ9 for a neater version with map_if():
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map_if(data, is_tibble,
~ mutate(.x, `Sepal.Length^2` = Sepal.Length^2),
.else = NULL))
Related
I am nesting data and want to take weights on the dataset. However, I want to use an inner nest as a weight and have this different in value.
For example, by using the iris data:
iris %>% group_by(Species) %>% nest(.) %>% ungroup %>% mutate(data = map(data, ~ .x %>% transmute(
Sepal = apply(across(starts_with('Sepal')), 1, function(x)
weighted.mean(x, c(0.20, 0.30))),
Petal = apply(across(starts_with('Petal')), 1, function(x)
weighted.mean(x, c(0.25, 0.35)))
)))
I can get specific weights over columns, however I want to all take weights for the nested groups i.e.
# A tibble: 3 × 2
Species data
<fct> <list>
1 setosa <tibble [50 × 2]>
2 versicolor <tibble [50 × 2]>
3 virginica <tibble [50 × 2]>
an extra weight for each species such that c(0.15, 0.20, 0.25) when performing the calculation. The best I can try is adding an extra column with the value 1 however I do not know how to make each weighted dependent on the nested group.
Something like:
setosa*Sepal.Length*Sepal.Width ~ 0.15*0.20*0.30
versicolor*Sepal.Length*Sepal.Width ~ 0.20*0.20*0.30
...
I am trying to create a new nested column using the data from the min and max values of another nested column.
If I nest the IRIS data by Species and want to create a new nested data frame by the min and max of the Petal.Length for each Species how would I do it?
My code so far, create a function to create a new data.frame or expand.grid, then apply it using mutate(...map(...
Code/Data:
func = function(input){
data.frame(
min_to_max = seq(
from = min(.x$Petal.Length),
to = max(.x$Petal.Length),
by = 1
)
)
}
iris %>%
group_by(Species) %>%
nest() %>%
mutate(
expandDF = map(data, ~ func(.x))
)
The function should have match the argument name used i.e. input and not .x
func <- function(input){
data.frame(
min_to_max = seq(
from = min(input$Petal.Length),
to = max(input$Petal.Length),
by = 1
)
)
}
-testing
iris %>%
group_by(Species) %>%
nest() %>%
mutate(
expandDF = map(data, ~ func(.x))
) %>% ungroup
-output
# A tibble: 3 × 3
Species data expandDF
<fct> <list> <list>
1 setosa <tibble [50 × 4]> <df [1 × 1]>
2 versicolor <tibble [50 × 4]> <df [3 × 1]>
3 virginica <tibble [50 × 4]> <df [3 × 1]>
We could also do this without using map i.e with nest_by
iris %>%
nest_by(Species) %>%
mutate(expandDF = list(data.frame(min_to_max =
seq(from = min(data$Petal.Length), to = max(data$Petal.Length))))) %>%
ungroup
# A tibble: 3 × 3
Species data expandDF
<fct> <list<tibble[,4]>> <list>
1 setosa [50 × 4] <df [1 × 1]>
2 versicolor [50 × 4] <df [3 × 1]>
3 virginica [50 × 4] <df [3 × 1]>
This is a follow on question to an earlier question here. I am looking to take it a bit futher but not sure how.
What I want to do is extract the quintiles for each Sepal_width, Petal_width etc. and store them as a nested tibble.
Instead of Species I actually have time series data of years (so I nest on these years), so I actually want to calculate the quintile of a number of columns in that year (as opposed to Sepal.length etc.) and then pull out all quintiles of 1 for all years, combine them into one time series data frame for all years and then nest them back based on quintiles.
I know the title might be a little confusion so what I am trying to do is:
1) Nest the data by "year/(species)"
2) Compute the quintiles for a number of columns
3) unnest the data where all quintiles equal 1, then 2, then 3... etc. so all the years/(species) are put back into a total of 5 data frames (or hopefully new nested tibbles)
4) Renest the data by quintiles (which now contains all the years / (species)
So I will have a nested tibble of 5 tibbles (one for each quintile) where each tibble consists of all the years (species) combined back together
Data:
data(iris)
iris_quintiles <- iris %>%
as_tibble() %>%
group_by(Species) %>%
nest(.key = "data") %>%
mutate(Sep_len = map(data, ~select(.x, Species, Sepal.Length)),
Sep_len = map(Sep_len, ~mutate(.x, quantile_Sep_len = ntile(Sepal.Length, 5))),
Sep_wid = map(data, ~select(.x, Species, Sepal.Width)),
Sep_wid = map(Sep_wid, ~mutate(.x, quantile_Sep_wid = ntile(Sepal.Width, 5))),
Pet_len = map(data, ~select(.x, Species, Petal.Length)),
Pet_len = map(Pet_len, ~mutate(.x, quantile_Pet_len = ntile(Petal.Length, 5))),
Pet_wid = map(data, ~select(.x, Species, Petal.Width)),
Pet_wid = map(Pet_wid, ~mutate(.x, quantile_Pet_wid = ntile(Petal.Width, 5))))
iris_quintiles
# Here is where it gets a little messy and what I am currently doing
# is extracting them individually but I will have to do this for quantile_Sen_len, quantile_Pet_len, quantile_Pet_wid etc. where the code gets quite large
df1 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 1)
df2 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 2)
df3 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 3)
df <- list(df1, df2, df3)
df <- plyr::ldply(df, data.frame)
df %>%
group_by(Species) %>%
nest(.key = "data")
Expected Output: - Not "exactly" but more or less close.
# A tibble: 3 x 6
QUINTILES data Sep_len Sep_wid Pet_len Pet_wid
<fct> <list> <list> <list> <list> <list>
1 quintile_1 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
2 quintile_2 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_3 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_4 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_5 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
Ignore the tibble sizes since I copied and pasted this tibble and modified it a bit.
So there is 5 tibbles - one for each quantile. The sep_len, sep_wid, pet_len and pet_wid consists of all the data (for all Species) - i.e. before the nested tibble nested on Species - performed the quintile operation and then using something like this
df1 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 1)
allowed me to extract the quintile = 1 for all 3 species. So here df1 should basically be Sep_len in the above tibble for quintile_1. The following:
df2 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 2)
would be sep_len for quintile_2 in the same tibble.
We could loop over the column names of 'iris_quintiles', unnest, then nest and reduce it to a single dataset
library(tidyverse)
map(names(iris_quintiles)[-(1:2)], ~
iris_quintiles %>%
select(Species, .x) %>%
unnest %>%
rename_at(vars(matches("quantile")), ~ "QUINTILES") %>%
group_by(QUINTILES = str_c("quintile_", QUINTILES)) %>%
nest(.key = !!.x)) %>%
reduce(inner_join, by = "QUINTILES") %>%
arrange(factor(QUINTILES, levels = str_c("quintile_", 1:5)))
I am trying to move away from rowwise() for list columns as I have heard that the tidyverse team is in the process of axing it. However, I am not used to using the purrr functions so I feel like there must be a better way of doing the following:
I create a list-column containing a tibble for each species. I then want to go into the tibble and take the mean of certain variables. The first case is using map and second is the rowwise solution that I personally feel is cleaner.
Does anyone know a better way to use map in this situation?
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(mean_slength = map_dbl(data, ~mean(.$Sepal.Length, na.rm = TRUE)),
mean_swidth = map_dbl(data, ~mean(.$Sepal.Width, na.rm = TRUE))
)
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
iris %>%
group_by(Species) %>%
nest() %>%
rowwise() %>%
mutate(mean_slength = mean(data$Sepal.Length, na.rm = TRUE),
mean_swidth = mean(data$Sepal.Width, na.rm = TRUE))
#> Source: local data frame [3 x 4]
#> Groups: <by row>
#>
#> # A tibble: 3 x 4
#> Species data mean_slength mean_swidth
#> <fct> <list> <dbl> <dbl>
#> 1 setosa <tibble [50 x 4]> 5.01 3.43
#> 2 versicolor <tibble [50 x 4]> 5.94 2.77
#> 3 virginica <tibble [50 x 4]> 6.59 2.97
Created on 2018-12-26 by the reprex package (v0.2.1)
Instead of having two map, use a single one, with summarise_at
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(out = map(data, ~
.x %>%
summarise_at(vars(matches('Sepal')),
funs(mean_s = mean(., na.rm = TRUE))))) %>%
unnest(out)
Let's say I have two datasets for the same group of irises over two years:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
I would like to fit a separate linear model for each species in the 2007 data, which I can do like this:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
When we look at the results, they make sense as a nicely-organized tibble.
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now let's do the same thing to the 2008 data.
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
Again, we end up with a nice tibble.
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now what I would like to do is use the linear models from the 2008 data to predict results using the 2007 data. Thinking that the best way to do that would be to combine the two datasets (retaining the group structure), here is what happens when I try to merge the two nested tibbles:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
As you can see below, the tibble no longer seems to have the same format as the individual tibbles above. Specifically, the organization is hard to discern (note that I am not including the full output in this chunk, but you get the idea).
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
And although I can still apparently use the models fitted to the 2008 data (as models.y) to the data from 2007 (as data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
The result is again not a nicely-organized tibble: (again not showing full output)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
So my question is -- is this process working even though the tibbles become strangely organized after the merge? Or am I missing something? Thanks!
install.packages("pacman")
pacman::p_load(tidyverse)
iris_2007 <- iris %>% mutate(year = 2007)
iris_2008 <- iris %>% mutate(year = 2008)
iris_2008[1:4] <- 2 * iris_2008[1:4]
# combine data
iris_all_data <- iris_2007 %>%
bind_rows(iris_2008) %>%
group_by(Species) %>%
nest()
# model and predict
iris_predict <- iris_all_data %>%
mutate(
modelData = data %>% map(., ~ filter(., year == 2007)),
validationData = data %>% map(., ~ filter(., year == 2008)),
model = modelData %>% map(., ~ lm(Petal.Length ~ Petal.Width, data = .)),
prediction = map2(
.x = model, .y = validationData, ~ predict(object = .x, newdata = .y)
)
) %>%
select(Species, prediction) %>%
unnest(cols = c(prediction))
print(iris_predict)
I would double nest it first and apply the models later
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
Which is a Tidier version of what you get with inner_join