I am nesting data and want to take weights on the dataset. However, I want to use an inner nest as a weight and have this different in value.
For example, by using the iris data:
iris %>% group_by(Species) %>% nest(.) %>% ungroup %>% mutate(data = map(data, ~ .x %>% transmute(
Sepal = apply(across(starts_with('Sepal')), 1, function(x)
weighted.mean(x, c(0.20, 0.30))),
Petal = apply(across(starts_with('Petal')), 1, function(x)
weighted.mean(x, c(0.25, 0.35)))
)))
I can get specific weights over columns, however I want to all take weights for the nested groups i.e.
# A tibble: 3 × 2
Species data
<fct> <list>
1 setosa <tibble [50 × 2]>
2 versicolor <tibble [50 × 2]>
3 virginica <tibble [50 × 2]>
an extra weight for each species such that c(0.15, 0.20, 0.25) when performing the calculation. The best I can try is adding an extra column with the value 1 however I do not know how to make each weighted dependent on the nested group.
Something like:
setosa*Sepal.Length*Sepal.Width ~ 0.15*0.20*0.30
versicolor*Sepal.Length*Sepal.Width ~ 0.20*0.20*0.30
...
Related
I have mapped lineal models, and i would like to check how sussesful are they, so I would like to split my data into train and test, but I am not sure about where to add that.
At the moment I have something like this:
library(tidyverse)
library(broom)
df<-iris %>% group_by(Species) %>% nest() %>%
mutate(model=map(data,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
final<-vector()
for(i in 1:length(df$model)){
the_model<-df$model[[i]]
value<-broom::glance(the_model)%>% select(adj.r.squared)
final[i]<-value[[1]]
}
I know I should split my data before the "for", and then I should check its sussess inside it.
But I dont know how to split them inside the nested groups, in order to have 75% of each group and not risk to have something like 80% of one group, 60% of another and 75% of the last one.
What should I do?
You can do :
library(tidyverse)
iris %>%
group_by(Species) %>%
mutate(class = sample(rep(c('train', 'test'), c(n() * 0.8, n() * 0.2)))) %>%
group_by(class, .add = TRUE) %>%
nest() %>%
pivot_wider(names_from = class, values_from = data) %>%
mutate(model= map(train,~lm(formula= "Sepal.Length ~ Sepal.Width",data=.x)))
# Species train test model
# <fct> <list> <list> <list>
#1 setosa <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#2 versicolor <tibble [40 × 4]> <tibble [10 × 4]> <lm>
#3 virginica <tibble [40 × 4]> <tibble [10 × 4]> <lm>
I'm trying to mutate a new variable in a nested dataframe with an ifelse-condition. But the problem is that after implementing the ifelse-condition the nested dataframe turns into a list.
I want to show this problem with the iris dataset:
Here you can see the original nested format:
iris %>% nest(data = -Species)
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica <tibble [50 x 4]>
And now I want to mutate a new variable in the nested dataframes:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 5]>
2 versicolor <tibble [50 x 5]>
3 virginica <tibble [50 x 5]>
This code works. The data-column is as desired in the tibble-format.
But if I insert the ifelse-condition now, the tibble-format is lost:
iris %>%
nest(data = -Species) %>%
mutate(data = map(data, function(x)
ifelse(!is.na(x), x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), NA)))
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <list [200]>
2 versicolor <list [200]>
3 virginica <list [200]>
I want to keep the tibble-format even with the ifelse-condition.
Can anyone help me?
In the first step of the map() computation, i.e. data in setosa, the input x of your custom function is actually
x <- iris[1:50, 1:4]
Then you put x into ifelse()
ifelse(!is.na(x), # part 1
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2), # part 2
NA) # part 3
The first part is !is.na(x), which returns 50x4=200 logical values. Hence, the second and third parts will be recycled to length 200. However, the second part, i.e.
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
is a tibble with 5 variables, which is also a list with length 5, so each variable in this tibble will be recycled 40 times and subsequently a list with length 200 will be created. That is why you get 3 lists of length 200.
In your case, ifelse() may not be applicable. You can adjust it to
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map(data, function(x) {
if(is.data.frame(x))
x %>% mutate(`Sepal.Length^2` = Sepal.Length^2)
else
NULL
}))
# # A tibble: 4 x 2
# Species data
# <chr> <list>
# 1 setosa <tibble [50 × 5]>
# 2 versicolor <tibble [50 × 5]>
# 3 virginica <tibble [50 × 5]>
# 4 example <NULL>
Make sure that the condition in if() must be a single logical value.
Grateful to #27ϕ9 for a neater version with map_if():
iris %>%
nest(data = -Species) %>%
add_row(Species = "example", data = NA) %>%
mutate(data = map_if(data, is_tibble,
~ mutate(.x, `Sepal.Length^2` = Sepal.Length^2),
.else = NULL))
This is a follow on question to an earlier question here. I am looking to take it a bit futher but not sure how.
What I want to do is extract the quintiles for each Sepal_width, Petal_width etc. and store them as a nested tibble.
Instead of Species I actually have time series data of years (so I nest on these years), so I actually want to calculate the quintile of a number of columns in that year (as opposed to Sepal.length etc.) and then pull out all quintiles of 1 for all years, combine them into one time series data frame for all years and then nest them back based on quintiles.
I know the title might be a little confusion so what I am trying to do is:
1) Nest the data by "year/(species)"
2) Compute the quintiles for a number of columns
3) unnest the data where all quintiles equal 1, then 2, then 3... etc. so all the years/(species) are put back into a total of 5 data frames (or hopefully new nested tibbles)
4) Renest the data by quintiles (which now contains all the years / (species)
So I will have a nested tibble of 5 tibbles (one for each quintile) where each tibble consists of all the years (species) combined back together
Data:
data(iris)
iris_quintiles <- iris %>%
as_tibble() %>%
group_by(Species) %>%
nest(.key = "data") %>%
mutate(Sep_len = map(data, ~select(.x, Species, Sepal.Length)),
Sep_len = map(Sep_len, ~mutate(.x, quantile_Sep_len = ntile(Sepal.Length, 5))),
Sep_wid = map(data, ~select(.x, Species, Sepal.Width)),
Sep_wid = map(Sep_wid, ~mutate(.x, quantile_Sep_wid = ntile(Sepal.Width, 5))),
Pet_len = map(data, ~select(.x, Species, Petal.Length)),
Pet_len = map(Pet_len, ~mutate(.x, quantile_Pet_len = ntile(Petal.Length, 5))),
Pet_wid = map(data, ~select(.x, Species, Petal.Width)),
Pet_wid = map(Pet_wid, ~mutate(.x, quantile_Pet_wid = ntile(Petal.Width, 5))))
iris_quintiles
# Here is where it gets a little messy and what I am currently doing
# is extracting them individually but I will have to do this for quantile_Sen_len, quantile_Pet_len, quantile_Pet_wid etc. where the code gets quite large
df1 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 1)
df2 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 2)
df3 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 3)
df <- list(df1, df2, df3)
df <- plyr::ldply(df, data.frame)
df %>%
group_by(Species) %>%
nest(.key = "data")
Expected Output: - Not "exactly" but more or less close.
# A tibble: 3 x 6
QUINTILES data Sep_len Sep_wid Pet_len Pet_wid
<fct> <list> <list> <list> <list> <list>
1 quintile_1 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
2 quintile_2 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_3 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_4 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
3 quintile_5 <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 4]> <tibble [50 x 5]>
Ignore the tibble sizes since I copied and pasted this tibble and modified it a bit.
So there is 5 tibbles - one for each quantile. The sep_len, sep_wid, pet_len and pet_wid consists of all the data (for all Species) - i.e. before the nested tibble nested on Species - performed the quintile operation and then using something like this
df1 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 1)
allowed me to extract the quintile = 1 for all 3 species. So here df1 should basically be Sep_len in the above tibble for quintile_1. The following:
df2 <- iris_quintiles %>%
unnest() %>%
filter(quantile_Sep_len == 2)
would be sep_len for quintile_2 in the same tibble.
We could loop over the column names of 'iris_quintiles', unnest, then nest and reduce it to a single dataset
library(tidyverse)
map(names(iris_quintiles)[-(1:2)], ~
iris_quintiles %>%
select(Species, .x) %>%
unnest %>%
rename_at(vars(matches("quantile")), ~ "QUINTILES") %>%
group_by(QUINTILES = str_c("quintile_", QUINTILES)) %>%
nest(.key = !!.x)) %>%
reduce(inner_join, by = "QUINTILES") %>%
arrange(factor(QUINTILES, levels = str_c("quintile_", 1:5)))
Hi I am trying to apply a very simple function by using purrr::map however i keep getting the error Error in mutate_impl(.data, dots) :
Evaluation error: unused argument (.x[[i]]).
The codes are as below:
data = data.frame(name = c('A', 'B', 'C'), metric = c(0.29, 0.39,0.89))
get_sample_size = function(metric, threshold = 0.01){
sample_size = ceiling((1.96^2)*(metric*(1-metric))/(threshold^2))
return(data.frame(sample_size))
}
data %>% group_by(name) %>% tidyr::nest() %>%
dplyr::mutate(result = purrr::map( .x = data, .f = get_sample_size, metric = metric, threshold = 0.01 ))
You don't need nest. The metric argument from get_sample_size function should be a numeric vector, but if you do nest, the data column is a list of data frame, which cannot be the input for the metric argument.
I think you can use summarize and map to apply your function to the metric column.
library(tidyverse)
data %>%
group_by(name) %>%
summarize(result = purrr::map(.x = metric,
.f = get_sample_size,
threshold = 0.01))
# # A tibble: 3 x 2
# name result
# <fct> <list>
# 1 A <data.frame [1 x 1]>
# 2 B <data.frame [1 x 1]>
# 3 C <data.frame [1 x 1]>
When you pass metric in the ... part of map, it's not clear that that is a column in the nested data frame. But once you nest the data like you've done, metric isn't a column in data, it's a column in the nested frame...also called "data." (This is a good example of why you want more specific variable names btw.)
If you're mapping over the data column, you can use $metric to point to that column, either in writing out a function, as I've done here (such as df$metric), or in formula notation (such as .$metric).
As #www said, you don't need nested data frames in this case. But for a more complicated case, you might need nested data frames to work with, such as for building models, so it's good to know how to reference exactly the data you want.
library(tidyverse)
data %>%
group_by(name) %>%
tidyr::nest() %>%
mutate(result = map(data, function(df) {
get_sample_size(metric = df$metric, threshold = 0.01)
}))
#> # A tibble: 3 x 3
#> name data result
#> <fct> <list> <list>
#> 1 A <tibble [1 × 1]> <data.frame [1 × 1]>
#> 2 B <tibble [1 × 1]> <data.frame [1 × 1]>
#> 3 C <tibble [1 × 1]> <data.frame [1 × 1]>
Created on 2019-01-16 by the reprex package (v0.2.1)
Let's say I have two datasets for the same group of irises over two years:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
I would like to fit a separate linear model for each species in the 2007 data, which I can do like this:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
When we look at the results, they make sense as a nicely-organized tibble.
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now let's do the same thing to the 2008 data.
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
Again, we end up with a nice tibble.
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
Now what I would like to do is use the linear models from the 2008 data to predict results using the 2007 data. Thinking that the best way to do that would be to combine the two datasets (retaining the group structure), here is what happens when I try to merge the two nested tibbles:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
As you can see below, the tibble no longer seems to have the same format as the individual tibbles above. Specifically, the organization is hard to discern (note that I am not including the full output in this chunk, but you get the idea).
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
And although I can still apparently use the models fitted to the 2008 data (as models.y) to the data from 2007 (as data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
The result is again not a nicely-organized tibble: (again not showing full output)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
So my question is -- is this process working even though the tibbles become strangely organized after the merge? Or am I missing something? Thanks!
install.packages("pacman")
pacman::p_load(tidyverse)
iris_2007 <- iris %>% mutate(year = 2007)
iris_2008 <- iris %>% mutate(year = 2008)
iris_2008[1:4] <- 2 * iris_2008[1:4]
# combine data
iris_all_data <- iris_2007 %>%
bind_rows(iris_2008) %>%
group_by(Species) %>%
nest()
# model and predict
iris_predict <- iris_all_data %>%
mutate(
modelData = data %>% map(., ~ filter(., year == 2007)),
validationData = data %>% map(., ~ filter(., year == 2008)),
model = modelData %>% map(., ~ lm(Petal.Length ~ Petal.Width, data = .)),
prediction = map2(
.x = model, .y = validationData, ~ predict(object = .x, newdata = .y)
)
) %>%
select(Species, prediction) %>%
unnest(cols = c(prediction))
print(iris_predict)
I would double nest it first and apply the models later
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
Which is a Tidier version of what you get with inner_join