R: predict new values for groups

R: predict new values for groups - r

I've calculated a different regression for each group in a data frame:
DF.L <- DF %>%
group_by(Channel) %>%
do(Fit = rlm(L ~ -1 + Y + I(Y^2), data = .))
I want to apply this set of regressions to another data frame. To do so, I'm testing how to apply it to the same data frame:
DF %>%
group_by(Channel) %>%
do({
Lfit <- predict(subset(DF.L, Channel == unique(.$Channel))$Fit, .)
data.frame(., Lfit)
})
glimpse(DF)
But I keep getting this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
Calls: %>% ... do_.grouped_df -> eval -> eval -> predict -> predict
What I am doing wrong?

Using the built-in ChickWeight data:
library(dplyr)
library(MASS)
library(broom)
library(tidyr)
library(ggplot2)
head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Fit some models
ChickWeight_models <- ChickWeight %>%
group_by(Diet) %>%
do(fit = MASS::rlm(weight ~ Time + I(Time^2), data = .))
ChickWeight_models
Source: local data frame [4 x 2]
Groups: <by row>
# A tibble: 4 x 2
Diet fit
* <fctr> <list>
1 1 <S3: rlm>
2 2 <S3: rlm>
3 3 <S3: rlm>
4 4 <S3: rlm>
So I've created a very similar object to your DF.L. It's a frame with the four groups, each with an rlm object in a list-column called fit.
Make up some test data
Now I'll make up some data to test this model on. In this case, I'll just take the original data and add some noise to each of the variables.
ChickWeight_simulated <- ChickWeight %>%
mutate(Time = Time + runif(length(Time)),
weight = weight + rnorm(length(weight)))
ChickWeight_simulated
weight Time Chick Diet
1 42.72075 0.9786272 1 1
2 51.12669 2.8399631 1 1
3 58.64632 4.4576380 1 1
4 63.77617 6.1083591 1 1
5 75.40434 8.1051792 1 1
6 91.75830 10.7899030 1 1
Now we want to combine the dataframe of the models with the new data to test on. First we group_by and tidyr::nest the simulated data. This creates an object that is a dataframe with the four groups and a list-column called data, each element of which contains a rolled-up dataframe.
ChickWeight_simulated %>% group_by(Diet) %>% nest()
# A tibble: 4 x 2
Diet data
<fctr> <list>
1 1 <tibble [220 x 3]>
2 2 <tibble [120 x 3]>
3 3 <tibble [120 x 3]>
4 4 <tibble [118 x 3]>
Add the original models to the new data
Then we can join it to the models dataframe:
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models)
# A tibble: 4 x 3
Diet data fit
<fctr> <list> <list>
1 1 <tibble [220 x 3]> <S3: rlm>
2 2 <tibble [120 x 3]> <S3: rlm>
3 3 <tibble [120 x 3]> <S3: rlm>
4 4 <tibble [118 x 3]> <S3: rlm>
Now we group by Diet again, and use broom::augment to make a prediction of each model on the new simulated data. Since each group is one row, there is one element each of fit and data; we have to extract that single element out of each list-column into a usable form by using [[1]].
ChickWeight_simulated_predicted <-
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models) %>%
group_by(Diet) %>%
do(augment(.$fit[[1]], newdata = .$data[[1]]))
head(ChickWeight_simulated_predicted)
# A tibble: 6 x 6
# Groups: Diet [1]
Diet weight Time Chick .fitted .se.fit
<fctr> <dbl> <dbl> <ord> <dbl> <dbl>
1 1 42.72075 0.9786272 1 43.62963 2.368838
2 1 51.12669 2.8399631 1 51.80855 1.758385
3 1 58.64632 4.4576380 1 59.67606 1.534051
4 1 63.77617 6.1083591 1 68.43218 1.534152
5 1 75.40434 8.1051792 1 80.00678 1.647612
6 1 91.75830 10.7899030 1 97.26450 1.726331
Sanity check
To prove that this really only used the model from a particular level of Diet on the simulated data from that level of Diet, we can visualize the model fit.
ChickWeight_simulated_predicted %>%
ggplot(aes(Time, weight)) +
geom_point(shape = 1) +
geom_ribbon(aes(Time,
ymin = .fitted-1.96*.se.fit,
ymax = .fitted+1.96*.se.fit),
alpha = 0.5, fill = "black") +
geom_line(aes(Time, .fitted), size = 1, color = "red") +
facet_wrap(~Diet)

I think your error comes from how you are calling predict. I can't fix your exact code, but here is a simple way you can get predictions from your model. A more sophisticated way using purrr and nest is outlined here: http://ijlyttle.github.io/isugg_purrr/presentation.html#(1)
UPDATE - the purrr and nest way
Just adding this to show that it can be done pretty easily within the tidyverse, using predict. See link above for more details.
library(tidyverse)
# shuffle the rows to mix up the species
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# create first dataset - use the first 50 rows for running the model
iris_nested <-
myiris[1:50,] %>%
nest(-Species) %>%
rename(myorigdata = data)
# create second dataset - use the other 100 rows for making predictions
new_iris_nested <-
myiris[51:150,] %>%
nest(-Species) %>%
rename(mynewdata = data)
# make a model function
my_rlm <- function(df) {
MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = df)
}
# get the predictions (see the GitHub link above which breaks this into steps)
predictions_tall <-
iris_nested %>%
mutate(my_model = map(myorigdata, my_rlm)) %>%
full_join(new_iris_nested, by = "Species") %>%
mutate(my_new_pred = map2(my_model, mynewdata, predict)) %>%
select(Species, mynewdata, my_new_pred) %>%
unnest(mynewdata, my_new_pred) %>%
rename(modeled = my_new_pred, measured = Sepal.Length) %>%
gather("Type", "Sepal.Length", modeled, measured)
The nested predictions_tall object looks like this:
predictions_tall %>% nest(-Species, -type) %>% as.tibble()
# A tibble: 6 x 3
Species type data
<fctr> <chr> <list>
1 setosa modeled <data.frame [32 x 4]>
2 versicolor modeled <data.frame [33 x 4]>
3 virginica modeled <data.frame [35 x 4]>
4 setosa measured <data.frame [32 x 4]>
5 versicolor measured <data.frame [33 x 4]>
6 virginica measured <data.frame [35 x 4]>
And finally, the plot to show the prediction results:
predictions_tall %>%
ggplot(aes(x = Petal.Length, y = Sepal.Length)) +
geom_line(aes(color = Species, linetype = Type))
ORIGINAL - the broom way
I've updated this now to only calculate predictions for each group using the model for that group.
This way uses the broom package - specifically the augment function - to add fitted values. See more here: https://cran.r-project.org/web/packages/broom/vignettes/broom.html
Since you don't supply data, I use iris here.
library(tidyverse)
library(broom)
# first shuffle around the rows of iris
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# first data - first 25 rows for running the models on
origiris <-
myiris[1:25,] %>%
nest(-Species) %>%
rename(origdata = data)
# second data - last 50 rows for predicting on
prediris <-
myiris[101:150,] %>%
nest(-Species) %>%
rename(preddata = data)
# estimate models on the first 25 rows
# a separate model is estimated for each species
iris_mod <-
origiris %>%
mutate(mod = map(origdata, ~ MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = .)))
First get fitted values for the original dataset (not essential, just for illustration):
# get fitted values for the first dataset (origdata)
origiris_aug <-
iris_mod %>%
mutate(origpred = map(mod, augment)) %>%
unnest(origpred) %>%
as.tibble()
The origiris_aug predictions dataframe looks like this:
origiris_aug
# A tibble: 25 x 10
Species .rownames Sepal.Length Petal.Length Petal.Width .fitted .se.fit .resid
<fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 18 5.1 1.4 0.3 5.002797 0.1514850 0.09720290
2 setosa 2 4.9 1.4 0.2 4.931824 0.1166911 -0.03182417
3 setosa 34 5.5 1.4 0.2 4.931824 0.1166911 0.56817583
4 setosa 40 5.1 1.5 0.2 4.981975 0.1095883 0.11802526
5 setosa 39 4.4 1.3 0.2 4.881674 0.1422123 -0.48167359
6 setosa 36 5.0 1.2 0.2 4.831523 0.1784156 0.16847698
7 setosa 25 4.8 1.9 0.2 5.182577 0.2357614 -0.38257703
8 setosa 31 4.8 1.6 0.2 5.032125 0.1241074 -0.23212531
9 setosa 42 4.5 1.3 0.3 4.952647 0.1760223 -0.45264653
10 setosa 21 5.4 1.7 0.2 5.082276 0.1542594 0.31772411
# ... with 15 more rows, and 2 more variables: .hat <dbl>, .sigma <dbl>
And now what you actually want - making predictions on the new dataset:
# get fitted values for the second dataset (preddata)
# each model is fitted to the appropriate species' nested dataframe
prediris_aug <-
iris_mod %>%
inner_join(prediris, by = "Species") %>%
map2_df(.x = iris_mod$mod, .y = prediris$preddata, .f = ~augment(.x, newdata = .y)) %>%
as.tibble()
The prediris_aug dataframe looks like this:
prediris_aug
# A tibble: 50 x 7
.rownames Sepal.Length Sepal.Width Petal.Length Petal.Width .fitted .se.fit
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 105 6.5 3.0 5.8 2.2 8.557908 3.570269
2 115 5.8 2.8 5.1 2.4 8.348800 3.666631
3 117 6.5 3.0 5.5 1.8 8.123565 3.005888
4 139 6.0 3.0 4.8 1.8 7.772511 2.812748
5 103 7.1 3.0 5.9 2.1 8.537086 3.475224
6 107 4.9 2.5 4.5 1.7 7.551086 2.611123
7 119 7.7 2.6 6.9 2.3 9.180537 4.000412
8 135 6.1 2.6 5.6 1.4 7.889823 2.611457
9 124 6.3 2.7 4.9 1.8 7.822661 2.838502
10 118 7.7 3.8 6.7 2.2 9.009263 3.825613
# ... with 40 more rows

Related

Is there a way to summarise if column value is x?

I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time

Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)

Problem with pipe within purrr:map2 and mutate

nested_numeric <- model_table %>%
group_by(ano_fiscal) %>%
select(-c("ano_estudo", "payout", "div_ratio","ebitda", "name.company",
"alavancagem","div_pl", "div_liq", "div_total")) %>%
nest()
nested_numeric
# A tibble: 7 x 2
# Groups: ano_fiscal [7]
ano_fiscal data
<dbl> <list>
1 2012 <tibble [34 x 10]>
2 2013 <tibble [35 x 10]>
3 2014 <tibble [35 x 10]>
4 2015 <tibble [35 x 10]>
5 2016 <tibble [35 x 10]>
6 2017 <tibble [35 x 10]>
7 2018 <tibble [35 x 10]>
df_ipca$idx
[1] 0.9652515 0.9741318 0.9817300 0.9911546 0.9941281 0.9985022 1.0000000
The list-column named "data" consists of numeric variables. I want to multiply them for a deflator index. (a.k.a. adjusting for inflation)
this works fine
map2_df(nested_numeric$data, df_ipca$idx, ~ .x * .y)
or even
map2(nested_numeric$data, df_ipca$idx, ~ .x * .y)
but I'm trying to create a new list-column named "adjusted_data" with the result of this operation:
nested_numeric <- model_table %>%
group_by(ano_fiscal) %>%
select(-c("ano_estudo", "payout", "div_ratio","ebitda", "name.company",
"alavancagem","div_pl", "div_liq", "div_total")) %>%
nest() %>%
mutate( adjusted_data = data %>% {
map2(., df_ipca$idx, ~ .x * .y)})
Gives me this error:
Error: Column `adjusted_data` must be length 1 (the group size), not 7
I hope my problem is clear enough because I'm trying to adjust for inflation a data frame with values nested by years.
I thought that going for map2 within a mutate would be enough... I've tried everything and couldn't figure it what I'm doing wrong.
I've read similar questions with pipes within map2 here, but still...
Please help :)
Thank you!

A simple solution (which however does break up your pipes) is to just do
nested_numeric$adjusted_data <- map2(nested_numeric$data, df_ipca$idx, ~ .x * .y)
For example, using the iris data:
library(tidyverse)
df_ipca <- data.frame(idx = runif(3))
iris <- iris %>%
group_by(Species) %>%
nest()
iris$adjusted_data <- map2(iris$data, df_ipca$idx, ~.x * .y)
iris
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species data adjusted_data
#> <fct> <list> <list>
#> 1 setosa <tibble [50 × 4]> <df[,4] [50 × 4]>
#> 2 versicolor <tibble [50 × 4]> <df[,4] [50 × 4]>
#> 3 virginica <tibble [50 × 4]> <df[,4] [50 × 4]>
Using solution with mutate
If you want to do the map2 inside mutate, after you have grouped and nested your data, you first have to ungroup() before calling mutate (I think otherwise mutate will try to do the operation within each group instead of looping over the entire data column, which is what you want):
nested_numeric %>%
ungroup() %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
For example, using the iris data:
library(tidyverse)
df_ipca <- data.frame(idx = runif(3))
iris_nested <- iris %>%
group_by(Species) %>%
nest() %>%
ungroup() %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
# Original data
map(iris_nested$data, head)
#> [[1]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#>
#> [[2]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 7 3.2 4.7 1.4
#> 2 6.4 3.2 4.5 1.5
#> 3 6.9 3.1 4.9 1.5
#> 4 5.5 2.3 4 1.3
#> 5 6.5 2.8 4.6 1.5
#> 6 5.7 2.8 4.5 1.3
#>
#> [[3]]
#> # A tibble: 6 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6.3 3.3 6 2.5
#> 2 5.8 2.7 5.1 1.9
#> 3 7.1 3 5.9 2.1
#> 4 6.3 2.9 5.6 1.8
#> 5 6.5 3 5.8 2.2
#> 6 7.6 3 6.6 2.1
# Adjusted data
map(iris_nested$adjusted_data, head)
#> [[1]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 1.0206142 0.7004215 0.2801686 0.04002409
#> 2 0.9805901 0.6003613 0.2801686 0.04002409
#> 3 0.9405660 0.6403854 0.2601566 0.04002409
#> 4 0.9205540 0.6203733 0.3001807 0.04002409
#> 5 1.0006022 0.7204336 0.2801686 0.04002409
#> 6 1.0806503 0.7804697 0.3402047 0.08004817
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 0.3256959 0.1488896 0.2186816 0.06513919
#> 2 0.2977791 0.1488896 0.2093760 0.06979199
#> 3 0.3210431 0.1442368 0.2279872 0.06979199
#> 4 0.2559039 0.1070144 0.1861120 0.06048639
#> 5 0.3024319 0.1302784 0.2140288 0.06979199
#> 6 0.2652095 0.1302784 0.2093760 0.06048639
#>
#> [[3]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 2.399749 1.257011 2.285475 0.9522814
#> 2 2.209293 1.028464 1.942654 0.7237339
#> 3 2.704479 1.142738 2.247384 0.7999164
#> 4 2.399749 1.104646 2.133110 0.6856426
#> 5 2.475932 1.142738 2.209293 0.8380076
#> 6 2.894935 1.142738 2.514023 0.7999164
In fact, you can also omit the group_by() and ungroup() calls by providing the non-nested column (in your case, ano_fiscal) to nest():
iris %>%
nest(data = -Species) %>%
mutate(
adjusted_data = map2(data, df_ipca$idx, ~ .x * .y)
)
which should give the same result as before. Note to avoid having a warning, you should name the -Species argument inside nest().

Run a aov test through a tibble in a tidy way

I want to run a linear regression on a data frame using the same dependent variable. A similar question was solved here. The problem is that aov function to implement ANOVA doesn't accept x and y as arguments (as far as I know). Is there a way to implement the analysis in a tidy way? So far I've tried something like:
library(tidyverse)
iris %>%
as_tibble() %>%
select(Sepal.Length, Species) %>%
mutate(foo_a = as_factor(sample(c("a", "b", "c"), nrow(.), replace = T)),
foo_b = as_factor(sample(c("d", "e", "f"), nrow(.), replace = T))) %>%
map(~aov(Sepal.Length ~ .x, data = .))
Created on 2019-02-12 by the reprex package (v0.2.1)
The desired output is three analysis: Sepal.Length and Species, Sepal.Length and foo_a and the last one Sepal.Length and foo_b. Is it possible or I am totally wrong?

One approach is to make this into a long-shaped data frame, group by the independent variable of interest, and use the "many models" approach. I usually prefer something like this over trying to do tidyeval across multiple columns—it just gives me a clearer sense of what's going on.
To save space, I'm working with iris_foo, which is your data as you created it up through the 2 mutate lines. Putting it into a long format gives you a key of the names of those three columns that will be used as independent variables in each of the aov calls.
library(tidyverse)
iris_foo %>%
gather(key, value, -Sepal.Length)
#> # A tibble: 450 x 3
#> Sepal.Length key value
#> <dbl> <chr> <chr>
#> 1 5.1 Species setosa
#> 2 4.9 Species setosa
#> 3 4.7 Species setosa
#> 4 4.6 Species setosa
#> 5 5 Species setosa
#> 6 5.4 Species setosa
#> 7 4.6 Species setosa
#> 8 5 Species setosa
#> 9 4.4 Species setosa
#> 10 4.9 Species setosa
#> # … with 440 more rows
From there, nest by key and create a new list-column of ANOVA models. This will be a list of aov objects. For simplicity with getting your models back out, you can drop the data column.
aov_models <- iris_foo %>%
gather(key, value, -Sepal.Length) %>%
group_by(key) %>%
nest() %>%
mutate(model = map(data, ~aov(Sepal.Length ~ value, data = .))) %>%
select(-data)
aov_models
#> # A tibble: 3 x 2
#> key model
#> <chr> <list>
#> 1 Species <S3: aov>
#> 2 foo_a <S3: aov>
#> 3 foo_b <S3: aov>
From there, you can work with the models however you like. They're accessible in the list aov_models$model. Printed, they look how you'd expect. For example, the first model:
aov_models$model[[1]]
#> Call:
#> aov(formula = Sepal.Length ~ value, data = .)
#>
#> Terms:
#> value Residuals
#> Sum of Squares 63.21213 38.95620
#> Deg. of Freedom 2 147
#>
#> Residual standard error: 0.5147894
#> Estimated effects may be unbalanced
To see all the models, call aov_models$model %>% map(print). You might also want to use broom functions, such as broom::tidy or broom::glance, depending on how you need to present the models.
aov_models$model %>%
map(broom::tidy)
#> [[1]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 63.2 31.6 119. 1.67e-31
#> 2 Residuals 147 39.0 0.265 NA NA
#>
#> [[2]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.281 0.141 0.203 0.817
#> 2 Residuals 147 102. 0.693 NA NA
#>
#> [[3]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.756 0.378 0.548 0.579
#> 2 Residuals 147 101. 0.690 NA NA
Or tidying all the models into a single data frame, which keeps the key column, you could do:
aov_models %>%
mutate(model_tidy = map(model, broom::tidy)) %>%
unnest(model_tidy)

Predicting values with dplyr and augment

I'd like to fit models to a grouped data frame and then predict one new value per model (i.e. group).
library(dplyr)
library(broom)
data(iris)
dat <- rbind(iris, iris)
dat$Group <- rep(c("A", "B"), each = 150)
new.dat <- data.frame(Group = rep(c("A", "B"), each = 3),
Species = rep(c("setosa", "versicolor", "virginica"), times = 2),
Sepal.Width = 1:6)
> new.dat
Group Species val
1 A setosa 1
2 A versicolor 2
3 A virginica 3
4 B setosa 4
5 B versicolor 5
6 B virginica 6
However, augment returns 36 rows, as if each new value is fit with each model. How can I preserve the grouping here and get one fitted value per group?
dat %>%
group_by(Species, Group) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .), newdata = new.dat))
# A tibble: 36 x 5
# Groups: Species, Group [6]
Group Species Sepal.Width .fitted .se.fit
<fct> <fct> <int> <dbl> <dbl>
1 A setosa 1 3.33 0.221
2 A versicolor 2 4.02 0.133
3 A virginica 3 4.71 0.0512
4 B setosa 4 5.40 0.0615
5 B versicolor 5 6.09 0.145
6 B virginica 6 6.78 0.234
7 A setosa 1 3.33 0.221
8 A versicolor 2 4.02 0.133
9 A virginica 3 4.71 0.0512
10 B setosa 4 5.40 0.0615
# ... with 26 more rows
(Note that due to the example data the rows are actually duplicates, which is however not the case with my original data).

You need to make the Species and Group of new.dat match those of the group currently being processed in do. You can do this like so:
group.cols <- c("Species", "Group")
dat %>%
group_by(!!! group.cols) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .),
newdata = semi_join(new.dat, ., by = group.cols)))

kmeans clustering in grouped data

Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2 center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!

kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster column is from the model results, not the input data.
You can also do the same thing with do, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: predict new values for groups - r

Related

Is there a way to summarise if column value is x?

Problem with pipe within purrr:map2 and mutate

Run a aov test through a tibble in a tidy way

Predicting values with dplyr and augment

kmeans clustering in grouped data

Categories

Resources