I want to run a linear regression on a data frame using the same dependent variable. A similar question was solved here. The problem is that aov function to implement ANOVA doesn't accept x and y as arguments (as far as I know). Is there a way to implement the analysis in a tidy way? So far I've tried something like:
library(tidyverse)
iris %>%
as_tibble() %>%
select(Sepal.Length, Species) %>%
mutate(foo_a = as_factor(sample(c("a", "b", "c"), nrow(.), replace = T)),
foo_b = as_factor(sample(c("d", "e", "f"), nrow(.), replace = T))) %>%
map(~aov(Sepal.Length ~ .x, data = .))
Created on 2019-02-12 by the reprex package (v0.2.1)
The desired output is three analysis: Sepal.Length and Species, Sepal.Length and foo_a and the last one Sepal.Length and foo_b. Is it possible or I am totally wrong?
One approach is to make this into a long-shaped data frame, group by the independent variable of interest, and use the "many models" approach. I usually prefer something like this over trying to do tidyeval across multiple columns—it just gives me a clearer sense of what's going on.
To save space, I'm working with iris_foo, which is your data as you created it up through the 2 mutate lines. Putting it into a long format gives you a key of the names of those three columns that will be used as independent variables in each of the aov calls.
library(tidyverse)
iris_foo %>%
gather(key, value, -Sepal.Length)
#> # A tibble: 450 x 3
#> Sepal.Length key value
#> <dbl> <chr> <chr>
#> 1 5.1 Species setosa
#> 2 4.9 Species setosa
#> 3 4.7 Species setosa
#> 4 4.6 Species setosa
#> 5 5 Species setosa
#> 6 5.4 Species setosa
#> 7 4.6 Species setosa
#> 8 5 Species setosa
#> 9 4.4 Species setosa
#> 10 4.9 Species setosa
#> # … with 440 more rows
From there, nest by key and create a new list-column of ANOVA models. This will be a list of aov objects. For simplicity with getting your models back out, you can drop the data column.
aov_models <- iris_foo %>%
gather(key, value, -Sepal.Length) %>%
group_by(key) %>%
nest() %>%
mutate(model = map(data, ~aov(Sepal.Length ~ value, data = .))) %>%
select(-data)
aov_models
#> # A tibble: 3 x 2
#> key model
#> <chr> <list>
#> 1 Species <S3: aov>
#> 2 foo_a <S3: aov>
#> 3 foo_b <S3: aov>
From there, you can work with the models however you like. They're accessible in the list aov_models$model. Printed, they look how you'd expect. For example, the first model:
aov_models$model[[1]]
#> Call:
#> aov(formula = Sepal.Length ~ value, data = .)
#>
#> Terms:
#> value Residuals
#> Sum of Squares 63.21213 38.95620
#> Deg. of Freedom 2 147
#>
#> Residual standard error: 0.5147894
#> Estimated effects may be unbalanced
To see all the models, call aov_models$model %>% map(print). You might also want to use broom functions, such as broom::tidy or broom::glance, depending on how you need to present the models.
aov_models$model %>%
map(broom::tidy)
#> [[1]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 63.2 31.6 119. 1.67e-31
#> 2 Residuals 147 39.0 0.265 NA NA
#>
#> [[2]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.281 0.141 0.203 0.817
#> 2 Residuals 147 102. 0.693 NA NA
#>
#> [[3]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.756 0.378 0.548 0.579
#> 2 Residuals 147 101. 0.690 NA NA
Or tidying all the models into a single data frame, which keeps the key column, you could do:
aov_models %>%
mutate(model_tidy = map(model, broom::tidy)) %>%
unnest(model_tidy)
Related
I normalized the numeric variables by library(recipes) in R before putting into Decision Tree models to predict outcome. Now, I have decision tree, and age is one of important variables in the node, like >1.5 and < 1.5. I want to convert that -1.5 back into a non-normalized value to be able to give it a practical meaning (like age >50 or </= 50 years old). I have searched and cannot find the answer.
library(recipes)
recipe_obj <- dataset %>%
recipe(formula = anyaki ~.) %>% #specify formula
step_center(all_numeric()) %>% #center data (0 mean)
step_scale(all_numeric()) %>% #std = 1
prep(data = dataset)
dataset_scaled <- bake(recipe_obj, new_data = dataset)
Age is one of variables that have been normalized in recipes package in R. Now, I am struggling to convert the normalized data that I have in the final model back to into a non-normalized value to be able to give it a practical meaning. How can I do this?
You can access these kind of estimated values using the tidy() method for recipes and recipe steps. Check out more details here and here.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(penguins)
penguin_rec <- recipe(~ ., data = penguins) %>%
step_other(all_nominal(), threshold = 0.2, other = "another") %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal())
tidy(penguin_rec)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other FALSE FALSE other_ZNJ2R
#> 2 2 step normalize FALSE FALSE normalize_ogEvZ
#> 3 3 step dummy FALSE FALSE dummy_YVCBo
tidy(penguin_rec, number = 1)
#> # A tibble: 1 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 all_nominal() <NA> other_ZNJ2R
penguin_prepped <- prep(penguin_rec, training = penguins)
#> Warning: There are new levels in a factor: NA
tidy(penguin_prepped)
#> # A tibble: 3 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step other TRUE FALSE other_ZNJ2R
#> 2 2 step normalize TRUE FALSE normalize_ogEvZ
#> 3 3 step dummy TRUE FALSE dummy_YVCBo
tidy(penguin_prepped, number = 1)
#> # A tibble: 6 × 3
#> terms retained id
#> <chr> <chr> <chr>
#> 1 species Adelie other_ZNJ2R
#> 2 species Gentoo other_ZNJ2R
#> 3 island Biscoe other_ZNJ2R
#> 4 island Dream other_ZNJ2R
#> 5 sex female other_ZNJ2R
#> 6 sex male other_ZNJ2R
tidy(penguin_prepped, number = 2)
#> # A tibble: 8 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 bill_length_mm mean 43.9 normalize_ogEvZ
#> 2 bill_depth_mm mean 17.2 normalize_ogEvZ
#> 3 flipper_length_mm mean 201. normalize_ogEvZ
#> 4 body_mass_g mean 4202. normalize_ogEvZ
#> 5 bill_length_mm sd 5.46 normalize_ogEvZ
#> 6 bill_depth_mm sd 1.97 normalize_ogEvZ
#> 7 flipper_length_mm sd 14.1 normalize_ogEvZ
#> 8 body_mass_g sd 802. normalize_ogEvZ
Created on 2021-08-07 by the reprex package (v2.0.0)
I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
data(iris)
x <- iris %>%
group_by(Species, Petal.Length) %>%
summarise(mean(Sepal.Length))
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
summarise(mean(Sepal.Length))
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
summarise(mean(Sepal.Length))
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
library(tidyverse)
data(iris)
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
summarise(mean(Sepal.Length))
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
head(3)
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#>
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#>
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)
I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high
I have this simple dataframe. The sum column represents the sum of the row. I would like to use prop.test to determine the P-value for each column, and present that data as an additional row labeled p-value. I can use prop.test in the following way to determine a p value for any individual column, but cannot work out how to apply that to multiple columns with a single function.
Other Island N_Shelf N_Shore S_Shore Sum
Type1 10 4 1 0 3 18
Type2 19 45 1 9 11 85
This will output a p-value for the island column
ResI2<- prop.test(x=TableAvE_Island$Island, n=TableAvE_Island$Sum)
output:
data: TableAvE_Island$Island out of TableAvE_Island$Sum
X-squared = 4.456, df = 1, p-value = 0.03478
alternative hypothesis: two.sided
95 percent confidence interval:
-0.56027107 -0.05410802
sample estimates:
prop 1 prop 2
0.2222222 0.5294118
I've tried to use the apply command but cannot work out its usage, and the examples i've been able to find dont seem similar enough. Any pointers would be appreciated.
Here's a look with broom's function tidy, which takes output from tests and other operations and formats them as "tidy" data frames.
For the first prop.test that you posted, the tidy output looks like this:
library(tidyverse)
broom::tidy(prop.test(TableAvE_Island$Island, TableAvE_Island$Sum))
#> estimate1 estimate2 statistic p.value parameter conf.low
#> 1 0.2222222 0.5294118 4.456017 0.03477849 1 -0.5602711
#> conf.high
#> 1 -0.05410802
#> method
#> 1 2-sample test for equality of proportions with continuity correction
#> alternative
#> 1 two.sided
To do this for all the variables in your data frame vs Sum, I gathered it into a long shape
table_long <- gather(TableAvE_Island, key = variable, value = val, -Sum)
head(table_long)
#> # A tibble: 6 x 3
#> Sum variable val
#> <int> <chr> <int>
#> 1 18 Other 10
#> 2 85 Other 19
#> 3 18 Island 4
#> 4 85 Island 45
#> 5 18 N_Shelf 1
#> 6 85 N_Shelf 1
Then grouped the long-shaped data by variable, pipe it into do, which allows you to call a function on each of the groups in a data frame, using . as a standing for the subset of the data. Then I called tidy on the column containing the nested results of the prop.test. This gives you a data frame of all the relevant results of the test, with each of "Island", "N_Shelf", etc shown.
table_long %>%
group_by(variable) %>%
do(test = prop.test(x = .$val, n = .$Sum)) %>%
broom::tidy(test)
#> # A tibble: 5 x 10
#> # Groups: variable [5]
#> variable estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Island 0.222 0.529 4.46 0.0348 1 -0.560
#> 2 N_Shelf 0.0556 0.0118 0.0801 0.777 1 -0.0981
#> 3 N_Shore 0 0.106 0.972 0.324 1 -0.205
#> 4 Other 0.556 0.224 6.54 0.0106 1 0.0523
#> 5 S_Shore 0.167 0.129 0.00163 0.968 1 -0.183
#> # ... with 3 more variables: conf.high <dbl>, method <fct>,
#> # alternative <fct>
Created on 2018-05-10 by the reprex package (v0.2.0).
We could gather into 'long' format and then store it as a list column
library(tidyverse)
res <- gather(TableAvE_Island, key, val, -Sum) %>%
group_by(key) %>%
nest() %>%
mutate(out = map(data, ~prop.test(.x$val, .x$Sum)))
res$out
I've calculated a different regression for each group in a data frame:
DF.L <- DF %>%
group_by(Channel) %>%
do(Fit = rlm(L ~ -1 + Y + I(Y^2), data = .))
I want to apply this set of regressions to another data frame. To do so, I'm testing how to apply it to the same data frame:
DF %>%
group_by(Channel) %>%
do({
Lfit <- predict(subset(DF.L, Channel == unique(.$Channel))$Fit, .)
data.frame(., Lfit)
})
glimpse(DF)
But I keep getting this error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
Calls: %>% ... do_.grouped_df -> eval -> eval -> predict -> predict
What I am doing wrong?
Using the built-in ChickWeight data:
library(dplyr)
library(MASS)
library(broom)
library(tidyr)
library(ggplot2)
head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Fit some models
ChickWeight_models <- ChickWeight %>%
group_by(Diet) %>%
do(fit = MASS::rlm(weight ~ Time + I(Time^2), data = .))
ChickWeight_models
Source: local data frame [4 x 2]
Groups: <by row>
# A tibble: 4 x 2
Diet fit
* <fctr> <list>
1 1 <S3: rlm>
2 2 <S3: rlm>
3 3 <S3: rlm>
4 4 <S3: rlm>
So I've created a very similar object to your DF.L. It's a frame with the four groups, each with an rlm object in a list-column called fit.
Make up some test data
Now I'll make up some data to test this model on. In this case, I'll just take the original data and add some noise to each of the variables.
ChickWeight_simulated <- ChickWeight %>%
mutate(Time = Time + runif(length(Time)),
weight = weight + rnorm(length(weight)))
ChickWeight_simulated
weight Time Chick Diet
1 42.72075 0.9786272 1 1
2 51.12669 2.8399631 1 1
3 58.64632 4.4576380 1 1
4 63.77617 6.1083591 1 1
5 75.40434 8.1051792 1 1
6 91.75830 10.7899030 1 1
Now we want to combine the dataframe of the models with the new data to test on. First we group_by and tidyr::nest the simulated data. This creates an object that is a dataframe with the four groups and a list-column called data, each element of which contains a rolled-up dataframe.
ChickWeight_simulated %>% group_by(Diet) %>% nest()
# A tibble: 4 x 2
Diet data
<fctr> <list>
1 1 <tibble [220 x 3]>
2 2 <tibble [120 x 3]>
3 3 <tibble [120 x 3]>
4 4 <tibble [118 x 3]>
Add the original models to the new data
Then we can join it to the models dataframe:
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models)
# A tibble: 4 x 3
Diet data fit
<fctr> <list> <list>
1 1 <tibble [220 x 3]> <S3: rlm>
2 2 <tibble [120 x 3]> <S3: rlm>
3 3 <tibble [120 x 3]> <S3: rlm>
4 4 <tibble [118 x 3]> <S3: rlm>
Now we group by Diet again, and use broom::augment to make a prediction of each model on the new simulated data. Since each group is one row, there is one element each of fit and data; we have to extract that single element out of each list-column into a usable form by using [[1]].
ChickWeight_simulated_predicted <-
ChickWeight_simulated %>% group_by(Diet) %>% nest() %>%
full_join(ChickWeight_models) %>%
group_by(Diet) %>%
do(augment(.$fit[[1]], newdata = .$data[[1]]))
head(ChickWeight_simulated_predicted)
# A tibble: 6 x 6
# Groups: Diet [1]
Diet weight Time Chick .fitted .se.fit
<fctr> <dbl> <dbl> <ord> <dbl> <dbl>
1 1 42.72075 0.9786272 1 43.62963 2.368838
2 1 51.12669 2.8399631 1 51.80855 1.758385
3 1 58.64632 4.4576380 1 59.67606 1.534051
4 1 63.77617 6.1083591 1 68.43218 1.534152
5 1 75.40434 8.1051792 1 80.00678 1.647612
6 1 91.75830 10.7899030 1 97.26450 1.726331
Sanity check
To prove that this really only used the model from a particular level of Diet on the simulated data from that level of Diet, we can visualize the model fit.
ChickWeight_simulated_predicted %>%
ggplot(aes(Time, weight)) +
geom_point(shape = 1) +
geom_ribbon(aes(Time,
ymin = .fitted-1.96*.se.fit,
ymax = .fitted+1.96*.se.fit),
alpha = 0.5, fill = "black") +
geom_line(aes(Time, .fitted), size = 1, color = "red") +
facet_wrap(~Diet)
I think your error comes from how you are calling predict. I can't fix your exact code, but here is a simple way you can get predictions from your model. A more sophisticated way using purrr and nest is outlined here: http://ijlyttle.github.io/isugg_purrr/presentation.html#(1)
UPDATE - the purrr and nest way
Just adding this to show that it can be done pretty easily within the tidyverse, using predict. See link above for more details.
library(tidyverse)
# shuffle the rows to mix up the species
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# create first dataset - use the first 50 rows for running the model
iris_nested <-
myiris[1:50,] %>%
nest(-Species) %>%
rename(myorigdata = data)
# create second dataset - use the other 100 rows for making predictions
new_iris_nested <-
myiris[51:150,] %>%
nest(-Species) %>%
rename(mynewdata = data)
# make a model function
my_rlm <- function(df) {
MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = df)
}
# get the predictions (see the GitHub link above which breaks this into steps)
predictions_tall <-
iris_nested %>%
mutate(my_model = map(myorigdata, my_rlm)) %>%
full_join(new_iris_nested, by = "Species") %>%
mutate(my_new_pred = map2(my_model, mynewdata, predict)) %>%
select(Species, mynewdata, my_new_pred) %>%
unnest(mynewdata, my_new_pred) %>%
rename(modeled = my_new_pred, measured = Sepal.Length) %>%
gather("Type", "Sepal.Length", modeled, measured)
The nested predictions_tall object looks like this:
predictions_tall %>% nest(-Species, -type) %>% as.tibble()
# A tibble: 6 x 3
Species type data
<fctr> <chr> <list>
1 setosa modeled <data.frame [32 x 4]>
2 versicolor modeled <data.frame [33 x 4]>
3 virginica modeled <data.frame [35 x 4]>
4 setosa measured <data.frame [32 x 4]>
5 versicolor measured <data.frame [33 x 4]>
6 virginica measured <data.frame [35 x 4]>
And finally, the plot to show the prediction results:
predictions_tall %>%
ggplot(aes(x = Petal.Length, y = Sepal.Length)) +
geom_line(aes(color = Species, linetype = Type))
ORIGINAL - the broom way
I've updated this now to only calculate predictions for each group using the model for that group.
This way uses the broom package - specifically the augment function - to add fitted values. See more here: https://cran.r-project.org/web/packages/broom/vignettes/broom.html
Since you don't supply data, I use iris here.
library(tidyverse)
library(broom)
# first shuffle around the rows of iris
set.seed(1234)
myiris <- iris[sample(nrow(iris), replace = F),]
# first data - first 25 rows for running the models on
origiris <-
myiris[1:25,] %>%
nest(-Species) %>%
rename(origdata = data)
# second data - last 50 rows for predicting on
prediris <-
myiris[101:150,] %>%
nest(-Species) %>%
rename(preddata = data)
# estimate models on the first 25 rows
# a separate model is estimated for each species
iris_mod <-
origiris %>%
mutate(mod = map(origdata, ~ MASS::rlm(Sepal.Length ~ Petal.Length + Petal.Width, data = .)))
First get fitted values for the original dataset (not essential, just for illustration):
# get fitted values for the first dataset (origdata)
origiris_aug <-
iris_mod %>%
mutate(origpred = map(mod, augment)) %>%
unnest(origpred) %>%
as.tibble()
The origiris_aug predictions dataframe looks like this:
origiris_aug
# A tibble: 25 x 10
Species .rownames Sepal.Length Petal.Length Petal.Width .fitted .se.fit .resid
<fctr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 18 5.1 1.4 0.3 5.002797 0.1514850 0.09720290
2 setosa 2 4.9 1.4 0.2 4.931824 0.1166911 -0.03182417
3 setosa 34 5.5 1.4 0.2 4.931824 0.1166911 0.56817583
4 setosa 40 5.1 1.5 0.2 4.981975 0.1095883 0.11802526
5 setosa 39 4.4 1.3 0.2 4.881674 0.1422123 -0.48167359
6 setosa 36 5.0 1.2 0.2 4.831523 0.1784156 0.16847698
7 setosa 25 4.8 1.9 0.2 5.182577 0.2357614 -0.38257703
8 setosa 31 4.8 1.6 0.2 5.032125 0.1241074 -0.23212531
9 setosa 42 4.5 1.3 0.3 4.952647 0.1760223 -0.45264653
10 setosa 21 5.4 1.7 0.2 5.082276 0.1542594 0.31772411
# ... with 15 more rows, and 2 more variables: .hat <dbl>, .sigma <dbl>
And now what you actually want - making predictions on the new dataset:
# get fitted values for the second dataset (preddata)
# each model is fitted to the appropriate species' nested dataframe
prediris_aug <-
iris_mod %>%
inner_join(prediris, by = "Species") %>%
map2_df(.x = iris_mod$mod, .y = prediris$preddata, .f = ~augment(.x, newdata = .y)) %>%
as.tibble()
The prediris_aug dataframe looks like this:
prediris_aug
# A tibble: 50 x 7
.rownames Sepal.Length Sepal.Width Petal.Length Petal.Width .fitted .se.fit
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 105 6.5 3.0 5.8 2.2 8.557908 3.570269
2 115 5.8 2.8 5.1 2.4 8.348800 3.666631
3 117 6.5 3.0 5.5 1.8 8.123565 3.005888
4 139 6.0 3.0 4.8 1.8 7.772511 2.812748
5 103 7.1 3.0 5.9 2.1 8.537086 3.475224
6 107 4.9 2.5 4.5 1.7 7.551086 2.611123
7 119 7.7 2.6 6.9 2.3 9.180537 4.000412
8 135 6.1 2.6 5.6 1.4 7.889823 2.611457
9 124 6.3 2.7 4.9 1.8 7.822661 2.838502
10 118 7.7 3.8 6.7 2.2 9.009263 3.825613
# ... with 40 more rows