This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I am writing a function to dplyr::_join two dataframes by different columns, with the column name of the first dataframe dynamically specified as a function argument. I believe I need to use rlang quasiquotation/metaprogramming but haven't been able to get a working solution. I appreciate any suggestions!
library(dplyr)
library(rlang)
library(palmerpenguins)
# Create a smaller dataset
penguins <-
penguins %>%
group_by(species) %>%
slice_head(n = 4) %>%
ungroup()
# Create a colors dataset
penguin_colors <-
tibble(
type = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Without function --------------------------------------------------------
# Join works with character vectors
left_join(
penguins, penguin_colors, by = c("species" = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join works with data-variable and character vector
left_join(
penguins, penguin_colors, by = c(species = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join does NOT work with character vector and data-variable
left_join(
penguins, penguin_colors, by = c(species = type)
)
#> Error in standardise_join_by(by, x_names = x_names, y_names = y_names): object 'type' not found
# With function -----------------------------------------------------------
# Version 1: Without tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c(var = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `var`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `var`.
# Version 2: With tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
# Version 2: With tunneling and glue syntax
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" := "type")
)
}
add_colors(penguins, species)
#> Error: `:=` can only be used within a quasiquoted argument
add_colors(penguins, "species")
#> Error: `:=` can only be used within a quasiquoted argument
Created on 2020-10-05 by the reprex package (v0.3.0)
Here are related resources I consulted:
using `rlang` quasiquotation with `dplyr::_join` functions
https://dplyr.tidyverse.org/reference/join.html
https://speakerdeck.com/lionelhenry/interactivity-and-programming-in-the-tidyverse
https://dplyr.tidyverse.org/articles/programming.html
Thank you for your advice.
library(dplyr)
left_join(
penguins, penguin_colors, by = c(species = "type")
)
The reason why above works is because in by we are creating a named vector like this :
c(species = "type")
#species
# "type"
You can also do that via setNames :
setNames('type', 'species')
but notice that passing species without quotes fail.
setNames('type', species)
Error in setNames("type", species) : object 'species' not found
So create a named vector with setNames and pass character value in the function.
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = setNames('type', var)
)
}
add_colors(penguins, 'species')
To add to Ronak's solution you can also achieve this without quotes using ensym
Example:
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = set_names("type", nm = ensym(var))
)
}
Related
I am working with time series, i have 2 different time series that have 2 columns and different row number.
df_1=read.table("data_1")
df_2=read.table("data_2")
I would like to compare the values of df_1$V2 (second column) with the values in df_2$V2, if they are equal calculate the time difference between them (df_2$V1[i]-df_2$V1[j])
here is my code:
vect=c()
horo=c()
j=1
for (i in 2: nrow(df_1)){
for(j in 1:nrow(df_2)) {
if(df_1$V2[i]==df_2$V2[j]){
calc=abs(df_2$V1[j] - df_1$V1[i])
vect=append(vect, calc)
}
}
}
The problem is:
it could exist many element in df_2$V2[j] that are equal to df_1$V2[i]
and i only want the first value.
as i know that in my data if (for example) df_1$V2[1]= df_2$V2[8] so for the next iteration no need to compare the df_1$V1[2] with the first 8 values of df_2$V2 and i can start comparing from df_2$V2[9]
it take too much time... because of the for loop, so is there another way to do it?
Thank you for your help!
data example:
df_1=
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851
....
df_2=
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942
You've mentioned that your for-loop is slow; it's generally advisable to avoid writing your own for-loops in R, and letting built-in vectorisation handle things efficiently.
Here's a non-for-loop-dependent solution using the popular dplyr package from the tidyverse.
Read in data
First, let's read in your data for the sake of reproducibility. Note that I've added names to your data, because unnamed data is confusing and hard to work with.
require(vroom) # useful package for flexible data reading
df_1 <- vroom(
"timestamp value
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851")
#> Rows: 20 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_2 <- vroom(
"timestamp value
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942")
#> Rows: 17 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Comparing time differences for matching values
Let's go through the solution step-by-step:
Add id for each row of df_1
We'll need this later to remove unwanted values.
require(dplyr)
#> Loading required package: dplyr
df_1 <- mutate(df_1, id = paste0("id_", row_number()) |>
## For the sake of neatness, we'll make id_ an ordered factor
## that's ordered by it's current arrangement
ordered() |>
fct_inorder())
df_1 <- relocate(df_1, id)
head(df_1)
#> # A tibble: 6 × 3
#> id timestamp value
#> <chr> <dbl> <dbl>
#> 1 id_1 15.9 2633
#> 2 id_2 15.9 2664
#> 3 id_3 15.9 2699
#> 4 id_4 15.9 3068
#> 5 id_5 15.9 4256
#> 6 id_6 15.9 5444
Join rows from df_2 on matching values
joined <- left_join(df_1, df_2, by = "value", suffix = c(".1", ".2"))
head(joined)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <chr> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_1 15.9 2633 16.0
#> 3 id_2 15.9 2664 NA
#> 4 id_3 15.9 2699 NA
#> 5 id_4 15.9 3068 16.0
#> 6 id_5 15.9 4256 NA
Get the first returned value for each value in df_1
We can do this by grouping by our id column, then just getting the first() row from each group.
joined <- group_by(joined, id) # group by row identifiers
summary <- summarise(joined, across(everything(), first))
head(summary)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <ord> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_2 15.9 2664 NA
#> 3 id_3 15.9 2699 NA
#> 4 id_4 15.9 3068 16.0
#> 5 id_5 15.9 4256 NA
#> 6 id_6 15.9 5444 NA
Get time difference
A simple case of using mutate() to subtract timestamp.1 from timestamp.2:
times <- mutate(summary, time_diff = timestamp.2 - timestamp.1) |>
relocate(value, .after = id) # this is just for presentation
## You may want to remove rows with no time diff?
filter(times, !is.na(time_diff))
#> # A tibble: 8 × 5
#> id value timestamp.1 timestamp.2 time_diff
#> <ord> <dbl> <dbl> <dbl> <dbl>
#> 1 id_1 2633 15.9 16.0 0.0167
#> 2 id_4 3068 15.9 16.0 0.0178
#> 3 id_7 6632 15.9 16.0 0.0238
#> 4 id_10 10196 15.9 16.0 0.0329
#> 5 id_12 12572 16.0 16.0 0.0219
#> 6 id_15 16136 16.0 16.0 0.0339
#> 7 id_18 18624 16.0 16.0 0.0322
#> 8 id_19 18659 16.0 16.1 0.0175
Created on 2022-10-25 with reprex v2.0.2
I would like to group all members of the same genera together for some summary statistics, but would like to maintain their full names in the original dataframe. I know that I could change their names or create a new column in the original dataframe but I am lookng for a more elegant solution. I would like to implement this in R and the dplyr package.
Example data here https://knb.ecoinformatics.org/knb/d1/mn/v2/object/urn%3Auuid%3Aeb176981-1909-4d6d-ac07-3406e4efc43f
I would like to group all clams of the genus Macoma as one group, "Macoma sp." but ideally creating this grouping within the following, perhapse before the group_by(site_code, species_scientific)
summary <- data %>%
group_by(site_code, species_scientific) %>%
summarize(mean_size = mean(width_mm))
Note that there are multiple Macoma xxx species and multiple other species that I want to group as is.
We may replace the species_scientific by replaceing the elements that have the substring 'Macoma' (str_detect) with 'Macoma', use that as grouping column and get the mean
library(dplyr)
library(stringr)
data %>%
mutate(species_scientific = replace(species_scientific,
str_detect(species_scientific, "Macoma"), "Macoma")) %>%
group_by(site_code, species_scientific) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
site_code species_scientific mean_size
<chr> <chr> <dbl>
1 H_01_a Clinocardium nuttallii 33.9
2 H_01_a Macoma 41.0
3 H_01_a Protothaca staminea 37.3
4 H_01_a Saxidomus gigantea 56.0
5 H_01_a Tresus nuttallii 100.
6 H_02_a Clinocardium nuttallii 35.1
7 H_02_a Macoma 41.3
8 H_02_a Protothaca staminea 38.0
9 H_02_a Saxidomus gigantea 54.7
10 H_02_a Tresus nuttallii 50.5
# … with 87 more rows
If the intention is to keep only the first word in 'species_scientific'
data %>%
group_by(genus = str_remove(species_scientific, "\\s+.*"), site_code) %>%
summarise(mean_size = mean(width_mm, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 97 × 3
genus site_code mean_size
<chr> <chr> <dbl>
1 Clinocardium H_01_a 33.9
2 Clinocardium H_02_a 35.1
3 Clinocardium H_03_a 37.5
4 Clinocardium H_04_a 48.2
5 Clinocardium H_05_a 37.6
6 Clinocardium H_06_a 38.7
7 Clinocardium H_07_a 40.2
8 Clinocardium L_01_a 44.4
9 Clinocardium L_02_a 54.8
10 Clinocardium L_03_a 61.1
# … with 87 more rows
I want to compute a weighted moving average across multiple columns, using the same weights for each column. The weighted moving average shall be computed per group (in contrast to using `dplyr::across` with functions with more than one argument).
In the example below, the grouping should make the weighted moving average "reset" every year, yielding missing values for the first two observations of each year.
How do I make this work?
library(tidyverse)
weighted.filter <- function(x, wt, filter, ...) {
filter <- filter / sum(filter)
stats::filter(x * wt, filter, ...) / stats::filter(wt, filter, ...)
}
economics %>%
group_by(year = lubridate::year(date)) %>%
arrange(date) %>%
mutate(across(
c(pce, psavert, uempmed),
list("moving_average_weighted" = weighted.filter),
wt = pop, filter = rep(1, 3), sides = 1
))
#> Error: Problem with `mutate()` input `..1`.
#> x Input `..1` can't be recycled to size 12.
#> ℹ Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
#> ℹ Input `..1` must be size 12 or 1, not 6.
#> ℹ The error occurred in group 2: year = 1968.
Created on 2021-03-31 by the reprex package (v1.0.0)
Try
economics %>%
group_by(year = lubridate::year(date)) %>%
arrange(date) %>%
mutate(across(
c(pce, psavert, uempmed),
list("moving_average_weighted" =
~ weighted.filter(., wt = pop, filter = rep(1, 3), sides = 1))
))
# # A tibble: 574 x 10
# # Groups: year [49]
# date pce pop psavert uempmed unemploy year pce_moving_average_w~ psavert_moving_avera~ uempmed_moving_avera~
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1967-07-01 507. 198712 12.6 4.5 2944 1967 NA NA NA
# 2 1967-08-01 510. 198911 12.6 4.7 2945 1967 NA NA NA
# 3 1967-09-01 516. 199113 11.9 4.6 2958 1967 511. 12.4 4.60
# 4 1967-10-01 512. 199311 12.9 4.9 3143 1967 513. 12.5 4.73
# 5 1967-11-01 517. 199498 12.8 4.7 3066 1967 515. 12.5 4.73
# 6 1967-12-01 525. 199657 11.8 4.8 3018 1967 518. 12.5 4.80
# 7 1968-01-01 531. 199808 11.7 5.1 2878 1968 NA NA NA
# 8 1968-02-01 534. 199920 12.3 4.5 3001 1968 NA NA NA
# 9 1968-03-01 544. 200056 11.7 4.1 2877 1968 536. 11.9 4.57
# 10 1968-04-01 544 200208 12.3 4.6 2709 1968 541. 12.1 4.40
# # ... with 564 more rows
I would like to provide a user-facing function that allows arbitrary grouping variables to be passed to a summary function, with the option of specifying additional arguments for filtering, but which are NULL by default (and thus unevaluated).
I understand why the following example should fail (because it is ambiguous where homeworld belongs and the other arg takes precedence), but I'm unsure what is the best way to pass dots appropriately in this situation. Ideally the result of the second and third calls to fun below would return the same results.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fun <- function(.df, .species = NULL, ...) {
.group_vars <- rlang::ensyms(...)
if (!is.null(.species)) {
.df <- .df %>%
dplyr::filter(.data[["species"]] %in% .species)
}
.df %>%
dplyr::group_by(!!!.group_vars) %>%
dplyr::summarize(
ht = mean(.data[["height"]], na.rm = TRUE),
.groups = "drop"
)
}
fun(.df = starwars, .species = c("Human", "Droid"), species, homeworld)
#> # A tibble: 19 x 3
#> species homeworld ht
#> <chr> <chr> <dbl>
#> 1 Droid Naboo 96
#> 2 Droid Tatooine 132
#> 3 Droid <NA> 148
#> 4 Human Alderaan 176.
#> 5 Human Bespin 175
#> 6 Human Bestine IV 180
#> 7 Human Chandrila 150
#> 8 Human Concord Dawn 183
#> 9 Human Corellia 175
#> 10 Human Coruscant 168.
#> 11 Human Eriadu 180
#> 12 Human Haruun Kal 188
#> 13 Human Kamino 183
#> 14 Human Naboo 168.
#> 15 Human Serenno 193
#> 16 Human Socorro 177
#> 17 Human Stewjon 182
#> 18 Human Tatooine 179.
#> 19 Human <NA> 193
fun(.df = starwars, .species = NULL, homeworld)
#> # A tibble: 49 x 2
#> homeworld ht
#> <chr> <dbl>
#> 1 Alderaan 176.
#> 2 Aleen Minor 79
#> 3 Bespin 175
#> 4 Bestine IV 180
#> 5 Cato Neimoidia 191
#> 6 Cerea 198
#> 7 Champala 196
#> 8 Chandrila 150
#> 9 Concord Dawn 183
#> 10 Corellia 175
#> # … with 39 more rows
fun(.df = starwars, homeworld)
#> Error in fun(.df = starwars, homeworld): object 'homeworld' not found
<sup>Created on 2020-06-15 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
I know that I can achieve the desired result by:
fun <- function(.df, .species = NULL, .groups = NULL) {
.group_vars <- rlang::syms(purrr::map(.groups, rlang::as_string))
...
}
But I am looking for a solution using ..., or that allows the user to pass either strings or symbols to .groups, e.g. .groups = c(species, homeworld) or .groups = c("species", "homeworld").
You could move the parameters so that .species comes after the dots.
fun <- function(.df, ..., .species = NULL) {
.group_vars <- rlang::ensyms(...)
if (!is.null(.species)) {
.df <- .df %>%
dplyr::filter(.data[["species"]] %in% .species)
}
.df %>%
dplyr::group_by(!!!.group_vars) %>%
dplyr::summarize(
ht = mean(.data[["height"]], na.rm = TRUE),
.groups = "drop"
)
}
fun(.df = starwars, homeworld)
which gives
> fun(.df = starwars, homeworld)
# A tibble: 49 x 3
homeworld ht .groups
<chr> <dbl> <chr>
1 NA 139. drop
2 Alderaan 176. drop
3 Aleen Minor 79 drop
4 Bespin 175 drop
5 Bestine IV 180 drop
6 Cato Neimoidia 191 drop
7 Cerea 198 drop
8 Champala 196 drop
9 Chandrila 150 drop
10 Concord Dawn 183 drop
# ... with 39 more rows
which is what you wanted to happen. The other examples still work as well.
Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()