Use mutate inside a function called by an apply family function - r

I am trying to change some of my data that are stored as tibbles inside a list.
This list of tibbles was generated by a package.
I do not understand why my function does not work.
If I extract a tibble element manually, the function works but not inside a lapply.
my function:
changesomethingtaxize <- function(x, whatchange=NULL, applyfunction=NULL){
library(lazyeval) ;
mutate_call <- lazyeval::interp(~ a(b), a = match.fun(applyfunction), b = as.name(whatchange) )
x %<>% mutate_(.dots = setNames(list(mutate_call), whatchange) )
return(x)
}
I want to do
mydata <- lapply(mydata, function(x) changesomethingtaxize(x, whatchange=rank, applyfunction=str_to_sentence) )
I could use a loop to extract each tibbles (in this case I only have 5) but I would like to understand what I do wrong :)
From dput()
mydata <- structure(list(`Zostera marina` = structure(list(name = c("Plantae",
"Viridiplantae", "Streptophyta", "Embryophyta", "Tracheophyta",
"Spermatophytina", "Magnoliopsida", "Lilianae", "Alismatales",
"Zosteraceae", "Zostera", "Zostera marina"), rank = c("kingdom",
"subkingdom", "infrakingdom", "superdivision", "division", "subdivision",
"class", "superorder", "order", "family", "genus", "species"),
id = c("202422", "954898", "846494", "954900", "846496",
"846504", "18063", "846542", "38883", "39069", "39073", "39074"
)), row.names = c(NA, 12L), class = "data.frame"), `Vascular plants` = structure(list(
name = c("Plantae", "Viridiplantae", "Streptophyta", "Embryophyta",
"Tracheophyta"), rank = c("kingdom", "subkingdom", "infrakingdom",
"superdivision", "division"), id = c("202422", "954898",
"846494", "954900", "846496")), row.names = c(NA, 5L), class = "data.frame"),
`Fucus vesiculosus` = structure(list(name = c("Chromista",
"Chromista", "Phaeophyta", "Phaeophyceae", "Fucales", "Fucaceae",
"Fucus", "Fucus vesiculosus"), rank = c("kingdom", "subkingdom",
"division", "class", "order", "family", "genus", "species"
), id = c("630578", "590735", "660055", "10686", "11328",
"11329", "11334", "11335")), row.names = c(NA, 8L), class = "data.frame"),
Macroalgae = NA, `Filamentous algae` = NA), class = "classification", db = "itis")

I think I actually found why... :D
The lapply works but was not returning anything because of the NAs (empty elements of the list).
I added an if() that only mutates a tibble if the tibble actually contains something.
It is always an NA issue somewhere!
Well hope that piece of code could help someone someday.

The functions you provided aren't usable by themselves, but it looks like you're attempting to use a function meant to modify a data frame on non-dataframe objects, which mydata contains.
I'm using dplyr::mutate() just to illustrate here.
Your data contain NAs (which in this case are logical). dplyr::mutate() doesnt' have a method for logicals and I'm assuming the function you're trying to use doesn't either (or simply doesn't have a way of handling NA values).
You should be getting an error that's at least conceptually similar to the following ...
lapply(mydata, function(x) dplyr::mutate(x, col_to_modify = toupper(rank)))
#> Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "logical"
To get around this, you can check your list ahead of time and note which elements are indeed data frames.
df_indices <- vapply(mydata, is.data.frame, logical(1L))
df_indices
#> Zostera marina Vascular plants Fucus vesiculosus Macroalgae
#> TRUE TRUE TRUE FALSE
#> Filamentous algae
#> FALSE
Using df_indices, we can modify only those elements in mydata like so...
mydata[df_indices] <- lapply(
mydata[df_indices],
function(x) dplyr::mutate(x, col_to_modify = toupper(rank))
)
mydata
#> $`Zostera marina`
#> name rank id col_to_modify
#> 1 Plantae kingdom 202422 KINGDOM
#> 2 Viridiplantae subkingdom 954898 SUBKINGDOM
#> 3 Streptophyta infrakingdom 846494 INFRAKINGDOM
#> 4 Embryophyta superdivision 954900 SUPERDIVISION
#> 5 Tracheophyta division 846496 DIVISION
#> 6 Spermatophytina subdivision 846504 SUBDIVISION
#> 7 Magnoliopsida class 18063 CLASS
#> 8 Lilianae superorder 846542 SUPERORDER
#> 9 Alismatales order 38883 ORDER
#> 10 Zosteraceae family 39069 FAMILY
#> 11 Zostera genus 39073 GENUS
#> 12 Zostera marina species 39074 SPECIES
#>
#> $`Vascular plants`
#> name rank id col_to_modify
#> 1 Plantae kingdom 202422 KINGDOM
#> 2 Viridiplantae subkingdom 954898 SUBKINGDOM
#> 3 Streptophyta infrakingdom 846494 INFRAKINGDOM
#> 4 Embryophyta superdivision 954900 SUPERDIVISION
#> 5 Tracheophyta division 846496 DIVISION
#>
#> $`Fucus vesiculosus`
#> name rank id col_to_modify
#> 1 Chromista kingdom 630578 KINGDOM
#> 2 Chromista subkingdom 590735 SUBKINGDOM
#> 3 Phaeophyta division 660055 DIVISION
#> 4 Phaeophyceae class 10686 CLASS
#> 5 Fucales order 11328 ORDER
#> 6 Fucaceae family 11329 FAMILY
#> 7 Fucus genus 11334 GENUS
#> 8 Fucus vesiculosus species 11335 SPECIES
#>
#> $Macroalgae
#> [1] NA
#>
#> $`Filamentous algae`
#> [1] NA
#>
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "itis"
Note that {purrr} has a nice map() variant designed to handle this very situation. purrr::map_if() takes a .p (predicate) argument to which you can provide a function that it applies to .x and returns TRUE or FALSE. Only those elements that return TRUE are modified by the function you provide to .f
purrr::map_if(.x = mydata, .p = is.data.frame,
.f = ~ dplyr::mutate(.x, col_to_modify = toupper(rank)))
#> $`Zostera marina`
#> name rank id col_to_modify
#> 1 Plantae kingdom 202422 KINGDOM
#> 2 Viridiplantae subkingdom 954898 SUBKINGDOM
#> 3 Streptophyta infrakingdom 846494 INFRAKINGDOM
#> 4 Embryophyta superdivision 954900 SUPERDIVISION
#> 5 Tracheophyta division 846496 DIVISION
#> 6 Spermatophytina subdivision 846504 SUBDIVISION
#> 7 Magnoliopsida class 18063 CLASS
#> 8 Lilianae superorder 846542 SUPERORDER
#> 9 Alismatales order 38883 ORDER
#> 10 Zosteraceae family 39069 FAMILY
#> 11 Zostera genus 39073 GENUS
#> 12 Zostera marina species 39074 SPECIES
#>
#> $`Vascular plants`
#> name rank id col_to_modify
#> 1 Plantae kingdom 202422 KINGDOM
#> 2 Viridiplantae subkingdom 954898 SUBKINGDOM
#> 3 Streptophyta infrakingdom 846494 INFRAKINGDOM
#> 4 Embryophyta superdivision 954900 SUPERDIVISION
#> 5 Tracheophyta division 846496 DIVISION
#>
#> $`Fucus vesiculosus`
#> name rank id col_to_modify
#> 1 Chromista kingdom 630578 KINGDOM
#> 2 Chromista subkingdom 590735 SUBKINGDOM
#> 3 Phaeophyta division 660055 DIVISION
#> 4 Phaeophyceae class 10686 CLASS
#> 5 Fucales order 11328 ORDER
#> 6 Fucaceae family 11329 FAMILY
#> 7 Fucus genus 11334 GENUS
#> 8 Fucus vesiculosus species 11335 SPECIES
#>
#> $Macroalgae
#> [1] NA
#>
#> $`Filamentous algae`
#> [1] NA

Related

Synth gives different results depending on the order of predictor names

I'm using the package Synth to run synthetic control. Below is an example coming from the docs (?synth):
library(Synth)
#> ##
#> ## Synth Package: Implements Synthetic Control Methods.
#> ## See https://web.stanford.edu/~jhain/synthpage.html for additional information.
packageVersion("Synth")
#> [1] '1.1.6'
# load data
data(synth.data)
# create matrices from panel data that provide inputs for synth()
dataprep.out<-
dataprep(
foo = synth.data,
predictors = c("X1", "X2", "X3"),
predictors.op = "mean",
dependent = "Y",
unit.variable = "unit.num",
time.variable = "year",
special.predictors = list(
list("Y", 1991, "mean"),
list("Y", 1985, "mean"),
list("Y", 1980, "mean")
),
treatment.identifier = 7,
controls.identifier = c(29, 2, 13, 17, 32, 38),
time.predictors.prior = c(1984:1989),
time.optimize.ssr = c(1984:1990),
unit.names.variable = "name",
time.plot = 1984:1996
)
synth.out <- synth(dataprep.out)
#>
#> X1, X0, Z1, Z0 all come directly from dataprep object.
#>
#>
#> ****************
#> searching for synthetic control unit
#>
#>
#> ****************
#> ****************
#> ****************
#>
#> MSPE (LOSS V): 4.714688
#>
#> solution.v:
#> 0.00490263 0.003884407 0.1972011 0.2707289 0.0007091301 0.5225738
#>
#> solution.w:
#> 0.0001407318 0.004851527 0.1697786 0.2173031 0.6079231 2.9419e-06
gaps<- dataprep.out$Y1plot-(
dataprep.out$Y0plot%*%synth.out$solution.w
)
gaps
#> 7
#> 1984 -0.1686325
#> 1985 1.0936597
#> 1986 0.8502235
#> 1987 3.3489866
#> 1988 -1.4241479
#> 1989 -4.1947133
#> 1990 -0.4646250
#> 1991 0.3998366
#> 1992 8.1181380
#> 1993 12.8906031
#> 1994 15.9712216
#> 1995 16.9174715
#> 1996 22.9879423
If I change the order of the variables names in predictors, I get different results. For instance, if I use predictors = c("X3", "X1", "X2"), I obtain the following results, which are different from those above:
#> 7
#> 1984 0.1893536
#> 1985 1.1648763
#> 1986 0.7300337
#> 1987 2.9856265
#> 1988 -1.7695154
#> 1989 -4.0939843
#> 1990 -0.2292420
#> 1991 0.9373385
#> 1992 8.3290952
#> 1993 13.9617721
#> 1994 17.1929960
#> 1995 18.0576735
#> 1996 24.3488169
To me, it makes no sense to have different results depending on the order of variables. Can someone explain to me why this is the case? (or confirm that this is a bug?)

How to read files in two separate lists in a function based on a condition in R

Okay, I hope I manage to sum up what I need to achieve. I am running experiments in which I obtain data from two different source, with a date_time being the matching unifying variable. The data in the two separate sources have the same structure (in csv or txt). The distinction is in the filenames. I provide an example below:
list_of_files <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
frequency = c("30000",
"31000",
"32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
frequency = c("20000",
"21000",
"22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46",
"20/07/2022 13:56",
"20/07/2022 14:06"),
temperature = c("22.3",
"22.6",
"22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10",
"20/07/2022 13:20",
"20/07/2022 14:30"),
temperature = c("24.5",
"24.6",
"24.8")
)
)
)
First I want to read the data files in two separate lists. One that contains the keyword "solid" in the file name, and the other one that contains "water".
Then I need to create a new columns from the filename in each data frame that will be separated by "_" (e.g paint = "epox1", thickness = "10"), by which I could do an inner join by the date_time column, paint, thickness,etc. Basically what I struggle so far is to create a function that loads that files in two separate lists. This is what I've tried so far
load_files <-
function(list_of_files) {
all.files.board <- list()
all.files.temp <- list()
for (i in 1:length(list_of_files))
{
if (exists("board")) {
all.files.board[[i]] = fread(list_of_files[i])
}
else{
all.files.temp[[i]] = fread(list_of_files[i])
}
return(list(all.files.board, all.files.temp))
}
}
But it doesn't do what I need it. I hope I made it as clear as possible. I'm pretty comfortable with the tidyverse package but writing still a newbie in writing custom functions. Any ideas welcomed.
Regarding question in the title -
first issue, calling return() too early and thus breaking a for-loop, was already mentioned in comments and that should be sorted.
next one is condition itself, if (exists("board")){} checks if there is an object called board; in provided sample it would evaluate to TRUE only if something was assigned to global board object before calling load_files() function and it would evaluate to FALSE only if there were no such assignment or board was explicitly removed. I.e. with
board <- "something"; dataframes <- load_files(file_list) that check will be TRUE while with
rm(board); dataframes <- load_files(file_list) it will be FALSE, there's nothing in function itself that would change the "existance" of board, so the result is actually determined before calling the function.
If core of the question is about joining 2 somewhat different datasets and splitting result by groups, I'd just drop loops, conditions and most of involved lists and would go with something like this with Tidyverse:
library(fs)
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
# prepare input files for sample ------------------------------------------
sample_dfs <- structure(
list
(
solid_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
frequency = c("30000", "31000", "32000"),
index = c("1", "2", "3")
),
solid_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
frequency = c("20000", "21000", "22000"),
index = c("1", "2", "3")
),
water_epoxy1_10 = data.frame(
date_time = c("20/07/2022 13:46", "20/07/2022 13:56", "20/07/2022 14:06"),
temperature = c("22.3", "22.6", "22.5")
),
water_otherpaint_20 = data.frame(
date_time = c("20/07/2022 13:10", "20/07/2022 13:20", "20/07/2022 14:30"),
temperature = c("24.5", "24.6", "24.8")
)
)
)
tmp_path <- file_temp("reprex")
dir_create(tmp_path)
sample_filenames <- str_glue("{1:length(sample_dfs)}_{names(sample_dfs)}.csv")
for (i in seq_along(sample_dfs)) {
write_csv(sample_dfs[[i]], path(tmp_path, sample_filenames[i]))
}
dir_ls(tmp_path, type = "file")
#> Temp/RtmpqUoct8/reprex5cc517f177b/1_solid_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/2_solid_otherpaint_20.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/3_water_epoxy1_10.csv
#> Temp/RtmpqUoct8/reprex5cc517f177b/4_water_otherpaint_20.csv
# read files --------------------------------------------------------------
t_solid <- dir_ls(tmp_path, glob = "*solid*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_solid
#> # A tibble: 6 × 5
#> paint thickness date_time frequency index
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1
#> 2 epoxy1 10 20/07/2022 13:56 31000 2
#> 3 epoxy1 10 20/07/2022 14:06 32000 3
#> 4 otherpaint 20 20/07/2022 13:10 20000 1
#> 5 otherpaint 20 20/07/2022 13:20 21000 2
#> 6 otherpaint 20 20/07/2022 14:30 22000 3
t_water <- dir_ls(tmp_path, glob = "*water*.csv", type = "file") %>%
read_csv(id = "filename") %>%
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv")
t_water
#> # A tibble: 6 × 4
#> paint thickness date_time temperature
#> <chr> <chr> <chr> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 22.3
#> 2 epoxy1 10 20/07/2022 13:56 22.6
#> 3 epoxy1 10 20/07/2022 14:06 22.5
#> 4 otherpaint 20 20/07/2022 13:10 24.5
#> 5 otherpaint 20 20/07/2022 13:20 24.6
#> 6 otherpaint 20 20/07/2022 14:30 24.8
# or implement as a function ----------------------------------------------
load_files <- function(csv_path, glob = "*.csv") {
return(
dir_ls(csv_path, glob = glob, type = "file") %>%
# store filenames in filename column
read_csv(id = "filename", show_col_types = FALSE) %>%
# extract each regex group to its own column
extract(filename, c("paint", "thickness"), "_([^_]+)_(\\d+)\\.csv"))
}
# join / group / split ----------------------------------------------------
t_solid <- load_files(tmp_path, "*solid*.csv")
t_water <- load_files(tmp_path, "*water*.csv")
# either join by multiple columns or select only required cols
# to avoid x.* & y.* columns in result
inner_join(t_solid, t_water, by = c("date_time", "paint", "thickness")) %>%
group_by(paint) %>%
group_split()
Final result as a list of tibbles:
#> <list_of<
#> tbl_df<
#> paint : character
#> thickness : character
#> date_time : character
#> frequency : double
#> index : double
#> temperature: double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 epoxy1 10 20/07/2022 13:46 30000 1 22.3
#> 2 epoxy1 10 20/07/2022 13:56 31000 2 22.6
#> 3 epoxy1 10 20/07/2022 14:06 32000 3 22.5
#>
#> [[2]]
#> # A tibble: 3 × 6
#> paint thickness date_time frequency index temperature
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 otherpaint 20 20/07/2022 13:10 20000 1 24.5
#> 2 otherpaint 20 20/07/2022 13:20 21000 2 24.6
#> 3 otherpaint 20 20/07/2022 14:30 22000 3 24.8

Merging two data frames with different numbers of observations and matching them

The data frame below presents two data frames that I merged through cbindX(Period1, Period2). Both have the same columns but represent two time periods and have different observations for AEZ.
Example for Abyei and Angola
> dput(new_data2[1:6, c(1,2,3,5,7,8,9,11) ])
structure(list(AEZ_1 = c("Tropics, lowland semi-arid", "Dominantly hydromorphic soils", "Tropics, lowland sub-humid", "Tropics, lowland semi-arid", "Dominantly built-up land", "Dominantly hydromorphic soils"), Country_1 = c("Abyei", "Abyei", "Angola", "Angola", "Angola", "Angola"), File_name_1 = c("PRIO_AEZ_FS_1981_2010", "PRIO_AEZ_FS_1981_2010", "PRIO_AEZ_FS_1981_2010", "PRIO_AEZ_FS_1981_2010", "PRIO_AEZ_FS_1981_2010", "PRIO_AEZ_FS_1981_2010"), Share_1 = c(9418.132755827, 520.625044495, 616817.473747498, 278142.684969026, 1330.4290338252, 74581.3053271609), AEZ_2 = c("Tropics, lowland semi-arid", "Tropics, lowland sub-humid", "Dominantly hydromorphic soils", "Tropics, lowland sub-humid", "Tropics, lowland semi-arid", "Dominantly built-up land"), Country_2 = c("Abyei", "Abyei", "Abyei", "Angola", "Angola", "Angola"), File_name_2 = c("PRIO_AEZ_FS_2011_2040", "PRIO_AEZ_FS_2011_2040", "PRIO_AEZ_FS_2011_2040", "PRIO_AEZ_FS_2011_2040", "PRIO_AEZ_FS_2011_2040", "PRIO_AEZ_FS_2011_2040"), Share_2 = c(8475.525647713, 942.6071081139, 520.625044495, 754641.194306016, 289900.409286599, 1330.4290338252)), row.names = c(NA, 6L), class = "data.frame")
I would like to have matching Country to see the change of AEZ over time.
Image 2
Thanks
Assume you have two data frames (an old and a new one) with country properties:
library(tidyverse)
old <- tribble(
~AEZ, ~Country,
1, "Abyei",
2, "Angola"
) %>%
mutate(time = "old")
old
#> # A tibble: 2 x 3
#> AEZ Country time
#> <dbl> <chr> <chr>
#> 1 1 Abyei old
#> 2 2 Angola old
new <- tribble(
~AEZ, ~Country,
1, "Abyei",
2, "Angola",
3, "Angola"
) %>%
mutate(time = "new")
new
#> # A tibble: 3 x 3
#> AEZ Country time
#> <dbl> <chr> <chr>
#> 1 1 Abyei new
#> 2 2 Angola new
#> 3 3 Angola new
old %>%
full_join(new) %>%
pivot_wider(names_from = time, values_from = AEZ) %>%
unnest(old) %>%
unnest(new)
#> Joining, by = c("AEZ", "Country", "time")
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> # A tibble: 3 x 3
#> Country old new
#> <chr> <dbl> <dbl>
#> 1 Abyei 1 1
#> 2 Angola 2 2
#> 3 Angola 2 3
Created on 2021-09-21 by the reprex package (v2.0.1)
My suggestion is: Rename AEZ variable in the first file (data frame) as AEZ_1981 and the same variable in the second file as AEZ_2011 before merging. This is how you can keep all the information and compare the changes in the merged file.
Best,
Lev
If it helps, I figure out how to do it:
new_data<-merge(Period1, Period2, by.x=c("Country", "AEZ"), by.y=c("Country", "AEZ"), all= TRUE)

R function to fix automatically formatted data

I am currently analyzing a baseball data set that has the count data included, however, some of the data has automatically been formatted as a date.
I have already tried using as.numeric but it does not help. I have provided a sample of the data below:
Count(Factor) 0-0 0-1 0-2 1-Feb 1-Jan 1-Mar 2-Feb 2-Jan 2-Mar
Feb-00 Jan-00 Mar-00
I would like to remove the date format. For instance, I want to see 1-Feb as 1-2, 1-Jan as 1-1, 1-Mar as 1-3, Feb-00 as 2-0.
Does anyone have any suggestions on how to do so?
You can replace the abbreviated months with their relevant calendar position by referencing months.abb. Below I have created a general function using Base R.
## function to apply
month_num <- function(x){
if (! grepl('\\w{3}', x))
return(x)
gsub('/?\\w{3}', as.character(match(regmatches(x, regexpr('(\\w{3})', x)), month.abb)), x)
}
## vector
strings <- c( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
sapply(strings, month_num, USE.NAMES = FALSE)
#> [1] "0-0" "0-1" "0-2" "1-2" "1-1" "1-3" "2-2" "2-1" "2-3" "2-00"
#> [11] "1-00" "3-00"
## data.frame or matrix
tmp <- data.frame(
strings = c( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
)
tmp$strings <- apply(tmp, 1, month_num)
tmp
#> strings
#> 1 0-0
#> 2 0-1
#> 3 0-2
#> 4 1-2
#> 5 1-1
#> 6 1-3
#> 7 2-2
#> 8 2-1
#> 9 2-3
#> 10 2-00
#> 11 1-00
#> 12 3-00
## list
strings <- list( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
strings <- lapply(strings, month_num)
tail(strings)
#> [[1]]
#> [1] "2-2"
#>
#> [[2]]
#> [1] "2-1"
#>
#> [[3]]
#> [1] "2-3"
#>
#> [[4]]
#> [1] "2-00"
#>
#> [[5]]
#> [1] "1-00"
#>
#> [[6]]
#> [1] "3-00"
Created on 2019-02-12 by the reprex package (v0.2.1)

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources