I would like to provide a user-facing function that allows arbitrary grouping variables to be passed to a summary function, with the option of specifying additional arguments for filtering, but which are NULL by default (and thus unevaluated).
I understand why the following example should fail (because it is ambiguous where homeworld belongs and the other arg takes precedence), but I'm unsure what is the best way to pass dots appropriately in this situation. Ideally the result of the second and third calls to fun below would return the same results.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
fun <- function(.df, .species = NULL, ...) {
.group_vars <- rlang::ensyms(...)
if (!is.null(.species)) {
.df <- .df %>%
dplyr::filter(.data[["species"]] %in% .species)
}
.df %>%
dplyr::group_by(!!!.group_vars) %>%
dplyr::summarize(
ht = mean(.data[["height"]], na.rm = TRUE),
.groups = "drop"
)
}
fun(.df = starwars, .species = c("Human", "Droid"), species, homeworld)
#> # A tibble: 19 x 3
#> species homeworld ht
#> <chr> <chr> <dbl>
#> 1 Droid Naboo 96
#> 2 Droid Tatooine 132
#> 3 Droid <NA> 148
#> 4 Human Alderaan 176.
#> 5 Human Bespin 175
#> 6 Human Bestine IV 180
#> 7 Human Chandrila 150
#> 8 Human Concord Dawn 183
#> 9 Human Corellia 175
#> 10 Human Coruscant 168.
#> 11 Human Eriadu 180
#> 12 Human Haruun Kal 188
#> 13 Human Kamino 183
#> 14 Human Naboo 168.
#> 15 Human Serenno 193
#> 16 Human Socorro 177
#> 17 Human Stewjon 182
#> 18 Human Tatooine 179.
#> 19 Human <NA> 193
fun(.df = starwars, .species = NULL, homeworld)
#> # A tibble: 49 x 2
#> homeworld ht
#> <chr> <dbl>
#> 1 Alderaan 176.
#> 2 Aleen Minor 79
#> 3 Bespin 175
#> 4 Bestine IV 180
#> 5 Cato Neimoidia 191
#> 6 Cerea 198
#> 7 Champala 196
#> 8 Chandrila 150
#> 9 Concord Dawn 183
#> 10 Corellia 175
#> # … with 39 more rows
fun(.df = starwars, homeworld)
#> Error in fun(.df = starwars, homeworld): object 'homeworld' not found
<sup>Created on 2020-06-15 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
I know that I can achieve the desired result by:
fun <- function(.df, .species = NULL, .groups = NULL) {
.group_vars <- rlang::syms(purrr::map(.groups, rlang::as_string))
...
}
But I am looking for a solution using ..., or that allows the user to pass either strings or symbols to .groups, e.g. .groups = c(species, homeworld) or .groups = c("species", "homeworld").
You could move the parameters so that .species comes after the dots.
fun <- function(.df, ..., .species = NULL) {
.group_vars <- rlang::ensyms(...)
if (!is.null(.species)) {
.df <- .df %>%
dplyr::filter(.data[["species"]] %in% .species)
}
.df %>%
dplyr::group_by(!!!.group_vars) %>%
dplyr::summarize(
ht = mean(.data[["height"]], na.rm = TRUE),
.groups = "drop"
)
}
fun(.df = starwars, homeworld)
which gives
> fun(.df = starwars, homeworld)
# A tibble: 49 x 3
homeworld ht .groups
<chr> <dbl> <chr>
1 NA 139. drop
2 Alderaan 176. drop
3 Aleen Minor 79 drop
4 Bespin 175 drop
5 Bestine IV 180 drop
6 Cato Neimoidia 191 drop
7 Cerea 198 drop
8 Champala 196 drop
9 Chandrila 150 drop
10 Concord Dawn 183 drop
# ... with 39 more rows
which is what you wanted to happen. The other examples still work as well.
Related
I wanted to assign same serial number for all same Submission_Ids under one Batch_number. Could some one please help me figure this out?
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- (633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Expected result
Sl.No <- c(1,1,1,1,2,2,2,2,2,3,3,4,1,1,1,1,1,1,1,1,1)
One way to do it is creating run-length IDs with data.table::rleid(Submission_Id) grouped_by(Batch_No). We can use this inside 'dplyr'. To show this I created a tibble() with both given vectors Batch_Id and Submission_Id.
library(dplyr)
library(data.table)
dat <- tibble(Submission_Id = Submission_Id,
Batch_No = Batch_No)
dat %>%
group_by(Batch_No) %>%
mutate(S1.No = data.table::rleid(Submission_Id))
#> # A tibble: 21 x 3
#> # Groups: Batch_No [3]
#> Submission_Id Batch_No S1.No
#> <dbl> <dbl> <int>
#> 1 619295 633 1
#> 2 619295 633 1
#> 3 619295 633 1
#> 4 619295 633 1
#> 5 619296 633 2
#> 6 619296 633 2
#> 7 619296 633 2
#> 8 619296 633 2
#> 9 619296 633 2
#> 10 556921 633 3
#> # ... with 11 more rows
The original data
Submission_Id <- c(619295,619295,619295,619295,619296,619296,619296,619296,619296,556921,556921,559254,647327,647327,647327,646040,646040,646040,646040,646040,64604)
Batch_No <- c(633,633,633,633,633,633,633,633,633,633,633,633,634,634,634,650,650,650,650,650,650)
Created on 2022-12-16 by the reprex package (v2.0.1)
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
My data frame has different dates as rows. Every unique date occurs appr. 500 times. I want to make a new data frame where every column is a unique date and where the rows are all the observations of that date from my old dataset. So for every column dat represents a certain date, I should have appr. 500 rows that each represent a rel_spread from that day.
You can use pivot_wider from tidyr:
library(tidyr)
pivot_wider(df, names_from = date, values_from = rel_spread, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 17
#> `20000103` `20000104` `20000105` `20000106` `20000107` `20000108` `20000109`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.0234 -0.0128 0.00729 0.0408 -0.0298 0.0398 0.0445
#> 2 0.0492 -0.0120 0.0277 0.0435 -0.0288 0.0152 -0.0374
#> # ... with 10 more variables: `20000110` <dbl>, `20000111` <dbl>,
#> # `20000112` <dbl>, `20000113` <dbl>, `20000114` <dbl>, `20000115` <dbl>,
#> # `20000116` <dbl>, `20000117` <dbl>, `20000118` <dbl>, `20000119` <dbl>
Note that we don't have your data (and I wasn't about to transcribe a picture of your data), but I created a little reproducible data set which should match the structure of your data set, except it only has two values per date for demo purposes:
set.seed(1)
df <- data.frame(date = rep(as.character(20000103:20000119), 2),
rel_spread = runif(34, -0.05, 0.05))
df
#> date rel_spread
#> 1 20000103 -0.0234491337
#> 2 20000104 -0.0127876100
#> 3 20000105 0.0072853363
#> 4 20000106 0.0408207790
#> 5 20000107 -0.0298318069
#> 6 20000108 0.0398389685
#> 7 20000109 0.0444675269
#> 8 20000110 0.0160797792
#> 9 20000111 0.0129114044
#> 10 20000112 -0.0438213730
#> 11 20000113 -0.0294025425
#> 12 20000114 -0.0323443247
#> 13 20000115 0.0187022847
#> 14 20000116 -0.0115896282
#> 15 20000117 0.0269841420
#> 16 20000118 -0.0002300758
#> 17 20000119 0.0217618508
#> 18 20000103 0.0491906095
#> 19 20000104 -0.0119964821
#> 20 20000105 0.0277445221
#> 21 20000106 0.0434705231
#> 22 20000107 -0.0287857479
#> 23 20000108 0.0151673766
#> 24 20000109 -0.0374444904
#> 25 20000110 -0.0232779331
#> 26 20000111 -0.0113885907
#> 27 20000112 -0.0486609667
#> 28 20000113 -0.0117612043
#> 29 20000114 0.0369690846
#> 30 20000115 -0.0159651003
#> 31 20000116 -0.0017919885
#> 32 20000117 0.0099565825
#> 33 20000118 -0.0006458693
#> 34 20000119 -0.0313782399
Allan’s answer is perfect if you have the same number of rows for each date. If this isn’t the case, the following should work:
library(tidyr)
library(dplyr)
data_wide <- data_long %>%
group_by(date) %>%
mutate(daterow = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = date, values_from = rel_spread) %>%
select(!daterow)
data_wide
Output:
# A tibble: 6 x 4
`20000103` `20000104` `20000105` `20000106`
<dbl> <dbl> <dbl> <dbl>
1 -0.626 0.184 -0.836 -0.621
2 1.60 0.330 -0.820 -2.21
3 0.487 0.738 0.576 1.12
4 -0.305 1.51 0.390 -0.0449
5 NA NA NA -0.0162
6 NA NA NA 0.944
Example data:
set.seed(1)
data_long <- data.frame(
date = c(rep(20000103:20000105, 4), rep(20000106, 6)),
rel_spread = rnorm(18)
)
The content in the column appears as follows $1,521+ 2 bds. I want to extract 1521 and put it in a new column. I know this can be done in alteryx using regex can I do it R?
How about the following?:
library(tidyverse)
x <- '$1,521+ 2 bds'
parse_number(x)
For example:
library(tidyverse)
#generate some data
tbl <- tibble(string = str_c('$', as.character(seq(1521, 1541, 1)), '+', ' 2bds'))
new_col <-
tbl$string %>%
str_split('\\+',simplify = TRUE) %>%
`[`(, 1) %>%
str_sub(2, -1) #get rid of '$' at the start
mutate(tbl, number = new_col)
#> # A tibble: 21 x 2
#> string number
#> <chr> <chr>
#> 1 $1521+ 2bds 1521
#> 2 $1522+ 2bds 1522
#> 3 $1523+ 2bds 1523
#> 4 $1524+ 2bds 1524
#> 5 $1525+ 2bds 1525
#> 6 $1526+ 2bds 1526
#> 7 $1527+ 2bds 1527
#> 8 $1528+ 2bds 1528
#> 9 $1529+ 2bds 1529
#> 10 $1530+ 2bds 1530
#> # … with 11 more rows
Created on 2021-06-12 by the reprex package (v2.0.0)
We can use sub from base R
as.numeric( sub("\\$(\\d+),(\\d+).*", "\\1\\2", x))
#[1] 1521
data
x <- '$1,521+ 2 bds'
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I am writing a function to dplyr::_join two dataframes by different columns, with the column name of the first dataframe dynamically specified as a function argument. I believe I need to use rlang quasiquotation/metaprogramming but haven't been able to get a working solution. I appreciate any suggestions!
library(dplyr)
library(rlang)
library(palmerpenguins)
# Create a smaller dataset
penguins <-
penguins %>%
group_by(species) %>%
slice_head(n = 4) %>%
ungroup()
# Create a colors dataset
penguin_colors <-
tibble(
type = c("Adelie", "Chinstrap", "Gentoo"),
color = c("orange", "purple", "green")
)
# Without function --------------------------------------------------------
# Join works with character vectors
left_join(
penguins, penguin_colors, by = c("species" = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join works with data-variable and character vector
left_join(
penguins, penguin_colors, by = c(species = "type")
)
#> # A tibble: 12 x 9
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <chr> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… NA NA NA NA
#> 5 Chinst… Dream 46.5 17.9 192 3500
#> 6 Chinst… Dream 50 19.5 196 3900
#> 7 Chinst… Dream 51.3 19.2 193 3650
#> 8 Chinst… Dream 45.4 18.7 188 3525
#> 9 Gentoo Biscoe 46.1 13.2 211 4500
#> 10 Gentoo Biscoe 50 16.3 230 5700
#> 11 Gentoo Biscoe 48.7 14.1 210 4450
#> 12 Gentoo Biscoe 50 15.2 218 5700
#> # … with 3 more variables: sex <fct>, year <int>, color <chr>
# Join does NOT work with character vector and data-variable
left_join(
penguins, penguin_colors, by = c(species = type)
)
#> Error in standardise_join_by(by, x_names = x_names, y_names = y_names): object 'type' not found
# With function -----------------------------------------------------------
# Version 1: Without tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c(var = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `var`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `var`.
# Version 2: With tunneling
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" = "type")
)
}
add_colors(penguins, species)
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
add_colors(penguins, "species")
#> Error: Join columns must be present in data.
#> x Problem with `{{var}}`.
# Version 2: With tunneling and glue syntax
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = c("{{var}}" := "type")
)
}
add_colors(penguins, species)
#> Error: `:=` can only be used within a quasiquoted argument
add_colors(penguins, "species")
#> Error: `:=` can only be used within a quasiquoted argument
Created on 2020-10-05 by the reprex package (v0.3.0)
Here are related resources I consulted:
using `rlang` quasiquotation with `dplyr::_join` functions
https://dplyr.tidyverse.org/reference/join.html
https://speakerdeck.com/lionelhenry/interactivity-and-programming-in-the-tidyverse
https://dplyr.tidyverse.org/articles/programming.html
Thank you for your advice.
library(dplyr)
left_join(
penguins, penguin_colors, by = c(species = "type")
)
The reason why above works is because in by we are creating a named vector like this :
c(species = "type")
#species
# "type"
You can also do that via setNames :
setNames('type', 'species')
but notice that passing species without quotes fail.
setNames('type', species)
Error in setNames("type", species) : object 'species' not found
So create a named vector with setNames and pass character value in the function.
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = setNames('type', var)
)
}
add_colors(penguins, 'species')
To add to Ronak's solution you can also achieve this without quotes using ensym
Example:
add_colors <- function(data, var) {
left_join(
data, penguin_colors, by = set_names("type", nm = ensym(var))
)
}
Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()