Let's suppose that this is my code:
library(magrittr)
library(dplyr)
set.seed(123,kind="Mersenne-Twister",normal.kind="Inversion")
y = runif(20,0,50)
simulation <- function(y){
x <- rnorm(length(y),3,0.125)
lm(y ~ x)
}
fit <- lapply(1:10, function(dummy) simulation(y))
coef <- sapply(fit, coef) %>%
t() %>%
as.data.frame()
How can I collect the 10 simulated x variables generated from the function simulation in a data frame?
They are stored in the fit object. A single one can be extracted with fit[[1]]$model$x. Programmatically we can get all of them in a data frame like this:
xs = lapply(fit, \(x) x$model$x)
data.frame(i = rep(seq_along(xs), lengths(xs)), x = unlist(xs))
# i x
# 1 1 3.153010
# 2 1 3.044977
# 3 1 3.050096
# 4 1 3.013835
# 5 1 2.930520
# 6 1 3.223364
# 7 1 3.062231
# ...
Or, if you want lots of info you can use broom::augment instead. This will return the full data, along with predicted values, residuals, etc.
result = bind_rows(lapply(fit, broom::augment), .id = "sim")
head(result)
# A tibble: 6 × 9
# sim y x .fitted .resid .hat .sigma .cooksd .std.resid
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 14.4 3.15 19.2 -4.78 0.141 15.1 0.0101 -0.350
# 2 1 39.4 3.04 24.6 14.8 0.0612 14.7 0.0353 1.04
# 3 1 20.4 3.05 24.3 -3.89 0.0633 15.1 0.00252 -0.273
# 4 1 44.2 3.01 26.2 18.0 0.0525 14.5 0.0437 1.26
# 5 1 47.0 2.93 30.4 16.7 0.0603 14.6 0.0438 1.17
# 6 1 2.28 3.22 15.6 -13.3 0.234 14.7 0.164 -1.04
You might also like bind_rows(lapply(fit, broom::tidy), .id = "sim") which will give you one row per coefficient and bind_rows(lapply(fit, broom::glance), .id = "sim") which will give you one row per model.
Related
I'm trying to write a function that can flexibly group by a variable number of arguments and fit a linear model to each subset. The output should be a table with each row showing the grouping variable(s) and corresponding lm call results that broom::glance provides. But I can't figure out how to structure the output. Code that produces the same error is as follows:
library(dplyr)
library(broom)
test_fcn <- function(var1, ...) {
x <- unlist(list(...))
mtcars %>%
group_by(across(all_of(c('gear', x)))) %>%
mutate(mod = list(lm(hp ~ !!sym(var1), data = .))) %>%
summarize(broom::glance(mod))
}
test_fcn('qsec', 'cyl', 'carb')
I'm pushing my R/dplyr comfort zone by mixing static and dynamic variable arguments, so I've left them here in case that's a contributing factor. Thanks for any input!
You were nearly there.
test_fcn <- function(var1, ...) {
x <- unlist(list(...))
mtcars %>%
group_by(across(all_of(c('gear', x)))) %>%
summarise(
mod = list(lm(hp ~ !!sym(var1), data = .)),
mod = map(mod, broom::glance),
.groups = "drop")
}
test_fcn('qsec', 'cyl', 'carb') %>% unnest(mod)
## A tibble: 12 × 15
# gear cyl carb r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³ df.re…⁴
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 3 4 1 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 2 3 6 1 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 3 3 8 2 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 4 3 8 3 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 5 3 8 4 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 6 4 4 1 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 7 4 4 2 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 8 4 6 4 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
# 9 5 4 2 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
#10 5 6 6 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
#11 5 8 4 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
#12 5 8 8 0.502 0.485 49.2 30.2 5.77e-6 1 -169. 344. 348. 72633. 30
## … with 1 more variable: nobs <int>, and abbreviated variable names ¹adj.r.squared, ²statistic,
## ³deviance, ⁴df.residual
## ℹ Use `colnames()` to see all variable names
Because you are storing the lm fit objects in a list, you need to loop over the entries using purrr::map.
You might want to put the unnest into the test_fcn: a slightly more compact version would be
test_fcn <- function(var1, ...) {
x <- unlist(list(...))
mtcars %>%
group_by(across(all_of(c('gear', x)))) %>%
summarise(
mod = map(list(lm(hp ~ !!sym(var1), data = .)), broom::glance),
.groups = "drop") %>%
unnest(mod)
}
Update
Until your comment, I hadn't realised that the grouping was ignored. Here is a nest-unnest-type solution.
test_fcn <- function(var1, ...) {
x <- list(...)
mtcars %>%
group_by(across(all_of(c('gear', x)))) %>%
nest() %>%
ungroup() %>%
mutate(mod = map(
data,
~ lm(hp ~ !!sym(var1), data = .x) %>% broom::glance())) %>%
unnest(mod)
}
test_fcn('qsec', 'cyl', 'carb')
## A tibble: 12 × 16
# cyl gear carb data r.squared adj.r.s…¹ sigma statis…² p.value df logLik
# <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 6 4 4 <tibble> 0.911 0.867 2.74e+ 0 20.5 0.0454 1 -8.32
# 2 4 4 1 <tibble> 0.525 0.287 1.15e+ 1 2.21 0.276 1 -14.1
# 3 6 3 1 <tibble> 1 NaN NaN NaN NaN 1 Inf
# 4 8 3 2 <tibble> 0.0262 -0.461 1.74e+ 1 0.0538 0.838 1 -15.7
# 5 8 3 4 <tibble> 0.869 0.825 7.48e+ 0 19.9 0.0210 1 -15.9
# 6 4 4 2 <tibble> 0.0721 -0.392 3.18e+ 1 0.155 0.732 1 -18.1
# 7 8 3 3 <tibble> 0.538 0.0769 2.63e-14 1.17 0.475 1 91.2
# 8 4 3 1 <tibble> 0 0 NaN NA NA NA Inf
# 9 4 5 2 <tibble> 1 NaN NaN NaN NaN 1 Inf
#10 8 5 4 <tibble> 0 0 NaN NA NA NA Inf
#11 6 5 6 <tibble> 0 0 NaN NA NA NA Inf
#12 8 5 8 <tibble> 0 0 NaN NA NA NA Inf
## … with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## nobs <int>, and abbreviated variable names ¹adj.r.squared, ²statistic
## ℹ Use `colnames()` to see all variable names
Explanation: tidyr::nest nests data in a list column (with name data by default); we can then loop through the data entries, fit the model and extract model summaries with broom::glance in a new column mod; unnesting mod then gives the desired structure. If not needed, you can remove the data column with select(-data).
PS. The example produces some warnings (leading to NAs in the model summaries) from those groups where you have only a single observation.
I wrote a function that runs a linear model and outputs a data frame. I would like to run the function several times over two grouping variables and stack the output. Here is a hypothetical dataset and function:
data = data.frame(grade_level = rep(1:4, each = 3),
x = rnorm(12, mean = 21, sd = 7.5),
y = rnorm(12, mean = 20, sd = 7),
cut_set = rep(c("low", "med", "high"), each = 4))
func = function(grade, set){
model = lm(y ~ x, data=data[data$grade_level == grade & data$cut_set == set,])
fitted.values = model$fitted.values
final = data.frame(grade_level = data$grade_level[data$grade_level == grade & data$cut_set == set],
predicted_values = fitted.values)
final
}
I can run it multiple times and then bind the output but I know this isn't the best
g1.low <- func(grade = 1, set = "low")
g1.med <- func(grade = 1, set = "med")
pred.values = rbind(g1.low, g1.med)
I would like to loop through all grades (1 to 4) and set ("low", "med", "high") values. I've tried this loop but it doesn't work. I wonder if there is a purrr solution.
for (i in grades) {
for(c in 1:length(cut_sets)) {
temp <- func(grade = i, set = cut_sets[c])
predicted.values <- rbind(predicted.values, temp)
}
}
If I've understood well you can manage it with dplyr and broom:
library(dplyr)
library(broom)
library(tidyr)
mods <- data %>%
group_by(grade_level, cut_set) %>%
do(model = augment(lm(y ~ x, data = .)) )
mods
# A tibble: 6 x 3
# Rowwise:
grade_level cut_set model
<int> <chr> <list>
1 1 low <tibble [3 x 8]>
2 2 low <tibble [1 x 8]>
3 2 med <tibble [2 x 8]>
4 3 high <tibble [1 x 8]>
5 3 med <tibble [2 x 8]>
6 4 high <tibble [3 x 8]>
mods %>% unnest(cols = c(model))
# A tibble: 12 x 10
grade_level cut_set y x .fitted .resid .hat .sigma .cooksd .std.resid
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 low 27.5 20.9 27.4 1.12e- 1 0.992 NaN 60.9 1.
2 1 low 24.8 30.4 24.0 8.15e- 1 0.567 Inf 0.656 1.00
3 1 low 23.5 29.3 24.4 -9.26e- 1 0.441 NaN 0.394 -1
4 2 low 31.6 18.6 31.6 0. 1 0 NaN NaN
5 2 med 19.3 20.9 19.3 3.55e-15 1 0 NaN NaN
6 2 med 16.9 14.7 16.9 0. 1 0 NaN NaN
7 3 high 20.1 22.9 20.1 0. 1 0 NaN NaN
8 3 med 21.6 13.2 21.6 3.55e-15 1 0 NaN NaN
9 3 med 20.9 26.5 20.9 0. 1 0 NaN NaN
10 4 high 26.4 20.0 20.9 5.49e+ 0 0.369 NaN 0.293 1.
11 4 high 15.2 15.6 19.0 -3.88e+ 0 0.685 NaN 1.09 -1.
12 4 high 23.7 30.8 25.3 -1.61e+ 0 0.946 NaN 8.71 -1.
To get slopes, you can:
data %>%
group_by(grade_level, cut_set) %>%
do(model = tidy(lm(y ~ x, data = .)) ) %>% unnest(cols = c(model))
# A tibble: 12 x 7
grade_level cut_set term estimate std.error statistic p.value
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 low (Intercept) 14.8 7.05 2.09 0.284
2 1 low x 0.339 0.371 0.913 0.529
3 2 low (Intercept) 23.1 NaN NaN NaN
4 2 low x NA NA NA NA
5 2 med (Intercept) 1.27 NaN NaN NaN
6 2 med x 0.561 NaN NaN NaN
7 3 high (Intercept) 14.7 NaN NaN NaN
8 3 high x NA NA NA NA
9 3 med (Intercept) 7.29 NaN NaN NaN
10 3 med x 0.229 NaN NaN NaN
11 4 high (Intercept) 13.8 4.18 3.30 0.187
12 4 high x 0.106 0.210 0.505 0.702
I have a recollection that purrr::pmap_* can treat a data.frame as a list but the syntax eludes me.
Imagine we wanted to fit a separate lm object for each value of mtcars$vs and mtcars$am
library(tidyverse)
library(broom)
d1 <- mtcars %>%
group_by(
vs, am
) %>%
nest %>%
mutate(
coef = data %>%
map(
~lm(mpg ~ wt, data =.) %>%
tidy
)
)
If I wanted to extract the coefficient estimates as an un-nested data.frame, and append the values of am and vs, I might try
d1[, -3] %>%
pmap_dfr(
function(i, j, k)
k %>%
mutate(
vs = i,
am = j
)
)
But this results in an error. More explicitly declaring these variables as separate lists has the desired effect
list(
d1$vs,
d1$am,
d1$coef
) %>%
pmap_dfr(
function(i, j, k)
k %>%
mutate(
vs = i,
am = j
)
)
Is there a succinct way for pmap_* to treat a data.frame as a list?
We can use the standard option to extract the components (..1, ..2, etc)
d1[, -3] %>%
pmap_dfr(~ ..3 %>%
mutate(vs = ..1, am = ..2))
# A tibble: 8 x 7
# term estimate std.error statistic p.value vs am
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 42.4 3.30 12.8 0.000213 0 1
#2 wt -7.91 1.14 -6.93 0.00227 0 1
#3 (Intercept) 44.1 6.96 6.34 0.00144 1 1
#4 wt -7.77 3.36 -2.31 0.0689 1 1
#5 (Intercept) 31.5 8.98 3.51 0.0171 1 0
#6 wt -3.38 2.80 -1.21 0.281 1 0
#7 (Intercept) 25.1 3.51 7.14 0.0000315 0 0
#8 wt -2.44 0.842 -2.90 0.0159 0 0
This is because the second list has no names attribute. If you unname d1 it works. The fact that you used the list function in the second example doesn't make a difference (except that it removed the names), because both objects are lists (data frames are lists).
d1[, -3] %>%
unname %>%
pmap_dfr(
function(i, j, k)
k %>%
mutate(
vs = i,
am = j
)
)
# # A tibble: 8 x 7
# term estimate std.error statistic p.value vs am
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 42.4 3.30 12.8 0.000213 0 1
# 2 wt -7.91 1.14 -6.93 0.00227 0 1
# 3 (Intercept) 44.1 6.96 6.34 0.00144 1 1
# 4 wt -7.77 3.36 -2.31 0.0689 1 1
# 5 (Intercept) 31.5 8.98 3.51 0.0171 1 0
# 6 wt -3.38 2.80 -1.21 0.281 1 0
# 7 (Intercept) 25.1 3.51 7.14 0.0000315 0 0
# 8 wt -2.44 0.842 -2.90 0.0159 0 0
You can also name the arguments in your first code block's function to match (or use ..1 etc) for the same result
d1[, -3] %>%
pmap_dfr(
function(vs, am, coef)
coef %>%
mutate(
vs = vs,
am = am
)
)
# # A tibble: 8 x 7
# term estimate std.error statistic p.value vs am
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 42.4 3.30 12.8 0.000213 0 1
# 2 wt -7.91 1.14 -6.93 0.00227 0 1
# 3 (Intercept) 44.1 6.96 6.34 0.00144 1 1
# 4 wt -7.77 3.36 -2.31 0.0689 1 1
# 5 (Intercept) 31.5 8.98 3.51 0.0171 1 0
# 6 wt -3.38 2.80 -1.21 0.281 1 0
# 7 (Intercept) 25.1 3.51 7.14 0.0000315 0 0
# 8 wt -2.44 0.842 -2.90 0.0159 0 0
You could also use wap from the experimental rap package
library(rap)
d1[, -3] %>%
wap( ~ coef %>%
mutate(
vs = vs,
am = am)) %>%
bind_rows
# # A tibble: 8 x 7
# term estimate std.error statistic p.value vs am
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 42.4 3.30 12.8 0.000213 0 1
# 2 wt -7.91 1.14 -6.93 0.00227 0 1
# 3 (Intercept) 44.1 6.96 6.34 0.00144 1 1
# 4 wt -7.77 3.36 -2.31 0.0689 1 1
# 5 (Intercept) 31.5 8.98 3.51 0.0171 1 0
# 6 wt -3.38 2.80 -1.21 0.281 1 0
# 7 (Intercept) 25.1 3.51 7.14 0.0000315 0 0
# 8 wt -2.44 0.842 -2.90 0.0159 0 0
Is it possible to change the contrasts of interaction terms which have been specified in an lm using the colon : notation?
In the example below, the reference category defaults to the last of the six terms generated by gear:vs (i.e., gear5:vs1). I'd instead like it to use the first of the six as the reference (i.e., gear3:vs0).
mtcars.1 <- mtcars %>%
mutate(gear = as.factor(gear)) %>%
mutate(vs = as.factor(vs))
lm(data=mtcars.1, mpg ~ gear:vs) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
Specifying contrasts for gear and vs separately doesn't seem to have an effect:
lm(data=mtcars.1, mpg ~ gear:vs,
contrasts = list(gear = contr.treatment(n=3,base=3),
vs = contr.treatment(n=2,base=2))) %>%
tidy
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
And I'm not sure how to specify a contrast for gear:vs directly:
lm(data=mtcars.1, mpg ~ gear:vs,
contrasts = list("gear:vs" = contr.treatment(n=6,base=6))) %>%
tidy
#> Warning in model.matrix.default(mt, mf, contrasts): variable 'gear:vs' is
#> absent, its contrast will be ignored
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 30.4 4.13 7.36 0.0000000824
#> 2 gear3:vs0 -15.4 4.30 -3.57 0.00143
#> 3 gear4:vs0 -9.40 5.06 -1.86 0.0747
#> 4 gear5:vs0 -11.3 4.62 -2.44 0.0218
#> 5 gear3:vs1 -10.1 4.77 -2.11 0.0447
#> 6 gear4:vs1 -5.16 4.33 -1.19 0.245
Created on 2019-01-21 by the reprex package (v0.2.1)
One way around this is to pre-calculate the interaction term before regression.
To demonstrate, we can create a factor column GV in mtcars with the same levels as observed in your lm output. It generates the same values:
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs)),
GV = factor(GV, levels = c("5.1", "3.0", "4.0", "5.0", "3.1", "4.1"))) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 30.4 4.13 7.36 0.0000000824
2 GV3.0 -15.4 4.30 -3.57 0.00143
3 GV4.0 -9.4 5.06 -1.86 0.0747
4 GV5.0 -11.3 4.62 -2.44 0.0218
5 GV3.1 -10.1 4.77 -2.11 0.0447
6 GV4.1 -5.16 4.33 -1.19 0.245
Now we omit the second mutate term, so the levels are 3.0, 4.0, 5.0, 3.1, 4.1, 5.1.
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs))) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 15.1 1.19 12.6 1.38e-12
2 GV4.0 5.95 3.16 1.88 7.07e- 2
3 GV5.0 4.08 2.39 1.71 9.96e- 2
4 GV3.1 5.28 2.67 1.98 5.83e- 2
5 GV4.1 10.2 1.77 5.76 4.61e- 6
6 GV5.1 15.4 4.30 3.57 1.43e- 3
Use interaction(factor(gear), factor(vs), lex.order = TRUE) to get the levels 3.0, 3.1, 4.0, 4.1, 5.0, 5.1.
mtcars %>%
mutate(GV = interaction(factor(gear), factor(vs), lex.order = TRUE)) %>%
lm(mpg ~ GV, .) %>%
tidy()
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 15.0 1.19 12.6 1.38e-12
2 GV3.1 5.28 2.67 1.98 5.83e- 2
3 GV4.0 5.95 3.16 1.88 7.07e- 2
4 GV4.1 10.2 1.77 5.76 4.61e- 6
5 GV5.0 4.07 2.39 1.71 9.96e- 2
6 GV5.1 15.3 4.30 3.57 1.43e- 3
I read this this question and practiced matching patterns, but I am still not figuring it.
I have a panel with the same measure, several times per year. Now, I want to rename them in a logical way. My raw data looks a bit like this,
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4))
# install.packages(c("tidyverse"), dependencies = TRUE)
require(tidyverse)
(tbl <- dta %>% as_tibble())
#> # A tibble: 6 x 5
#> id R1213 R1224 R1255 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
Now, I use str_replace_all() to rename them, here with only one variable in where I use pate, and everything is fine (it might also be possible to optimize this in other ways, if so please feel to let me know),
names(tbl) <- tbl %>% names() %>%
str_replace_all('^R1.[125].$', 'A') %>%
str_replace_all('^R1.[3].$', paste0('A.2018.', 1))
tbl
#> # A tibble: 6 x 5
#> id A A A A.2018.1
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
Eveything call A is actually from the same year, let's say 2017, but with the suffix .1, .2, etc. need to appended. I start over and again use paste0('A.2017.', 1:3), but this time with three suffices,
tbl <- dta %>% as_tibble()
names(tbl) <- tbl %>% names() %>%
str_replace_all('^R1.[125].$', paste0('A.2017.', 1:3)) %>%
str_replace_all('^R1.[7].$', paste0('A.2018.', 1))
tbl
#> Warning message:
#> In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
#> longer object length is not a multiple of shorter object length
#> > tbl
#> # A tibble: 6 x 5
#> id A.2017.2 A.2017.3 A.2017.1 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.488 1.60 2.07 3.07
#> 2 2 0.692 1.42 2.76 3.19
#> 3 3 0.262 1.34 2.33 3.82
#> 4 4 0.330 1.77 2.61 3.93
#> 5 5 0.582 1.92 2.15 3.86
#> 6 6 0.930 1.88 2.56 3.59
this does come out, but the order is reversed and I am told longer object length is not a multiple of shorter object length, but isen't 3 the right length? I am looking to do this in a cleaner and simpler way. Also, I don't really like names(tbl) <-, if that can be done in a more elegant way.
Building on David's suggestion - how about something like the following using dplyr::rename_at?
library(dplyr)
## Get data
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4)) %>%
as_tibble()
## Rename
dta <- dta %>%
rename_at(.vars = grep('^R1.[125].$', names(.)),
.funs = ~paste0("A.2017.", 1:length(.)))
dta
#> # A tibble: 6 x 5
#> id A.2017.1 A.2017.2 A.2017.3 R1235
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.196 1.74 2.51 3.49
#> 2 2 0.478 1.85 2.06 3.69
#> 3 3 0.780 1.32 2.21 3.26
#> 4 4 0.705 1.49 2.49 3.33
#> 5 5 0.942 1.59 2.66 3.58
#> 6 6 0.906 1.90 2.87 3.93
Vectorised solution for multiple patterns
For a complete solution that can be used for multiple patterns and replacements, we can make use of purr::map2_dfc as follows.
library(dplyr)
library(purrr)
## Get data
set.seed(667)
dta <- data.frame(id = 1:6,
R1213 = runif(6),
R1224 = runif(6, 1, 2),
R1255 = runif(6, 2, 3),
R1235 = runif(6, 3, 4)) %>%
as_tibble()
## Define a function to keep a hold out data set, then rename iteratively for each pattern and replacement.
rename_multiple_years <- function(df, patterns,
replacements,
hold_out_var = "id") {
hold_out_df <- df %>%
select_at(.vars = hold_out_var)
rename_df <- map2_dfc(patterns, replacements, function(pattern, replacement) {
df %>%
rename_at(.vars = grep(pattern, names(.)),
.funs = ~paste0(replacement, 1:length(.))) %>%
select_at(.vars = grep(replacement, names(.)))
})
final_df <- bind_cols(hold_out_df, rename_df)
return(final_df)
}
## Call function on specified patterns and replacements
renamed_dta <- dta %>%
rename_multiple_years(patterns = c("^R1.[125].$", "^R1.[3].$"),
replacements = c("A.2017.", "A.2018."))
renamed_dta
#> # A tibble: 6 x 5
#> id A.2017.1 A.2017.2 A.2017.3 A.2018.1
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.196 1.74 2.51 3.49
#> 2 2 0.478 1.85 2.06 3.69
#> 3 3 0.780 1.32 2.21 3.26
#> 4 4 0.705 1.49 2.49 3.33
#> 5 5 0.942 1.59 2.66 3.58
#> 6 6 0.906 1.90 2.87 3.93
Towards tidy data
Now that the variables have been renamed you might find it useful to have your data in a tidy format. The following using tidyr::gather might be useful.
library(tidyr)
library(dplyr)
#Use tidy dataframe gather all variables, split by "." and drop A column (or keep if a measurement id)
renamed_dta %>%
gather(key = "measure", value = "value", -id) %>%
separate(measure, c("A", "year", "measure"), "[[.]]") %>%
select(-A)
#> # A tibble: 24 x 4
#> id year measure value
#> <int> <chr> <chr> <dbl>
#> 1 1 2017 1 0.196
#> 2 2 2017 1 0.478
#> 3 3 2017 1 0.780
#> 4 4 2017 1 0.705
#> 5 5 2017 1 0.942
#> 6 6 2017 1 0.906
#> 7 1 2017 2 1.74
#> 8 2 2017 2 1.85
#> 9 3 2017 2 1.32
#> 10 4 2017 2 1.49
#> # ... with 14 more rows