tidyselect::where() inconsistencies: where is where()? - r

Summary: You can do rename(A=1, B=2), can you do the same using rename_with()? my ~str_replace(... paste0()) works, I don't need to change that. But it only works for one variable at a time. Tidyselect suggests wrapping where(~str_replace...) but then complains it can't find it even though I can get where() to work in other instances.
I want to implement rename_with for more than one variable, but I get an error Error: Formula shorthand must be wrapped in where()`.
# Bad
data %>% select(~str_replace(., "Var_2_", paste0("Issue: Time")))
# Good
data %>% select(where(~str_replace(., "Var_2_", paste0("Issue: time"))))
Example original:
test%>% rename_with( ~str_replace(., "Var_2_", paste0("Issue: Time")), ~str_replace(., "Var_3_", paste0("Issue: Time")))
when I run
test%>% rename_with(where( ~str_replace(., "Var_2_", paste0("Issue: Time")), ~str_replace(., "Var_3_", paste0("Issue: Time"))))
and
test%>% rename_with( where(~str_replace(., "Var_2_", paste0("Issue: Time"))), where(~str_replace(., "Var_3_", paste0("Issue: Time"))))
I get
Error in where(~str_replace(., "Var_1_", paste0("Gov't surveillance: video wave")), : could not find function "where"
And I can't find it tabbing through tidyselect::
But I can run
test%>% select(where(is.numeric)) %>% map(sd, na.rm = TRUE)
without any issue so it does exist. What am I doing wrong?
Example data:
x <- c("_1_1",
"_1_2",
"_1_3",
"_2_1",
"_2_2",
"_2_3",
"_3_1",
"_3_2",
"_3_3",
"_4_3")
paste0("Var",x)
test <- t(as_tibble(rnorm(10, 5.5, .35)))
colnames(test) <- paste0("Var",x)

There is a switching of arguments in rename_with compared to rename_at. It is a bit unclear about the column names specified in the code and the data showed especially with the str_replace in both arguments. A typical use to replace the column names that starts with 'Var_2' with 'Issue: Time_2' would be
library(dplyr)
data <- data %>%
rename_with(~ str_replace(., 'Var_2', 'Issue: Time'),
starts_with('Var_2'))
-output
data
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 `Issue: Time_1` `Issue: Time_2` `Issue: Time_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
If we need to change multiple column patterns, use matches
data %>%
rename_with(~ str_replace(., '(Var_2|Var_3)', '\\1_Issue: Time'),
matches('Var_2|Var_3'))
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 `Var_2_Issue: Tim… `Var_2_Issue: Tim… `Var_2_Issue: Tim… `Var_3_Issue: Ti… `Var_3_Issue: Ti… `Var_3_Issue: Ti… Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
Or if we want to change corresponding replacement, pattern, use str_replace_all
data1 <- data %>%
set_names(str_replace_all(names(.), c("Var_1", "Var_2"), c("Issue 1 wave", "Issue 2 Wave")))
compare the output
data1
# A tibble: 1 x 10
`Issue 1 wave_1` Var_1_2 `Issue 1 wave_3` `Trust Wave_1` Var_2_2 `Issue 2 Wave_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
with original data
data
# A tibble: 1 x 10
Var_1_1 Var_1_2 Var_1_3 Var_2_1 Var_2_2 Var_2_3 Var_3_1 Var_3_2 Var_3_3 Var_4_3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
where is generally used to check the column value i.e. suppose we want to select columns that are numeric type, use select(where(is.numeric)) and not on the column names. There are select_helpers to find the column names based on a substring i.e. starts_with, ends_with, contains, or pass a regex pattern in matches. An use case of where would be
data %>%
rename_with(~ str_replace(., 'Var_2', 'Issue: Time'), where(~ all(. > 5.5)))
# A tibble: 1 x 10
# Var_1_1 Var_1_2 Var_1_3 Var_2_1 Var_2_2 `Issue: Time_3` Var_3_1 Var_3_2 Var_3_3 Var_4_3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5.68 5.18 5.34 5.38 5.47 5.82 5.93 5.35 5.20 5.62
In the OP's code, select/map can be replaced with summarise/across
df %>%
summarise(across(where(is.numeric), sd))
data
data <- as_tibble(test)

Related

In function, receive "Internal error in `df_slice()`: Columns must match the data frame size."

I'm trying to run filter() on the cur_data() of (potentially) grouped data
The following works fine:
lookAhead = 2
colnm = sym(glue("maxCloseGainPctNext{lookAhead}"))
p = dailyDataFinal %>%
summarise( xxx=nrow(filter(cur_data(), {{colnm}}>0)) )
But when I add:
p = dailyDataFinal %>%
summarise(n = n(),
xxx = nrow(filter(cur_data(), {{colnm}}>0))
)
I get:
Error: Problem with `summarise()` column `nPos(2)`.
i `xxx = nrow(filter(cur_data(), maxCloseGainPctNext2 > 0))`.
x Internal error in `df_slice()`: Columns must match the data frame size.
In fact either of the summarise lines is fine by themselves, it's just the combination that breaks it, even though the output from eack is a 1x1 tibble.
I'm at a total loss to understand what that message means.
Input data is a basic tibble:
> dailyDataFinal
# A tibble: 10,003 x 30
date gspc.adjusted gspc.close gspc.high gspc.low gspc.open gspc.volume gspc.DailyGainPct maxCloseGainPctNext2
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1982-04-20 115. 115. 117. 115. 116. 54610000 -1.08 1.52
2 1982-04-21 116. 116. 116. 115. 115. 57820000 0.243 2.52
3 1982-04-22 117. 117. 117. 116. 116. 64470000 1.27 1.77
4 1982-04-23 119. 119. 119. 117. 118. 71840000 1.24 0.523
5 1982-04-26 119. 119. 119. 118. 119. 60500000 0.523 -1.06
6 1982-04-27 118 118 119. 118. 119. 56480000 -1.06 -0.627
7 1982-04-28 117. 117. 118. 117. 118. 50530000 -0.627 -0.699
8 1982-04-29 116. 116. 117. 116. 116. 51330000 -0.955 0.586
9 1982-04-30 116. 116. 117. 116. 116. 48200000 0.258 0.876
10 1982-05-03 117. 117. 117. 116. 116. 46490000 0.326 0.728
# ... with 9,993 more rows, and 21 more variables: maxHighGainPctNext2 <dbl>, minCloseGainPctNext2 <dbl>,
# minLowGainPctNext2 <dbl>, maxCloseGainPctNext5 <dbl>, maxHighGainPctNext5 <dbl>, minCloseGainPctNext5 <dbl>,
# minLowGainPctNext5 <dbl>, maxCloseGainPctNext10 <dbl>, maxHighGainPctNext10 <dbl>, minCloseGainPctNext10 <dbl>,
# minLowGainPctNext10 <dbl>, maxCloseGainPctNext20 <dbl>, maxHighGainPctNext20 <dbl>, minCloseGainPctNext20 <dbl>,
# minLowGainPctNext20 <dbl>, range <dbl>, openProp <dbl>, closeProp <dbl>, openLevel <fct>, closeLevel <fct>,
# candleType <glue>
It is easier to help if you produce a small but reproducible example for us to test the issue. Based on your description I have created a similar example and the code works for me.
Code outside the function.
library(tidyverse)
library(glue)
lookAhead = 2
colnm = sym(glue("abc{lookAhead}"))
set.seed(123)
df <- data.frame(abc1 = rnorm(5), abc2 = rnorm(5))
df %>%
summarise(xxx=nrow(filter(cur_data(), {{colnm}}>0)))
# xxx
#1 2
Code inside the function.
fsumm = function(data, lookAhead) {
colnm = sym(glue("abc{lookAhead}") )
data %>%
drop_na({{colnm}} ) %>%
summarise("nPos({{lookAhead}})" := nrow(filter(cur_data(), {{colnm}}>0)),
)
}
fsumm(df, 2)
# nPos(2)
#1 2
For the updated question using n := n() is not correct since n is not a variable. If you put n = n() at the end of summarise it fixes the error.
fsumm = function(data, lookAhead) {
colnm = sym(glue("abc{lookAhead}") )
data %>%
drop_na({{colnm}} ) %>%
summarise(
"nPos({{lookAhead}})" := nrow(filter(cur_data(), {{colnm}}>0)),
n = n()
)
}
fsumm(df, 2)
# nPos(2) n
#1 2 5
Also, I would actually use sum to calculate number of entries that satisfy a condition instead of using filter and nrow. If we use that then there is no error.
fsumm = function(data, lookAhead) {
colnm = sym(glue("abc{lookAhead}") )
data %>%
drop_na({{colnm}} ) %>%
summarise(n = n(),
"nPos({{lookAhead}})" := sum({{colnm}}>0)
)
}
fsumm(df, 2)
# n nPos(2)
#1 5 2

reshaping rows of data to two columns

We have data on school districts where the columns are the local-specific information (e.g., free and reduced price lunch %) and the corresponding statewide values.
dat <- tribble(
~state.poverty, ~state.EL, ~state.disability, ~state.frpl, ~local.poverty, ~local.frpl, ~local.disability, ~local.EL,
12.50592, 0.08342419, 0.12321831, 0.4495395, 25.23731, 0.6415712, 0.140739, 0.1469898)
dat
# A tibble: 1 x 8
state.poverty state.EL state.disability state.frpl local.poverty local.frpl local.disability local.EL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12.5 0.0834 0.123 0.450 25.2 0.642 0.141 0.147
We want to reshape that so that it looks like this.
demog state local
<chr> <dbl> <dbl>
1 poverty 12.5 25.2
2 EL 0.0834 0.147
3 disability 0.123 0.141
4 frpl 0.450 0.642
It seems like something that pivot_longer should be able to handle, but I haven't had much success so far. Any suggestions?
We can use pivot_longer
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = everything(),
names_to = c(".value", "demog"), names_sep = "\\.")
-output
# A tibble: 4 x 3
# demog state local
# <chr> <dbl> <dbl>
#1 poverty 12.5 25.2
#2 EL 0.0834 0.147
#3 disability 0.123 0.141
#4 frpl 0.450 0.642
A base R option using reshape
reshape(
dat,
direction = "long",
varying = 1:ncol(dat)
)
gives
# A tibble: 4 x 4
time state local id
<chr> <dbl> <dbl> <int>
1 poverty 12.5 25.2 1
2 EL 0.0834 0.642 1
3 disability 0.123 0.141 1
4 frpl 0.450 0.147 1

A quick way to rename multiple columns with unique names using dplyr

I am beginner R user, currently learning the tidyverse way. I imported a dataset which is a time series of monthly indexed consumer prices over a period of four years. The imported headings on the monthly CPI columns displayed in R as five digit numbers (as characters). Here is a short mockup recreation of what it looks like...
df <- tibble(`Product` = c("Eggs", "Chicken"),
`44213` = c(35.77, 36.77),
`44244` = c(39.19, 39.80),
`44272` = c(40.12, 43.42),
`44303` = c(41.09, 41.33)
)
# A tibble: 2 x 5
# Product `44213` `44244` `44272` `44303`
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
I want to change the column headings (44213 etc) to dates that make more sense to me (still as characters). I understand, using dplyr, to do it the following way:
df <- df %>% rename("Jan17" = `44213`, "Feb17" = `44244`,
"Mar17" = `44272`, "Apr17" = `44303`)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
The problem is that my actual dataset contains 48 such columns (months) to rename and so it is a lot of work to type out. I looked at other replace and set_names functions but these seem to add in the repeated changes to the column names, don't provide new unique names like I am looking for?
(I realise dates as columns is not good practice and would need to shift these to rows before proceeding with any analysis... or maybe this must be a prior step to renaming?)
Trust I expressed my question sufficiently. Would love to learn a quicker solution using dplyr or be directed to where one can be found. Thank you for your time.
We can use !!! with rename by passing a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
rename(!!! setNames(names(df)[-1], str_c(month.abb[1:4], 17)))
-output
df1
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
Or use rename_with
df %>%
rename_with(~str_c(month.abb[1:4], 17), -1)
If the column names should be converted to Date formatted
nm1 <- format(as.Date(as.numeric(names(df)[-1]), origin = '1896-01-01'), '%b%y')
df %>%
rename_with(~ nm1, -1)
# A tibble: 2 x 5
# Product Jan17 Feb17 Mar17 Apr17
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3
using some random names, but sequentially
names(df)[2:ncol(df)] <- paste0('col_', 1:(ncol(df)-1), sep = '')
## A tibble: 2 x 5
# Product col_1 col_2 col_3 col_4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 Eggs 35.8 39.2 40.1 41.1
#2 Chicken 36.8 39.8 43.4 41.3

Summarizing by group of two rows

I have a data frame that I want to group by two variables, and then summarize the total and average.
I tried this on my data, which is correct.
df %>%
group_by(date, group) %>%
summarise(
weight = sum(ind_weigh) ,
total_usage = sum(total_usage_min) ,
Avg_usage = total_usage / weight) %>%
ungroup()
It returns this data frame:
df <- tibble::tribble(
~date, ~group, ~weight, ~total_usage, ~Avg_usage,
20190201, 0, 450762, 67184943, 149,
20190201, 1, 2788303, 385115718, 138,
20190202, 0, 483959, 60677765, 125,
20190202, 1, 2413699, 311226351, 129,
20190203, 0, 471189, 59921762, 127,
20190203, 1, 2143811, 277425186, 129,
20190204, 0, 531020, 83695977, 158,
20190204, 1, 2640087, 403200829, 153
)
I am wondering how can I add another variable in my script to get the avg_usage_total(for both group 0 and group 1) as well.
Expected result:
ex, first row --> (67184943 / (450762 + 2788303) = 20.7
date group rech total_usage Avg_usage Avg_usage_total
20190201 0 450762 67184943 149 20.7
20190201 1 2788303 385115718 138 118.9
You can do that using mutate and group_by if necessary.
library(tidyverse)
# generate dataset
(df <- tibble(
date = c(rep(Sys.Date(), 10), rep(Sys.Date() - 1, 10)),
group = rbinom(20, 1, 0.5),
rech = runif(20),
weight = runif(20),
total_usage = runif(20)
))
# A tibble: 20 x 5
date group rech weight total_usage
<date> <int> <dbl> <dbl> <dbl>
1 2019-03-10 0 0.985 0.831 0.963
2 2019-03-10 1 0.178 0.990 0.676
3 2019-03-10 1 0.505 0.697 0.152
4 2019-03-10 1 0.416 0.165 0.824
5 2019-03-10 0 0.554 0.790 0.974
# step 1 of analysis
(df <- df %>%
group_by(date, group) %>%
summarise(rech = sum(rech),
weight = sum(weight),
total_usage = sum(total_usage)) %>%
mutate(Avg_usage = total_usage / weight))
# A tibble: 4 x 6
# Groups: date [2]
date group rech weight total_usage Avg_usage
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628
2 2019-03-09 1 1.45 1.22 1.16 0.954
3 2019-03-10 0 1.54 1.62 1.94 1.20
4 2019-03-10 1 3.15 4.55 4.63 1.02
# step 2 of analysis
df %>%
group_by(date) %>% # only necessary if you want to compute Avg_usage_total by date
mutate(Avg_usage_total = total_usage / sum(rech)) %>% # total_usage is taken by row, sum is taken for the entire column
ungroup()
# A tibble: 4 x 7
date group rech weight total_usage Avg_usage Avg_usage_total
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-09 0 3.29 4.82 3.03 0.628 0.639
2 2019-03-09 1 1.45 1.22 1.16 0.954 0.246
3 2019-03-10 0 1.54 1.62 1.94 1.20 0.413
4 2019-03-10 1 3.15 4.55 4.63 1.02 0.986

Conditional replacement of column name in tibble using dplyr

I have the following tibble:
df <- structure(list(gene_symbol = c("0610005C13Rik", "0610007P14Rik",
"0610009B22Rik", "0610009L18Rik", "0610009O20Rik", "0610010B08Rik"
), foo.control.cv = c(1.16204038288333, 0.120508045270669, 0.205712615954009,
0.504508040948641, 0.333956330117591, 0.543693011377001), foo.control.mean = c(2.66407458486012,
187.137728870855, 142.111269303428, 16.7278587043453, 69.8602872478098,
4.77769028710622), foo.treated.cv = c(0.905769898934564, 0.186441944401973,
0.158552512842753, 0.551955061149896, 0.15743983656006, 0.290447431974039
), foo.treated.mean = c(2.40658723367692, 180.846795140269, 139.054032348287,
11.8584348984435, 76.8141734599118, 2.24088124240385)), .Names = c("gene_symbol",
"foo.control.cv", "foo.control.mean", "foo.treated.cv", "foo.treated.mean"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
6L))
Which looks like this:
# A tibble: 6 × 5
gene_symbol foo.control.cv foo.control.mean foo.treated.cv foo.treated.mean
* <chr> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
What I want to do is to replace all column names with mean in it into mean_expr. Resulting in
gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
How can I achieve that?
With current versions of dplyr, you can use rename_at:
library(dplyr)
df %>% rename_at(vars(contains('mean')), funs(sub('mean', 'mean_expr', .)))
#> # A tibble: 6 × 5
#> gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv
#> * <chr> <dbl> <dbl> <dbl>
#> 1 0610005C13Rik 1.1620404 2.664075 0.9057699
#> 2 0610007P14Rik 0.1205080 187.137729 0.1864419
#> 3 0610009B22Rik 0.2057126 142.111269 0.1585525
#> 4 0610009L18Rik 0.5045080 16.727859 0.5519551
#> 5 0610009O20Rik 0.3339563 69.860287 0.1574398
#> 6 0610010B08Rik 0.5436930 4.777690 0.2904474
#> # ... with 1 more variables: foo.treated.mean_expr <dbl>
Really, you could use rename_all, as well, as names that don't match would be unaffected anyway. Further, you can use a quosure or anything that can be coerced to a function by rlang::as_function for .funs, so you can use purrr-style notation:
df %>% rename_all(~sub('mean', 'mean_expr', .x))
Since a data frame is a list, purrr's set_names can do the same thing:
library(purrr) # or library(tidyverse)
df %>% set_names(~sub('mean', 'mean_expr', .x))
All return the same thing.
Another option is to paste in rename_at (using the devel version of dplyr)
library(dplyr)
df %>%
rename_at(vars(matches('mean')), funs(sprintf('%s_expr', .)))
# A tibble: 6 × 5
# gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
#* <chr> <dbl> <dbl> <dbl> <dbl>
#1 0610005C13Rik 1.1620404 2.664075 0.9057699 2.406587
#2 0610007P14Rik 0.1205080 187.137729 0.1864419 180.846795
#3 0610009B22Rik 0.2057126 142.111269 0.1585525 139.054032
#4 0610009L18Rik 0.5045080 16.727859 0.5519551 11.858435
#5 0610009O20Rik 0.3339563 69.860287 0.1574398 76.814173
#6 0610010B08Rik 0.5436930 4.777690 0.2904474 2.240881
Or using rename_if
df %>%
rename_if(grepl("mean", names(.)), funs(sprintf("%s_expr", .)))
Here is a non-dplyr base R method:
names(df) <- sub("mean$", "mean_expr", names(df))
# or names(df) <- sub("mean", "mean_expr", names(df)) if the mean doesn't have to be at the
# end of the string
names(df)
#[1] "gene_symbol" "foo.control.cv" "foo.control.mean_expr"
#[4] "foo.treated.cv" "foo.treated.mean_expr"
If you want it to be a part of the pipe, you can make use of setNames function:
df %>% setNames(sub("mean", "mean_expr", names(.))) %>% names(.)
#[1] "gene_symbol" "foo.control.cv" "foo.control.mean_expr"
#[4] "foo.treated.cv" "foo.treated.mean_expr"
Another option is dplyr::select_all():
df %>% select_all(~gsub("mean", "mean_expr", .))
And with the use of magritrr you can have
library(magrittr)
names(df)[df %>% names %>% grep(pattern = "mean")] %<>% paste0("_expr")
df
# A tibble: 6 x 5
gene_symbol foo.control.cv foo.control.mean_expr foo.treated.cv foo.treated.mean_expr
* <chr> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.16 2.66 0.906 2.41
2 0610007P14Rik 0.121 187. 0.186 181.
3 0610009B22Rik 0.206 142. 0.159 139.
4 0610009L18Rik 0.505 16.7 0.552 11.9
5 0610009O20Rik 0.334 69.9 0.157 76.8
6 0610010B08Rik 0.544 4.78 0.290 2.24

Resources