Select columns using starts_with() - r

With select(starts_with("A") I can select all the columns in a dataframe/tibble starting with "A".
But how can I select all the columns in a dataframe/tibble starting with one of the letters in a vector?
Example:
columns_to_select <- c("A", "B", "C")
df %>% select(starts_with(columns_to_select))
I would like to select A1, A2, A3... and B1, B2, B3, ... and C1, C2, Cxy...

This currently seems to be working the way you're describing:
library(tidyverse)
df <- tibble(A1 = 1:10, B1 = 1:10, C3 = 21:30, D2 = 11:20)
columns_to_select <- c("A", "B", "C")
df |>
select(starts_with(columns_to_select))
#> # A tibble: 10 × 3
#> A1 B1 C3
#> <int> <int> <int>
#> 1 1 1 21
#> 2 2 2 22
#> 3 3 3 23
#> 4 4 4 24
#> 5 5 5 25
#> 6 6 6 26
#> 7 7 7 27
#> 8 8 8 28
#> 9 9 9 29
#> 10 10 10 30
Do you mean to select only by one of the letters at a time? (you can use columns_to_select[1] for this) Apologies if I've misunderstood the question - can delete this response if not relevant.

Related

How to expand rows and fill in the numbers between given start and end

I have this data frame:
df <- tibble(x = c(1, 10))
x
<dbl>
1 1
2 10
I want this:
x
<int>
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Unfortunately I can't remember how I have to approach. I tried expand.grid, uncount, runner::fill_run.
Update: The real world data ist like this with groups and given start and end number. Here are only two groups:
df <- tibble(group = c("A", "A", "B", "B"),
x = c(10,30, 1, 10))
group x
<chr> <dbl>
1 A 10
2 A 30
3 B 1
4 B 10
We may need full_seq with either summarise or reframe or tidyr::complete
library(dplyr)
df %>%
group_by(group) %>%
reframe(x = full_seq(x, period = 1))
# or with
#tidyr::complete(x = full_seq(x, period = 1))
-output
# A tibble: 31 × 2
group x
<chr> <dbl>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
7 A 16
8 A 17
9 A 18
10 A 19
# … with 21 more rows
A simple base R variation:
group <- c(rep("A", 21), rep("B ", 10))
x <- c(10:30, 1:10)
df <- tibble(group, x)
df
# A tibble: 31 × 2
group x
<chr> <int>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
And here's an expand.grid solution:
g1 <- expand.grid(group = "A", x = 20:30)
g2 <- expand.grid(group = "B", x = 1:10)
df <- rbind(g1, g2)
df
group x
1 A 20
2 A 21
3 A 22
4 A 23
5 A 24
6 A 25
7 A 26
Using base:
stack(sapply(split(df$x, df$group), function(i) seq(i[ 1 ], i[ 2 ])))

Define groups of columns and sum all i-th columns of each groups with dplyr

I have two groups of columns, each with 36 columns, and I want to sum all i-th column of group 1 with i-th column of group2, getting 36 columns. The number of columns in each group is not fix in my code, although each group has the same number of them.
Exemple. What I have:
teste <- tibble(a1=c(1,2,3),a2=c(7,8,9),b1=c(4,5,6),b2=c(10,20,30))
a1 a2 b1 b2
<dbl> <dbl> <dbl> <dbl>
1 1 7 4 10
2 2 8 5 20
3 3 9 6 30
What I want:
resultado <- teste %>%
summarise(
a_b1 = a1+b1,
a_b2 = a2+b2
)
a_b1 a_b2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
It would be nice to perform this operation with dplyr.
I would thank any help.
You will struggle to find a dplyr solution as simple and elegant as the base R one:
teste[1:2] + teste[3:4]
#> a1 a2
#> 1 5 17
#> 2 7 28
#> 3 9 39
Though I guess in dplyr you get the same result with:
teste %>% select(starts_with("a")) + teste %>% select(starts_with("b"))
teste %>%
summarise(across(starts_with("a")) + across(starts_with("b")))
# A tibble: 3 x 2
a1 a2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
This might also help in base R:
as.data.frame(do.call(cbind, lapply(split.default(teste, sub("\\D(\\d+)", "\\1", names(teste))), rowSums, na.rm = TRUE)))
1 2
1 5 17
2 7 28
3 9 39
Another dplyr solution. We can use rowwise and c_across together to sum the values per row. Notice that we can add na.rm = TRUE to the sum function in this case.
library(dplyr)
teste2 <- teste %>%
rowwise() %>%
transmute(a_b1 = sum(c_across(ends_with("1")), na.rm = TRUE),
a_b2 = sum(c_across(ends_with("2")), na.rm = TRUE)) %>%
ungroup()
teste2
# # A tibble: 3 x 2
# a_b1 a_b2
# <dbl> <dbl>
# 1 5 17
# 2 7 28
# 3 9 39

Self defined function with dplyr functions won't accept argument values [duplicate]

This question already has an answer here:
dplyr with name of columns in a function
(1 answer)
Closed 4 years ago.
I'm trying to use dplyr's mutate_at to subtract a numeric column's value (A1) from another corresponding numeric column (A2), I have multiple columns and several data frames I want to do for this for (BCDE..., df1:df99) so I want to write a function.
df1 <- df1 %>% mutate_at(.vars = vars(A1), .funs = funs(remainder = .-A2))
Works fine, however when I try and write a function to perform this:
REMAINDER <- function(df, numer, denom){
df <- df %>% mutate_at(.vars = vars(numer), .funs = funs(remainder = .-denom))
return(df)
}
With arguments df1 <- REMAINDER(df1, A1, A2)
I get the error Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
Which I don't understand as I just manually called the line of code without a function and my columns are numeric.
The vignette Programming with dplyr explains in great detail what to do:
library(dplyr)
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
df %>% mutate_at(.vars = vars(!! numer), .funs = funs(remainder = . - !! denom))
}
df1 <- data_frame(A1 = 11:13, A2 = 3:1, B1 = 21:23, B2 = 8:6)
REMAINDER(df1, A1, A2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
REMAINDER(df1, B1, B2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
Naming the result column
The OP wants to update df1 and he wants to apply this operation to other columns as well.
Unfortunately, the REMAINDER() function as it is currently defined will overwrite the result column:
df1
# A tibble: 3 x 4
A1 A2 B1 B2
<int> <int> <int> <int>
1 11 3 21 8
2 12 2 22 7
3 13 1 23 6
df1 <- REMAINDER(df1, A1, A2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
The function can be modified so that the result column is individually named:
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
result_name <- paste0("remainder_", quo_name(numer), "_", quo_name(denom))
df %>% mutate_at(.vars = vars(!! numer),
.funs = funs(!! result_name := . - !! denom))
}
Now, calling REMAINDER() twice on different columns and replacing df1 after each call, we get
df1 <- REMAINDER(df1, A1, A2)
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 6
A1 A2 B1 B2 remainder_A1_A2 remainder_B1_B2
<int> <int> <int> <int> <int> <int>
1 11 3 21 8 8 13
2 12 2 22 7 10 15
3 13 1 23 6 12 17
I have used this suggestion in order to subtract pairs of columns in a list of data frames. My example has only 3 pairs of columns in each of the two data frames and it can work with higher number of columns and data frames.
dt <- data.table(A1 = round(runif(3),1), A2 = round(runif(3),1),
B1 = round(runif(3),1), B2 = round(runif(3),1),
C1 =round(runif(3),1), C2 =round(runif(3),1))
dt = list(dt,dt+dt)
lapply(seq_along(dt), function(z) {
dt[[z]][, lapply(1:(ncol(.SD)/2), function(x) (.SD[[2*x-1]] - .SD[[2*x]]))]
})

Using lapply to transpose part of a column and add it as new columns to a data frame

I've been searching for some clarity on this one, but cannot find something that applies to my case, I constructed a DF very similar to this one (but with considerably more data, over a million rows in total)
Key1 <- c("A", "B", "C", "A", "C", "B", "B", "C", "A", "C")
Key2 <- c("A1", "B1", "C1", "A2", "C2", "B2", "B3", "C3", "A3", "C4")
NumVal <- c(2, 3, 1, 4, 6, 8, 2, 3, 1, 0)
DF1 <- as.data.frame(cbind(Key1, Key2, NumVal), stringsAsFactors = FALSE) %>% arrange(Key2)
ConsId <- c(1:10)
DF1 <- cbind(DF1, ConsId)
Now, what I want to do is to add lets say 3 new columns (in real life I need 12, but in order to be more graphic in this toy example we'll use 3) to the data frame, where each row corresponds to the values of $NumVal with the same $Key1 and greater than or equal $ConsId to the ones in each row and filling the remaining spaces with NA's, here is the expected result in case I wasn't very clear:
Key1 Key2 NumVal ConsId V1 V2 V3
A A1 2 1 2 4 1
A A2 4 2 4 1 NA
A A3 1 3 1 NA NA
B B1 3 4 3 8 2
B B2 8 5 8 2 NA
B B3 2 6 2 NA NA
C C1 1 7 1 6 3
C C2 6 8 6 3 0
C C3 3 9 3 0 NA
C C4 0 10 0 NA NA
Now I'm using a do.call(rbind), and even tough it works fine, it takes way too long for my real data with a bit over 1 million rows (around 6 hrs), I also tried with the bind_rows dplyr function but it took a bit longer so I stuck with the do.call option, here's an example of the code I'm using:
# Function
TranspNumVal <- function(i){
Id <- DF1[i, "Key1"]
IdCons <- DF1[i, "ConsId"]
myvect <- as.matrix(filter(DF1, Id == Key1, ConsId >= IdCons) %>% select(NumVal))
Result <- as.data.frame(t(myvect[1:3]))
return(Result)
}
# Applying the function to the entire data frame
DF2 <- do.call(rbind, lapply(1:NROW(DF1), function(i) TranspNumVal(i)))
DF3 <- cbind(DF1, DF2)
Maybe changing the class is causing the code to be so inefficient, or maybe I'm just not finding a better way to vectorize my problem (you don't want to know how long it took with a nested loop), I'm fairly new to R and have just started fooling around with dplyr, so I'm open to any suggestion about how to optimize my code
We can use dplyr::lead
DF1 %>%
group_by(Key1) %>%
mutate(
V1 = NumVal,
V2 = lead(NumVal, n = 1),
V3 = lead(NumVal, n = 2))
## A tibble: 10 x 7
## Groups: Key1 [3]
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <chr> <int> <chr> <chr> <chr>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
#10 C C4 0 10 0 NA NA
Explanation: We group entries by Key1 and then use lead to shift NumVal values for columns V2 and V3. V1 is simply a copy of NumVal.
A dplyr pipeline.
First utility function will filter a (NumVal) based on the values of b (ConsId):
myfunc1 <- function(a,b) {
n <- length(b)
lapply(seq_along(b), function(i) a[ b >= b[i] ])
}
Second utility function converts a ragged list into a data.frame. It works with arbitrary number of columns to append, but we've limited it to 3 based on your requirements:
myfunc2 <- function(x, ncols = 3) {
n <- min(ncols, max(lengths(x)))
as.data.frame(do.call(rbind, lapply(x, `length<-`, n)))
}
Now the pipeline:
dat %>%
group_by(Key1) %>%
mutate(lst = myfunc1(NumVal, ConsId)) %>%
ungroup() %>%
bind_cols(myfunc2(.$lst)) %>%
select(-lst) %>%
arrange(Key1, ConsId)
# # A tibble: 10 × 7
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <int> <int> <int> <int> <int>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
# 10 C C4 0 10 0 NA NA
After grouping by 'Key1', use shift (from data.table) to get the next value of 'NumVal' in a list, convert it to tibble and unnest the nested list elements to individual columns of the dataset. By default, shift fill NA at the end.
library(data.table)
library(tidyverse)
DF1 %>%
group_by(Key1) %>%
mutate(new = shift(NumVal, 0:(n()-1), type = 'lead') %>%
map(~
as.list(.x) %>%
set_names(paste0("V", seq_along(.))) %>%
as_tibble)) %>%
unnest %>%
select(-V4)
# A tibble: 10 x 7
# Groups: Key1 [3]
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
#10 C C4 0 10 0 NA NA
data
DF1 <- data.frame(Key1, Key2, NumVal, stringsAsFactors = FALSE) %>%
arrange(Key2)
DF1$ConsId <- 1:10

Substract value from tibble column based on another tibble

Say I have a tibble of values:
raw = tibble(
group = c("A", "B", "C", "A", "B", "C"),
value = c(10, 20, 30, 40, 50, 60)
)
# A tibble: 6 x 2
group value
<chr> <dbl>
1 A 10
2 B 20
3 C 30
4 A 40
5 B 50
6 C 60
I want to subtract a certain amount from each value in my tibble depending on which group it belongs to. The amounts I need to subtract are in another tibble:
corrections = tibble(
group = c("A", "B", "C"),
corr = c(0, 1, 2)
)
# A tibble: 3 x 2
group corr
<chr> <dbl>
1 A 0
2 B 1
3 C 2
What is the most elegant way to achieve this? The following works, but I feel like it is messy - surely there is another way?
mutate(raw, corrected = value - as_vector(corrections[corrections["group"] == group, "corr"]))
# A tibble: 6 x 3
group value corrected
<chr> <dbl> <dbl>
1 A 10 10
2 B 20 19
3 C 30 28
4 A 40 40
5 B 50 49
6 C 60 58
How about first joining raw and corrections and then calculating corrected?
library(dplyr)
left_join(raw, corrections, by = "group") %>%
mutate(corrected = value - corr) %>%
select(-corr)
#> # A tibble: 6 x 3
#> group value corrected
#> <chr> <dbl> <dbl>
#> 1 A 10 10
#> 2 B 20 19
#> 3 C 30 28
#> 4 A 40 40
#> 5 B 50 49
#> 6 C 60 58

Resources