R lookup function and merge - r

I'm trying to find the appropriate prices in dt for all values in vector a. It's so simple, yet I can't seem to figure it out. It should be a multiple merge...
a <- 1:10
min <- c( 0, 2, 4, 7)
max <- c(1, 3, 6, 10)
price <- c(2, 4, 6, 8)
dt <- data.frame(min = min, max = max, price=price)
This is the output I would like

Using the data.table package, you can do this
library(data.table)
setDT(dt)[data.table(a), .(a = min, price), on = .(min <= a, max >= a)]
Output
a price
1: 1 2
2: 2 4
3: 3 4
4: 4 6
5: 5 6
6: 6 6
7: 7 8
8: 8 8
9: 9 8
10: 10 8

Are you looking for this:
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
dt %>% mutate(range = map2(min, max, `:`)) %>% unnest(range) %>%
inner_join(as.tibble(a), by = c('range' = 'value')) %>% select('a' = range, price)
# A tibble: 10 x 2
a price
<int> <dbl>
1 1 2
2 2 4
3 3 4
4 4 6
5 5 6
6 6 6
7 7 8
8 8 8
9 9 8
10 10 8

Here's a base R "lookup and merge" approach:
ranges <- mapply(seq, min, max)
values <- mapply(rep, price, lengths(ranges))
lookup <- data.frame(a = unlist(ranges), values = unlist(values))
merge(data.frame(a, price_lookup))
a values
1 1 2
2 2 4
3 3 4
4 4 6
5 5 6
6 6 6
7 7 8
8 8 8
9 9 8
10 10 8

We can use fuzzy_join
library(fuzzyjoin)
fuzzy_left_join(dt, tibble(a), by = c('min' = 'a', 'max' = 'a'),
match_fun = list(`<=`, `>=`)) %>%
select(a, price)
# a price
#1 1 2
#2 2 4
#3 3 4
#4 4 6
#5 5 6
#6 6 6
#7 7 8
#8 8 8
#9 9 8
#10 10 8

Related

Arithmetic between dataframes with varying numbers of rows in R

I have objects containing monthly data on plant growth. Each object is a fixed number of columns, and the number of rows is equal to the number of months the plant survives. I would like to take the mean of these objects so that the mean considers only plants surviving at a given timestep. Here is example data:
df1 <- data.frame(GPP = 1:10, NPP = 1:10)
df2 <- data.frame(GPP = 2:8, NPP = 2:8)
df3 <- data.frame(GPP = 3:9, NPP = 3:9 )
In this scenario, the maximum timesteps is 10, and the 2nd and 3rd plants did not survive this long. To take the mean, my initial thought was to replace empty space with NA to make the dimensions the same, such as this:
na <- matrix( , nrow = 3, ncol = 2)
colnames(na) <- c("GPP","NPP")
df2 <- rbind(df2, na)
df3 <- rbind(df3, na)
This is not desirable because the NA does not simply ignore the value as I had hoped, but nullifies the field, leading to all outputs of arithmetic with NA becoming NA, such as this:
(df1 + df2 + df3) / 3
GPP NPP
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 NA NA
9 NA NA
10 NA NA
I can NOT just fill na with 0s because I want to see the mean of every plant that is living at a given timestep while completely ignoring those that have died. Replacing with 0s would skew the mean, and not achieve this. For my example data here, this is the desired outcome
(df1 + df2 + df3) / 3
GPP NPP
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 8 8
9 9 9
10 10 10
Here rows 8-10 are replaced with the values from df1 because there are only 7 rows in both df2 and df3.
I'll restate my comment: it is generally much safer to encode the month in the original data before you do anything else; it is explicit and will insulate you from mistakes later in the pipeline that might inadvertently change the order of rows (which completely breaks any valid significance you hope to attain). Additionally, since I'm going to recommend putting all data into one frame, let's encode the plant number as well (even if we don't use it immediately here).
For that, then:
df1 <- data.frame(plant = "A", month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(plant = "B", month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(plant = "C", month = 1:7, GPP = 3:9, NPP = 3:9)
From this, I'm a huge fan of having all data in one frame. This is well-informed by https://stackoverflow.com/a/24376207/3358227, where a premise is that if you're going to do the same thing to a bunch of frames, it should either be a list-of-frames or one combined frame (that keeps the source id encoded):
dfs <- do.call(rbind, list(df1, df2, df3))
### just a sampling, for depiction
dfs[c(1:2, 10:12, 17:19),]
# plant month GPP NPP
# 1 A 1 1 1
# 2 A 2 2 2
# 10 A 10 10 10
# 11 B 1 2 2
# 12 B 2 3 3
# 17 B 7 8 8
# 18 C 1 3 3
# 19 C 2 4 4
base R
aggregate(cbind(GPP, NPP) ~ month, data = dfs, FUN = mean, na.rm = TRUE)
# month GPP NPP
# 1 1 2 2
# 2 2 3 3
# 3 3 4 4
# 4 4 5 5
# 5 5 6 6
# 6 6 7 7
# 7 7 8 8
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
dplyr
library(dplyr)
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), mean))
# # A tibble: 10 x 3
# month GPP NPP
# <int> <dbl> <dbl>
# 1 1 2 2
# 2 2 3 3
# 3 3 4 4
# 4 4 5 5
# 5 5 6 6
# 6 6 7 7
# 7 7 8 8
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
Side point: two bits of data you are "losing" in this summary is the size of data and the variability of each month. You might include them with:
dfs %>%
group_by(month) %>%
summarize(across(c(GPP, NPP), list(mu = ~ mean(.), sigma = ~ sd(.), len = ~ length(.))))
# # A tibble: 10 x 7
# month GPP_mu GPP_sigma GPP_len NPP_mu NPP_sigma NPP_len
# <int> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 1 2 1 3 2 1 3
# 2 2 3 1 3 3 1 3
# 3 3 4 1 3 4 1 3
# 4 4 5 1 3 5 1 3
# 5 5 6 1 3 6 1 3
# 6 6 7 1 3 7 1 3
# 7 7 8 1 3 8 1 3
# 8 8 8 NA 1 8 NA 1
# 9 9 9 NA 1 9 NA 1
# 10 10 10 NA 1 10 NA 1
In this case, an average of 8 may be meaningful, but noting that it is a length of 1 is also informative of the "strength" of that statistic (i.e., weak).
library(dplyr)
df1 <- data.frame(month = 1:10, GPP = 1:10, NPP = 1:10)
df2 <- data.frame(month = 1:7, GPP = 2:8, NPP = 2:8)
df3 <- data.frame(month = 1:7, GPP = 3:9, NPP = 3:9 )
df <- rbind(df1, df2, df3)
df %>%
group_by(month) %>%
summarise(GPP = mean(GPP),
NPP = mean(NPP))
month GPP NPP
<int> <dbl> <dbl>
1 1 2 2
2 2 3 3
3 3 4 4
4 4 5 5
5 5 6 6
6 6 7 7
7 7 8 8
8 8 8 8
9 9 9 9
10 10 10 10
Using data.table
library(data.table)
rbindlist(mget(ls(pattern = '^df\\d+$')))[, lapply(.SD, mean), month]

Unnest or unchop dataframe containing lists of different lengths

I have a dataframe with several columns containing list columns that I want to unnest (or unchop). BUT, they are different lengths, so the resulting error is Error: No common size for...
Here is a reprex to show what works and doesn't work.
library(tidyr)
library(vctrs)
# This works as expected
df_A <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9)))
)
unchop(df_A, cols = c(A))
# A tibble: 7 x 2
ID A
<int> <dbl>
1 1 9
2 1 8
3 1 5
4 2 7
5 2 6
6 3 6
7 3 9
# This works as expected as the lists are the same lengths
df_AB_1 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2, 3), c(4, 5), c(7, 8)))
)
unchop(df_AB_1, cols = c(A, B))
# A tibble: 7 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 3
4 2 7 4
5 2 6 5
6 3 6 7
7 3 9 8
# This does NOT work as the lists are different lengths
df_AB_2 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2), c(4, 5, 6), c(7, 8, 9, 0)))
)
unchop(df_AB_2, cols = c(A, B))
# Error: No common size for `A`, size 3, and `B`, size 2.
The output that I would like to achieve for df_AB_2 above is as follows where each list is unchopped and missing values are filled with NA:
# A tibble: 10 x 3
ID A B
<dbl> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
I have referenced this issue on Github and StackOverflow here.
Any ideas how to achieve the result above?
Versions
> packageVersion("tidyr")
[1] ‘1.0.0’
> packageVersion("vctrs")
[1] ‘0.2.0.9001’
Here is an idea via dplyr that you can generalise to as many columns as you want,
library(tidyverse)
df_AB_2 %>%
pivot_longer(c(A, B)) %>%
mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest() %>%
filter(rowSums(is.na(.[-1])) != 2)
which gives,
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Defining a helper function to update the lengths of the element and proceeding with dplyr:
foo <- function(x, len_vec) {
lapply(
seq_len(length(x)),
function(i) {
length(x[[i]]) <- len_vec[i]
x[[i]]
}
)
}
df_AB_2 %>%
mutate(maxl = pmax(lengths(A), lengths(B))) %>%
mutate(A = foo(A, maxl), B = foo(B, maxl)) %>%
unchop(cols = c(A, B)) %>%
select(-maxl)
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Using data.table:
library(data.table)
setDT(df_AB_2)
df_AB_2[, maxl := pmax(lengths(A), lengths(B))]
df_AB_2[, .(unlist(A)[seq_len(maxl)], unlist(B)[seq_len(maxl)]), by = ID]

How to use group_by with summarise and summarise_all?

x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12
The above is part of the input.
Let's suppose that it also has a bunch of other columns
I want to:
group_by x
summarise y by sum
And for all other columns, I want to summarise_all by just taking the first value
Here's an approach that breaks it into two problems and combines them:
library(dplyr)
left_join(
# Here we want to treat column y specially
df %>%
group_by(x) %>%
summarize(sum_y = sum(y)),
# Here we exclude y and use a different summation for all the remaining columns
df %>%
group_by(x) %>%
select(-y) %>%
summarise_all(first)
)
# A tibble: 5 x 3
x sum_y z
<int> <int> <int>
1 1 20 1
2 2 16 3
3 3 17 2
4 4 18 2
5 5 7 3
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text="x y z
1 1 1
3 2 2
2 3 3
3 4 4
2 5 1
4 6 2
5 7 3
2 8 4
1 9 1
1 10 2
3 11 3
4 12 4")
library(dplyr)
df1 %>%
group_by(x) %>%
summarise_each(list(avg = mean), -y) %>%
bind_cols(.,{df1 %>%
group_by(x) %>%
summarise_at(vars(y), funs(sum)) %>%
select(-x)
})
#> # A tibble: 5 x 4
#> x r_avg r.1_avg y
#> <int> <dbl> <dbl> <int>
#> 1 1 6.67 6.67 20
#> 2 2 5.33 5.33 16
#> 3 3 5.67 5.67 17
#> 4 4 9 9 18
#> 5 5 7 7 7
Created on 2019-06-20 by the reprex package (v0.3.0)
Data:
df1 <- read.table(text="
r x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12", header=T)
df1 <- df1[,c(2,3,1,1)]
library(tidyverse)
df <- tribble(~x, ~y, # making a sample data frame
1, 1,
3, 2,
2, 3,
3, 4,
2, 5,
4, 6,
5, 7,
2, 8,
1, 9,
1, 10,
3, 11,
4, 12)
df <- df %>%
add_column(z = sample(1:nrow(df))) #add another column for the example
df
# If there is only one additional column and you need the first value
df %>%
group_by(x) %>%
summarise(sum_y = sum(y), z_1st = z[1])
# otherwise use summarise_at to address all the other columns
f <- function(x){x[1]} # function to extract the first value
df %>%
group_by(x) %>%
summarise_at(.vars = vars(-c('y')), .funs = f) # exclude column y from the calculations

Repeat a record for N times and create a new sequence from 1 to N

I want to repeat the rows of a data.frame for N times. Here N calculates based on the difference between the values of a first and second column in each row of a data.frame. Here I am facing a problem with N. In particular, N may change per each row. And I need to create a new column by creating a sequence from a first value to second value in row 1 by increasing K. Here K remains constant for all the rows.
Ex: d1<-data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
In the above dataset, there are 5 rows. THe difference between first and second values in first row is 7. Now I need to replicate the first row for 7 times and need to create a new column with the sequence of 2,3,4,5,6,7 and 8.
I can create a dataset by using the following code.
dist<-1
rec_len<-c()
seqe<-c()
for(i in 1:nrow(d1))
{
a<-seq(d1[i,"A"],d1[i,"B"],by=dist)
rec_len<-c(rec_len,length(a))
seqe<-c(seqe,a)
}
d1$C<-rec_len
d1<-d1[rep(1:nrow(d1),d1$C),]
d1$D<-seqe
row.names(d1)<-NULL
But it is taking very long time. Is there any possibity to speed up the process?
A data.table approach for this can be to use 1:nrow(df) as grouping variable to make rowwise operation for creating a list with the sequences of A and B, and then unlist, i.e.
library(data.table)
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
Which gives,
A B C D
1: 2 8 7 2
2: 2 8 7 3
3: 2 8 7 4
4: 2 8 7 5
5: 2 8 7 6
6: 2 8 7 7
7: 2 8 7 8
8: 4 6 3 4
9: 4 6 3 5
10: 4 6 3 6
11: 6 7 2 6
12: 6 7 2 7
13: 8 8 1 8
14: 1 10 10 1
15: 1 10 10 2
16: 1 10 10 3
17: 1 10 10 4
18: 1 10 10 5
19: 1 10 10 6
20: 1 10 10 7
21: 1 10 10 8
22: 1 10 10 9
23: 1 10 10 10
A B C D
Note You can easily change K within seq, i.e.
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B, by = 0.2))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
You could use lists and purr package to process each row of your data frame:
data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10)) %>% # take original data frame
setNames(c("from", "to")) %>% pmap(seq) %>% # sequence from A to B
map(as_data_frame) %>% # convert each element to data frame
map(~mutate(.,A=min(value), B=max(value))) %>% # add A and B columns
bind_rows() %>% select(A,B,value) # combine and reorder columns
Here is a base R option where we get the times of replication of each row by subtracting the 'B' with 'A' column ('i1'), create that as column 'C', then replicate the sequence of rows of original dataset using 'i1'. Finally, the 'D' column is created by getting the sequence of corresponding elements of 'A' and 'B' using Map. The output will be a list, so we unlist it to make a vector
i1 <- with(d1, B - A + 1)
d1$C <- i1
d2 <- d1[rep(seq_len(nrow(d1)), i1),]
d2$D <- unlist(Map(`:`, d1$A, d1$B))
row.names(d2) <- NULL
d2
# A B C D
#1 2 8 7 2
#2 2 8 7 3
#3 2 8 7 4
#4 2 8 7 5
#5 2 8 7 6
#6 2 8 7 7
#7 2 8 7 8
#8 4 6 3 4
#9 4 6 3 5
#10 4 6 3 6
#11 6 7 2 6
#12 6 7 2 7
#13 8 8 1 8
#14 1 10 10 1
#15 1 10 10 2
#16 1 10 10 3
#17 1 10 10 4
#18 1 10 10 5
#19 1 10 10 6
#20 1 10 10 7
#21 1 10 10 8
#22 1 10 10 9
#23 1 10 10 10
Simple example using N (case where k = 1)
library(dplyr)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d) {
A = rep(d$A, d$diff)
B = rep(d$B, d$diff)
C = seq(d$A, d$B)
data.frame(A, B, C) }
d1 %>%
mutate(diff = B - A + 1) %>% # calculate difference
rowwise() %>% # for every row
do(f(.)) %>% # apply the function
ungroup() # forget the grouping
# # A tibble: 23 x 3
# A B C
# * <dbl> <dbl> <int>
# 1 2 8 2
# 2 2 8 3
# 3 2 8 4
# 4 2 8 5
# 5 2 8 6
# 6 2 8 7
# 7 2 8 8
# 8 4 6 4
# 9 4 6 5
# 10 4 6 6
# # ... with 13 more rows
Example where you have one k for all rows (I'm using 0.25 to demonstrate)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d, k) {
A = d$A
B = d$B
C = seq(d$A, d$B, k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(., 0.25)) %>% # apply the function using your own k
ungroup()
# # A tibble: 77 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.00
# 2 2 8 2.25
# 3 2 8 2.50
# 4 2 8 2.75
# 5 2 8 3.00
# 6 2 8 3.25
# 7 2 8 3.50
# 8 2 8 3.75
# 9 2 8 4.00
# 10 2 8 4.25
# # ... with 67 more rows
Example where you have different k for each row
# example data frame
# give manually different k for each row
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
d1$k = c(0.5, 1, 2, 0.25, 1.5)
d1
# A B k
# 1 2 8 0.50
# 2 4 6 1.00
# 3 6 7 2.00
# 4 8 8 0.25
# 5 1 10 1.50
# function to use (must have same column names)
f = function(d) {
A = d$A
B = d$B
C = seq(d$A, d$B, d$k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(.)) %>% # apply the function using different k for each row
ungroup()
# # A tibble: 25 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.0
# 2 2 8 2.5
# 3 2 8 3.0
# 4 2 8 3.5
# 5 2 8 4.0
# 6 2 8 4.5
# 7 2 8 5.0
# 8 2 8 5.5
# 9 2 8 6.0
# 10 2 8 6.5
# # ... with 15 more rows

How to mutate multiple variables without repeating codes?

I'm trying to create new variables from existing variables like below:
a1+a2=a3, b1+b2=b3, ..., z1+z2=z3
Here is an example data frame
df <- data.frame(replicate(10,sample(1:10)))
colnames(df) <- c("a1","a2","b1","b2","c1","c2","d1","d2","e1","e2")
Here's my solution with repeating codes
# a solution by base R
df$a3 <- df$a1 + df$a2
df$b3 <- df$b1 + df$b2
df$c3 <- df$c1 + df$c2
df$d3 <- df$d1 + df$d2
df$e3 <- df$e1 + df$e2
Or
# a solution by dplyr
library(dplyr)
df <- df %>%
mutate(a3 = a1+a2,
b3 = b1+b2,
c3 = c1+c2,
d3 = d1+d2,
e3 = e1+d2)
Or
# a solution by data.table
library(data.table)
DT <- data.table(df)
DT[,a3:=a1+a2][,b3:=b1+b2][,c3:=c1+c2][,d3:=d1+d2][,e3:=e1+e2]
Actually I have more than 100 variables, so I want to find a way to do so without repeating code... Although I tried to use mutate_ with standard evaluation and regular expression, I lost my way because I'm a newbie in R. Can you mutate multiple variables without repeating code?
Your data format is making this hard - I would reshape the data like this. In general, you shouldn't encode actual data information in column names, if the difference between a1 and a2 is meaningful, it is better to have a column with letter, a, b, c and a column with number, 1, 2.
df$id = 1:nrow(df)
library(tidyr)
library(dplyr)
tdf = gather(df, key = key, value = value, -id) %>%
separate(key, into = c("letter", "number"), sep = 1) %>%
mutate(number = paste0("V", number)) %>%
spread(key = number, value = value)
## now data is "tidy":
head(tdf)
# id letter V1 V2
# 1 1 a 2 7
# 2 1 b 10 4
# 3 1 c 9 10
# 4 1 d 9 4
# 5 1 e 5 8
# 6 2 a 9 8
## and the operation is simple:
tdf$V3 = tdf$V1 + tdf$V2
head(tdf)
# id letter V1 V2 V3
# 1 1 a 2 7 9
# 2 1 b 10 4 14
# 3 1 c 9 10 19
# 4 1 d 9 4 13
# 5 1 e 5 8 13
# 6 2 a 9 8 17
A possible solution using data.table:
DT <- data.table(df)[, rn := .I]
DTadd3 <- dcast(melt(DT, measure.vars = 1:10)[, `:=` (let = substr(variable,1,1), rn = 1:.N), variable
][, s3 := sum(value), .(let,rn)],
rn ~ paste0(let,3), value.var = 's3', mean)
DT[DTadd3, on = 'rn'][, rn := NULL][]
which gives:
a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
1: 10 5 9 5 10 4 5 3 7 10 15 14 14 8 17
2: 2 6 6 8 3 8 7 1 4 7 8 14 11 8 11
3: 6 4 7 4 4 3 4 6 3 3 10 11 7 10 6
4: 1 2 4 2 9 9 3 7 10 4 3 6 18 10 14
5: 9 10 8 1 8 7 10 5 9 1 19 9 15 15 10
6: 8 8 10 6 2 5 2 4 2 6 16 16 7 6 8
7: 7 9 1 7 5 10 9 2 1 8 16 8 15 11 9
8: 5 1 2 9 7 2 1 8 5 5 6 11 9 9 10
9: 3 7 3 3 1 6 8 10 8 9 10 6 7 18 17
10: 4 3 5 10 6 1 6 9 6 2 7 15 7 15 8
A similar solution using dplyr and tidyr:
df %>%
bind_cols(., df %>%
gather(var, val) %>%
group_by(var) %>%
mutate(let = substr(var,1,1), rn = 1:n()) %>%
group_by(let,rn) %>%
summarise(s3 = sum(val)) %>%
spread(let, s3) %>%
select(-rn)
)
However, as noted by #Gregor, it is much better to transform your data into long format. The data.table equivalent of #Gregor's answer:
DT <- data.table(df)
melt(DT[, rn := .I],
variable.name = 'let',
measure.vars = patterns('1$','2$'),
value.name = paste0('v',1:2)
)[, `:=` (let = letters[let], v3 = v1 + v2)][]
which gives (first 15 rows):
rn let v1 v2 v3
1: 1 a 10 5 15
2: 2 a 2 6 8
3: 3 a 6 4 10
4: 4 a 1 2 3
5: 5 a 9 10 19
6: 6 a 8 8 16
7: 7 a 7 9 16
8: 8 a 5 1 6
9: 9 a 3 7 10
10: 10 a 4 3 7
11: 1 b 9 5 14
12: 2 b 6 8 14
13: 3 b 7 4 11
14: 4 b 4 2 6
15: 5 b 8 1 9
My data.table solution:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := get(paste0(ll, 1)) + get(paste0(ll, 2))])
df[]
# a1 a2 b1 b2 c1 c2 d1 d2 e1 e2 a3 b3 c3 d3 e3
# 1: 5 2 2 6 4 1 10 7 3 9 7 8 5 17 12
# 2: 4 8 7 3 3 7 9 6 9 7 12 10 10 15 16
# 3: 10 7 6 10 1 9 4 1 2 4 17 16 10 5 6
# 4: 3 4 1 7 6 4 7 4 7 5 7 8 10 11 12
# 5: 8 3 4 2 2 2 3 3 4 10 11 6 4 6 14
# 6: 6 6 5 1 8 10 1 10 5 3 12 6 18 11 8
# 7: 2 10 8 9 5 6 2 5 10 2 12 17 11 7 12
# 8: 1 1 10 8 9 5 6 9 6 8 2 18 14 15 14
# 9: 9 5 3 5 10 3 5 2 1 6 14 8 13 7 7
# 10: 7 9 9 4 7 8 8 8 8 1 16 13 15 16 9
Or, more extensibly:
sapply(c("a", "b", "c", "d", "e"), function(ll)
df[ , paste0(ll, 3) := Reduce(`+`, mget(paste0(ll, 1:2)))])
If all of the variables fit the pattern of ending with 1 or 2, you might try:
stems = unique(gsub("[0-9]", "", names(df)))
Then sapply(stems, ...)
library(tidyverse)
reduce(.init=df, .x=letters[1:5], .f~{
mutate(.x, '{.y}3' := get(str_c(.y, 1)) + get(str_c(.y, 2)))
})

Resources