Dplyr: How to match a value from multiple columns? - r

I have a dataset with N column and an additional one containing a number of column. I want to add another column which will return values taken from a column having a particular number (rowwise).
Col 1
…
Col 14
…
Col n
Number of column
Value
a1
…
a14
…
an
14
a14
b1
…
b14
…
bn
8
b8
c1
…
c14
…
cn
1
c1
Such operation can be done with a for loop, but how it can be done in dplyr? Thank you!

Base R option -
df$Value <- df[cbind(1:nrow(df), df$n)]
df
# col1 col2 col3 n Value
#1 1 6 11 1 1
#2 2 7 12 2 7
#3 3 8 13 3 13
#4 4 9 14 3 14
#5 5 10 15 2 10
In dplyr -
library(dplyr)
df %>% rowwise() %>% mutate(Value = c_across()[n])
data
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15, n = c(1, 2, 3, 3, 2))

Related

How to get this outcome in R

I have multiple data frames. Here I have demonstrated 3 data frames with different rows.
dat1<-read.table (text=" D Size1
A1 12
A2 18
A3 16
A4 14
A5 11
A6 0
Value1 25
Score1 30
", header=TRUE)
dat2<-read.table (text=" D Size2
S12 5
S13 9
S14 11
S15 12
S16 12
Value2 40
Score2 45
", header=TRUE)
dat3<-read.table (text=" D Size2
S17 0
S19 1
S22 2
S33 1
Value3 22
Score3 60
", header=TRUE)
I want to get the following outcome:
D Value Score
1 25 30
2 40 45
3 22 60
I need to get a data frame only for value and score
We may have to filter the rows after binding the datasets into a single data and then use pivot_wider to reshape back to wide
library(dplyr)
library(tidyr)
library(stringr)
bind_rows(dat1, dat2, dat3) %>%
filter(str_detect(D, '(Value|Score)\\d+')) %>%
separate(D, into = c("colnm", "D"), sep = "(?<=[a-z](?=\\d))") %>%
group_by(colnm, D) %>%
transmute(Score = coalesce(Size1, Size2)) %>%
ungroup %>%
pivot_wider(names_from = colnm, values_from = Score)
-output
# A tibble: 3 × 3
D Value Score
<chr> <int> <int>
1 1 25 30
2 2 40 45
3 3 22 60
Or an option in base R
do.call(rbind, Map(function(dat, y) data.frame(D = y,
Value = dat[[2]][grepl('Value', dat$D)],
Score = dat[[2]][grepl('Score', dat$D)]), list(dat1, dat2, dat3), 1:3))
D Value Score
1 1 25 30
2 2 40 45
3 3 22 60

Subsetting dataframe in grouped data

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C

How to merge two dataframes with replacement/creation of rows depending on existence in first df?

I have two dataframes df1 and df2, I am looking for the simplest operation to get df3.
I want to replace rows in df1 with rows from df2 if id match (so rbind.fill is not a solution), and append rows from df2 where id does not exist in df1but only for columns that exist in df2.
I guess I could use several joins and antijoins and then merge but I wonder if there already exists a function for that operation.
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
df1
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 14 19 24
5 15 20 25
df2
id c1 c2
4 1 5
5 2 6
6 3 7
7 4 8
df3
id c1 c2 c3
1 11 16 21
2 12 17 22
3 13 18 23
4 1 5 24
5 2 6 25
6 3 7 NULL
7 4 8 NULL
We can use {powerjoin}, make a full join and deal with the conflicts using coalesce_xy (which is really dplyr::coalesce) :
library(powerjoin)
df1 <- data.frame(id = 1:5, c1 = 11:15, c2 = 16:20, c3 = 21:25)
df2 <- data.frame(id = 4:7, c1 = 1:4, c2 = 5:8)
safe_full_join(df1, df2, by= "id", conflict = coalesce_xy)
# id c1 c2 c3
# 1 1 11 16 21
# 2 2 12 17 22
# 3 3 13 18 23
# 4 4 14 19 24
# 5 5 15 20 25
# 6 6 3 7 NA
# 7 7 4 8 NA
I ended up with :
special_combine <- function(df1, df2){
df1_int <- df1[, colnames(df1) %in% colnames(df2)]
df1_ext <- df1[, c("id", colnames(df1)[!colnames(df1) %in% colnames(df2)])]
df3 <- bind_rows(df1_int, df2)
df3 <- df3[!duplicated(df3$id, fromLast=TRUE), ] %>%
dplyr::left_join(df1_ext, by="id") %>%
dplyr::arrange(id)
df3
}

Self defined function with dplyr functions won't accept argument values [duplicate]

This question already has an answer here:
dplyr with name of columns in a function
(1 answer)
Closed 4 years ago.
I'm trying to use dplyr's mutate_at to subtract a numeric column's value (A1) from another corresponding numeric column (A2), I have multiple columns and several data frames I want to do for this for (BCDE..., df1:df99) so I want to write a function.
df1 <- df1 %>% mutate_at(.vars = vars(A1), .funs = funs(remainder = .-A2))
Works fine, however when I try and write a function to perform this:
REMAINDER <- function(df, numer, denom){
df <- df %>% mutate_at(.vars = vars(numer), .funs = funs(remainder = .-denom))
return(df)
}
With arguments df1 <- REMAINDER(df1, A1, A2)
I get the error Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
Which I don't understand as I just manually called the line of code without a function and my columns are numeric.
The vignette Programming with dplyr explains in great detail what to do:
library(dplyr)
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
df %>% mutate_at(.vars = vars(!! numer), .funs = funs(remainder = . - !! denom))
}
df1 <- data_frame(A1 = 11:13, A2 = 3:1, B1 = 21:23, B2 = 8:6)
REMAINDER(df1, A1, A2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
REMAINDER(df1, B1, B2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
Naming the result column
The OP wants to update df1 and he wants to apply this operation to other columns as well.
Unfortunately, the REMAINDER() function as it is currently defined will overwrite the result column:
df1
# A tibble: 3 x 4
A1 A2 B1 B2
<int> <int> <int> <int>
1 11 3 21 8
2 12 2 22 7
3 13 1 23 6
df1 <- REMAINDER(df1, A1, A2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
The function can be modified so that the result column is individually named:
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
result_name <- paste0("remainder_", quo_name(numer), "_", quo_name(denom))
df %>% mutate_at(.vars = vars(!! numer),
.funs = funs(!! result_name := . - !! denom))
}
Now, calling REMAINDER() twice on different columns and replacing df1 after each call, we get
df1 <- REMAINDER(df1, A1, A2)
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 6
A1 A2 B1 B2 remainder_A1_A2 remainder_B1_B2
<int> <int> <int> <int> <int> <int>
1 11 3 21 8 8 13
2 12 2 22 7 10 15
3 13 1 23 6 12 17
I have used this suggestion in order to subtract pairs of columns in a list of data frames. My example has only 3 pairs of columns in each of the two data frames and it can work with higher number of columns and data frames.
dt <- data.table(A1 = round(runif(3),1), A2 = round(runif(3),1),
B1 = round(runif(3),1), B2 = round(runif(3),1),
C1 =round(runif(3),1), C2 =round(runif(3),1))
dt = list(dt,dt+dt)
lapply(seq_along(dt), function(z) {
dt[[z]][, lapply(1:(ncol(.SD)/2), function(x) (.SD[[2*x-1]] - .SD[[2*x]]))]
})

Aggregate dataframe in rolling blocks of 3 rows

I have the following data frame as an example
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
> df
score total1 total2
1 a 1 16
2 b 2 17
3 c 3 18
4 d 4 19
5 e 5 20
6 f 6 21
7 g 7 22
8 h 8 23
9 i 9 24
10 j 10 25
11 k 11 26
12 l 12 27
13 m 13 28
14 n 14 29
15 o 15 30
I would like to aggregate my data frame by sum by grouping the rows having different name, i.e.
groups sum1 sum2
'a-b-c' 6 51
'c-d-e' 21 60
etc
All the given answers to this kind of question assume that the strings repeat in the row.
The usual aggregate function that I use to obtain the summary delivers a different result:
aggregate(df$total1, by=list(sum1=df$score %in% c('a','b','c'), sum2=df$score %in% c('d','e','f')), FUN=sum)
sum1 sum2 x
1 FALSE FALSE 99
2 TRUE FALSE 6
3 FALSE TRUE 15
If you want a tidyverse solution, here is one possibility:
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
df %>%
mutate(groups = case_when(
score %in% c("a","b","c") ~ "a-b-c",
score %in% c("d","e","f") ~ "d-e-f"
)) %>%
group_by(groups) %>%
summarise_if(is.numeric, sum)
returns
# A tibble: 3 x 3
groups total1 total2
<chr> <int> <int>
1 a-b-c 6 51
2 d-e-f 15 60
3 <NA> 99 234
Add a "groups" column with the category value.
df$groups = NA
and then define each group like this:
df$groups[df$score=="a" | df$score=="b" | df$score=="c" ] = "a-b-c"
Finally aggregate by that column.
Here's a solution that works for any sized data frame.
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
# I'm adding a row to demonstrate that the grouping pattern works when the
# number of rows is not equally divisible by 3.
df <- rbind(df, data.frame(score = letters[16], total1 = 16, total2 = 31))
# A vector that represents the correct groupings for the data frame.
groups <- c(rep(1:floor(nrow(df) / 3), each = 3),
rep(floor(nrow(df) / 3) + 1, nrow(df) - length(1:(nrow(df) / 3)) * 3))
# Your method of aggregation by `groups`. I'm going to use `data.table`.
require(data.table)
dt <- as.data.table(df)
dt[, group := groups]
aggDT <- dt[, list(score = paste0(score, collapse = "-"),
total1 = sum(total1), total2 = sum(total2)), by = group][
, group := NULL]
aggDT
score total1 total2
1: a-b-c 6 51
2: d-e-f 15 60
3: g-h-i 24 69
4: j-k-l 33 78
5: m-n-o 42 87
6: p 16 31

Resources