Suppose I have a data frame like this:
df = data.frame(preA = c(1,2,3),preB = c(3,4,5),postA = c(6,7,8),postB = c(9,8,4))
I want to add columns having column-wise differences, that is:
diffA = postA - preA
diffB = postB - preB
and so on...
Is there an efficient way to do this in tidyverse?
The way to go with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
pivot_longer(-id,
names_to = c("pre_post", ".value"),
names_pattern = "(pre|post)(.*)") %>%
group_by(id) %>%
mutate(across(A:B, diff, .names = "diff{col}")) %>%
pivot_wider(names_from = pre_post, values_from = c(A, B),
names_glue = '{pre_post}{.value}') %>%
select(id, starts_with("pre"), starts_with("post"), starts_with("diff"))
# id preA preB postA postB diffA diffB
# 1 1 1 3 6 9 5 6
# 2 2 2 4 7 8 5 4
# 3 3 3 5 8 4 5 -1
A shorter but less flexible was with dplyover::across2:
library(dplyr)
library(dplover)
df %>%
#relocate(sort(colnames(.))) %>%
mutate(across2(starts_with("post"), starts_with("pre"), `-`,
.names = "diff{idx}"))
# preA preB postA postB diff1 diff2
# 1 1 3 6 9 5 6
# 2 2 4 7 8 5 4
# 3 3 5 8 4 5 -1
You can do this with two uses of across(), creating new variables with the first use and subtracting the second. This also assumes your columns are in order.
df %>%
mutate(across(starts_with("post"), .names = "diff{sub('post', '', .col)}") - across(starts_with("pre")))
preA preB postA postB diffA diffB
1 1 3 6 9 5 6
2 2 4 7 8 5 4
3 3 5 8 4 5 -1
A few more solutions. My favourite is the first one demonstrated here - I think it's the cleanest and most debuggable:
# Setup:
library(dplyr, warn.conflicts = FALSE)
library(glue)
df <- data.frame(
preA = c(1,2,3),
preB = c(3,4,5),
postA = c(6,7,8),
postB = c(9,8,4)
)
Method 1: Using expressions:
This is my favourite approach. I think it's very readable, and I think it should be reasonably fast compared to solutions using across():
cols <- c("A", "B")
exprs <- glue("post{cols} - pre{cols}")
names(exprs) <- glue("diff{cols}")
df |>
mutate(!!!rlang::parse_exprs(exprs))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 2: Using mutate() + across() + get():
Personally, I don't like this sort of thing because I think it's really hard to read:
df |>
mutate(across(
starts_with("post"),
~ .x - get(stringr::str_replace_all(cur_column(), "^post", "pre")),
.names = "diff{stringr::str_remove(.col, '^post')}"
))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 3: Using base subsetting:
The main advantage here is that you don't need any packages (you can use paste0() instead of glue()), IMO it's also pretty readable. But I don't like that it doesn't play well with |>:
cols <- c("A", "B")
df2 <- df
df2[glue("diff{cols}")] <- df2[glue("post{cols}")] - df2[glue("pre{cols}")]
df2
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Related
I have 50 columns of names, but here I have presented only 4 columns for convenience.
Name1 Name2 Name3 Name4
Rose,Ali Van,Hall Ghol,Dam Murr,kate
Camp,Laura Ka,Klo Dan,Dan Ali,Hoss
Rose,Ali Van,Hall Ghol,Dam Kol,Kan
Murr,Kate Ismal, Ismal Sian,Rozi Nas,Ami
Ghol,Dam Ka,Klo Rose,Ali Nor,Ko
Murr,Kate Ismal, Ismal Dan,Dan Nas,Ami
I want to assign numbers to each person based on the columns, a sequence of numbers.
For example, in Name 1, we get the numbers from 1-4. The repeated names will get the same numbers.
In Name 2, it should be started from 5 and so on. This will give me the following table:
Assign1 Assian2 Assian3 Assian4
1 5 8 12
2 6 9 13
1 5 8 14
3 7 10 15
4 6 11 17
3 7 9 15
I would like to have it without a loop, i.e.,sapply,i.e., sapply(dat, function(x) match(x, unique(x))).
Using dplyr or tidyverse would be great.
A tidyverse solution with purrr::accumulate():
library(tidyverse)
df %>%
mutate(as_tibble(
accumulate(across(Name1:Name4, ~ match(.x, unique(.x))), ~ .y + max(.x))
))
# Name1 Name2 Name3 Name4
# 1 1 5 8 12
# 2 2 6 9 13
# 3 1 5 8 14
# 4 3 7 10 15
# 5 4 6 11 16
# 6 3 7 9 15
Because the values in each column depend on the values in the previous column, the calculations have to be done sequentially. This is probably most succinctly achieved by a loop. Remember that lapply and sapply are simply loops-in-disguise, and won't be quicker than an explicit loop.
Note that your expected output has a mistake in it (there is a number 17 which should be 16)
output <- setNames(df, paste0('Assign', seq_along(df)))
for(i in seq_along(output)) {
output[[i]] <- match(output[[i]], unique(output[[i]]))
if(i > 1) output[[i]] <- output[[i]] + max(output[[i - 1]])
}
output
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Edit
If you really want it without an explicit loop, you can do:
res <- sapply(seq_along(df), \(i) match(df[[i]], unique(df[[i]])))
res + t(replicate(nrow(df), head(c(0, cumsum(apply(res, 2, max))), -1))) |>
as.data.frame() |>
setNames(paste0('Assign', seq_along(df)))
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Created on 2023-01-13 with reprex v2.0.2
Data taken from question in reproducible format
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")
Here is a tidyverse approach:
First paste the column name after each of the strings in all your columns, for sorting purpose later. Then pivot it into a two-column df so that we can assign ID to them by match. Finally pivot it back to a wide format and unnest the list columns.
library(tidyverse)
df %>%
mutate(across(everything(), ~ paste0(.x, "_", cur_column()))) %>%
pivot_longer(everything(), names_to = "ab", values_to = "a") %>%
arrange(ab) %>%
mutate(b = match(a, unique(a)), .keep = "unused") %>%
pivot_wider(names_from = "ab", values_from = "b") %>%
unnest(everything())
# A tibble: 6 × 4
Name1 Name2 Name3 Name4
<int> <int> <int> <int>
1 1 5 8 12
2 2 6 9 13
3 1 5 8 14
4 3 7 10 15
5 4 6 11 16
6 3 7 9 15
Data
Taken from #Allan Cameron.
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")
Update: The approach below is not ideal because ID's are not unique. Sorry.
Using a lookup table with tidyverse:
library(dplyr)
library(tidyr)
lookup <-
df |>
pivot_longer(everything()) |>
distinct() |>
arrange(name) |>
transmute(name = value, value = row_number()) |>
deframe()
df |>
mutate(across(everything(), ~ recode(., !!!lookup)))
Output:
Name1 Name2 Name3 Name4
1 1 5 4 12
2 2 6 9 13
3 1 5 4 14
4 3 7 10 15
5 4 6 1 16
6 3 7 9 15
Data from #Allan Cameron, thanks.
I am trying to fill NA values of my dataframe. However, I would like to fill them based on the first value of each group.
#> df = data.frame(
group = c(rep("A", 4), rep("B", 4)),
val = c(1, 2, NA, NA, 4, 3, NA, NA)
)
#> df
group val
1 A 1
2 A 2
3 A NA
4 A NA
5 B 4
6 B 3
7 B NA
8 B NA
#> fill(df, val, .direction = "down")
group val
1 A 1
2 A 2
3 A 2 # -> should be 1
4 A 2 # -> should be 1
5 B 4
6 B 3
7 B 3 # -> should be 4
8 B 3 # -> should be 4
Can I do this with tidyr::fill()? Or is there another (more or less elegant) way how to do this? I need to use this in a longer chain (%>%) operation.
Thank you very much!
Use tidyr::replace_na() and dplyr::first() (or val[[1]]) inside a grouped mutate():
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val))) %>%
ungroup()
#> # A tibble: 8 × 2
#> group val
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 1
#> 4 A 1
#> 5 B 4
#> 6 B 3
#> 7 B 4
#> 8 B 4
PS - #richarddmorey points out the case where the first value for a group is NA. The above code would keep all NA values as NA. If you'd like to instead replace with the first non-missing value per group, you could subset the vector using !is.na():
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val[!is.na(val)]))) %>%
ungroup()
Created on 2022-11-17 with reprex v2.0.2
This should work, which uses dplyr's case_when
library(dplyr)
df %>%
group_by(group) %>%
mutate(val = case_when(
is.na(val) ~ val[1],
TRUE ~ val
))
Output:
group val
<chr> <dbl>
1 A 1
2 A 2
3 A 1
4 A 1
5 B 4
6 B 3
7 B 4
8 B 4
I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4
I'm sure this question has been asked before, but I can't find the answer.
Here's my data:
df <- data.frame(group=c("a","a","a","b","b","c"), value=c(1,2,3,4,5,7))
df
#> group value
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 c 7
I'd like a 3rd column which has the sum of "value" for each "group", like so:
#> group value group_sum
#> 1 a 1 6
#> 2 a 2 6
#> 3 a 3 6
#> 4 b 4 9
#> 5 b 5 9
#> 6 c 7 7
How can I do this with dplyr?
Using dplyr -
df %>%
group_by(group) %>%
mutate(group_sum = sum(value))
Nobody mentioned data.table yet:
library(data.table)
dat <- data.table(df)
dat[, `:=`(sums = sum(value)), group]
Which transforms dat into:
group value sums
1: a 1 6
2: a 2 6
3: a 3 6
4: b 4 9
5: b 5 9
6: c 7 7
left_join(
df,
df %>% group_by(group) %>% summarise(group_sum = sum(value)),
by = c("group")
)
I don't know how to do it one step, but
df_avg <- df %>% group_by(group) %>% summarize(group_sum=sum(value))
df %>% full_join(df_avg,by="group")
works. (This is basically equivalent to #KeqiangLi's answer.)
ave(), from base R, is useful here too:
df %>% mutate(group_sum=ave(value,group,FUN=sum))
How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8