Recode multiple columns to numbers increasingly in R - r

I have 50 columns of names, but here I have presented only 4 columns for convenience.
Name1 Name2 Name3 Name4
Rose,Ali Van,Hall Ghol,Dam Murr,kate
Camp,Laura Ka,Klo Dan,Dan Ali,Hoss
Rose,Ali Van,Hall Ghol,Dam Kol,Kan
Murr,Kate Ismal, Ismal Sian,Rozi Nas,Ami
Ghol,Dam Ka,Klo Rose,Ali Nor,Ko
Murr,Kate Ismal, Ismal Dan,Dan Nas,Ami
I want to assign numbers to each person based on the columns, a sequence of numbers.
For example, in Name 1, we get the numbers from 1-4. The repeated names will get the same numbers.
In Name 2, it should be started from 5 and so on. This will give me the following table:
Assign1 Assian2 Assian3 Assian4
1 5 8 12
2 6 9 13
1 5 8 14
3 7 10 15
4 6 11 17
3 7 9 15
I would like to have it without a loop, i.e.,sapply,i.e., sapply(dat, function(x) match(x, unique(x))).
Using dplyr or tidyverse would be great.

A tidyverse solution with purrr::accumulate():
library(tidyverse)
df %>%
mutate(as_tibble(
accumulate(across(Name1:Name4, ~ match(.x, unique(.x))), ~ .y + max(.x))
))
# Name1 Name2 Name3 Name4
# 1 1 5 8 12
# 2 2 6 9 13
# 3 1 5 8 14
# 4 3 7 10 15
# 5 4 6 11 16
# 6 3 7 9 15

Because the values in each column depend on the values in the previous column, the calculations have to be done sequentially. This is probably most succinctly achieved by a loop. Remember that lapply and sapply are simply loops-in-disguise, and won't be quicker than an explicit loop.
Note that your expected output has a mistake in it (there is a number 17 which should be 16)
output <- setNames(df, paste0('Assign', seq_along(df)))
for(i in seq_along(output)) {
output[[i]] <- match(output[[i]], unique(output[[i]]))
if(i > 1) output[[i]] <- output[[i]] + max(output[[i - 1]])
}
output
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Edit
If you really want it without an explicit loop, you can do:
res <- sapply(seq_along(df), \(i) match(df[[i]], unique(df[[i]])))
res + t(replicate(nrow(df), head(c(0, cumsum(apply(res, 2, max))), -1))) |>
as.data.frame() |>
setNames(paste0('Assign', seq_along(df)))
#> Assign1 Assign2 Assign3 Assign4
#> 1 1 5 8 12
#> 2 2 6 9 13
#> 3 1 5 8 14
#> 4 3 7 10 15
#> 5 4 6 11 16
#> 6 3 7 9 15
Created on 2023-01-13 with reprex v2.0.2
Data taken from question in reproducible format
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")

Here is a tidyverse approach:
First paste the column name after each of the strings in all your columns, for sorting purpose later. Then pivot it into a two-column df so that we can assign ID to them by match. Finally pivot it back to a wide format and unnest the list columns.
library(tidyverse)
df %>%
mutate(across(everything(), ~ paste0(.x, "_", cur_column()))) %>%
pivot_longer(everything(), names_to = "ab", values_to = "a") %>%
arrange(ab) %>%
mutate(b = match(a, unique(a)), .keep = "unused") %>%
pivot_wider(names_from = "ab", values_from = "b") %>%
unnest(everything())
# A tibble: 6 × 4
Name1 Name2 Name3 Name4
<int> <int> <int> <int>
1 1 5 8 12
2 2 6 9 13
3 1 5 8 14
4 3 7 10 15
5 4 6 11 16
6 3 7 9 15
Data
Taken from #Allan Cameron.
df <- structure(list(Name1 = c("Rose,Ali", "Camp,Laura", "Rose,Ali",
"Murr,Kate", "Ghol,Dam", "Murr,Kate"), Name2 = c("Van,Hall",
"Ka,Klo", "Van,Hall", "Ismal, Ismal", "Ka,Klo", "Ismal, Ismal"
), Name3 = c("Ghol,Dam", "Dan,Dan", "Ghol,Dam", "Sian,Rozi",
"Rose,Ali", "Dan,Dan"), Name4 = c("Murr,kate", "Ali,Hoss", "Kol,Kan",
"Nas,Ami", "Nor,Ko", "Nas,Ami")), row.names = c(NA, -6L),
class = "data.frame")

Update: The approach below is not ideal because ID's are not unique. Sorry.
Using a lookup table with tidyverse:
library(dplyr)
library(tidyr)
lookup <-
df |>
pivot_longer(everything()) |>
distinct() |>
arrange(name) |>
transmute(name = value, value = row_number()) |>
deframe()
df |>
mutate(across(everything(), ~ recode(., !!!lookup)))
Output:
Name1 Name2 Name3 Name4
1 1 5 4 12
2 2 6 9 13
3 1 5 4 14
4 3 7 10 15
5 4 6 1 16
6 3 7 9 15
Data from #Allan Cameron, thanks.

Related

Tidyverse column-wise differences

Suppose I have a data frame like this:
df = data.frame(preA = c(1,2,3),preB = c(3,4,5),postA = c(6,7,8),postB = c(9,8,4))
I want to add columns having column-wise differences, that is:
diffA = postA - preA
diffB = postB - preB
and so on...
Is there an efficient way to do this in tidyverse?
The way to go with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
pivot_longer(-id,
names_to = c("pre_post", ".value"),
names_pattern = "(pre|post)(.*)") %>%
group_by(id) %>%
mutate(across(A:B, diff, .names = "diff{col}")) %>%
pivot_wider(names_from = pre_post, values_from = c(A, B),
names_glue = '{pre_post}{.value}') %>%
select(id, starts_with("pre"), starts_with("post"), starts_with("diff"))
# id preA preB postA postB diffA diffB
# 1 1 1 3 6 9 5 6
# 2 2 2 4 7 8 5 4
# 3 3 3 5 8 4 5 -1
A shorter but less flexible was with dplyover::across2:
library(dplyr)
library(dplover)
df %>%
#relocate(sort(colnames(.))) %>%
mutate(across2(starts_with("post"), starts_with("pre"), `-`,
.names = "diff{idx}"))
# preA preB postA postB diff1 diff2
# 1 1 3 6 9 5 6
# 2 2 4 7 8 5 4
# 3 3 5 8 4 5 -1
You can do this with two uses of across(), creating new variables with the first use and subtracting the second. This also assumes your columns are in order.
df %>%
mutate(across(starts_with("post"), .names = "diff{sub('post', '', .col)}") - across(starts_with("pre")))
preA preB postA postB diffA diffB
1 1 3 6 9 5 6
2 2 4 7 8 5 4
3 3 5 8 4 5 -1
A few more solutions. My favourite is the first one demonstrated here - I think it's the cleanest and most debuggable:
# Setup:
library(dplyr, warn.conflicts = FALSE)
library(glue)
df <- data.frame(
preA = c(1,2,3),
preB = c(3,4,5),
postA = c(6,7,8),
postB = c(9,8,4)
)
Method 1: Using expressions:
This is my favourite approach. I think it's very readable, and I think it should be reasonably fast compared to solutions using across():
cols <- c("A", "B")
exprs <- glue("post{cols} - pre{cols}")
names(exprs) <- glue("diff{cols}")
df |>
mutate(!!!rlang::parse_exprs(exprs))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 2: Using mutate() + across() + get():
Personally, I don't like this sort of thing because I think it's really hard to read:
df |>
mutate(across(
starts_with("post"),
~ .x - get(stringr::str_replace_all(cur_column(), "^post", "pre")),
.names = "diff{stringr::str_remove(.col, '^post')}"
))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Method 3: Using base subsetting:
The main advantage here is that you don't need any packages (you can use paste0() instead of glue()), IMO it's also pretty readable. But I don't like that it doesn't play well with |>:
cols <- c("A", "B")
df2 <- df
df2[glue("diff{cols}")] <- df2[glue("post{cols}")] - df2[glue("pre{cols}")]
df2
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13

dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe

I am struggling with one maybe easy question. I have a dataframe of 1 column with n rows (n is a multiple of 3). I would like to add a second column with integers like: 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,.. How can I achieve this with dplyr as a general solution for different length of rows (all multiple of 3).
I tried this:
df <- tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(1:4, each=3))
This works. But I would like to have a solution for n rows, each = 3 . Many thanks!
You can specify each and length.out parameter in rep.
library(dplyr)
tibble(Col1 = c(1:12)) %>%
mutate(Col2 = rep(row_number(), each=3, length.out = n()))
# Col1 Col2
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 2
# 7 7 3
# 8 8 3
# 9 9 3
#10 10 4
#11 11 4
#12 12 4
We can use gl
library(dplyr)
df %>%
mutate(col2 = as.integer(gl(n(), 3, n())))
As integer division i.e. %/% 3 over a sequence say 0:n will result in 0, 0, 0, 1, 1, 1, ... adding 1 will generate the desired sequence automatically, so simply this will also do
df %>% mutate(col2 = 1+ (row_number()-1) %/% 3)
# A tibble: 12 x 2
Col1 col2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 2
5 5 2
6 6 2
7 7 3
8 8 3
9 9 3
10 10 4
11 11 4
12 12 4

replace at once multiple columns names which end with different patterns in R

I have a table with hundreds of columns. Their names end either with .a or .b
What I need is to rename all columns.a with a columns.a_new and column.b with column->column.b_new at once.
I can do it only one pattern at a time but I don't know how to do it at once for all columns.
rename_at_example <- my_table %>% rename_at(vars(ends_with(".a")),
funs(str_replace(., ".a", ".a_new")))
Any idea how to write it in a compact way for all columns?
Thank you
One dplyr option could be:
df %>%
rename_at(vars(matches("[ab]$")), ~ paste0(., "_new"))
col1a_new col2a_new col1b_new col2b_new col1c col2c
1 1 11 1 11 1 11
2 2 12 2 12 2 12
3 3 13 3 13 3 13
4 4 14 4 14 4 14
5 5 15 5 15 5 15
6 6 16 6 16 6 16
7 7 17 7 17 7 17
8 8 18 8 18 8 18
9 9 19 9 19 9 19
10 10 20 10 20 10 20
Sample data:
df <- data.frame(col1a = 1:10,
col2a = 11:20,
col1b = 1:10,
col2b = 11:20,
col1c = 1:10,
col2c = 11:20,
stringsAsFactors = FALSE)
If '.a' names and '.b' names don't require the same replacement/action, e.g. adding '_new' to the end, you could use reduce2
library(tidyverse) # dplyr + purrr for reduce2
df <- data.frame(one.a = 1, one.d = 2, twoa = 3, two.b = 4, three.a = 5)
df
# one.a one.d twoa two.b three.a
# 1 1 2 3 4 5
df %>%
rename_all(~ reduce2(c('\\.a$', '\\.b$'), c('.a_new1', '.b_new2'),
str_replace, .init = .x))
# one.a_new1 one.d twoa two.b_new2 three.a_new1
# 1 1 2 3 4 5

Stack 10 Columns in R in to two columns [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
I'm having trouble stacking 10 columns in R into two columns of 5 where each column relates. Basically I have something like:
Name1, ID1, Name2, ID2, Name3, ID3, Name4, ID4, Name5, ID5
And I need to stack them in to a Name and ID table where the values in each Name column still match its ID counterpart. What would be the best way to approach this?
Thanks!
I would recommend melt from the "data.table" package.
Here's some sample data. (This is something you should share.)
mydf <- data.frame(
matrix(1:20, ncol = 10, dimnames = list(NULL, paste0(c("Name", "ID"),
rep(1:5, each = 2)))))
mydf
## Name1 ID1 Name2 ID2 Name3 ID3 Name4 ID4 Name5 ID5
## 1 1 3 5 7 9 11 13 15 17 19
## 2 2 4 6 8 10 12 14 16 18 20
Here's the reshaping:
library(data.table)
melt(as.data.table(mydf), measure = patterns("Name", "ID"),
value.name = c("Name", "ID"))
## variable Name ID
## 1: 1 1 3
## 2: 1 2 4
## 3: 2 5 7
## 4: 2 6 8
## 5: 3 9 11
## 6: 3 10 12
## 7: 4 13 15
## 8: 4 14 16
## 9: 5 17 19
## 10: 5 18 20
You can do this with reshaping
library(dplyr)
library(tidyr)
library(rex)
variable_regex =
rex(capture("Name" %>%
or ("ID") ),
capture(digits) )
mydf %>%
mutate(row_ID = 1:n()) %>%
gather(variable, value, -row_ID) %>%
extract(variable,
c("new_variable", "column_ID"),
variable_regex) %>%
spread(new_variable, value)

Resources