How to recode multiple columns in R - r

I tried my best to recode multiple columns, but I still struggle to do it. Here what I have done:
df<-read.table(text="ZR1 Time1 ZR2 Time2 ZR3 Time3
A 60 A 56 B 44
C 61 B 44 D 78
D 62 C 78 E 66
E 58 D 46 B 45
A 54 B 23 B 23
A 57 E 24 B 100",h=T)
What I have done
for (i in 1) {
ZRi<-paste0("ZR", i)
Zi<-paste0("Z",i)}
df[,Zi]=c(A=4,B=3,C=2,D=1,E=0)
df[,Zi]=c(A=4,B=3,C=2,D=1,E=0)[df[,ZRi]]
I got this:
ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1
1 A 60 A 56 B 44 4
2 C 61 B 44 D 78 3
3 D 62 C 78 E 66 2
4 E 58 D 46 B 45 1
5 A 54 B 23 B 23 4
6 A 57 E 24 B 100 4
As you can see, I could get Z1, which is wrong.
I want to get this:
ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1 Z2 Z3
A 60 A 56 B 44 4 4 3
C 61 B 44 D 78 2 3 1
D 62 C 78 E 66 1 2 0
E 58 D 46 B 45 0 1 3
A 54 B 23 B 23 4 3 3
A 57 E 24 B 100 4 0 3

Here's the base approach (and probably fastest). You are just using the values of the ZR columns as an index into c(A=4,B=3,C=2,D=1,E=0) which becomes a translation table and then assigning those results to new columns in df:
df[ paste0("Z", 1:3) ] <-
lapply( df[ , grepl("^ZR", names(df))] , # passes "ZR" columns one-at-a-time
function(x) {c(A=4,B=3,C=2,D=1,E=0)[as.character(x)]})
Depending on what was intended as the purpose for these new columns, #User60 should be aware that this delivers numeric vectors

By playing with levels and labels you can get this:
for (i in 1:3) {
df[[paste0("Z",i)]] <-
factor(df[[paste0("ZR", i)]],levels=LETTERS[1:5],labels=4:0)
}
df
# ZR1 Time1 ZR2 Time2 ZR3 Time3 Z1 Z2 Z3
# 1 A 60 A 56 B 44 4 4 3
# 2 C 61 B 44 D 78 2 3 1
# 3 D 62 C 78 E 66 1 2 0
# 4 E 58 D 46 B 45 0 1 3
# 5 A 54 B 23 B 23 4 3 3
# 6 A 57 E 24 B 100 4 0 3
The created columns with this method will be factors, to have numeric instead use the following:
for (i in 1:3) {
df[[paste0("Z",i)]] <-
as.numeric(as.character(factor(df[[paste0("ZR", i)]],levels=LETTERS[1:5],labels=4:0)))
}

Maybe this one one-liner with dplyr could help
df %>%
mutate_at(setNames(paste0("ZR", 1:3), paste0("Z", 1:3)),
~5-as.numeric(factor(.x, levels = LETTERS[1:5])))
The trick here is to pass named vector to mutate_at to create new columns. You can coerce factor to numeric if you pre-specified the levels.

An alternative solution using dplyr + magrittr packages
library(dplyr); library(magrittr)
df2 <- select(df, starts_with("ZR")) %>%
lapply(as.character) %>%
mapply(`[`, list(c(A=4,B=3,C=2,D=1,E=0)), .) %>%
data.frame(df, .)
names(df2)[ncol(df2)-2:0] <- paste0("Z", 1:3)

Here's a more dplyr-esque method. Useful for recoding when the output isn't an integer.
library(dplyr)
# Make lookup table
lookup <- data.frame(let = LETTERS[1:5], num = 4:0, stringsAsFactors = F)
# Join with lookup table
df %>%
left_join(lookup, by = c('ZR1' = 'let')) %>%
left_join(lookup, by = c('ZR2' = 'let')) %>%
left_join(lookup, by = c('ZR3' = 'let')) %>%
rename_at(vars(matches('num')), ~paste0('Z', 1:3))
Or, with data.table
library(data.table)
lookup <- data.frame(let = LETTERS[1:5], num = 4:0, stringsAsFactors = F)
setDT(df)
df[, paste0('Z', 1:3) := lapply(df[,paste0('ZR', 1:3)],
function(x) lookup$num[match(x, lookup$let)])]

Related

Identify pairs or groups of rows that have the same values across multiple columns

Say I have a data.frame:
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
I want to get a sorted data frame of all rows that have the exact same values of sex, age, and num with any other row in the data frame.
The result should look like this (note that the data frame is sorted by the pairs or groups that are duplicated with each other):
result = read.table(text = "sex age num
M 32 5
M 32 5
F 31 2
F 31 2", header = T, sep = "")
I have tried various combinations of distinct in dplyr and duplicated, but they don't quite get at this use case.
We need duplicated twice i.e. one duplicated in the normal direction from up to bottom and second from bottom to top (fromLast = TRUE) and then use | so that it can be TRUE in either direction for subsetting
out <- file[duplicated(file)|duplicated(file, fromLast = TRUE),]
out$sex <- factor(out$sex, levels = c("M", "F"))
out1 <- out[do.call(order, out),]
row.names(out1) <- NULL
-output
> out1
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
The above can be written in tidyverse
library(dplyr)
file %>%
arrange(sex == "F", across(everything())) %>%
filter(duplicated(.)|duplicated(., fromLast = TRUE))
sex age num
1 M 32 5
2 M 32 5
3 F 31 2
4 F 31 2
An alternative approach:
Here all groups with more then 1 nrow will be kept:
library(dplyr)
file %>%
group_by(sex, age, num) %>%
filter(n() > 1) %>%
arrange(.by_group = T)
ungroup()
sex age num
<chr> <int> <int>
1 F 31 2
2 F 31 2
3 M 32 5
4 M 32 5
file = read.table(text = "sex age num
M 32 5
F 31 2
M 91 2
M 30 1
M 23 1
F 19 1
F 31 2
F 21 2
M 32 5
F 65 3
M 24 5", header = T, sep = "")
library(vctrs)
library(dplyr, warn = F)
#> Warning: package 'dplyr' was built under R version 4.1.2
file %>%
filter(vec_duplicate_detect(.)) %>%
arrange(across(everything()))
#> sex age num
#> 1 F 31 2
#> 2 F 31 2
#> 3 M 32 5
#> 4 M 32 5
Created on 2022-08-19 by the reprex package (v2.0.1.9000)
A base R option using subset + ave
> subset(file, ave(seq_along(num), sex, age, num, FUN = length) > 1)
sex age num
1 M 32 5
2 F 31 2
7 F 31 2
9 M 32 5
or rbind + split
> do.call(rbind, Filter(function(x) nrow(x) > 1, split(file, ~ sex + age + num)))
sex age num
F.31.2.2 F 31 2
F.31.2.7 F 31 2
M.32.5.1 M 32 5
M.32.5.9 M 32 5
Here is an approach, using .SD[.N>1] by group in data.table
library(data.table)
result = setDT(file)[, i:=.I][, .SD[.N>1],.(sex,age,num)][, i:=NULL]
Output:
sex age num
1: M 32 5
2: M 32 5
3: F 31 2
4: F 31 2

Apply a function across groups and columns in data.table and/or dplyr

I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1. Here is a reproducible example:
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)
The real case involves a large number of columns so I with to refer to them with variables.
Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1. Here is one of many attempts that fail:
for (ic in 1:3) {
c1 <- dt2[,(ic), with=FALSE]
c2 <- dt2[,(ic), with=FALSE]
dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}
I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax. In place (as above) is not critical.
desired result:
dt1 (or dt3) =
a b c groupVar
11 22 33 1
12 23 34 1
13 24 35 1
24 35 46 2
...
40 51 62 3
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:
Working in long format
EDIT: Update joins using get()
EDIT 2: Computing on the language
1. Working in long format
The information on corresponding columns can be provided in a look-up table or translation table:
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2
1: a a2
2: b b2
3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar a b c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62
2. Update joins using get()
Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
Note that dt1 is updated by reference, i.e., without copying.
Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").
3. Computing on the language
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string. This string is then evaluated and executed in one go; no for loops required.
As the helper function EVAL() already uses paste0() the call to glue() can be omitted:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.
Assuming that the column names are consistent (e.g. you want a + a2, b + b2...etc), here is a tidyverse solution that starts in a similar way as #dclarson's, then uses the bang-bang operator to select the columns to add up.
Is this what you are after?
## Create tibbles and join
dt1 <- tibble(groupVar,a,b,c)
dt2 <- tibble(groupVar = 1:3,a2,b2,c2)
dt3 <- inner_join(dt1,dt2)
## Define the column starters you are interested in
cols <- c("a","b","c")
## Or in case of many columns
cols <- colnames(dt1[-1])
## Create function to add columns with the same starting letters
add_cols <- function(col){
dt3 %>% select(starts_with(!!col)) %>%
transmute(!!(sym(col)) := !!(sym(col)) + !!(sym(paste0(col,"2"))))
}
## map the function and add groupVar
map_dfc(cols,add_cols) %>% mutate(groupVar = dt3$groupVar)
# A tibble: 10 x 4
a b c groupVar
<dbl> <dbl> <dbl> <dbl>
1 11 22 33 1
2 12 23 34 1
3 13 24 35 1
4 24 35 46 2
5 25 36 47 2
6 26 37 48 2
7 37 48 59 3
8 38 49 60 3
9 39 50 61 3
10 40 51 62 3
It is simple if you add groupVar to dt2:
dt2 <- data.table(a2, b2, c2, groupVar=1:3)
dt3 <- merge(dt1, dt2)
dt4 <- with(dt3, data.table(a=a+a2, b=b+b2, c=c+c2, groupVar))
dt4
# a b c groupVar
# 1: 11 22 33 1
# 2: 12 23 34 1
# 3: 13 24 35 1
# 4: 24 35 46 2
# 5: 25 36 47 2
# 6: 26 37 48 2
# 7: 37 48 59 3
# 8: 38 49 60 3
# 9: 39 50 61 3
# 10: 40 51 62 3
This should solve your desire:
Create a groupVar in dt2 with unique groupVar from dt1
right_join by groupVar
Create new columns a, b, c with mutate
Keep a, b, c and groupVar as desired output
library(dplyr)
dt3 <- dt2 %>%
mutate(groupVar = unique(dt1$groupVar)) %>%
right_join(dt1, by="groupVar") %>%
mutate(a = a + a2,
b = b + b2,
c = c + c2) %>%
select(a, b, c, groupVar)
data:
library(data.table)
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

How to match different ids within a single data set in R?

A sample of my data is
df1 <- read.table(text = " id1 time id2 gender id3 group id4 house
123 12 141 F 13 1 156 A
141 19 144 F 144 1 123 A
144 22 123 M 123 1 141 M
168 14 13 M 141 2 144 M
156 13 168 M 168 2 13 Q
13 11 156 F 156 2 168 Q
", header = TRUE)
I want to get the following outcome. For example, id123, time= 12, Gender=M, group=1, house= A, by looking at other ids
df1 <- read.table(text = " id time gender group house
123 12 M 1 A
141 19 F 2 M
144 22 F 1 M
168 14 M 2 Q
156 13 F 2 A
13 11 M 1 Q
", header = TRUE)
I have tried left_join, but I struggled to get the outcome of interest
df1 <- left_join(id2,id3,id4 by = "id1")
You've got the folks confused here because your table is in an unusual format. Typically in R, we expect one variable per column and one observation per row. What you have is effectively four tables stuck side-by-side, where id1, id2, id3 and id4 are all actually just "id". So effectively, you are looking to left join columns 3:4 to columns 1:2, then left join columns 5:6 to that, and so on.
I'll show one way of doing that, then maybe some of the smart folks here can show you a better way:
library(dplyr)
df_list <- lapply(list(1:2, 3:4, 5:6, 7:8), function(x) df1[x])
df_list <- lapply(df_list, function(x) {names(x)[1] <- "id"; x})
df2 <- df_list[[1]] %>%
left_join(df_list[[2]]) %>%
left_join(df_list[[3]]) %>%
left_join(df_list[[4]])
df2
#> id time gender group house
#> 1 123 12 M 1 A
#> 2 141 19 F 2 M
#> 3 144 22 F 1 M
#> 4 168 14 M 2 Q
#> 5 156 13 F 2 A
#> 6 13 11 M 1 Q
Created on 2020-07-01 by the reprex package (v0.3.0)
It seems like we need a match for different 'id' columns and corresponding 'group', 'gender' etc columns
nm1 <- c('id1', 'time', 'gender', 'group', 'house')
out1 <- transform(df1, gender = gender[match(id1, id2)],
group = group[match(id1, id3)],
house = house[match(id1, id4)])[nm1]
names(out1)[1] <- 'id'
out1
# id time gender group house
#1 123 12 M 1 A
#2 141 19 F 2 M
#3 144 22 F 1 M
#4 168 14 M 2 Q
#5 156 13 F 2 A
#6 13 11 M 1 Q
In addition to the above base R, an alternative option to #AllanCameron's solution would be to split subset of columns based on the occurrence of 'id' column (split.default), then change the first column name to 'id' and apply left_join within reduce
library(dplyr)
library(purrr)
df1 %>%
split.default(cumsum(startsWith(names(.), "id"))) %>%
map(~ rename_at(.x, 1, ~ 'id')) %>%
reduce(left_join, by = 'id')
# id time gender group house
#1 123 12 M 1 A
#2 141 19 F 2 M
#3 144 22 F 1 M
#4 168 14 M 2 Q
#5 156 13 F 2 A
#6 13 11 M 1 Q

Dplyr mutate new column at a specified location

An example:
a = c(10,20,30)
b = c(1,2,3)
c = c(4,5,6)
d = c(7,8,9)
df=data.frame(a,b,c,d)
library(dplyr)
df_1 = df %>% mutate(a1=sum(a+1))
How do I add "a1" after "a" (or any other defined position) and NOT at the end?
Thank you.
An update that might be useful for others who find this question - this can now be achieved directly within mutate (I'm using dplyr v1.0.2).
Just specify which existing column the new column should be positioned after or before, e.g.:
df_after <- df %>%
mutate(a1=sum(a+1), .after = a)
df_before <- df %>%
mutate(a1=sum(a+1), .before = b)
Another option is add_column from tibble
library(tibble)
add_column(df, a1 = sum(a + 1), .after = "a")
# a a1 b c d
#1 10 63 1 4 7
#2 20 63 2 5 8
#3 30 63 3 6 9
Extending on www's answer, we can use dplyr's select_helper functions to reorder newly created columns as we see fit:
library(dplyr)
## add a1 after a
df %>%
mutate(a1 = sum(a + 1)) %>%
select(a, a1, everything())
#> a a1 b c d
#> 1 10 63 1 4 7
#> 2 20 63 2 5 8
#> 3 30 63 3 6 9
## add a1 after c
df %>%
mutate(a1 = sum(a + 1)) %>%
select(1:c, a1, everything())
#> a b c a1 d
#> 1 10 1 4 63 7
#> 2 20 2 5 63 8
#> 3 30 3 6 63 9
dplyr >= 1.0.0
relocate was added as a new verb to change the order of one or more columns. If you pipe the output of your mutate the syntax for relocate also uses .before and .after arguments:
df_1 %>%
relocate(a1, .after = a)
a a1 b c d
1 10 63 1 4 7
2 20 63 2 5 8
3 30 63 3 6 9
An additional benefit is you can also move multiple columns using any tidyselect syntax:
df_1 %>%
relocate(c:a1, .before = b)
a c d a1 b
1 10 4 7 63 1
2 20 5 8 63 2
3 30 6 9 63 3
The mutate function will always add the newly created column at the end. However, we can sort the column alphabetically after the mutate function using select.
library(dplyr)
df_1 <- df %>%
mutate(a1 = sum(a + 1)) %>%
select(sort(names(.)))
df_1
# a a1 b c d
# 1 10 63 1 4 7
# 2 20 63 2 5 8
# 3 30 63 3 6 9

How to make groups in a data.frame equal length?

I have this data.frame:
df <- data.frame(id=c('A','A','B','B','B','C'), amount=c(45,66,99,34,71,22))
id | amount
-----------
A | 45
A | 66
B | 99
B | 34
B | 71
C | 22
which I need to expand so that each by group in the data.frame is of equal length (filling it out with zeroes), like so:
id | amount
-----------
A | 45
A | 66
A | 0 <- added
B | 99
B | 34
B | 71
C | 22
C | 0 <- added
C | 0 <- added
What is the most efficient way of doing this?
NOTE
Benchmarking the some of the solutions provided with my actual 1 million row data.frame I got:
plyr | data.table | unstack
-----------------------------------
Elapsed: 139.87s | 0.09s | 2.00s
One way using data.table
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 2L, 3L),
.Label = c("A ", "B ", "C "), class = "factor"),
V2 = c(45, 66, 99, 34, 71, 22)),
.Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -6L))
require(data.table)
dt <- data.table(df, key="V1")
# get maximum index
idx <- max(dt[, .N, by=V1]$N)
# get final result
dt[, list(V2 = c(V2, rep(0, idx-length(V2)))), by=V1]
# V1 V2
# 1: A 45
# 2: A 66
# 3: A 0
# 4: B 99
# 5: B 34
# 6: B 71
# 7: C 22
# 8: C 0
# 9: C 0
I'm sure there is a base R solution, but here is one that uses ddply in the plyr package
library(plyr)
##N: How many values should be in each group
N = 3
ddply(df, "id", summarize,
amount = c(amount, rep(0, N-length(amount))))
gives:
id amount
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0
Here's another way in base R using unstack and stack.
# ensure character id col
df <- transform(df, id=as.character(id))
# break into a list by id
u <- unstack(df, amount ~ id)
# get max length
max.len <- max(sapply(u, length))
# pad the short ones with 0s
filled <- lapply(u, function(x) c(x, numeric(max.len - length(x))))
# recombine into data.frame
stack(filled)
# values ind
# 1 45 A
# 2 66 A
# 3 0 A
# 4 99 B
# 5 34 B
# 6 71 B
# 7 22 C
# 8 0 C
# 9 0 C
How about this?
out <- by(df, INDICES = df$id, FUN = function(x, N) {
x <- droplevels(x)
lng <- nrow(x)
dif <- N - lng
if (dif == 0) return(x)
make.list <- lapply(1:dif, FUN = function(y) data.frame(id = levels(x$id), amount = 0))
rbind(x, do.call("rbind", make.list))
}, N = max(table(df$id))) # N could also be an integer
do.call("rbind", out)
id amount
A.1 A 45
A.2 A 66
A.3 A 0
B.3 B 99
B.4 B 34
B.5 B 71
C.6 C 22
C.2 C 0
C.3 C 0
Here is a dplyr option:
library(dplyr)
# Get maximum number of rows for all groups
N = max(count(df,id)$n)
df %>%
group_by(id) %>%
summarise(amount = c(amount, rep(0, N-length(amount))), .groups = "drop")
Output
id amount
<chr> <dbl>
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0

Resources