Given two dataframes with the same column names:
a <- data.frame(x=1:4,y=5:8)
b <- data.frame(x=LETTERS[1:4],y=LETTERS[5:8])
>a
x y
1 5
2 6
3 7
4 8
>b
x y
A E
B F
C G
D H
How can each column with the same name be concatentated?
Desired output:
cat_x cat_y
1 A 5 E
2 B 6 F
3 C 7 G
4 D 8 H
Tried so far, merging columns one at a time:
a$cat_x <- paste(a$x,b$x)
a$cat_y <- paste(a$y,b$y)
This approach works, but the real data has 40 columns (and will include multiple more dataframes). Looking for a more efficient method for larger dataframes.
We may use Map to do this on a loop
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b,
MoreArgs = list(sep = "_")))
-output
cat_x cat_y
1 1_A 5_E
2 2_B 6_F
3 3_C 7_G
4 4_D 8_H
Used sep above in case we want to add a delimiter. Or else by default it will be space
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b ))
cat_x cat_y
1 1 A 5 E
2 2 B 6 F
3 3 C 7 G
4 4 D 8 H
Another possible solution, using purrr::map2_dfc:
library(tidyverse)
map2_dfc(a,b, ~ str_c(.x, .y, sep = " ")) %>%
rename_with(~ str_c("cat", .x, sep = "_"))
#> # A tibble: 4 × 2
#> cat_x cat_y
#> <chr> <chr>
#> 1 1 A 5 E
#> 2 2 B 6 F
#> 3 3 C 7 G
#> 4 4 D 8 H
This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Closed 1 year ago.
Suppose I have a data frame with a list of names:
> x <- c("a", "b", "c")
> x <- as.data.frame(x)
# > x
# 1 a
# 2 b
# 3 c
I want to spread each unique name (x, below) to each name (y, below) and create a new column before the original column so that the new data frame looks like this:
# > z
# x y
# a a
# a b
# a c
# b a
# b b
# b c
# c a
# c b
# c c
This is for creating a "from" "to" edge list in igraph where the network is full.
How could I do this? Is there a simple tidyverse solution that I'm missing?
You can use tidyr::expand_grid or tidyr::crossing
tidyr::expand_grid(a = x$x, b = x$x)
#tidyr::crossing(a = x$x, b = x$x)
# a b
# <chr> <chr>
#1 a a
#2 a b
#3 a c
#4 b a
#5 b b
#6 b c
#7 c a
#8 c b
#9 c c
This is similar to base R expand.grid only the order is different.
expand.grid(a = x$x, b = x$x)
Using dplyr and tidyr, you could do:
x %>%
mutate(y = x) %>%
complete(y, x)
y x
<fct> <fct>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
A base R solution:
names <- c("a", "b", "c")
x = rep(names, each=length(names))
y = rep(names, length(names))
df = data.frame(x,y)
df
x y
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
You can also use expand function to return every possible combinations of the two columns:
library(tidyr)
x %>%
mutate(y = x) %>%
expand(x, y)
# A tibble: 9 x 2
x y
<chr> <chr>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
You can also use crossing function:
x <- c("a", "b", "c")
x <- as.data.frame(x)
x$y <- c("a", "b", "c")
crossing(x$x, x$y) # But you can't just use it within a pipeline since the first argument is not data
# A tibble: 9 x 2
`x$x` `x$y`
<chr> <chr>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
If you really want to use igraph, here might be one option
make_full_graph(
length(x),
directed = TRUE,
loops = TRUE
) %>%
set_vertex_attr(name = "name", value = x) %>%
get.data.frame()
which gives
from to
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
I'm trying to do a rolling count for variable one or two if they have group x.
Essentially I want something get back new_var1 and new_var2 in this example, where every time Var1 or Var2 has the combination a and group f it counts, or b and group f and so on. So, the overall appearances of a in each group are counted, regardless if a appears in column Var1 or Var2. However, the counts must be assigned to the proper colum. So, if a appears in column Var1 then the actual count must be assigned to column new_var1. Accordingly, for a in Var2 the actual count is to be in new_var2.
x <- expand.grid(letters[1:5],letters[1:5],KEEP.OUT.ATTRS = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x<- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x<- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))`
Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3
Any help would greatly be appreciated.
I managed to get this working:
x <- data.table(x)
x[, new_var1a := seq(.N) , by = c('Var1','group')]
x[, new_var2a := seq(.N) , by = c('Var2','group')]
Var2 Var1 group new_var2 new_var1 new_var1a new_var2a
1: a b f 1 1 1 1
2: a c f 2 1 1 2
3: a d f 3 1 1 3
4: a e f 4 1 1 4
5: b a g 1 1 1 1
6: b c g 2 1 1 2
7: b d g 3 1 1 3
8: b e g 4 1 1 4
9: c a f 2 5 1 1
10: c b f 3 2 2 2
11: c d f 4 2 2 3
12: c e f 5 2 2 4
13: d a i 1 1 1 1
14: d b i 2 1 1 2
15: d c i 3 1 1 3
16: d e i 4 1 1 4
17: e a f 3 6 2 1
18: e b f 4 3 3 2
19: e c f 5 6 2 3
20: e d f 6 3 3 4
But it treats var1 and var2 independently. Which I do not want.
So, your problem is more an algorithm problem so we'll use a loop instead of dplyr or data.table. FOR ME, using loops in R opten means using Rcpp. So this is my answer:
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/foreach.hpp>
using namespace Rcpp;
// the C-style upper-case macro name is a bit ugly
#define foreach BOOST_FOREACH
// [[Rcpp::export]]
ListOf<IntegerVector> new_vars(const IntegerVector& Var1,
const IntegerVector& Var2,
int n_Var,
ListOf<IntegerVector> ind_groups) {
int nrow = Var1.size();
IntegerVector new_var1a(nrow, NA_INTEGER);
IntegerVector new_var2a(nrow, NA_INTEGER);
for (int i = 0; i < ind_groups.size(); i++) {
IntegerVector counts(n_Var);
foreach(const int& j, ind_groups[i]) {
new_var1a[j] = ++counts[Var1[j]];
new_var2a[j] = ++counts[Var2[j]];
}
}
return List::create(Named("new_var1a") = new_var1a,
Named("new_var2a") = new_var2a);
}
/*** R
x <- expand.grid(letters[1:5],letters[1:5],
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x <- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x <- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))
getNewVars <- function(x) {
Vars.levels <- unique(c(x$Var2, x$Var1))
new_vars <- new_vars(
Var1 = match(x$Var1, Vars.levels) - 1,
Var2 = match(x$Var2, Vars.levels) - 1,
n_Var = length(Vars.levels),
ind_groups = split(seq_along(x$group) - 1, x$group)
)
cbind(x, new_vars)
}
getNewVars(x)
*/
Put this in a ".cpp" file and source it.
PS: Make sure to use stringsAsFactors = FALSE.
Solution with dplyr, by first casting the data from wide to long format, while keeping the row id to later merge again.
Sample data
df = read.table(text=" Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3",header=T)
df = df[,c("Var2","Var1","group")]
Code
library(reshape2)
library(dplyr)
df$id = seq(1,nrow(df))
df2 = melt(df, id.vars=c("id", "group")) %>% arrange(id)
df2 = df2 %>% group_by(group,value) %>% mutate(n= row_number())
df = df %>% left_join(df2[df2$variable=="Var1",c("id","n")], by="id")
df = df %>% left_join(df2[df2$variable=="Var2",c("id","n")], by="id")
colnames(df)[colnames(df)=="n.x"]="new_var1"
colnames(df)[colnames(df)=="n.y"]="new_var2"
Optionally add df2 = df2 %>% group_by(group,value,id) %>% mutate(n=max(n)) if a line can contain the same variables (which is not the case in your example).
Output
Var2 Var1 group id new_var1 new_var2
1 a b f 1 1 1
2 a c f 2 1 2
3 a d f 3 1 3
4 a e f 4 1 4
5 b a g 5 1 1
6 b c g 6 1 2
7 b d g 7 1 3
8 b e g 8 1 4
9 c a f 9 5 2
10 c b f 10 2 3
11 c d f 11 2 4
12 c e f 12 2 5
13 d a i 13 1 1
14 d b i 14 1 2
15 d c i 15 1 3
16 d e i 16 1 4
17 e a f 17 6 3
18 e b f 18 3 4
19 e c f 19 6 5
20 e d f 20 3 6
Hope this helps!
The dcast() function from the data.table package allows us to reshape multiple value variables simultaneously. This can be used to avoid the double left join in Florian's answer:
library(data.table)
long <- melt(setDT(x)[, rn := .I], id.vars = c("rn", "group"),
measure.vars = c("Var1", "Var2"), value.name = "Var")[
, variable := rleid(variable)][
order(rn), new_var := rowid(group, Var)][]
dcast(long, rn + group ~ ..., value.var = c("Var", "new_var"))[, rn := NULL][]
group Var_1 Var_2 new_var_1 new_var_2
1: f b a 1 1
2: f c a 1 2
3: f d a 1 3
4: f e a 1 4
5: g a b 1 1
6: g c b 1 2
7: g d b 1 3
8: g e b 1 4
9: f a c 5 2
10: f b c 2 3
11: f d c 2 4
12: f e c 2 5
13: i a d 1 1
14: i b d 1 2
15: i c d 1 3
16: i e d 1 4
17: f a e 6 3
18: f b e 3 4
19: f c e 6 5
20: f d e 3 6
Explanation
setDT(x) coerces x to data.table, then a column with row numbers is added before reshaping from wide to long format. Just to get nicer looking column names from the subsequent dcast(), the variables are renamed (for this [, variable := sub("Var", "", variable)] can be used as alternative to [, variable := rleid(variable)]).
The important step is the numbering of appearances of each Var within each group using rowid() grouped by group and Var.
Now, the result has two value columns. Finally, it is reshaped back from long to wide format again, and the rn column is removed as no longer needed.
Data
x <- expand.grid(letters[1:5], letters[1:5], KEEP.OUT.ATTRS = FALSE)
x <- x[x[, 1] != x[, 2], c(2, 1)]
x <- data.frame(
x,
group = as.character(rep(letters[c(1, 2, 1, 4, 1) + 5], each = 4)),
new_var1 = c(1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 5, 1, 2, 3, 4, 3, 4, 5, 6),
new_var2 = c(1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 2, 2, 1, 1, 1, 1, 6, 3, 6, 3))