I am trying to concatenate certain row values (Strings) given varying conditions in R. I have flagged the row values in Flag (the flagging criteria are irrelevant in this example).
Notations: B is the beginning of a run and E the end. 0 is outside the run. 1 denotes any strings excluding B and E in the run. Your solution does not need to follow my convention.
Rules: Every run must begin with B and ends with E. There can be any number of 1 in the run. Any Strings positioned between B and E (both inclusive) are to be concatenated in the order as they are positioned in the run, and replace the B-string. . 0-string will remain in the dataframe. 1- and E-strings will be removed after concatenation.
I haven't come up with anything close to the desired output.
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
Strings Flag
1 d 0
2 r B
3 q 1
4 r 1
5 v E
6 f B
7 y E
8 u B
9 c E
10 x 0
11 h B
12 w 1
13 x 1
14 t 1
15 j E
16 d 0
17 j 0
Intermediate output.
Strings Flag Result
1 d 0 d
2 r B r q r v
3 q 1 q
4 r 1 r
5 v E v
6 f B f y
7 y E y
8 u B u c
9 c E c
10 x 0 x
11 h B h w x t j
12 w 1 w
13 x 1 x
14 t 1 t
15 j E j
16 d 0 d
17 j 0 j
Desired output.
Result
1 d
2 r q r v
3 f y
4 u c
5 x
6 h w x t j
7 d
8 j
Here is a solution that might help you. However, I am still not sure if I got your point correctly:
library(dplyr)
df2 %>%
mutate(Flag2 = cumsum(Flag == 'B' | Flag == '0')) %>%
group_by(Flag2) %>%
summarise(Result = paste0(Strings, collapse = ' '))
# A tibble: 8 × 2
Flag2 Result
<int> <chr>
1 1 d
2 2 r q r v
3 3 f y
4 4 u c
5 5 x
6 6 h w x t j
7 7 d
8 8 j
Using dplyr:
library(dplyr)
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
df2 %>%
group_by(group = cumsum( (Flag=="B") + (lag(Flag,1,"0")=="E"))) %>%
mutate(Result=if_else(Flag=="B", paste0(Strings,collapse = " "),Strings)) %>%
filter(!(Flag %in% c("1", "E"))) %>% ungroup() %>%
select(-group, -Strings, -Flag)
#> # A tibble: 8 × 1
#> Result
#> <chr>
#> 1 d
#> 2 r q r v
#> 3 f y
#> 4 u c
#> 5 x
#> 6 h w x t j
#> 7 d
#> 8 j
So I have 3 vectors:
n<-10e3
ch<-append(LETTERS,letters)
a<-sample(ch,n,replace=TRUE)
b<-sample(ch,n,replace=TRUE)
c<-sample(ch,n,replace=TRUE)
df<-data.frame(a,b,c)
And now i want it to be sorted by all these columns. I found:
df[with(df, order(a,b,c)), ]
but, problem is, it's case sensitive, and outcome looks like that:
5359 A a b
7325 A a B
7200 A a g
9122 A a V
2144 A a W
5984 A a z
8349 A A e
5215 A A E
4007 A A f
3099 A A H
3220 A A i
7080 A A N
4963 A A r
9159 A A V
4219 A A w
9723 A b b
4463 A b h
7894 A b V
3765 A B a
8772 A B b
As you see, for example, it puts "A A e" after "A a z". How to make it case insensitive?
If you don't mind going over to dplyr forms
#library(tidyverse)
df %>%
as_tibble() %>%
mutate(lowcase = tolower(paste(a,b,c))) %>%
arrange(lowcase)
# A tibble: 10,000 x 4
a b c lowcase
<fct> <fct> <fct> <chr>
1 a A D a a d
2 A A L a a l
3 a a N a a n
4 A A n a a n
5 a a w a a w
6 A a X a a x
7 A A Y a a y
8 a A Y a a y
9 A b A a b a
10 A B a a b a
I want to create a new data frame from the df one below. In the new data frame (df2), each element in df$name is placed in the first column and matched in its row with other element of df$name grouped by df$group.
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
> df
group name
1 a A
2 a B
3 a C
4 b D
5 b E
6 b F
In this example, "A", "B", and "C" in df$name belong to "a" in df$group, and I want to put them in the same row in a new data frame. The desired output looks like this:
> df2
V1 V2
1 A B
2 A C
3 B A
4 B C
5 C A
6 C B
7 D E
8 D F
9 E D
10 E F
11 F D
12 F E
We could do this in base R with merge
out <- setNames(subset(merge(df, df, by.x = 'group', by.y = 'group'),
name.x != name.y, select = -group), c("V1", "V2"))
row.names(out) <- NULL
out
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
#7 D E
#8 D F
#9 E D
#10 E F
#11 F D
#12 F E
In my opinion its case of self-join. Using dplyr a solution can be as:
library(dplyr)
inner_join(df, df, by="group") %>%
filter(name.x != name.y) %>%
select(V1 = name.x, V2 = name.y)
# V1 V2
# 1 A B
# 2 A C
# 3 B A
# 4 B C
# 5 C A
# 6 C B
# 7 D E
# 8 D F
# 9 E D
# 10 E F
# 11 F D
# 12 F E
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
library(tidyverse)
df %>%
group_by(group) %>% # for every group
summarise(v = list(expand.grid(V1=name, V2=name))) %>% # create all combinations of names
select(v) %>% # keep only the combinations
unnest(v) %>% # unnest combinations
filter(V1 != V2) # exclude rows with same names
# # A tibble: 12 x 2
# V1 V2
# <fct> <fct>
# 1 B A
# 2 C A
# 3 A B
# 4 C B
# 5 A C
# 6 B C
# 7 E D
# 8 F D
# 9 D E
# 10 F E
# 11 D F
# 12 E F
I'm trying to do a rolling count for variable one or two if they have group x.
Essentially I want something get back new_var1 and new_var2 in this example, where every time Var1 or Var2 has the combination a and group f it counts, or b and group f and so on. So, the overall appearances of a in each group are counted, regardless if a appears in column Var1 or Var2. However, the counts must be assigned to the proper colum. So, if a appears in column Var1 then the actual count must be assigned to column new_var1. Accordingly, for a in Var2 the actual count is to be in new_var2.
x <- expand.grid(letters[1:5],letters[1:5],KEEP.OUT.ATTRS = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x<- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x<- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))`
Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3
Any help would greatly be appreciated.
I managed to get this working:
x <- data.table(x)
x[, new_var1a := seq(.N) , by = c('Var1','group')]
x[, new_var2a := seq(.N) , by = c('Var2','group')]
Var2 Var1 group new_var2 new_var1 new_var1a new_var2a
1: a b f 1 1 1 1
2: a c f 2 1 1 2
3: a d f 3 1 1 3
4: a e f 4 1 1 4
5: b a g 1 1 1 1
6: b c g 2 1 1 2
7: b d g 3 1 1 3
8: b e g 4 1 1 4
9: c a f 2 5 1 1
10: c b f 3 2 2 2
11: c d f 4 2 2 3
12: c e f 5 2 2 4
13: d a i 1 1 1 1
14: d b i 2 1 1 2
15: d c i 3 1 1 3
16: d e i 4 1 1 4
17: e a f 3 6 2 1
18: e b f 4 3 3 2
19: e c f 5 6 2 3
20: e d f 6 3 3 4
But it treats var1 and var2 independently. Which I do not want.
So, your problem is more an algorithm problem so we'll use a loop instead of dplyr or data.table. FOR ME, using loops in R opten means using Rcpp. So this is my answer:
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/foreach.hpp>
using namespace Rcpp;
// the C-style upper-case macro name is a bit ugly
#define foreach BOOST_FOREACH
// [[Rcpp::export]]
ListOf<IntegerVector> new_vars(const IntegerVector& Var1,
const IntegerVector& Var2,
int n_Var,
ListOf<IntegerVector> ind_groups) {
int nrow = Var1.size();
IntegerVector new_var1a(nrow, NA_INTEGER);
IntegerVector new_var2a(nrow, NA_INTEGER);
for (int i = 0; i < ind_groups.size(); i++) {
IntegerVector counts(n_Var);
foreach(const int& j, ind_groups[i]) {
new_var1a[j] = ++counts[Var1[j]];
new_var2a[j] = ++counts[Var2[j]];
}
}
return List::create(Named("new_var1a") = new_var1a,
Named("new_var2a") = new_var2a);
}
/*** R
x <- expand.grid(letters[1:5],letters[1:5],
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x <- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x <- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))
getNewVars <- function(x) {
Vars.levels <- unique(c(x$Var2, x$Var1))
new_vars <- new_vars(
Var1 = match(x$Var1, Vars.levels) - 1,
Var2 = match(x$Var2, Vars.levels) - 1,
n_Var = length(Vars.levels),
ind_groups = split(seq_along(x$group) - 1, x$group)
)
cbind(x, new_vars)
}
getNewVars(x)
*/
Put this in a ".cpp" file and source it.
PS: Make sure to use stringsAsFactors = FALSE.
Solution with dplyr, by first casting the data from wide to long format, while keeping the row id to later merge again.
Sample data
df = read.table(text=" Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3",header=T)
df = df[,c("Var2","Var1","group")]
Code
library(reshape2)
library(dplyr)
df$id = seq(1,nrow(df))
df2 = melt(df, id.vars=c("id", "group")) %>% arrange(id)
df2 = df2 %>% group_by(group,value) %>% mutate(n= row_number())
df = df %>% left_join(df2[df2$variable=="Var1",c("id","n")], by="id")
df = df %>% left_join(df2[df2$variable=="Var2",c("id","n")], by="id")
colnames(df)[colnames(df)=="n.x"]="new_var1"
colnames(df)[colnames(df)=="n.y"]="new_var2"
Optionally add df2 = df2 %>% group_by(group,value,id) %>% mutate(n=max(n)) if a line can contain the same variables (which is not the case in your example).
Output
Var2 Var1 group id new_var1 new_var2
1 a b f 1 1 1
2 a c f 2 1 2
3 a d f 3 1 3
4 a e f 4 1 4
5 b a g 5 1 1
6 b c g 6 1 2
7 b d g 7 1 3
8 b e g 8 1 4
9 c a f 9 5 2
10 c b f 10 2 3
11 c d f 11 2 4
12 c e f 12 2 5
13 d a i 13 1 1
14 d b i 14 1 2
15 d c i 15 1 3
16 d e i 16 1 4
17 e a f 17 6 3
18 e b f 18 3 4
19 e c f 19 6 5
20 e d f 20 3 6
Hope this helps!
The dcast() function from the data.table package allows us to reshape multiple value variables simultaneously. This can be used to avoid the double left join in Florian's answer:
library(data.table)
long <- melt(setDT(x)[, rn := .I], id.vars = c("rn", "group"),
measure.vars = c("Var1", "Var2"), value.name = "Var")[
, variable := rleid(variable)][
order(rn), new_var := rowid(group, Var)][]
dcast(long, rn + group ~ ..., value.var = c("Var", "new_var"))[, rn := NULL][]
group Var_1 Var_2 new_var_1 new_var_2
1: f b a 1 1
2: f c a 1 2
3: f d a 1 3
4: f e a 1 4
5: g a b 1 1
6: g c b 1 2
7: g d b 1 3
8: g e b 1 4
9: f a c 5 2
10: f b c 2 3
11: f d c 2 4
12: f e c 2 5
13: i a d 1 1
14: i b d 1 2
15: i c d 1 3
16: i e d 1 4
17: f a e 6 3
18: f b e 3 4
19: f c e 6 5
20: f d e 3 6
Explanation
setDT(x) coerces x to data.table, then a column with row numbers is added before reshaping from wide to long format. Just to get nicer looking column names from the subsequent dcast(), the variables are renamed (for this [, variable := sub("Var", "", variable)] can be used as alternative to [, variable := rleid(variable)]).
The important step is the numbering of appearances of each Var within each group using rowid() grouped by group and Var.
Now, the result has two value columns. Finally, it is reshaped back from long to wide format again, and the rn column is removed as no longer needed.
Data
x <- expand.grid(letters[1:5], letters[1:5], KEEP.OUT.ATTRS = FALSE)
x <- x[x[, 1] != x[, 2], c(2, 1)]
x <- data.frame(
x,
group = as.character(rep(letters[c(1, 2, 1, 4, 1) + 5], each = 4)),
new_var1 = c(1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 5, 1, 2, 3, 4, 3, 4, 5, 6),
new_var2 = c(1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 2, 2, 1, 1, 1, 1, 6, 3, 6, 3))
In a dataset like
data_frame(a=letters, a_1=letters, b=letters, b_1=letters)
I would like to concatenate the columns that share a similar "root", namely a with a_1 and b with b_1. The output should look like
# A tibble: 26 x 2
a b
<chr> <chr>
1 a a a a
2 b b b b
3 c c c c
4 d d d d
5 e e e e
6 f f f f
7 g g g g
8 h h h h
9 i i i i
10 j j j j
# ... with 16 more rows
If you're looking for a tidyverse approach, you can do it using tidyr::unite_:
library(tidyr)
# get a list column name groups
cols <- split(names(df), sub("_.*", "", names(df)))
# loop through list and unite columns
for(x in names(cols)) {
df <- unite_(df, x, cols[[x]], sep = " ")
}
Here is one way to go about it,
ind <- sub('_.*', '', names(df))
as.data.frame(sapply(unique(ind), function(i) do.call(paste, df[i == ind])))
# a b
#1 a a a a
#2 b b b b
#3 c c c c
#4 d d d d
#5 e e e e
#6 f f f f
#7 g g g g
#8 h h h h