Concatenate row values given varying conditions in R - r

I am trying to concatenate certain row values (Strings) given varying conditions in R. I have flagged the row values in Flag (the flagging criteria are irrelevant in this example).
Notations: B is the beginning of a run and E the end. 0 is outside the run. 1 denotes any strings excluding B and E in the run. Your solution does not need to follow my convention.
Rules: Every run must begin with B and ends with E. There can be any number of 1 in the run. Any Strings positioned between B and E (both inclusive) are to be concatenated in the order as they are positioned in the run, and replace the B-string. . 0-string will remain in the dataframe. 1- and E-strings will be removed after concatenation.
I haven't come up with anything close to the desired output.
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
Strings Flag
1 d 0
2 r B
3 q 1
4 r 1
5 v E
6 f B
7 y E
8 u B
9 c E
10 x 0
11 h B
12 w 1
13 x 1
14 t 1
15 j E
16 d 0
17 j 0
Intermediate output.
Strings Flag Result
1 d 0 d
2 r B r q r v
3 q 1 q
4 r 1 r
5 v E v
6 f B f y
7 y E y
8 u B u c
9 c E c
10 x 0 x
11 h B h w x t j
12 w 1 w
13 x 1 x
14 t 1 t
15 j E j
16 d 0 d
17 j 0 j
Desired output.
Result
1 d
2 r q r v
3 f y
4 u c
5 x
6 h w x t j
7 d
8 j

Here is a solution that might help you. However, I am still not sure if I got your point correctly:
library(dplyr)
df2 %>%
mutate(Flag2 = cumsum(Flag == 'B' | Flag == '0')) %>%
group_by(Flag2) %>%
summarise(Result = paste0(Strings, collapse = ' '))
# A tibble: 8 × 2
Flag2 Result
<int> <chr>
1 1 d
2 2 r q r v
3 3 f y
4 4 u c
5 5 x
6 6 h w x t j
7 7 d
8 8 j

Using dplyr:
library(dplyr)
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
df2 %>%
group_by(group = cumsum( (Flag=="B") + (lag(Flag,1,"0")=="E"))) %>%
mutate(Result=if_else(Flag=="B", paste0(Strings,collapse = " "),Strings)) %>%
filter(!(Flag %in% c("1", "E"))) %>% ungroup() %>%
select(-group, -Strings, -Flag)
#> # A tibble: 8 × 1
#> Result
#> <chr>
#> 1 d
#> 2 r q r v
#> 3 f y
#> 4 u c
#> 5 x
#> 6 h w x t j
#> 7 d
#> 8 j

Related

Concatenate values in two data frames in R

Given two dataframes with the same column names:
a <- data.frame(x=1:4,y=5:8)
b <- data.frame(x=LETTERS[1:4],y=LETTERS[5:8])
>a
x y
1 5
2 6
3 7
4 8
>b
x y
A E
B F
C G
D H
How can each column with the same name be concatentated?
Desired output:
cat_x cat_y
1 A 5 E
2 B 6 F
3 C 7 G
4 D 8 H
Tried so far, merging columns one at a time:
a$cat_x <- paste(a$x,b$x)
a$cat_y <- paste(a$y,b$y)
This approach works, but the real data has 40 columns (and will include multiple more dataframes). Looking for a more efficient method for larger dataframes.
We may use Map to do this on a loop
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b,
MoreArgs = list(sep = "_")))
-output
cat_x cat_y
1 1_A 5_E
2 2_B 6_F
3 3_C 7_G
4 4_D 8_H
Used sep above in case we want to add a delimiter. Or else by default it will be space
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b ))
cat_x cat_y
1 1 A 5 E
2 2 B 6 F
3 3 C 7 G
4 4 D 8 H
Another possible solution, using purrr::map2_dfc:
library(tidyverse)
map2_dfc(a,b, ~ str_c(.x, .y, sep = " ")) %>%
rename_with(~ str_c("cat", .x, sep = "_"))
#> # A tibble: 4 × 2
#> cat_x cat_y
#> <chr> <chr>
#> 1 1 A 5 E
#> 2 2 B 6 F
#> 3 3 C 7 G
#> 4 4 D 8 H

Spread multiple values to unique values in data frame in R [duplicate]

This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Closed 1 year ago.
Suppose I have a data frame with a list of names:
> x <- c("a", "b", "c")
> x <- as.data.frame(x)
# > x
# 1 a
# 2 b
# 3 c
I want to spread each unique name (x, below) to each name (y, below) and create a new column before the original column so that the new data frame looks like this:
# > z
# x y
# a a
# a b
# a c
# b a
# b b
# b c
# c a
# c b
# c c
This is for creating a "from" "to" edge list in igraph where the network is full.
How could I do this? Is there a simple tidyverse solution that I'm missing?
You can use tidyr::expand_grid or tidyr::crossing
tidyr::expand_grid(a = x$x, b = x$x)
#tidyr::crossing(a = x$x, b = x$x)
# a b
# <chr> <chr>
#1 a a
#2 a b
#3 a c
#4 b a
#5 b b
#6 b c
#7 c a
#8 c b
#9 c c
This is similar to base R expand.grid only the order is different.
expand.grid(a = x$x, b = x$x)
Using dplyr and tidyr, you could do:
x %>%
mutate(y = x) %>%
complete(y, x)
y x
<fct> <fct>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
A base R solution:
names <- c("a", "b", "c")
x = rep(names, each=length(names))
y = rep(names, length(names))
df = data.frame(x,y)
df
x y
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
You can also use expand function to return every possible combinations of the two columns:
library(tidyr)
x %>%
mutate(y = x) %>%
expand(x, y)
# A tibble: 9 x 2
x y
<chr> <chr>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
You can also use crossing function:
x <- c("a", "b", "c")
x <- as.data.frame(x)
x$y <- c("a", "b", "c")
crossing(x$x, x$y) # But you can't just use it within a pipeline since the first argument is not data
# A tibble: 9 x 2
`x$x` `x$y`
<chr> <chr>
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c
If you really want to use igraph, here might be one option
make_full_graph(
length(x),
directed = TRUE,
loops = TRUE
) %>%
set_vertex_attr(name = "name", value = x) %>%
get.data.frame()
which gives
from to
1 a a
2 a b
3 a c
4 b a
5 b b
6 b c
7 c a
8 c b
9 c c

Count freq of word and which column it appears in using R

I have a dataset of a series of names in different columns. Each column determines the time in which the names were entered into the system. Is it possible to find the number of times ALL the names appear and the most recent column entry. I added a picture to show how the dataset works.
Here's one method:
library(dplyr)
set.seed(42)
dat <- setNames(as.data.frame(replicate(4, sample(letters, size = 10, replace = TRUE))), 1:4)
dat
# 1 2 3 4
# 1 q x c c
# 2 e g i z
# 3 a d y a
# 4 y y d j
# 5 j e e x
# 6 d n m k
# 7 r t e o
# 8 z z t v
# 9 q r b z
# 10 o o h h
tidyverse
library(dplyr)
library(tidyr)
pivot_longer(dat, everything(), names_to = "colname", values_to = "word") %>%
mutate(colname = as.integer(colname)) %>%
group_by(word) %>%
summarize(n = n(), latest = max(colname), .groups = "drop")
# # A tibble: 20 x 3
# word n latest
# <chr> <int> <int>
# 1 a 2 4
# 2 b 1 3
# 3 c 2 4
# 4 d 3 3
# 5 e 4 3
# 6 g 1 2
# 7 h 2 4
# 8 i 1 3
# 9 j 2 4
# 10 k 1 4
# 11 m 1 3
# 12 n 1 2
# 13 o 3 4
# 14 q 2 1
# 15 r 2 2
# 16 t 2 3
# 17 v 1 4
# 18 x 2 4
# 19 y 3 3
# 20 z 4 4
data.table
library(data.table)
melt(as.data.table(dat), integer(0), variable.name = "colname", value.name = "word")[
, colname := as.integer(colname)
][, .(n = .N, latest = max(colname)), by = .(word) ]
(though it is not sorted by word, the values are the same)

R slide window through tibble

I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance
We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())

rolling count of var1 or var2 by group in R

I'm trying to do a rolling count for variable one or two if they have group x.
Essentially I want something get back new_var1 and new_var2 in this example, where every time Var1 or Var2 has the combination a and group f it counts, or b and group f and so on. So, the overall appearances of a in each group are counted, regardless if a appears in column Var1 or Var2. However, the counts must be assigned to the proper colum. So, if a appears in column Var1 then the actual count must be assigned to column new_var1. Accordingly, for a in Var2 the actual count is to be in new_var2.
x <- expand.grid(letters[1:5],letters[1:5],KEEP.OUT.ATTRS = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x<- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x<- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))`
Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3
Any help would greatly be appreciated.
I managed to get this working:
x <- data.table(x)
x[, new_var1a := seq(.N) , by = c('Var1','group')]
x[, new_var2a := seq(.N) , by = c('Var2','group')]
Var2 Var1 group new_var2 new_var1 new_var1a new_var2a
1: a b f 1 1 1 1
2: a c f 2 1 1 2
3: a d f 3 1 1 3
4: a e f 4 1 1 4
5: b a g 1 1 1 1
6: b c g 2 1 1 2
7: b d g 3 1 1 3
8: b e g 4 1 1 4
9: c a f 2 5 1 1
10: c b f 3 2 2 2
11: c d f 4 2 2 3
12: c e f 5 2 2 4
13: d a i 1 1 1 1
14: d b i 2 1 1 2
15: d c i 3 1 1 3
16: d e i 4 1 1 4
17: e a f 3 6 2 1
18: e b f 4 3 3 2
19: e c f 5 6 2 3
20: e d f 6 3 3 4
But it treats var1 and var2 independently. Which I do not want.
So, your problem is more an algorithm problem so we'll use a loop instead of dplyr or data.table. FOR ME, using loops in R opten means using Rcpp. So this is my answer:
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/foreach.hpp>
using namespace Rcpp;
// the C-style upper-case macro name is a bit ugly
#define foreach BOOST_FOREACH
// [[Rcpp::export]]
ListOf<IntegerVector> new_vars(const IntegerVector& Var1,
const IntegerVector& Var2,
int n_Var,
ListOf<IntegerVector> ind_groups) {
int nrow = Var1.size();
IntegerVector new_var1a(nrow, NA_INTEGER);
IntegerVector new_var2a(nrow, NA_INTEGER);
for (int i = 0; i < ind_groups.size(); i++) {
IntegerVector counts(n_Var);
foreach(const int& j, ind_groups[i]) {
new_var1a[j] = ++counts[Var1[j]];
new_var2a[j] = ++counts[Var2[j]];
}
}
return List::create(Named("new_var1a") = new_var1a,
Named("new_var2a") = new_var2a);
}
/*** R
x <- expand.grid(letters[1:5],letters[1:5],
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x <- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x <- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))
getNewVars <- function(x) {
Vars.levels <- unique(c(x$Var2, x$Var1))
new_vars <- new_vars(
Var1 = match(x$Var1, Vars.levels) - 1,
Var2 = match(x$Var2, Vars.levels) - 1,
n_Var = length(Vars.levels),
ind_groups = split(seq_along(x$group) - 1, x$group)
)
cbind(x, new_vars)
}
getNewVars(x)
*/
Put this in a ".cpp" file and source it.
PS: Make sure to use stringsAsFactors = FALSE.
Solution with dplyr, by first casting the data from wide to long format, while keeping the row id to later merge again.
Sample data
df = read.table(text=" Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3",header=T)
df = df[,c("Var2","Var1","group")]
Code
library(reshape2)
library(dplyr)
df$id = seq(1,nrow(df))
df2 = melt(df, id.vars=c("id", "group")) %>% arrange(id)
df2 = df2 %>% group_by(group,value) %>% mutate(n= row_number())
df = df %>% left_join(df2[df2$variable=="Var1",c("id","n")], by="id")
df = df %>% left_join(df2[df2$variable=="Var2",c("id","n")], by="id")
colnames(df)[colnames(df)=="n.x"]="new_var1"
colnames(df)[colnames(df)=="n.y"]="new_var2"
Optionally add df2 = df2 %>% group_by(group,value,id) %>% mutate(n=max(n)) if a line can contain the same variables (which is not the case in your example).
Output
Var2 Var1 group id new_var1 new_var2
1 a b f 1 1 1
2 a c f 2 1 2
3 a d f 3 1 3
4 a e f 4 1 4
5 b a g 5 1 1
6 b c g 6 1 2
7 b d g 7 1 3
8 b e g 8 1 4
9 c a f 9 5 2
10 c b f 10 2 3
11 c d f 11 2 4
12 c e f 12 2 5
13 d a i 13 1 1
14 d b i 14 1 2
15 d c i 15 1 3
16 d e i 16 1 4
17 e a f 17 6 3
18 e b f 18 3 4
19 e c f 19 6 5
20 e d f 20 3 6
Hope this helps!
The dcast() function from the data.table package allows us to reshape multiple value variables simultaneously. This can be used to avoid the double left join in Florian's answer:
library(data.table)
long <- melt(setDT(x)[, rn := .I], id.vars = c("rn", "group"),
measure.vars = c("Var1", "Var2"), value.name = "Var")[
, variable := rleid(variable)][
order(rn), new_var := rowid(group, Var)][]
dcast(long, rn + group ~ ..., value.var = c("Var", "new_var"))[, rn := NULL][]
group Var_1 Var_2 new_var_1 new_var_2
1: f b a 1 1
2: f c a 1 2
3: f d a 1 3
4: f e a 1 4
5: g a b 1 1
6: g c b 1 2
7: g d b 1 3
8: g e b 1 4
9: f a c 5 2
10: f b c 2 3
11: f d c 2 4
12: f e c 2 5
13: i a d 1 1
14: i b d 1 2
15: i c d 1 3
16: i e d 1 4
17: f a e 6 3
18: f b e 3 4
19: f c e 6 5
20: f d e 3 6
Explanation
setDT(x) coerces x to data.table, then a column with row numbers is added before reshaping from wide to long format. Just to get nicer looking column names from the subsequent dcast(), the variables are renamed (for this [, variable := sub("Var", "", variable)] can be used as alternative to [, variable := rleid(variable)]).
The important step is the numbering of appearances of each Var within each group using rowid() grouped by group and Var.
Now, the result has two value columns. Finally, it is reshaped back from long to wide format again, and the rn column is removed as no longer needed.
Data
x <- expand.grid(letters[1:5], letters[1:5], KEEP.OUT.ATTRS = FALSE)
x <- x[x[, 1] != x[, 2], c(2, 1)]
x <- data.frame(
x,
group = as.character(rep(letters[c(1, 2, 1, 4, 1) + 5], each = 4)),
new_var1 = c(1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 5, 1, 2, 3, 4, 3, 4, 5, 6),
new_var2 = c(1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 2, 2, 1, 1, 1, 1, 6, 3, 6, 3))

Resources