Given two dataframes with the same column names:
a <- data.frame(x=1:4,y=5:8)
b <- data.frame(x=LETTERS[1:4],y=LETTERS[5:8])
>a
x y
1 5
2 6
3 7
4 8
>b
x y
A E
B F
C G
D H
How can each column with the same name be concatentated?
Desired output:
cat_x cat_y
1 A 5 E
2 B 6 F
3 C 7 G
4 D 8 H
Tried so far, merging columns one at a time:
a$cat_x <- paste(a$x,b$x)
a$cat_y <- paste(a$y,b$y)
This approach works, but the real data has 40 columns (and will include multiple more dataframes). Looking for a more efficient method for larger dataframes.
We may use Map to do this on a loop
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b,
MoreArgs = list(sep = "_")))
-output
cat_x cat_y
1 1_A 5_E
2 2_B 6_F
3 3_C 7_G
4 4_D 8_H
Used sep above in case we want to add a delimiter. Or else by default it will be space
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b ))
cat_x cat_y
1 1 A 5 E
2 2 B 6 F
3 3 C 7 G
4 4 D 8 H
Another possible solution, using purrr::map2_dfc:
library(tidyverse)
map2_dfc(a,b, ~ str_c(.x, .y, sep = " ")) %>%
rename_with(~ str_c("cat", .x, sep = "_"))
#> # A tibble: 4 × 2
#> cat_x cat_y
#> <chr> <chr>
#> 1 1 A 5 E
#> 2 2 B 6 F
#> 3 3 C 7 G
#> 4 4 D 8 H
Related
I am trying to concatenate certain row values (Strings) given varying conditions in R. I have flagged the row values in Flag (the flagging criteria are irrelevant in this example).
Notations: B is the beginning of a run and E the end. 0 is outside the run. 1 denotes any strings excluding B and E in the run. Your solution does not need to follow my convention.
Rules: Every run must begin with B and ends with E. There can be any number of 1 in the run. Any Strings positioned between B and E (both inclusive) are to be concatenated in the order as they are positioned in the run, and replace the B-string. . 0-string will remain in the dataframe. 1- and E-strings will be removed after concatenation.
I haven't come up with anything close to the desired output.
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
Strings Flag
1 d 0
2 r B
3 q 1
4 r 1
5 v E
6 f B
7 y E
8 u B
9 c E
10 x 0
11 h B
12 w 1
13 x 1
14 t 1
15 j E
16 d 0
17 j 0
Intermediate output.
Strings Flag Result
1 d 0 d
2 r B r q r v
3 q 1 q
4 r 1 r
5 v E v
6 f B f y
7 y E y
8 u B u c
9 c E c
10 x 0 x
11 h B h w x t j
12 w 1 w
13 x 1 x
14 t 1 t
15 j E j
16 d 0 d
17 j 0 j
Desired output.
Result
1 d
2 r q r v
3 f y
4 u c
5 x
6 h w x t j
7 d
8 j
Here is a solution that might help you. However, I am still not sure if I got your point correctly:
library(dplyr)
df2 %>%
mutate(Flag2 = cumsum(Flag == 'B' | Flag == '0')) %>%
group_by(Flag2) %>%
summarise(Result = paste0(Strings, collapse = ' '))
# A tibble: 8 × 2
Flag2 Result
<int> <chr>
1 1 d
2 2 r q r v
3 3 f y
4 4 u c
5 5 x
6 6 h w x t j
7 7 d
8 8 j
Using dplyr:
library(dplyr)
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
df2 %>%
group_by(group = cumsum( (Flag=="B") + (lag(Flag,1,"0")=="E"))) %>%
mutate(Result=if_else(Flag=="B", paste0(Strings,collapse = " "),Strings)) %>%
filter(!(Flag %in% c("1", "E"))) %>% ungroup() %>%
select(-group, -Strings, -Flag)
#> # A tibble: 8 × 1
#> Result
#> <chr>
#> 1 d
#> 2 r q r v
#> 3 f y
#> 4 u c
#> 5 x
#> 6 h w x t j
#> 7 d
#> 8 j
I have a list here, and I wish to mutate a new column with unique values for each list relative to the mutation. For example, I want to mutate a column named ID as n >= 1.
Naturally, on a dataframe I would do this:
dat %>% mutate(id = row_number())
For a list, I would do this:
dat%>% map(~ mutate(., ID = row_number()))
And I would get an output likeso:
dat <- list(data.frame(x=c("a", "b" ,"c", "d", "e" ,"f" ,"g") ), data.frame(y=c("p", "lk", "n", "m", "g", "f", "t")))
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 1
2 lk 2
3 n 3
4 m 4
5 g 5
6 f 6
7 t 7
Though, how would I mutate a new column ID such that the row number continues from the first list.
Expected output:
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 8
2 lk 9
3 n 10
4 m 11
5 g 12
6 f 13
7 t 14
An option is to bind them into a single dataset, create the 'id' with row_number(), split by 'grp', loop over the list and remove any columns that have all NA values
library(dplyr)
library(purrr)
dat %>%
bind_rows(.id = 'grp') %>%
mutate(id = row_number()) %>%
group_split(grp) %>%
map(~ .x %>%
select(where(~ any(!is.na(.))), -grp))
-output
#[[1]]
# A tibble: 7 x 2
# x id
# <chr> <int>
#1 a 1
#2 b 2
#3 c 3
#4 d 4
#5 e 5
#6 f 6
#7 g 7
#[[2]]
# A tibble: 7 x 2
# y id
# <chr> <int>
#1 p 8
#2 lk 9
#3 n 10
#4 m 11
#5 g 12
#6 f 13
#7 t 14
Or an easier approach is to unlist (assuming single column), get the sequence, add a new column with map2
map2(dat, relist(seq_along(unlist(dat)), skeleton = dat),
~ .x %>% mutate(id = .y))
Or using a for loop
dat[[1]]$id <- seq_len(nrow(dat[[1]]))
for(i in seq_along(dat)[-1]) dat[[i]]$id <-
seq(tail(dat[[i-1]]$id, 1) + 1, length.out = nrow(dat[[i]]), by = 1)
I have a large data frame with a lot of rows and columns. In one column there are characters, some of them occur only once, other multiple times. I would now like to separate the whole data frame, so that I end up with two data frames, one with all the rows that have characters that repeat themselves in this one column and another one with all the rows with the charcaters that occur only once. Like for example:
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
> df
One Two Three
1 1 4 a
2 2 5 b
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
7 7 1 f
8 8 8 e
9 9 1 g
10 10 9 c
I wish to have two data frames like
> dfSingle
One Two Three
1 1 4 a
2 2 5 b
7 7 1 f
9 9 1 g
> dfMultiple
One Two Three
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
8 8 8 e
10 10 9 c
I tried with the duplicated() function
dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))
but it does not work as the first of the "c", "d" and "e" go to the "dfSingle".
I also tried to do a for-loop
MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
if(df$Three[i] %in% MulipleValues){
dfMultiple[x,] = df[i,]
x = x+1
} else {
dfSingle[y,] = df[i,]
y = y+1
}
}
It seems to do the right thing as the data frames have now the right amont of rows but they somehow have 0 columns.
> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows
What am I doing wrong? Or is there another way to do this?
Thanks for your help!
In base R, we can use split with duplicated which will return you list of two dataframes.
df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
df1
#$`FALSE`
# One Two Three
#1 1 4 a
#2 2 5 b
#7 7 1 f
#9 9 1 g
#$`TRUE`
# One Two Three
#3 3 3 c
#4 4 6 d
#5 5 2 d
#6 6 7 e
#8 8 8 e
#10 10 9 c
where df1[[1]] can be considered as dfSingle and df1[[2]] as dfMultiple.
Here is a dplyr one for fun,
library(dplyr)
df %>%
group_by(Three) %>%
mutate(new = n() > 1) %>%
split(.$new)
which gives,
$`FALSE`
# A tibble: 4 x 4
# Groups: Three [4]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
$`TRUE`
# A tibble: 6 x 4
# Groups: Three [3]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE
A way with dplyr:
library(dplyr)
df %>%
group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)
Output:
[[1]]
# A tibble: 4 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
[[2]]
# A tibble: 6 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE
You can do it using base R
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
str(df)
df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))
dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)
Problem is simple and in many other posts, but I haven't found satisfactory answer.
Say you have a tibble with one column of labels (here letters) and other values in other columns (here just one 'value').
data <- tibble(letter = letters[1:5], value = 1:5)
Now what you want is generate all the pairs without permutations and keep the value attached to each of the pair element. Here's the solution I have and which I believe is valid but...inelegant.
combn(data$letter, m = 2) %>%
t() %>%
as_tibble() %>%
rename(letter_1 = V1, letter_2 = V2) %>%
left_join(data, by = c("letter_1" = "letter")) %>%
left_join(data, by = c("letter_2" = "letter"), suffix = c("_1", "_2"))
Which outputs the desired result:
# A tibble: 10 x 4
letter_1 letter_2 value_1 value_2
<chr> <chr> <int> <int>
1 a b 1 2
2 a c 1 3
3 a d 1 4
4 a e 1 5
5 b c 2 3
6 b d 2 4
7 b e 2 5
8 c d 3 4
9 c e 3 5
10 d e 4 5
I'm really looking for a tidyverse approach. I'm a fan boy :)
Thank you in advance for any help.
Here is a tidyverse solution using expand (instead of combn):
data %>%
expand(letter_1 = letter, letter_2 = letter) %>%
mutate(
value_1 = match(letter_1, letters),
value_2 = match(letter_2, letters)) %>%
filter(letter_1 != letter_2) %>%
rowwise() %>%
mutate(id = paste0(sort(c(letter_1, letter_2)), collapse = " ")) %>%
distinct(id, .keep_all = TRUE) %>%
select(-id)
## A tibble: 15 x 4
# letter_1 letter_2 value_1 value_2
# <chr> <chr> <int> <int>
# 2 a b 1 2
# 3 a c 1 3
# 4 a d 1 4
# 5 a e 1 5
# 7 b c 2 3
# 8 b d 2 4
# 9 b e 2 5
#11 c d 3 4
#12 c e 3 5
#13 d d 4 4
#14 d e 4 5
One option could be using combn as:
data <- tibble(letter = letters[1:5], value = 1:5)
res <- cbind(data.frame(t(combn(data$letter, 2))), data.frame(t(combn(data$value, 2))))
names(res) <- c("letter_1", "letter_2", "value_1", "value_2")
res
# letter_1 letter_2 value_1 value_2
# 1 a b 1 2
# 2 a c 1 3
# 3 a d 1 4
# 4 a e 1 5
# 5 b c 2 3
# 6 b d 2 4
# 7 b e 2 5
# 8 c d 3 4
# 9 c e 3 5
# 10 d e 4 5
I find the rowwise() function to work inconsistently in my machine. You might want to try map() functions in the purrr pacakge.
Here's a way to implement this:
library(purrr)
data %>%
expand(letter_1 = letter, letter_2 = letter) %>%
mutate(
value_1 = match(letter_1, letters),
value_2 = match(letter_2, letters)) %>%
filter(letter_1 != letter_2) %>%
mutate(
id = map2_chr(letter_1, letter_2, function(x, y) {
paste(sort(c(x, y)), collapse = " ")
})
) %>%
distinct(id, .keep_all = TRUE) %>%
select(-id)
# # A tibble: 10 x 4
# letter_1 letter_2 value_1 value_2
# <chr> <chr> <int> <int>
# 1 a b 1 2
# 2 a c 1 3
# 3 a d 1 4
# 4 a e 1 5
# 5 b c 2 3
# 6 b d 2 4
# 7 b e 2 5
# 8 c d 3 4
# 9 c e 3 5
# 10 d e 4 5
Here is my issue:
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df2) <- LETTERS[3:7]
df2
x y z
C 1 2 3
D 2 3 4
E 3 4 5
F 4 5 6
G 5 6 7
what I wanted is:
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
where duplicated rows were added up by same variable.
A solution with base R:
# create a new variable from the rownames
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
# bind the two dataframes together by row and aggregate
res <- aggregate(cbind(x,y,z) ~ rn, rbind(df1,df2), sum)
# or (thx to #alistaire for reminding me):
res <- aggregate(. ~ rn, rbind(df1,df2), sum)
# assign the rownames again
rownames(res) <- res$rn
# get rid of the 'rn' column
res <- res[, -1]
which gives:
> res
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
With dplyr,
library(dplyr)
# add rownames as a column in each data.frame and bind rows
bind_rows(df1 %>% add_rownames(),
df2 %>% add_rownames()) %>%
# evaluate following calls for each value in the rowname column
group_by(rowname) %>%
# add all non-grouping variables
summarise_all(sum)
## # A tibble: 7 x 4
## rowname x y z
## <chr> <int> <int> <int>
## 1 A 1 2 3
## 2 B 2 3 4
## 3 C 4 6 8
## 4 D 6 8 10
## 5 E 8 10 12
## 6 F 4 5 6
## 7 G 5 6 7
could also vectorize the operation turning the dfs to matrices:
result_df <- as.data.frame(as.matrix(df1) + as.matrix(df2))
This might need some teaking to get the rownames logic working on a longer example:
dfr <-rbind(df1,df2)
do.call(rbind, lapply( split(dfr, sapply(rownames(dfr),substr,1,1)), colSums))
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
If the rownames could all be assumed to be alpha characters a gsub solution should be easy.
An alternative is to melt the data and cast it. At first we set the row names to the last column of both data frames thanks to #Jaap
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
Then we melt the data based on the name
melt(list(df1, df2), id.vars = "rn")
Then we use dcast with mget function which is used to retrieve multiple variables at once.
mydf<- dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "rn"),
rn ~ variable, value.var = "value", fun.aggregate = sum)
rownames(mydf) <- mydf$rn
# get rid of the 'rn' column
mydf <- mydf[, -1]
> mydf
# x y z
#A 1 2 3
#B 2 3 4
#C 4 6 8
#D 6 8 10
#E 8 10 12
#F 4 5 6
#G 5 6 7