Merging variables by a group - r

I have a similar dataset to the one below, I would like to merge them based on the "TA", the merged variable should look like "ID1"€"ID2". The TAs are always in pairs
usa <- data.frame(
TA = c(111, 111, 121, 121, 131, 131, 141, 141),
ID = c("A", "B", "A", "C", "A", "B","C","D"))
The expected output is a new dataset
TA merged
1 111 "A€B"
2 121 "A€C"
3 131 "A€B"
4 141 "C€D"
Another option of an output
TA ID merged
1 111 A "A€B"
2 111 B "A€B"
3 121 A "A€C"
4 121 C "A€C"
5 131 A "A€B"
6 131 B "A€B"
7 141 C "C€D"
8 141 D "C€D"

You can use aggregate with paste:
aggregate(ID ~ TA, usa, paste, collapse = "\u20AC")
# TA ID
#1 111 A€B
#2 121 A€C
#3 131 A€B
#4 141 C€D

Does this work:
library(dplyr)
usa %>% group_by(TA) %>% summarise(merged = str_c(ID, collapse = '\u20AC'))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 2
TA merged
<dbl> <chr>
1 111 A€B
2 121 A€C
3 131 A€B
4 141 C€D
Second option:
usa %>% group_by(TA) %>% mutate(merged = str_c(ID, collapse = '\u20AC'))
# A tibble: 8 x 3
# Groups: TA [4]
TA ID merged
<dbl> <chr> <chr>
1 111 A A€B
2 111 B A€B
3 121 A A€C
4 121 C A€C
5 131 A A€B
6 131 B A€B
7 141 C C€D
8 141 D C€D

Another option with data.table
library(data.table)
setDT(usa)[, .(ID = paste(ID, collapse='\u20AC')), TA]
-output
# TA ID
#1: 111 A€B
#2: 121 A€C
#3: 131 A€B
#4: 141 C€D

Related

Choosing the right column based on a vector of column names

I'm trying to pull values from columns based on the values in a vector. I'm not sure I have the right words to describe the problem, but the code should help.
This feels related to coalesce maybe not?
library(tidyverse)
# Starting table
dat <-
tibble(
A = 1:10,
B = 31:40,
C = 101:110,
value = c("A", "C", "B", "A", "B", "C", "C", "B", "A", "A")
)
I want:
dat %>%
mutate(
output = c(1, 102, 33, 4, 35, 106, 107, 38, 9, 10)
)
I could do
dat %>%
mutate(
output =
case_when(value == "A" ~ A,
value == "B" ~ B,
value == "C" ~ C)
)
but my real application has many values and I want to take advantage of value having the matching info
Is there a function that does:
dat %>%
mutate(output = grab_the_right_column(value))
Thanks!
The rowwise approach would be less efficient, but it is compact within the tidyverse approaches to get the column value based on the column name for each row.
library(dplyr)
dat %>%
rowwise %>%
mutate(output = get(value)) %>%
ungroup
-output
# A tibble: 10 x 5
# A B C value output
# <int> <int> <int> <chr> <int>
# 1 1 31 101 A 1
# 2 2 32 102 C 102
# 3 3 33 103 B 33
# 4 4 34 104 A 4
# 5 5 35 105 B 35
# 6 6 36 106 C 106
# 7 7 37 107 C 107
# 8 8 38 108 B 38
# 9 9 39 109 A 9
#10 10 40 110 A 10
These type of issues are more efficient with a row/column indexing approach from base R. Create a matrix of row sequence and the matching index of columns with the 'value' column and the column names to extract the element
dat$output <- as.data.frame(dat)[,1:3][cbind(seq_len(nrow(dat)), match(dat$value, names(dat)[1:3]))]
You can also use purrr and pmap():
library(dplyr)
library(purrr)
dat%>%mutate(output=
pmap(., ~{
v1<-c(...)
v1[names(v1)==v1[['value']]]
}
)%>%
as.numeric()%>%
unlist)
# A tibble: 10 x 5
A B C value output
<int> <int> <int> <chr> <dbl>
1 1 31 101 A 1
2 2 32 102 C 102
3 3 33 103 B 33
4 4 34 104 A 4
5 5 35 105 B 35
6 6 36 106 C 106
7 7 37 107 C 107
8 8 38 108 B 38
9 9 39 109 A 9
10 10 40 110 A 10

Selected columns to new row

I'm trying to split columns into new rows keeping the data of the first two columns.
d1 <- data.frame(a=c(100,0,78),b=c(0,137,117),c.1=c(111,17,91), d.1=c(99,66,22), c.2=c(11,33,44), d.2=c(000,001,002))
d1
a b c.1 d.1 c.2 d.2
1 100 0 111 99 11 0
2 0 137 17 66 33 1
3 78 117 91 22 44 2
Expected results would be:
a b c d
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
Multiple tries with dplyr, but in sees is not the right approach.
If you want to stay in dplyr/tidyverse, you want tidyr::pivot_longer with a special reference to .value -- see the pivot vignette for more:
library(tidyverse)
d1 <- data.frame(
a = c(100, 0, 78),
b = c(0, 137, 117),
c.1 = c(111, 17, 91),
d.1 = c(99, 66, 22),
c.2 = c(11, 33, 44),
d.2 = c(000, 001, 002)
)
d1 %>%
pivot_longer(
cols = contains("."),
names_to = c(".value", "group"),
names_sep = "\\."
)
#> # A tibble: 6 x 5
#> a b group c d
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 100 0 1 111 99
#> 2 100 0 2 11 0
#> 3 0 137 1 17 66
#> 4 0 137 2 33 1
#> 5 78 117 1 91 22
#> 6 78 117 2 44 2
Created on 2020-05-11 by the reprex package (v0.3.0)
This could solve your issue:
#Try this
a1 <- d1[,c(1:4)]
a2 <- d1[,c(1,2,5,6)]
names(a1) <- names(a2) <- c('a','b','c','d')
DF <- rbind(a1,a2)
The posted answers are good, here's my attempt:
df <- data.frame(a=c(100,0,78),b=c(0,137,117),
c.1=c(111,17,91), d.1=c(99,66,22),
c.2=c(11,33,44), d.2=c(000,001,002))
# Make 2 pivot long operations
df_c <- df %>% select(-d.1, -d.2) %>%
pivot_longer(cols = c("c.1", "c.2"), values_to = "c") %>% select(-name)
df_d <- df %>% select(-c.1, -c.2) %>%
pivot_longer(cols=c("d.1","d.2"), values_to = "d") %>% select(-name)
# bind them without the "key" colums
bind_cols(df_c, select(df_d, -a, -b))
Which produces
# A tibble: 6 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2

Reodering a pivot wide table

I have the following dataframe:
df <- structure(list(rows = c(1, 2, 3, 4, 5, 6), col1 = c(122, 111,
111, 222, 212, 122), col2 = c(10101, 20202, 200022, 10201, 20022,
22222), col3 = c(11, 22, 22, 22, 11, 22)), class = "data.frame", row.names = c(NA,
-6L))
rows col1 col2 col3
1 1 122 10101 11
2 2 111 20202 22
3 3 111 200022 22
4 4 222 10201 22
5 5 212 20022 11
6 6 122 22222 22
I would like to filter the rows where at least one of the columns 2,3,4 include "1" AND "2".
The desired outcome would be:
rows col1 col2 col3
1 1 122 10101 11
4 4 222 10201 22
5 5 212 20022 11
6 6 122 22222 22
The following two are not working because they scan all the three columns together and not one by one.
df[which(apply(df[,2:4],1,function(x) any(grepl("1",x)) & any(grepl("2",x)))),]
OR
library(tidyverse)
TRIPS2_fin %>% filter_at(vars(2,3,4), any_vars(str_detect(., pattern="1|2")))
You could use :
df[apply(df[2:4], 1, function(x) any(grepl('1.*2|2.*1', x))),]
# rows col1 col2 col3
#1 1 122 10101 11
#4 4 222 10201 22
#5 5 212 20022 11
#6 6 122 22222 22
And similar using filter_at
library(dplyr)
df %>% filter_at(2:4, any_vars(grepl('1.*2|2.*1', .)))
We can vectorize it in base R
df[Reduce(`|`, lapply(df[2:4], grepl, pattern = '1.*2|2.*1')),]
# rows col1 col2 col3
#1 1 122 10101 11
#4 4 222 10201 22
#5 5 212 20022 11
#6 6 122 22222 22

Get most frequently occurring factor level in dplyr piping structure

I'd like to be able to find the most frequently occurring level in a factor in a dataset while using dplyr's piping structure. I'm trying to create a new variable that contains the 'modal' factor level when being grouped by another variable.
This is an example of what I'm looking for:
df <- data.frame(cat = stringi::stri_rand_strings(100, 1, '[A-Z]'), num = floor(runif(100, min=0, max=500)))
df <- df %>%
dplyr::group_by(cat) %>%
dplyr::mutate(cat_mode = Mode(num))
Where "Mode" is a function that I'm looking for
Use table to count the items and then use which.max to find out the most frequent one:
df %>%
group_by(cat) %>%
mutate(cat_mode = names(which.max(table(num)))) %>%
head()
# A tibble: 6 x 3
# Groups: cat [4]
# cat num cat_mode
# <fctr> <dbl> <chr>
#1 Q 305 138
#2 W 34.0 212
#3 R 53.0 53
#4 D 395 5
#5 W 212 212
#6 Q 417 138
# ...
similar question to Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>%
group_by(cat) %>%
mutate(cat_mode = Mode(num))
# A tibble: 100 x 3
# Groups: cat [26]
cat num cat_mode
<fct> <dbl> <dbl>
1 S 25 25
2 V 86 478
3 R 335 335
4 S 288 25
5 S 330 25
6 Q 384 384
7 C 313 313
8 H 275 275
9 K 274 274
10 J 75 75
# ... with 90 more rows
To see for each factor
df %>%
group_by(cat) %>%
summarise(cat_mode = Mode(num))
A tibble: 26 x 2
cat cat_mode
<fct> <dbl>
1 A 480
2 B 380
3 C 313
4 D 253
5 E 202
6 F 52
7 G 182
8 H 275
9 I 356
10 J 75
# ... with 16 more rows

How to identify data which does not show link between two data sets? [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 4 years ago.
Dataset1:
id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1
Dataset2:
id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f
I have two data sets. Both are linked by id1 and id2. How to identify those data from both data sets which fails to link???
We can use anti_join from the dplyr package to filter the rows with no match.
library(dplyr)
Dataset1_anti <- Dataset1 %>% anti_join(Dataset2, by = c("id1", "id2"))
Dataset1_anti
# id1 id2 abc n
# 1 2 121 no 1
# 2 3 122 yes 2
Dataset2_anti <- Dataset2 %>% anti_join(Dataset1, by = c("id1", "id2"))
Dataset2_anti
# id1 id2 age gen
# 1 2 1 52 f
# 2 121 122 41 f
# 3 121 122 44 m
# 4 4 221 56 m
DATA
Dataset1 <- read.table(text = "id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1 ",
header = TRUE, stringsAsFactors = FALSE)
Dataset2 <- read.table(text = "id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f ",
header = TRUE, stringsAsFactors = FALSE)

Resources