How to concatenate multiple columns in one and remove duplicates?

How to concatenate multiple columns in one and remove duplicates? - r

I have a dataframe like this one:
A <- c("a", "a", "a", "a")
B <- c("b", "b", "b", "b")
C <- c("c", "a", "c", "c")
D <- c("d", "b", "a", "d")
E <- c("a", "a", "b", "e")
F <- c("b", "b", "c", "f")
G <- c("c", "a", "a", "g")
df <- data.frame(A, B, C, D, E, F, G)
I need to merge all values from the columns A to G, remove duplicates, and store a resulting list in a new column. So, the final result should look like this:

Try this one
> df$new <- apply(df,1,unique)
> df
A B C D E F G new
1 a b c d a b c a, b, c, d
2 a b a b a b a a, b
3 a b c a b c a a, b, c
4 a b c d e f g a, b, c, d, e, f, g

A possible solution:
library(tidyverse)
A <- c("a", "a", "a", "a")
B <- c("b", "b", "b", "b")
C <- c("c", "a", "c", "c")
D <- c("d", "b", "a", "d")
E <- c("a", "a", "b", "e")
F <- c("b", "b", "c", "f")
G <- c("c", "a", "a", "g")
df <- data.frame(A, B, C, D, E, F, G)
df %>%
rowwise %>%
mutate(new = c_across(everything()) %>% unique %>% str_c(collapse = ",")) %>%
ungroup
#> # A tibble: 4 × 8
#> A B C D E F G new
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a b c d a b c a,b,c,d
#> 2 a b a b a b a a,b
#> 3 a b c a b c a a,b,c
#> 4 a b c d e f g a,b,c,d,e,f,g

this is sort of a silly way of doing it, but does this address your issue?
list(unique(t(df)[,1]),
unique(t(df)[,2]),
unique(t(df)[,3]),
unique(t(df)[,4]))

Related

How to crosstabulate two variables to classify a third categorical variable in R

I want to crosstabulate x by y to obtain in the table cells, the values of z.
library(tidyverse)
df <- tibble(x = c("a", "a", "b", "b"),
y = c("c", "d", "c", "d"),
z = c("e", "g", "f", "h"))
# I want to obtain this result:
# c d
# a e g
# b f h
Created on 2021-07-18 by the reprex package (v2.0.0)

I think you want tidyr::pivot_wider...
df %>% pivot_wider(names_from = y, values_from = z)
# A tibble: 2 x 3
x c d
<chr> <chr> <chr>
1 a e g
2 b f h

Find Count of Elements from One List in Another List

So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()

You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1

If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1

sum mismatchs in column to column comparison

I am quite new to R programming, and am having some difficulty with ANOTHER step of my project. I am not even sure at this point if I am asking the question correctly. I have a dataframe of actual and predicted values:
actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a
The issue that I am having is that I need to create a vector of mismatches between the actual value and each of the four predicted values. This should result in a single vector: c(2,1,2,4)
I am trying to use a boolean mask to sum over the TRUE values...but something is not working right. I need to do this sum for each of the four predicted values to actual value comparisons.
discordant_sums(df[,seq(1,ncol(df),2)]!=,df[,seq(2,ncol(df),2)])
Any suggestions would be greatly appreciated.

You can use apply to compare values in 1st column with values in each of all other columns.
apply(df[-1], 2, function(x)sum(df[1]!=x))
# predicted.1 predicted.2 predicted.3 predicted.4
# 2 1 2 4
Data:
df <- read.table(text =
"actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a",
header = TRUE, stringsAsFactors = FALSE)

We can replicate the first column to make the lengths equal between the comparison objects and do the colSums
as.vector(colSums(df[,1][row(df[-1])] != df[-1]))
#[1] 2 1 2 4
data
df <- structure(list(actual = c("a", "a", "b", "b", "c", "c", "d",
"d"), predicted.1 = c("a", "a", "b", "a", "c", "d", "d", "d"),
predicted.2 = c("a", "a", "a", "b", "c", "c", "d", "d"),
predicted.3 = c("a", "b", "b", "b", "c", "c", "c", "d"),
predicted.4 = c("a", "b", "b", "c", "c", "d", "d", "a")),
.Names = c("actual",
"predicted.1", "predicted.2", "predicted.3", "predicted.4"),
class = "data.frame", row.names = c(NA,
-8L))

Match columns with multiple entries in a row and mutate result

I have a data frame:
col_1 <- c("A", "A", "B", "B", "C", "C")
col_2 <- c("A", "B", "C", "D", "E", "F")
col_3 <- c("A", "B", "C", "C", "B", "A")
df <- data.frame(col_1, col_2, col_3)
I want to mutate a new column that contains TRUE or FALSE depending on whether any row has more than two identical entries.
e.g.:
t_f <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
Even better, if I could have a column that contains the repeated values, e.g.:
name <- c("A", "B", "C", NA, NA, NA)

For you first requirement
df$t_f <- apply(df, 1, function(x) any(duplicated(x)))
And your second
df$name <- apply(df, 1, function(x) ifelse(any(duplicated(x)), x[which(duplicated(x))], NA))

For your second requirement:
col_1 <- c("A", "A", "B", "B", "C", "C")
col_2 <- c("A", "B", "C", "D", "E", "F")
col_3 <- c("A", "B", "C", "C", "B", "A")
df <- data.frame(col_1, col_2, col_3)
df$name <- apply(df, 1,
function(row)ifelse(max(table(row))>=2,
names(table(row))[which.max(table(row))], NA))
df
#> col_1 col_2 col_3 name
#> 1 A A A A
#> 2 A B B B
#> 3 B C C C
#> 4 B D C <NA>
#> 5 C E B <NA>
#> 6 C F A <NA>

in base R you can try
ifelse(colSums(table(row(df), as.matrix(df)) >= 2) == 1, colnames(table(row(df), as.matrix(df))), NA)
A B C D E F
"A" "B" "C" NA NA NA
In tidyverse you can do
library(tidyverse)
df %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
mutate(dup=anyDuplicated(c(col_1, col_2, col_3))!=0) %>%
mutate(which.dup=c(col_1, col_2, col_3)[which(duplicated(c(col_1, col_2, col_3)))[1]])
Source: local data frame [6 x 5]
Groups: <by row>
# A tibble: 6 x 5
col_1 col_2 col_3 dup which.dup
<chr> <chr> <chr> <lgl> <chr>
1 A A A TRUE A
2 A B B TRUE B
3 B C C TRUE C
4 B D C FALSE NA
5 C E B FALSE NA
6 C F A FALSE NA

Tabulating list of values in third variable in R

I have following data:
ddf2 = structure(list(col1 = c(3, 3, 2, 1, 1, 1, 3, 2, 1, 1, 3, 1, 1,
2, 1, 1, 1, 2, 3, 1, 1, 3, 2, 3, 3), col2 = c("c", "c", "b",
"b", "b", "a", "b", "c", "b", "b", "c", "c", "b", "b", "a", "c",
"c", "b", "a", "b", "b", "c", "a", "c", "a"), col3 = c("C", "E",
"E", "B", "D", "E", "C", "C", "E", "E", "C", "A", "D", "D", "C",
"E", "A", "A", "A", "D", "A", "A", "B", "A", "E")), .Names = c("col1",
"col2", "col3"), row.names = c(NA, 25L), class = "data.frame")
head(ddf2)
col1 col2 col3
1 3 c C
2 3 c E
3 2 b E
4 1 b B
5 1 b D
6 1 a E
For every combination of col1 and col2, there may be many values of col3:
with(ddf2, ddf2[col1==1 & col2=='b',])
col1 col2 col3
4 1 b B
5 1 b D
9 1 b E
10 1 b E
13 1 b D
20 1 b D
21 1 b A
with(ddf2, table(col1, col2))
col2
col1 a b c
1 2 7 3
2 1 3 1
3 2 1 5
I want to create a table/matrix of col1 and col2 as above but each cell should have a list of unique col3 entries for that set of col1 and col2. I expect following output:
col2
col1 a b c
1 E,C A,B,D,E A,E
2 B A,D,E C
3 A,E C A,C,E
I tried following but it does not work:
with(ddf2, tapply(col3, list(col1,col2), c))
a b c
1 Character,2 Character,7 Character,3
2 "B" Character,3 "C"
3 Character,2 "C" Character,5
How can this be done? Thanks for your help.

One option:
d <- with(ddf2, aggregate(col3 ~ col2 + col1, FUN = function(x) paste0(unique(x))))
library(reshape2)
dcast(d, col1 ~ col2, value.var = "col3")
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Most likely it's possible to do both steps in one, but I'll generously leave it to someone else to figure this out ;)
Or
library(dplyr)
library(tidyr)
ddf2 %>%
group_by(col1, col2) %>%
summarise(col3 = paste(unique(col3), collapse = ", ")) %>%
spread(col2, col3)
#Source: local data frame [3 x 4]
#
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Edit after comment:
Just tested with tapply and this seems to work (the problem was apparently in calling c()):
with(ddf2, tapply(col3, list(col1,col2), FUN = function(x) paste(unique(x), collapse = ", ")))
# a b c
#1 "E, C" "B, D, E, A" "A, E"
#2 "B" "E, D, A" "C"
#3 "A, E" "C" "C, E, A"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to concatenate multiple columns in one and remove duplicates? - r

Try this one > df$new <- apply(df,1,unique) > df A B C D E F G new 1 a b c d a b c a, b, c, d 2 a b a b a b a a, b 3 a b c a b c a a, b, c 4 a b c d e f g a, b, c, d, e, f, g

this is sort of a silly way of doing it, but does this address your issue? list(unique(t(df)[,1]), unique(t(df)[,2]), unique(t(df)[,3]), unique(t(df)[,4]))

Related

How to crosstabulate two variables to classify a third categorical variable in R

Find Count of Elements from One List in Another List

sum mismatchs in column to column comparison

Match columns with multiple entries in a row and mutate result

Tabulating list of values in third variable in R

Categories

Resources