How to transform a given table in R? - r

I have a dataset of the form:
Var1 Freq
A 16
B 15
C 11
D 11
E 2
F 1
My goal is to get an OUTPUT of the following form:
cat1 cat2 cat3 cat4 cat5
A B C,D E F
16 15 11 2 1
where cat1, ..., cat5 are the name of variables. I appreciate for your help in advance!

with(aggregate(Var1 ~ Freq, df, paste, collapse = ","),
setNames(rbind.data.frame(Var1, Freq)[, order(Var1)], paste0("cat", seq(Freq))))
cat1 cat2 cat3 cat4 cat5
1 A B C,D E F
2 16 15 11 2 1

Try this out
library(tidyverse)
df <- tribble(~Var1, ~Freq,
"A", 16,
"B", 15,
"C", 11,
"D", 11,
"E", 2,
"F", 1) %>%
group_by(Freq) %>%
summarise(Var1 = paste(Var1, collapse = ",")) %>%
arrange(Var1) %>%
as.matrix() %>%
t() %>% as_tibble(.name_repair = "universal") %>%
mutate_all(~str_trim(.)) %>%
arrange(desc(...1))
colnames(df) <- paste0("cat", 1:length(df))

# considering your data is in a data.frame called df
# let's create it
var1 <- LETTERS[1:6]
Freq <- c(16, 15, 11, 11, 2, 1)
df <- data.frame(var1, Freq, stringsAsFactors = FALSE)
# function to join var1
join <- function(x) {
index <- which(df$Freq == x)
paste(df$var1[index], collapse = ', ')
}
# get unique Freq and its length
unique_freq <- unique(df$Freq)
l <- length(unique_freq)
# create summarised var1
summarised_var <- rep("", l)
for (i in 1:l) {
summarised_var[i] <- join(unique_freq[i])
}
# create grouped data.frame
grouped_df <- data.frame(summarised_var, unique_freq, stringsAsFactors = FALSE)
# create a transposed data.frame to get rows into columns
transposed_df <- t(grouped_df)
# create columns names (variables names)
col_names <- paste0('cat', 1:nrow(grouped_df))
# rename columns
colnames(transposed_df) <- col_names
# transposed_df is your output

Related

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

Reassigning labels using dplyr

Each ID records a series of signal label: "alpha", "beta" and "unknown".
If an ID has only two labels. Then I wish to assign the dominating label to all i.e. if the recorded labels of an ID is
c("alpha", "alpha", "unknown"), it becomes c("alpha", "alpha", "alpha")
Can someone please help me with this.
library(tidyverse)
# Data preparation (you can directly work with the tbl below)
ID <- c(rep("A", 14), rep("B", 14), rep("C", 10), rep("D", 22), rep("E", 2))
series <- c(11, 3, 12, 2, 8, 2, 11, 8, 3, 2)
label <- unlist(
sapply(series, function(x) {case_when(x < 5 ~ rep("unknown", x),
x >= 5 ~ case_when(x > 10 ~ rep("alpha", x),
x <= 10 ~ rep("beta", x)) )
}))
# tbl
tbl <- tibble(ID = ID,
label = label)
If I understood it correctly, from this
tbl %>% group_by(ID) %>% summarise(n_distinct(label))
1 A 2
2 B 2
3 C 2
4 D 3
5 E 1
We want to update labels for IDs A, B and C but not D or E. We can make use of the table function to get the most occurring within those IDS.
tbl2 <- tbl %>%
group_by(ID) %>%
mutate(label = if(n_distinct(label) == 2) names(which.max(table(label))) else label)
Which now gives the number of distinct labels per ID
tbl2 %>% group_by(ID) %>% summarise(n_distinct(label))
ID `n_distinct(label)`
<chr> <int>
1 A 1
2 B 1
3 C 1
4 D 3
5 E 1

Giving IDs to groups in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have in R data frame that is divided to groups, like this:
Row
Group
1
A
2
B
3
A
4
D
5
C
6
B
7
C
8
C
9
A
10
B
I would like to add a uniaque numeric ID to each group, so finally I would have something like this:
Row
Group
ID
1
A
1
2
B
2
3
A
1
4
D
4
5
C
3
6
B
2
7
C
3
8
C
3
9
A
1
10
B
2
How could I achieve this?
Thank you very much.
Here is a simple way.
df1$ID <- as.integer(factor(df1$Group))
There are 3 solutions posted, mine, TarJae's and akrun's, they can be timed with increasing data sizes. akrun's is the fastest.
library(microbenchmark)
library(dplyr)
library(ggplot2)
funtest <- function(x, n){
out <- lapply(seq_len(n), function(i){
for(j in seq_len(i)) x <- rbind(x, x)
cat("nrow(x):", nrow(x), "\n")
mb <- microbenchmark(
match = with(x, match(Group, sort(unique(Group)))),
dplyr = x %>% group_by(Group) %>% mutate(ID = cur_group_id()),
intfac = as.integer(factor(x$Group))
)
mb$n <- i
mb
})
out <- do.call(rbind, out)
aggregate(time ~ ., out, median)
}
df1 %>%
funtest(10) %>%
ggplot(aes(n, time, colour = expr)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10, labels = 1:10) +
scale_y_continuous(trans = "log10") +
theme_bw()
Update
group_indices() was deprecated in dplyr 1.0.0.
Please use cur_group_id() instead.
df1 <- df %>%
group_by(Group) %>%
mutate(ID = cur_group_id())
First answer:
You can use group_indices
library(dplyr)
df1 <- df %>%
group_by(Group) %>%
mutate(ID = group_indices())
data
df <- tribble(
~Row, ~Group,
1, "A",
2, "B",
3, "A",
4, "D",
5, "C",
6, "B",
7, "C",
8, "C",
9, "A",
10,"B")
Row Group ID
<int> <chr> <int>
1 1 A 1
2 2 B 2
3 3 A 1
4 4 D 4
5 5 C 3
6 6 B 2
7 7 C 3
8 8 C 3
9 9 A 1
10 10 B 2
We can use match on the sorted unique values of 'Group' on the 'Group' to get the position index
df1$ID <- with(df1, match(Group, sort(unique(Group))))
data
df1 <- structure(list(Row = 1:10, Group = c("A", "B", "A", "D", "C",
"B", "C", "C", "A", "B")), class = "data.frame", row.names = c(NA,
-10L))

Fastest way to check for unique values and returning it if there is only one unique value in an R data.table

Suppose I have a large data.table that looks like dt below.
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 10)
)
# dt
# player_1 player_1_age player_2 player_2_age
# 1: a 10 b 20
# 2: b 20 a 10
# 3: b 20 c 30
# 4: c 30 a 10
From the dt above, I would like to create a data.table with unique players and their age like the following, player_dt:
# player_dt
# player age
# a 10
# b 20
# c 30
To do so, I've tried the code below, but it takes too long on my larger dataset, probably because I am creating a data.table for each iteration of sapply.
How would you get the player_dt above, while checking for each player that there is only one unique age value?
# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))
# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
unique_values <- unique(c(
dt[player_1 == x][["player_1_age"]],
dt[player_2 == x][["player_2_age"]]))
if(length(unique_values) > 1) stop() else return(unique_values)
})
# combine to create the player_dt
player_dt <- data.table(player, age)
I use the data from #DavidT as input.
dt
# player_1 player_1_age player_2 player_2_age
#1: a 10 b 20
#2: b 20 a 10
#3: b 20 c 30
#4: c 30 a 11 # <--
TL;DR
You can do
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
out <-
unique(melt(
dt,
measure.vars = list(colsAge, colsOther),
value.name = c("age", "player")
)[, .(age, player)])[, if (.N == 1) # credit: https://stackoverflow.com/a/34427944/8583393
.SD, by = player]
out
# player age
#1: b 20
#2: c 30
Step-by-step
What you can to do is to melt multiple columns simultaneously - those that end with "age" and those that don't.
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))
The result is
dt1
# variable age player
#1: 1 10 a
#2: 1 20 b
#3: 1 20 b
#4: 1 30 c
#5: 2 20 b
#6: 2 10 a
#7: 2 30 c
#8: 2 11 a
Now we call unique ...
out <- unique(dt1[, .(age, player)])
out
# age player
#1: 10 a
#2: 20 b
#3: 30 c
#4: 11 a
... and filter for groups of "player" with length equal to 1
out <- out[, if(.N == 1) .SD, by=player]
out
# player age
#1: b 20
#2: c 30
Given OP's input data, that last step is not needed.
data
library(data.table)
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
Reference: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html
I've altered your data so that there's at least one error to catch:
library(tidyverse)
dt <- tibble(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
# Get the Names columns and the Age columns
colName <- names(dt)
ageCol <- colName[str_detect(colName, "age$")]
playrCol <- colName[! str_detect(colName, "age$")]
# Gather the Ages
ages <- dt %>%
select(ageCol) %>%
gather(player_age, age)
# Gather the names
names <- dt %>%
select(playrCol ) %>%
gather(player_name, name)
# Bind the two together, and throw out the duplicates
# If there are no contradictions, this is what you want.
allNameAge <- cbind( names, ages) %>%
select(name, age) %>%
distinct() %>%
arrange(name)
# But check for inconsistencies. This should leave you with
# an empty tibble, but instead it shows the error.
inconsistencies <- allNameAge %>%
group_by(name) %>%
mutate(AGE.COUNT = n_distinct(age)) %>%
filter(AGE.COUNT > 1) %>%
ungroup()
This should extends to more name/age column pairs.

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Resources