Pivot to wide and keep all columns - r

I have a dataset like this:
df <- data.frame(A=c(1,1,1,2,2,2), B=c(3,3,3,6,6, 6), C=c(2,3,9,12,2, 6), D=c("a1", "a2", "a3", "a1", "a2", "a3"))
and i want a dataset like this:
df2 <- data.frame(a1=c(2,12), a2=c(3, 2), a3=c(9, 6), B=c(3,6))
I try this function but it doesn't work:
df_new <- df %>%
mutate(B = if_else(B == 1, "A", "B")) %>%
group_by(B) %>%
mutate(var = paste0("V",row_number())) %>%
pivot_wider(id_cols = B, names_from = var, values_from = A) %>%
rename(row_name = B)
How can I solve?

You can use pivot_wider. To keep the column "B", use unused_fn with a summarizing function (here, mean, but it could also be first, min, max...).
library(tidyr)
df %>%
pivot_wider(A, names_from = D, values_from = C, unused_fn = mean)
A a1 a2 a3 B
1 1 2 3 9 3
2 2 12 2 6 6

data.table provides a nice function dcast (pass from reahspe2) to make this happen:
library(data.table)
dcast(df, A + B ~ D, value.var = "C")
A B a1 a2 a3
1 1 3 2 3 9
2 2 6 12 2 6
Read this vignette if interested

Related

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Replacing multiple columns from different dataframe using dplyr

I have two dataframes, one of which contains a subset of IDs and columns of the other (but has different values).
ds1 <- data.frame(id = c(1:4),
d1 = "A",
d2 = "B",
d3 = "C")
ds2 <- data.frame(id = c(1,2),
d1 = "W",
d2 = "X")
I am hoping to use dplyr on d1 to find the shared columns, and replace their values with those found in d2, matching on ID. I can mutate them one at a time like this:
ds1 %>%
mutate(d1 = ifelse(id %in% ds2$id, ds2$d1[ds2$id==id],d1),
d2 = ifelse(id %in% ds2$id, ds2$d2[ds2$id==id],d2))
In my real situation, I am needing to do this 47 times, however. With the robustness of across(), I feel there is a better way. I am open to non-dplyr solutions as well.
You may perhaps need this using dplyr and stringr (can be done without stringr also)
library(tidyverse)
ds1 %>% left_join(ds2, by = 'id') %>%
mutate(across(ends_with('.y'), ~ coalesce(., get(str_replace(cur_column(), '.y', '.x'))))) %>%
select(!ends_with('.x')) %>%
rename_with(~str_remove(., '.y'), ends_with('.y'))
#> id d3 d1 d2
#> 1 1 C W X
#> 2 2 C W X
#> 3 3 C A B
#> 4 4 C A B
Created on 2021-05-10 by the reprex package (v2.0.0)
using rows_update
library(tidyverse)
ds1 <- data.frame(id = c(1:4),
d1 = "A",
d2 = "B",
d3 = "C")
ds2 <- data.frame(id = c(1,2),
d1 = "W",
d2 = "X")
rows_update(x = ds1, y = ds2, by = "id")
#> id d1 d2 d3
#> 1 1 W X C
#> 2 2 W X C
#> 3 3 A B C
#> 4 4 A B C
Created on 2021-05-11 by the reprex package (v2.0.0)
This is somewhat similar to the one posted by my friend dear #AnilGoyal and also a little bit verbose comparing to yours you can use it for larger data sets:
library(dplyr)
library(stringr)
ds1 %>%
left_join(ds2, by = "id") %>%
mutate(across(ends_with(".x"), ~ ifelse(!is.na(get(str_replace(cur_column(), ".x", ".y"))),
get(str_replace(cur_column(), ".x", ".y")),
.x))) %>%
select(!ends_with(".y")) %>%
rename_with(~ str_remove(., ".x"), ends_with(".x"))
id d1 d2 d3
1 1 W X C
2 2 W X C
3 3 A B C
4 4 A B C

Merging rows will defining different parameters for each row being combined in R

I have a dataframe with different parameters in each. I'll like to merge rows using a different set of parameters for each row.
Here is my sameple data ZZ:
ZZ<-data.frame(Name =c("A","B","C","D","E","F"),A1=c(19,20,21,23,45,67),A2=c(1,2,3,4,5,6),A3=c(7,8,13,24,88,90),x=c(4,5,6,8,23,16),y=c(-3,-7,-6,-9,3,2))
> ZZ
Name A1 A2 A3 x y
1 A 19 1 7 4 -3
2 B 20 2 8 5 -7
3 C 21 3 13 6 -6
4 D 23 4 24 8 -9
5 E 45 5 88 23 3
6 F 67 6 90 16 2
I want to aggregate the rows A,B,C and D,E,F such that a new name is defined for each group (eg:C1 and C2), A1,A2 and A3 are combined by sum while x and y using the mean.
How can this be done please? The result should be:
> ZZ2
Name A1 A2 A3 x y
1 C1 60 6 28 5.000 -5.333
2 C2 135 15 202 15.667 -1.333
Based on how I interpreted your question I believe this should give you what you want using dplyr:
library(dplyr)
result <- ZZ %>%
mutate(Name = ifelse(Name %in% c("A", "B", "C"), "C1", "C2")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()
Depending on how many rows you have with different names there might be better alternatives for the mutating the Name variable into the 2 groups.
EDIT: Example if 4 cases exist
result <- ZZ %>%
mutate(Name = case_when(Name %in% c("A", "B", "C") ~ "C1",
Name %in% c("D", "E") ~ "C2",
Name %in% c("F", "G") ~ "C3",
Name %in% c("H", "I") ~ "C4")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

Resources