Apply a function between two lists of data frames - r

I have the following data example and code:
lt1 <- list(df1 <- data.frame(V1 = c("a", "b"),
V2 = c("b", "c"),
V3 = c(1, 2)),
df2 <- data.frame(V1 = c("x", "y"),
V2 = c("x", "z"),
V3 = c(1, 2)))
lvls_func <- function(x) {
x[1:2] %>%
unlist() %>%
unique() %>%
sort()
}
lt_lvls <- lapply(lt1, lvls_func)
complete_func <- function(x) {
tidyr::complete(x[1] = factor(x[1], levels = lt_lvls),
x[2] = factor(x[2], levels = lt_lvls),
x[3] = x[3],
fill = list(x[3] = 0))
}
lt1_final <- lapply(lt1, complete_func)
I have difficulty building my complete_func().
I getting this error when I run my complete_func()
Error: unexpected '=' in:
"complete_func <- function(x) {
tidyr::complete(x[1] ="
In my final list lt1_final I expect this output:
lt1_final <- list(df1 <- data.frame(V1 = c("a", "b", "a", "a", "b", "b", "c", "c", "c"),
V2 = c("b", "c", "a", "c", "b", "a", "a", "b", "c"),
V3 = c(1, 2, 0, 0, 0, 0, 0, 0, 0)),
df2 <- data.frame(V1 = c("x", "y", "x", "x", "y", "y", "z", "z", "z"),
V2 = c("x", "z", "y", "z", "y", "x", "z", "x", "y"),
V3 = c(1, 2, 0, 0, 0, 0, 0, 0, 0)))
Thanks all help

As the lt_lvls is a list of levels, we may need either Map (from base R) or use purrr::map2.
In addition, create the function by making use of across. There are multiple changes in the function
Add an argument lvls in the function
Convert the columns 1 to 2 to factor by looping across within mutate, specify the lvls
Apply complete on the subset of data using either splicing (!!!) (or could use invoke/exec), and specify the fill as a named list with dplyr::lst (or regular list with setNames)
library(dplyr)
library(tidyr)
library(purrr)
complete_func <- function(x, lvls) {
x %>%
dplyr::mutate(across(1:2, factor, levels =lvls)) %>%
tidyr::complete(!!! .[1:2], fill = dplyr::lst(!! names(.)[3] := 0)) %>%
arrange(across(3, ~ .x == 0))
}
-testing
map2(lt1, lt_lvls, ~ complete_func(.x, .y))
[[1]]
# A tibble: 9 × 3
V1 V2 V3
<fct> <fct> <dbl>
1 a b 1
2 b c 2
3 a a 0
4 a c 0
5 b a 0
6 b b 0
7 c a 0
8 c b 0
9 c c 0
[[2]]
# A tibble: 9 × 3
V1 V2 V3
<fct> <fct> <dbl>
1 x x 1
2 y z 2
3 x y 0
4 x z 0
5 y x 0
6 y y 0
7 z x 0
8 z y 0
9 z z 0

Related

Duplicating rows conditionally in R

I would like to duplicate each observation based on the count. For example:
If count == 3, duplicate the observation three times but replacing the count with 1 each time.
If count == 1, no changes are required.
# Sample data
df <- tibble(
x = c("A", "C", "C", "B", "C", "A", "A"),
y = c("Y", "N", "Y", "N", "N", "N", "Y"),
count = c(1, 1, 3, 2, 1, 1, 1)
)
# Target output
df <- tibble(
x = c("A", "C", "C", "C", "C", "B", "B", "C", "A", "A"),
y = c("Y", "N", "Y", "Y", "Y", "N", "N", "N", "N", "Y"),
count = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
Using dplyr and tidyr,
df %>% uncount(count, .remove = F) %>%
mutate(count = ifelse(count==3,1, count))
The output is
x y count
<chr> <chr> <dbl>
1 A Y 1
2 C N 1
3 C Y 1
4 C Y 1
5 C Y 1
6 B N 2
7 B N 2
8 C N 1
9 A N 1
10 A Y 1

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

dplyr mutate to replace specific values in a data frame

I have a data frame that consists of characters "a", "b", "x", "y".
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"))
Now I want to replace all values with the following scheme and also convert the whole data frame to numeric.
"a" -> 0
"b" -> 1
"x" -> 1
"y" -> 2
I know this must be somehow possible with mutate_all but I cannot figure out how
df %>% mutate_all(replace("a", 1)) %>%
mutate_all(is.character, as.numeric)
One solution could be with case_when:
df %>%
mutate_all(funs(case_when(. == "a" ~ 0,
. %in% c("b", "x") ~ 1,
. == "y" ~ 2,
TRUE ~ NA_real_)))
# v1 v2
# 1 0 0
# 2 1 1
# 3 1 0
# 4 2 2
Create a named vector with mappings and then subset it using mutate_all
vec <- c(a = 0, b = 1, x = 1, y = 2)
library(dplyr)
df %>% mutate_all(~vec[.])
# v1 v2
#1 0 0
#2 1 1
#3 1 0
#4 2 2
In base R that would be just
df[] <- vec[unlist(df)]
data
df <- data.frame(v1 = c("a", "b", "x", "y"),
v2 = c("a", "b", "a", "y"), stringsAsFactors = FALSE)

How to convert tidy hierarchical data frame to hierarchical list grid in R?

This is a more complex version of a previous question where I had abstracted the actual problem too much to apply the answers.
R convert tidy hierarchical data frame to hierarchical list
I've converted a hierarchical data frame with two grouping levels into a hierarchical list-grid using a for loop.
Is there a more efficient base R, tidyverse or other approach to achieve this?
In the real dataset:
The grouping variables and description are multi word strings.
The description preface - d# - is in the MWE for ease of checking.
There are 14 associated variables variously of type: character, integer and double
Rules
Group 1 and Group 2 headings to be in description column
Group 1 headings to appear once only
Group 2 heading are children of group 1 heading, and only change when there is a new group 2 heading
Descriptions are children of group 2 headings
From this
g1 g2 desc var1 var2 var3
A a d1 KS3 0.0500 2 PLs
A a d2 CTI 0.0500 9 7O0
A b d3 b8x 0.580 5 he2
A b d4 XOf 0.180 12 XJE
A b d5 ygn 0.900 11 v48
A c d6 dGY 0.770 6 UcH
A d d7 jpG 0.600 4 P5M
B d d8 Z95 0.600 10 j6O
To this
desc var1 var2 var3
A
a
d1 KS3 0.0500 2 PLs
d2 CTI 0.0500 9 7O0
b
d3 b8x 0.580 5 he2
d4 XOf 0.180 12 XJE
d5 ygn 0.900 11 v48
c
d6 dGY 0.770 6 UcH
d
d7 jpG 0.600 4 P5M
B
d
Code
library(tidyverse)
library(stringi)
set.seed(2018)
tib <- tibble(g1 = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
g2 = c("a", "a", "b", "b", "b", "c", "d", "d", "b", "b", "e", "e"),
desc = paste0("d", 1:12, " ", stri_rand_strings(12, 3)),
var1 = round(runif(12), 2),
var2 = sample.int(12),
var3 = stri_rand_strings(12, 3))
tib
# Number of rows in final table
n_rows <- length(unique(tib$g1)) + length(unique(paste0(tib$g1, tib$g2))) + nrow(tib)
# create empty output tibble
output <-
as_tibble(matrix(nrow = n_rows, ncol = ncol(tib)-1)) %>%
rename(id = V1, desc = V2, var1 = V3, var2 = V4, var3 = V5) %>%
mutate(id = NA_character_,
desc = NA_character_,
var1 = NA_real_,
var2 = NA_integer_,
var3 = NA_character_)
# Loop counters
level_1 <- 0
level_2 <- 0
output_row <- 1
for(i in seq_len(nrow(tib))){
# level 1 headings
if(tib$g1[[i]] != level_1) {
output$id[[output_row]] <- "g1"
output$desc[[output_row]] <- tib$g1[[i]]
output_row <- output_row + 1
}
# level 2 headings
if(paste0(tib$g1[[i]], tib$g2[[i]]) != paste0(level_1, level_2)) {
output$id[[output_row]] <- "g2"
output$desc[[output_row]] <- tib$g2[[i]]
output_row <- output_row + 1
}
level_1 <- tib$g1[[i]]
level_2 <- tib$g2[[i]]
# Description and data grid
output$desc[[output_row]] <- tib$desc[[i]]
output$var1[[output_row]] <- tib$var1[[i]]
output$var2[[output_row]] <- tib$var2[[i]]
output$var3[[output_row]] <- tib$var3[[i]]
output_row <- output_row + 1
}
output
Adapting the answer from tyluRp R convert tidy hierarchical data frame to hierarchical list I've hit on a solution.
library(tidyverse)
library(stringi)
set.seed(2018)
tib <- tibble(g1 = c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
g2 = c("a", "a", "b", "b", "b", "c", "d", "d", "b", "b", "e", "e"),
desc = paste0("d", 1:12, " ", stri_rand_strings(12, 3)),
var1 = round(runif(12), 2),
var2 = sample.int(12),
var3 = stri_rand_strings(12, 3))
# add unique identifier for description and variable rows
tib <-
tib %>%
rowid_to_column() %>%
mutate(rowid = paste0("z_", rowid))
# separate tibble for variables associated with descriptions
tib_var <-
tib %>%
select(rowid, var1, var2, var3)
# code adapted from tyluRp to reorder the data and add description variables
tib <-
tib %>%
select(g1, g2, desc, rowid) %>%
mutate(g2 = paste(g1, g2, sep = "_")) %>%
transpose() %>%
unlist() %>%
stack() %>%
distinct(values, ind) %>%
mutate(detect_var = str_detect(values, "^z_"),
ind = lead(case_when(detect_var == TRUE ~ values)),
values = case_when(detect_var == TRUE ~ NA_character_,
TRUE ~ values))%>%
drop_na(values) %>%
select(values, ind) %>%
mutate(values = str_remove(values, "\\D_")) %>%
left_join(tib_var, by = c("ind" = "rowid")) %>%
select(-ind) %>%
replace_na(list(var1 = "", var2 = "", var3 = ""))

Merging two data frames with different sizes by matching their columns

I am trying to "merge" column V of a Data Frame in another one if the columns X and Y are equals (I have to match dOne.X == dTwo.X & dOne.Y == dTwo.Y and also dOne.X == dTwo.Y & dOne.Y == dTwo.X)
I solved this using a for loop, but it is slow when the Data Frame dOne is big (in my machine it takes 25 minutes if length(dOne.X) == 500000). I would like to know if there is a way to solve this problem using a faster "vectorized" operation. Above is an exemple of what I want to do:
Data Frame ONE
X Y V
a b 2
a c 3
a d 0
a e 0
b c 2
b d 3
b e 0
c d 2
c e 0
d e 0
Data Frame TWO
X Y V
a b 1
a c 1
a d 1
b c 1
b d 1
c d 1
e d 1
Expected Data Frame after the columns are merged
X Y V V2
a b 2 1
a c 3 1
a d 0 1
a e 0 0
b c 2 1
b d 3 1
b e 0 0
c d 2 1
c e 0 0
d e 0 1
This is the code I am using so far that is slow when dOne is big (hundreds of thousands or rows):
copyadjlistValueColumn <- function(dOne, dTwo) {
dOne$V2 <- 0
lv <- union(levels(dOne$Y), levels(dOne$X))
dTwo$X <- factor(dTwo$X, levels = lv)
dTwo$Y <- factor(dTwo$Y, levels = lv)
dOne$X <- factor(dOne$X, levels = lv)
dOne$Y <- factor(dOne$Y, levels = lv)
for(i in 1:nrow(dTwo)) {
row <- dTwo[i,]
dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V
dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V
}
dOne
}
This is a testthat test case that covers what I am expecting (using the data frames above):
test_that("Copy V column to another Data Frame", {
dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0))
dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"),
Y=c("b", "c", "d", "c", "d", "d", "d"),
V=c(1, 1, 1, 1, 1, 1, 1))
lv <- union(levels(dfTwo$Y), levels(dfTwo$X))
dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0),
V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1))
dfExpected$X <- factor(dfExpected$X, levels = lv)
dfExpected$Y <- factor(dfExpected$Y, levels = lv)
dfMerged <- copyadjlistValueColumn(dfOne, dfTwo)
expect_identical(dfMerged, dfExpected)
})
Any suggestion?
Thanks a lot :)
Try two merge, where order of matching columns is reversed in the second, to get the 'bidirectional' matching. Then you may use e.g. rowSums to collapse the two created columns to one.
d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE)
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE)
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE))
# X Y V V2
# 1 a b 2 1
# 2 a c 3 1
# 3 a d 0 1
# 4 a e 0 0
# 5 b c 2 1
# 6 b d 3 1
# 7 b e 0 0
# 8 c d 2 1
# 9 c e 0 0
# 10 d e 0 1
For faster alternatives to merge, check data.table and dplyr alternatives here: stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/
Here's a possible data.table package approach. This approach should be particularly efficient for a big data set like you have:
First convert to data.table object and add keys
library(data.table)
setkey(setDT(dfOne), X, Y)
setkey(setDT(dfTwo), X, Y)
Then perform a join on X & Y combination - the join is performed by matching key columns X,Y of dfOne with key columns X,Y of dfTwo respectively.
dfOne[dfTwo, V2 := i.V]
Now perform a join on Y & X combination - the join is performed by matching key columns X,Y of dfOne with key columns Y,X of dfTwo respectively.
setkey(dfTwo, Y, X)
dfOne[dfTwo, V2 := i.V][]
Result (I'll keep the unmatched as NAs instead of zeroes as it makes more sense this way):
# X Y V V2
# 1: a b 2 1
# 2: a c 3 1
# 3: a d 0 1
# 4: a e 0 NA
# 5: b c 2 1
# 6: b d 3 1
# 7: b e 0 NA
# 8: c d 2 1
# 9: c e 0 NA
# 10: d e 0 1
With dplyr:
library(dplyr)
left_join(dfOne, dfTwo, by = c("X", "Y")) %>%
left_join(dfTwo, by = c("X" = "Y", "Y" = "X")) %>%
mutate(V2 = ifelse(is.na(V.y), V, V.y)) %>%
select(X, Y, V = V.x, V2) %>%
do(replace(., is.na(.), 0))

Resources