dplyr: Build set of items in list column - r

I want a column that tracks which items are included in a set based on a predicate. It seems like I should be able to do this with some combination of the purrr accumulate function and the dplyr lead/lag and union/setdiff functions.
This is probably best expressed as a reprex:
input_df <- dplyr::data_frame(user = c("1", "1", "1", "1"),
item = c("a", "b", "a", "a"),
include = c(TRUE, TRUE, FALSE, TRUE))
output_df <- dplyr::data_frame(user = c("1", "1", "1", "1"),
set = list(
c("a"),
c("a", "b"),
c("b"),
c("a", "b")))
Edit: I'm very close. I need to find a way of finding the "bag difference" (instead of the set difference) between vectors in case a user includes, excludes and then re-includes an item.
numbered_input_df <- input_df %>%
mutate(id = row_number())
include_df <- numbered_input_df %>%
filter(include == TRUE) %>%
mutate(include_set = purrr::accumulate(item, c)) %>%
select(user, id, include_set)
exclude_df <- numbered_input_df %>%
filter(include == FALSE) %>%
mutate(exclude_set = purrr::accumulate(item, c)) %>%
select(user, id, exclude_set)
numbered_input_df %>%
left_join(include_df) %>%
left_join(exclude_df) %>%
fill(include_set, exclude_set) %>%
mutate(set = map2(include_set, exclude_set, ~.x[! .x %in% .y]))

Define Update which takes the union or setdiff of the basket with the ith item and use Reduce to apply it to each i. Use ave to do all that by user. No packages are used.
Update <- function(basket, i) with(input_df[i, ],
(if (include) union else setdiff)(basket, item)
)
n <- nrow(input_df)
reduce_user <- function(ix) Reduce(Update, init = NULL, ix, accumulate = TRUE)[-1]
transform(input_df["user"], set = I(ave(as.list(1:n), user, FUN = reduce_user)))
giving:
user set
1 1 a
2 1 a, b
3 1 b
4 1 b, a
Alternately, translating the above to dplyr and purrr and making use of Update from above we get the code below.
library(dplyr)
library(purrr)
input_df %>%
mutate(ix = 1:n()) %>%
group_by(user) %>%
mutate(set = accumulate(ix, Update, .init = NULL)[-1]) %>%
ungroup %>%
select(user, set)
(Note that the only use of purrr is accumulate and that could easily be replaced with Reduce if you want to reduce dependencies.)

Related

Vectorization to extract and bind very nested data

I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.
You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()
If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))

Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

Multiple ratios by column-wise division with dplyr following grouping

I have a df that needs to be grouped by multiple columns to subsequently calculate ratios for subset of different columns and the row-wise means and standard deviations.
grouper1 grouper2 condition value
foo baz A 1
foo baz B 2
foo oof A 1
foo oof C 3
bar zab B 2
bar zab C 4
Based on this elegant answer I have managed to built a generic solution:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
crossing(c("A"), c("B","C")) %>%
pmap(~ query %>%
group_by(grouper1, grouper2) %>%
summarise(!! str_c('ratio_', ..1, ..2) :=
value[condition == ..1]/value[condition == ..2])) %>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
ungroup() %>% mutate(mean=rowMeans(select(.,-(grouper1, grouper2)), SD=unlist(pmap(select(.,-(grouper1, grouper2)), ~sd(c(...)))))
This works well if all the values in condition column are found in all groups. If this is not the case, e.g. A is not present in the second grouping using grouper1 in the above example, I will receive the following error:
Error: Column ratio_AC must be length 1 (a summary value), not 0
I could obviously preselect the values for crossing but this would require a filter on the df and I will loose generality. I would thus like a solution that simply ignores the missing combinations and still calculates the metrics.
One possible solution would be pivot_wider but here I cannot implement a working solution for calculation the ratios.
We could reshape to wide format with pivot_wider and then use that dataset
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df1 <- df %>%
pivot_wider(names_from = condition, values_from = value)
crossing(v1 = c("A"), v2 = c("B","C")) %>%
pmap(~ df1 %>%
transmute(grouper1, grouper2,
!! str_c('ratio_', ..1, ..2) :=
.[[..1]]/.[[..2]]))%>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
mutate(mean = rowMeans(select(., -grouper1, -grouper2), na.rm = TRUE),
SD= pmap_dbl(select(., -grouper1, -grouper2),
~sd(c(...), na.rm = TRUE)))
data
df <- structure(list(grouper1 = c("foo", "foo", "foo", "foo", "bar",
"bar"), grouper2 = c("baz", "baz", "oof", "oof", "zab", "zab"
), condition = c("A", "B", "A", "C", "B", "C"), value = c(1L,
2L, 1L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-6L))

using grep with count_if (EXPSS package in R)

I'm trying to count instances where a certain string appears in a dataframe (this will be a sub-string, i.e. "blue" will appear within a larger block of text), and then summarize those counts by another field. Here's the code:
totals_by_county <- county_data %>%
group_by(county_data$county)%>%
summarise(number_occurences = count(grepl('blue', county_data$color,ignore.case = TRUE)))
totals_by_county
And I get this error:
no applicable method for 'summarise_' applied to an object of class "logical"
Is there a way to do this in the method I'm trying to use above? Thanks in advance!
With grepl:
totals_by_county <- county_data %>%
group_by(county) %>%
summarise(number_occurences = sum(grepl('blue', color, ignore.case = TRUE)))
or, with count_if from expss:
totals_by_county <- county_data %>%
group_by(county) %>%
summarise(number_occurences = count_if(perl('blue', ignore.case = TRUE), color))
UPDATE with reproducible example:
library(dplyr)
library(expss)
county_data = data.frame(
county = c("A", "A", "A", "B", "B"),
color = c("blue-blue", "red", "orange-blue", "yellow", "green"),
stringsAsFactors = FALSE)
county_data %>%
group_by(county) %>%
summarise(number_occurences = count_if(perl('blue', ignore.case = TRUE), color))
# A tibble: 2 x 2
# county number_occurences
# <chr> <int>
# 1 A 2
# 2 B 0

Calculation on every pair from grouped data.frame

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))

Resources