Count by vector of multiple columns in sparklyr - r

In a related question I had some good help to generate possible combinations of a set or variables.
Assume the output of that process is
combo_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
combo_id = c("combo1", "combo2", "combo3"),
selection_1 = c("Alice", "Alice", "Bob"),
selection_2 = c("Bob", "Cat", "Cat")
),
name = "combo_table")
This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.
Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.
obs_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
Alice = c(1, 1, 0, 1, 0, 1, 0),
Bob = c(1, 1, 1, 0, 0, 0, 0),
Cat = c(0, 1, 1, 1, 1, 0, 0)
),
name = "obs_table")
I can relatively simply check if a specific pair were present in the house with this code:
obs_tbl %>%
group_by(Alice, Bob) %>%
summarise(n())
However, there are 2 flaws with this approach.
Each pair is being put in manually, when every combination I need to check is already in combo_tbl.
The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.
The ideal end result would be an output like so:
Alice | Bob | 2
Alice | Cat | 2
Bob | Cat | 2
i.e. The count of co-habitation days per pair.
A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.
So, is it possible on sparklyr to pass a vector of pairs that are iterated through?
How do I only check for where both of my selections are present? Instead of a vectorised group_by should I use a vectorised filter?
I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?
For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:
combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
get(tolower(selection_2)) == "1"
, .N], by = combo_id]
This nested process effectively splits each combination into it's own sub-table: by = combo_id, and then for that sub-table filters where selection_1 and selection_2 are 1, and then applies .N to count the rows in that sub-table, and then aggregates the output.

Related

Filter one column and matching to another (expanded)

Hi I have a similar question to this (Filter one column by matching to another column)
For background I'm trying to match up a code for a book name and a place where it is being used. I figured out how to use filter and grepyl to narrow down the book name but now I need to filter out if the location names match. I can't give up the data since it's private. It's a similar example to the one above except I'm filtering with what the animal starts with first so imagine this.
df <- data.frame(pair = c(1, 1, 2, 2, 3, 3,4,4,4),
animal = rep(c("Elephant", "Giraffe", "Antelope"), 6),
value = seq(1, 12, 2),
drop = c("savannah", "savannah", "jungle", "jungle", "zoo", "unknown", "unknown", "zoo", "my house"))
zoo_animals <- filter(df, grepl("Gir|Ele", animal))
what I'm not sure how to do is to build on that to see if the location column matches between each entry. Is it just & location = location?
What I want is to have it find is there an elephant and a giraffe from the zoo? what about the savanna? From the data I made it appears the only match is savanna so it would print those data points so a df that is
pair
animal
value
drop
1
Elephant
7
savannah
1
Giraffe
3
savannah
1
Giraffe
3
savannah

How to get one row per unique ID with multiple columns per values of particular column

I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)

Merging when one of the columns is a list, producing a new column that is a list

I have two datasets that I want to merge. One of the columns that I want to use as a key to merge has the values in a list. If any of those values appear in the second dataset’s column, I want the value in the other column to be merged into the first dataset – which might mean there are multiple values, which should be presented as a list.
That is quite hard to explain but hopefully this example data makes it clearer.
Example data
library(data.table)
mother_dt <- data.table(mother = c("Penny", "Penny", "Anya", "Sam", "Sam", "Sam"),
child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"))
mother_dt [, children := .(list(unique(child))), by = mother]
mother_dt [, child := NULL]
mother_dt <- unique(mother_dt , by = "mother")
child_dt <- data.table(child = c("Violet", "Prudence", "Erika", "Jake", "Wolf", "Red"),
age = c(10, 8, 9, 6, 5, 2))
So for example, the first row in my new dataset would have “Penny” in themother column, a list containing “Violet” and “Prudence” in the children column, and a list containing 10 and 8 in the age column.
I've tried the following:
combined_dt <- mother_dt[, child_age := ifelse(child_dt$child %in% children,
.(list(unique(child_dt$age))), NA)
But that just contains a list of all the ages in the final row.
I appreciate this is probably quite unusual behaviour but is there a way to achieve it?
Edit: The final datatable would look like this:
final_dt <- data.table(mother = c("Penny", "Anya", "Sam"),
children = c(list(c("Violet", "Prudence")), list(c("Erika")), list(c("Jake", "Wolf", "Red"))),
age = c(list(c(10, 8)), list(c(9)), list(c(6, 5, 2))))
The easiest way I can think of is, first unlist the children, then merge, then list again:
mother1 <- mother_dt[,.(children=unlist(children)),by=mother]
mother1[child_dt,on=c(children='child')][,.(children=list(children),age=list(age)),by=mother]
You can do something like this-
library(splitstackshape)
newm <- mother_dt[,.(children=unlist(children)),by=mother]
final_dt <- merge(newm,child_dt,by.x = "children",by.y = "child")
> aggregate(. ~ mother, data = cv, toString)
mother children age
1 Anya Erika 9
2 Penny Prudence, Violet 8, 10
3 Sam Jake, Red, Wolf 6, 2, 5
You could do it the following way, which has the advantage of preserving duplicates in mother column when they exist.
mother_dt$age <- lapply(
mother_dt$children,
function(x,y) y[x],
y = setNames(child_dt$age, child_dt$child))
mother_dt
# mother children age
# 1: Penny Violet,Prudence 10, 8
# 2: Anya Erika 9
# 3: Sam Jake,Wolf,Red 6,5,2
I translates nicely into tidyverse syntax :
library(tidyverse)
mutate(mother_dt, age = map(children,~.y[.], deframe(child_dt)))
# mother children age
# 1 Penny Violet, Prudence 10, 8
# 2 Anya Erika 9
# 3 Sam Jake, Wolf, Red 6, 5, 2

Creating numeric variable based on string intersection in R

I'm attempting to create a numeric variable based on the intersection of strings with R's dplyr package.
I have a list of columns containing codes for thousands of individuals who made purchases at an auto dealership. The codes can represent a purchase of a car, internal parts for a car, or items for the exterior of a car. I want to denote codes identified as a car purchase with 2, items for the interior of a car with 1, and items for the exterior of a car with 0. If the customer purchased a car, I want the column LargestPurchase = 2; if the customer didn't buy a car but bought an interior component, I would like the column LargestPurchase = 1; and if the customer did not buy a car or interior component I would like the column LargestPurchase = 0.
The codes for a car purchase are located in a separate data frame with column CarCodes, and the codes for the interior components of a car are located in a separate data frame with column InteriorCodes. Each contain thousands of codes.
The data for the customers would look like the following (called customers):
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3
001 STW387 K987 W9333
002 AZ326 CP993 EN499
003 BKY98 A0091 C2001
Example:
df1$CarCodes = c('STW387', 'W9333')
df2$InteriorCodes = c('K987', 'AZ326')
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
001 STW387 K987 W9333 2
002 AZ326 CP993 EN499 1
003 BKY98 A0091 C2001 0
I attempted to use the following ifelse function with mutate, but it does not seem to work with strings:
customers <- customers %>% mutate(LargestPurchase =
(ifelse(intersect(customers$PurchaseCode1, df1$CarCodes) == TRUE |
intersect(customers$PurchaseCode2, df1$CarCodes) |
intersect(customers$PurchaseCode3, df1$CarCodes), 2, ifelse(
intersect(customers$PurchaseCode1, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode2, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode3, df3$InteriorCodes) == TRUE, 1, 0)))
Any insight would be great.
Here is a dplyr version
CarCodes = c('STW387', 'W9333')
InteriorCodes = c('K987', 'AZ326')
data.frame(customer = c(001, 002, 003),
code1 = c('STW387', 'AZ326', 'BKY98'),
code2 = c('K987', 'CP993', 'A0091'),
code3 = c('W9333', 'EN499', 'C2001')) %>%
gather(variable, value, -customer) %>%
mutate(purchase = case_when(value %in% CarCodes ~ 2,
value %in% InteriorCodes ~ 1,
TRUE ~ 0)) %>%
group_by(customer) %>%
summarise(largest_purchase = max(purchase))
Determine if either the CarCodes or InteriorCodes are contained and then use the max value.
c2 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df2$InteriorCodes), 1, 0))
c1 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df1$CarCodes), 2, 0))
df3$LargestPurchase <- pmax(c1, c2)
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
1 1 STW387 K987 W9333 2
2 2 AZ326 CP993 EN499 1
3 3 BKY98 A0091 C2001 0

Efficient way to conditionally edit value labels

I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways.
The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames.
d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1,
2, 3, 8, 9), .Names = c("Protection of environment should be given priority",
"Economic growth should be given priority", "[DON'T READ] Both equally",
"[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")))
d$var1
<Labelled double>
[1] 1 2 NA NA 3 NA 1 1
Labels:
value label
1 Protection of environment should be given priority
2 Economic growth should be given priority
3 [DON'T READ] Both equally
8 [DON'T READ] Don't Know
9 [DON'T READ] Refused
If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)."
Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame.
library(labelled)
val_labels(d$var1) <- c("Protection of environment should be given priority" = 1,
"Economic growth should be given priority" = 2,
"Both equally (VOL)" = 3,
"Don't Know (VOL)" = 8,
"Refused (VOL)" = 9)
How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame?
The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate():
1). Using lapply()
This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well!
d[] <- lapply(d, function(x) {
labels <- attributes(x)$labels
names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels))
attributes(x)$labels <- labels
x
})
d$var1
[1] 1 2 NA NA 3 NA 1 1
attr(,"labels")
Protection of environment should be given priority Economic growth should be given priority
1 2
Both equally (VOL) Don't Know (VOL)
3 8
Refused (VOL)
9
attr(,"class")
[1] "labelled"
2) Using mutate_all()
Using the same logic (with the same result) you could change the name of the labels in a tidier way:
d %>%
mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>%
map(attributes) # just to check on the result

Resources