Creating numeric variable based on string intersection in R - r

I'm attempting to create a numeric variable based on the intersection of strings with R's dplyr package.
I have a list of columns containing codes for thousands of individuals who made purchases at an auto dealership. The codes can represent a purchase of a car, internal parts for a car, or items for the exterior of a car. I want to denote codes identified as a car purchase with 2, items for the interior of a car with 1, and items for the exterior of a car with 0. If the customer purchased a car, I want the column LargestPurchase = 2; if the customer didn't buy a car but bought an interior component, I would like the column LargestPurchase = 1; and if the customer did not buy a car or interior component I would like the column LargestPurchase = 0.
The codes for a car purchase are located in a separate data frame with column CarCodes, and the codes for the interior components of a car are located in a separate data frame with column InteriorCodes. Each contain thousands of codes.
The data for the customers would look like the following (called customers):
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3
001 STW387 K987 W9333
002 AZ326 CP993 EN499
003 BKY98 A0091 C2001
Example:
df1$CarCodes = c('STW387', 'W9333')
df2$InteriorCodes = c('K987', 'AZ326')
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
001 STW387 K987 W9333 2
002 AZ326 CP993 EN499 1
003 BKY98 A0091 C2001 0
I attempted to use the following ifelse function with mutate, but it does not seem to work with strings:
customers <- customers %>% mutate(LargestPurchase =
(ifelse(intersect(customers$PurchaseCode1, df1$CarCodes) == TRUE |
intersect(customers$PurchaseCode2, df1$CarCodes) |
intersect(customers$PurchaseCode3, df1$CarCodes), 2, ifelse(
intersect(customers$PurchaseCode1, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode2, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode3, df3$InteriorCodes) == TRUE, 1, 0)))
Any insight would be great.

Here is a dplyr version
CarCodes = c('STW387', 'W9333')
InteriorCodes = c('K987', 'AZ326')
data.frame(customer = c(001, 002, 003),
code1 = c('STW387', 'AZ326', 'BKY98'),
code2 = c('K987', 'CP993', 'A0091'),
code3 = c('W9333', 'EN499', 'C2001')) %>%
gather(variable, value, -customer) %>%
mutate(purchase = case_when(value %in% CarCodes ~ 2,
value %in% InteriorCodes ~ 1,
TRUE ~ 0)) %>%
group_by(customer) %>%
summarise(largest_purchase = max(purchase))

Determine if either the CarCodes or InteriorCodes are contained and then use the max value.
c2 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df2$InteriorCodes), 1, 0))
c1 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df1$CarCodes), 2, 0))
df3$LargestPurchase <- pmax(c1, c2)
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
1 1 STW387 K987 W9333 2
2 2 AZ326 CP993 EN499 1
3 3 BKY98 A0091 C2001 0

Related

R- compare different columns of a data frame with different values

I am currently working on microdata, using a survey called SHARE. I want to use a variable for education but the way it was coded makes it kind of hard.
In the survey, households are asked what degree they have. There is one column for each degree and it takes the value 0 or 1 if the interviewed has the degree or not. The issue is that I have two countries with different degrees, but they are using the same column, so I have to go to the user manual to find for each country to which degree corresponds each 0 or 1. I was able to do so and then translate it to an international way of measuring education.
My idea was to sum each column and then having only one column for each household. However, I wasn't able to proceed because some people have many degrees. I would like to get the highest degree of each household. I would like to have your help on this issue.
Here are tables of what I have and what I would like:
Let imagine in Germany the first diplome is equivalent to the first diplome in international standards, the second and thee third in Germany are the same as the second diplom in international standards and the last diplom in Germany is the same as the third internationally. And in France we have first = first int., second = second int., third = third int. and no fourth diplom. Then I have a the table:
country= c( "Germany", "Germany", "Germany", "France" , "France", "France")
degree_one= c( 1, 1, 1, 1 , 1, 1)
degree_two = c( 0, 1, 0, 1 , 1, 0)
degree_three= c( 1, 0, 1, 1 , 1, 0)
degree_four = c( 1, 0, 0, NA ,NA, NA)
f = data.frame(country,degree_one,degree_two,degree_three,degree_four)
Then I can translate and try to creat my variable degree by summing everything:
f$degree_one = ifelse(f$country == "Germany" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "Germany" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "Germany" & f$degree_three == 1,2,f$degree_three)
f$degree_four = ifelse(f$country == "Germany" & f$degree_four == 1,3,f$degree_four)
f$degree_one = ifelse(f$country == "France" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "France" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "France" & f$degree_three == 1,3,f$degree_three)
f$degree_four = ifelse(f$country == "France" & f$degree_four == "NA",0,f$degree_four)
f = replace(f, is.na(f), 0)
f2 = f %>% mutate(degree = degree_one + degree_two + degree_three + degree_four )
Unfortunately, it does not work and what I would like should look like this:
degree = c(3,2,2,3,3,1)
f3 = data.frame(f,degree)
I tried to do smth with a while loop but it did not work, as anyone any idea how I can solve my problem? I tried to make it as clear as possible, I hope you will understand and that someone as an idea on how to fix this.
Thanks :)
Here is an approach using data.table
library(data.table)
##
# create degree map by country
#
degreeMap <- data.table(country=c('France', 'Germany'))
degreeMap <- degreeMap[, .(degree=paste('degree', c('one', 'two', 'three', 'four'), sep='_')), by=.(country)]
degreeMap[country=='France', intlDegree:=c(1,2,3,NA)]
degreeMap[country=='Germany', intlDegree:=c(1,2,2,3)]
##
# process your data
#
setDT(f)
f[, indx:=1:.N] # need an index column to recover original order
f[, HH:=1:.N, by=.(country)] # need a HH column to distinguish different HH w/in country
maxDegree <- melt(f, id=c('country', 'HH', 'indx'), variable.name='degree', value.name = 'flag')
maxDegree <- maxDegree[flag > 0] # remove rows with flag=0 or NA
setorder(maxDegree, HH, degree)
maxDegree <- maxDegree[, .SD[.N], keyby=.(country, HH)]
maxDegree[degreeMap, intlDegree:=i.intlDegree, on=.(country, degree)]
setorder(maxDegree, indx)
maxDegree
## country HH indx degree flag intlDegree
## 1: Germany 1 1 degree_four 1 3
## 2: Germany 2 2 degree_two 1 2
## 3: Germany 3 3 degree_three 1 2
## 4: France 1 4 degree_three 1 3
## 5: France 2 5 degree_three 1 3
## 6: France 3 6 degree_one 1 1
So this converts your f to a data.table and adds an index column and a HH column to distinguish between HH within a country.
We then convert to long format using melt(...). In long format the four degree_ columns are reduced to two columns: a flag column indicating whether or not the degree applies, and a degree column indicating which degree.
Then we remove all rows with 0 or NA flags, and then extract the last remaining row (highest degree) for each country and HH.
Finally, we join to degreeMap to get the equivalent intlDegree.
Change NAs to 0 and then sum degree columns:
f <- f %>%
mutate(
degree_one = ifelse(is.na(degree_one), 0, degree_one),
degree_two = ifelse(is.na(degree_two), 0, degree_two),
degree_three = ifelse(is.na(degree_three), 0, degree_three),
degree_four = ifelse(is.na(degree_four), 0, degree_four),
degree_sum = degree_one + degree_two + degree_three + degree_four
)
Or, if you want to get fancy with the dplyr
f <- f %>%
mutate(across(contains("degree"), \(x) {ifelse(is.na(x), 0, x)})) %>%
mutate(degree_sum = select(., contains("degree")) %>% rowSums())

How to get one row per unique ID with multiple columns per values of particular column

I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)

subtracting values in dataframe with condition(s) from another dataframe

I have two dataframes that have sales data from fruits store.
1st Data frame has sales data from 'Store A',
and the 2nd data frame has that data gathered from 'Store A + Store B'
StoreA = data.frame(
Fruits = c('Apple', 'Banana', 'Blueberry'),
Customer = c('John', 'Peter', 'Jenny'),
Quantity = c(2, 3, 1)
)
Total = data.frame(
Fruits = c('Blueberry', 'Apple', 'Banana', 'Blueberry', 'Pineapple'),
Customer = c('Jenny' , 'John', 'Peter', 'John', 'Peter'),
Quantity = c(4, 7, 3, 5, 3)
)
StoreA
Total
I wish to subtract the sales data of 'StoreA' from 'Total' to get sales data for 'StoreB'.
At the end, I wish to have something like
Great Question! There is a simple and graceful way of achieving exactly what you want.
The title to this question is: "subtracting values in a data frame with conditons() from another data frame"
This subtraction can be accomplished just like the title says. But there is a better way than using subtraction. Turning a subtraction problem into an addition problem is often the easiest way of solving a problem.
To make this into an addition problem, just convert one of the data frames (StoreA$Quantity) into negative values. Only convert the Quantity variable into negative values. And then rename the other data frame (Total) into StoreB.
Once that is done, it's easy to finish. Just use the join function with the two data frames (StoreA & StoreB). Doing that brings the negative and positive values together and the data is more understandable. When there are the same things with positive and negative values, then it's obvious these things need to be combined.
To combine those similar items, use the group_by() function and pipe it into a summarize() function. Doing the coding this way makes the code easy to read and easy to understand. The code can almost be read like a book.
Create data frames:
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
StoreB = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
Convert StoreA$Quantity to negative values:
StoreA_ <- StoreA
StoreA_Quanity <- StoreA$Quantity * -1
StoreA_
StoreA_ now looks like this:
Fruits Customer Quantity
<fct> <fct> <dbl>
Apple John -2
Banana Peter -3
Blueberry Jenny -1
Now combine StoreA and StoreB. Use the full_join() function to join the two stores:
Total <- full_join(StoreA_, StoreB, disparse = 0)
Total
The last thing is accomplished using the group_by function. This will combine the positive and negative values together.
Total %>% group_by(Fruits, Customer) %>% summarize(s = sum(Quantity))
It's Done! The output is shown at this link:
You could do a full join first, then rename the columns, fill the missing values resulting from the join and then compute the difference.
library(tidyverse)
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
Total = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
full_join(StoreA %>%
rename(Qty_A = Quantity),
Total %>%
rename(Qty_Total = Quantity), by = c("Fruits", "Customer")) %>%
# fill NAs with zero
replace_na(list(Qty_A = 0)) %>%
# compute the difference
mutate(Qty_B = Qty_Total - Qty_A)
#> Fruits Customer Qty_A Qty_Total Qty_B
#> 1 Apple John 2 7 5
#> 2 Banana Peter 3 3 0
#> 3 Blueberry Jenny 1 4 3
#> 4 Blueberry John 0 5 5
#> 5 Pineapple Peter 0 3 3
Created on 2020-09-28 by the reprex package (v0.3.0)

R! mutate conditional and list intersect (How many time was a player on the court ?)

This is a sport analysis question - How many time was a player on the court ?
I have a list of players I am interested in
names <- c('John','Bill',Peter')
and a list of actions during multiple matches
team <- c('teama','teama','teama','teama','teama','teama','teamb','teamb')
player1 <- c('John', 'John', 'John', 'Bill', 'Mike', 'Mike', 'Steve', 'Steve')
player2 <- c('Mike', 'Mike', 'Mike', 'John', 'Bill', 'Bill', 'Peter', 'Bob')
df <- data.frame(team,player1,player2)
I want to build a column that will list how many action was the player on the court
actions_when_player_on_court <- df %>% group_by(team) %>%
calculate({nb of observation where the player is either player1 or player2} )
so I end up with a new list like
actions_when_player_on_court <- c(4,3,1)
so I can create a new DF like this
new df <- data.frame(names,actions_when_player_on_court)
where John appears 4 times on the court, Bill twice, and Peter once
I feel I may need to intersect the names and c(player1,player2) especially if
names are unique - John, Bill and Peter cannot belong to other teams and are unique in df
I may have 0 to n players on the field so 0 to n column (player1, player2... playern)
The following code should do what you need.
We first need to create a new data frame to store all names and an empty actions_when_player_on_court variable.
names = c()
for (i in 2:ncol(df)) {
names = c(names, unique(df[,i]))
}
names = data.frame(name = unique(names), actions_when_player_on_court = 0)
Then, we can fill the actions_when_player_on_court variable using a for loop:
df$n = 1
for (i in 2:(ncol(df)-1)) {
tmp = aggregate(cbind(n = n) ~ df[, i], data = df[, c(i, ncol(df))], FUN="sum")
names(tmp)[1] = "name"
names = merge(names, tmp, all=T)
names[is.na(names)] = 0
names$actions_when_player_on_court = names$actions_when_player_on_court + names$n
names = names[-ncol(names)]
}
You can have as many players as you want as long as they start with the second column an run until the end of the data frame. Note that the resulting data frame does not include the team variable. I think you can deal with that yourself. Here is the result:
> names
name actions_when_player_on_court
1 Bill 3
2 Bob 1
3 John 4
4 Mike 5
5 Peter 1
6 Steve 2

Count by vector of multiple columns in sparklyr

In a related question I had some good help to generate possible combinations of a set or variables.
Assume the output of that process is
combo_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
combo_id = c("combo1", "combo2", "combo3"),
selection_1 = c("Alice", "Alice", "Bob"),
selection_2 = c("Bob", "Cat", "Cat")
),
name = "combo_table")
This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.
Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.
obs_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
Alice = c(1, 1, 0, 1, 0, 1, 0),
Bob = c(1, 1, 1, 0, 0, 0, 0),
Cat = c(0, 1, 1, 1, 1, 0, 0)
),
name = "obs_table")
I can relatively simply check if a specific pair were present in the house with this code:
obs_tbl %>%
group_by(Alice, Bob) %>%
summarise(n())
However, there are 2 flaws with this approach.
Each pair is being put in manually, when every combination I need to check is already in combo_tbl.
The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.
The ideal end result would be an output like so:
Alice | Bob | 2
Alice | Cat | 2
Bob | Cat | 2
i.e. The count of co-habitation days per pair.
A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.
So, is it possible on sparklyr to pass a vector of pairs that are iterated through?
How do I only check for where both of my selections are present? Instead of a vectorised group_by should I use a vectorised filter?
I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?
For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:
combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
get(tolower(selection_2)) == "1"
, .N], by = combo_id]
This nested process effectively splits each combination into it's own sub-table: by = combo_id, and then for that sub-table filters where selection_1 and selection_2 are 1, and then applies .N to count the rows in that sub-table, and then aggregates the output.

Resources