subtracting values in dataframe with condition(s) from another dataframe - r

I have two dataframes that have sales data from fruits store.
1st Data frame has sales data from 'Store A',
and the 2nd data frame has that data gathered from 'Store A + Store B'
StoreA = data.frame(
Fruits = c('Apple', 'Banana', 'Blueberry'),
Customer = c('John', 'Peter', 'Jenny'),
Quantity = c(2, 3, 1)
)
Total = data.frame(
Fruits = c('Blueberry', 'Apple', 'Banana', 'Blueberry', 'Pineapple'),
Customer = c('Jenny' , 'John', 'Peter', 'John', 'Peter'),
Quantity = c(4, 7, 3, 5, 3)
)
StoreA
Total
I wish to subtract the sales data of 'StoreA' from 'Total' to get sales data for 'StoreB'.
At the end, I wish to have something like

Great Question! There is a simple and graceful way of achieving exactly what you want.
The title to this question is: "subtracting values in a data frame with conditons() from another data frame"
This subtraction can be accomplished just like the title says. But there is a better way than using subtraction. Turning a subtraction problem into an addition problem is often the easiest way of solving a problem.
To make this into an addition problem, just convert one of the data frames (StoreA$Quantity) into negative values. Only convert the Quantity variable into negative values. And then rename the other data frame (Total) into StoreB.
Once that is done, it's easy to finish. Just use the join function with the two data frames (StoreA & StoreB). Doing that brings the negative and positive values together and the data is more understandable. When there are the same things with positive and negative values, then it's obvious these things need to be combined.
To combine those similar items, use the group_by() function and pipe it into a summarize() function. Doing the coding this way makes the code easy to read and easy to understand. The code can almost be read like a book.
Create data frames:
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
StoreB = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
Convert StoreA$Quantity to negative values:
StoreA_ <- StoreA
StoreA_Quanity <- StoreA$Quantity * -1
StoreA_
StoreA_ now looks like this:
Fruits Customer Quantity
<fct> <fct> <dbl>
Apple John -2
Banana Peter -3
Blueberry Jenny -1
Now combine StoreA and StoreB. Use the full_join() function to join the two stores:
Total <- full_join(StoreA_, StoreB, disparse = 0)
Total
The last thing is accomplished using the group_by function. This will combine the positive and negative values together.
Total %>% group_by(Fruits, Customer) %>% summarize(s = sum(Quantity))
It's Done! The output is shown at this link:

You could do a full join first, then rename the columns, fill the missing values resulting from the join and then compute the difference.
library(tidyverse)
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
Total = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
full_join(StoreA %>%
rename(Qty_A = Quantity),
Total %>%
rename(Qty_Total = Quantity), by = c("Fruits", "Customer")) %>%
# fill NAs with zero
replace_na(list(Qty_A = 0)) %>%
# compute the difference
mutate(Qty_B = Qty_Total - Qty_A)
#> Fruits Customer Qty_A Qty_Total Qty_B
#> 1 Apple John 2 7 5
#> 2 Banana Peter 3 3 0
#> 3 Blueberry Jenny 1 4 3
#> 4 Blueberry John 0 5 5
#> 5 Pineapple Peter 0 3 3
Created on 2020-09-28 by the reprex package (v0.3.0)

Related

How to get one row per unique ID with multiple columns per values of particular column

I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)

R! mutate conditional and list intersect (How many time was a player on the court ?)

This is a sport analysis question - How many time was a player on the court ?
I have a list of players I am interested in
names <- c('John','Bill',Peter')
and a list of actions during multiple matches
team <- c('teama','teama','teama','teama','teama','teama','teamb','teamb')
player1 <- c('John', 'John', 'John', 'Bill', 'Mike', 'Mike', 'Steve', 'Steve')
player2 <- c('Mike', 'Mike', 'Mike', 'John', 'Bill', 'Bill', 'Peter', 'Bob')
df <- data.frame(team,player1,player2)
I want to build a column that will list how many action was the player on the court
actions_when_player_on_court <- df %>% group_by(team) %>%
calculate({nb of observation where the player is either player1 or player2} )
so I end up with a new list like
actions_when_player_on_court <- c(4,3,1)
so I can create a new DF like this
new df <- data.frame(names,actions_when_player_on_court)
where John appears 4 times on the court, Bill twice, and Peter once
I feel I may need to intersect the names and c(player1,player2) especially if
names are unique - John, Bill and Peter cannot belong to other teams and are unique in df
I may have 0 to n players on the field so 0 to n column (player1, player2... playern)
The following code should do what you need.
We first need to create a new data frame to store all names and an empty actions_when_player_on_court variable.
names = c()
for (i in 2:ncol(df)) {
names = c(names, unique(df[,i]))
}
names = data.frame(name = unique(names), actions_when_player_on_court = 0)
Then, we can fill the actions_when_player_on_court variable using a for loop:
df$n = 1
for (i in 2:(ncol(df)-1)) {
tmp = aggregate(cbind(n = n) ~ df[, i], data = df[, c(i, ncol(df))], FUN="sum")
names(tmp)[1] = "name"
names = merge(names, tmp, all=T)
names[is.na(names)] = 0
names$actions_when_player_on_court = names$actions_when_player_on_court + names$n
names = names[-ncol(names)]
}
You can have as many players as you want as long as they start with the second column an run until the end of the data frame. Note that the resulting data frame does not include the team variable. I think you can deal with that yourself. Here is the result:
> names
name actions_when_player_on_court
1 Bill 3
2 Bob 1
3 John 4
4 Mike 5
5 Peter 1
6 Steve 2

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

Creating numeric variable based on string intersection in R

I'm attempting to create a numeric variable based on the intersection of strings with R's dplyr package.
I have a list of columns containing codes for thousands of individuals who made purchases at an auto dealership. The codes can represent a purchase of a car, internal parts for a car, or items for the exterior of a car. I want to denote codes identified as a car purchase with 2, items for the interior of a car with 1, and items for the exterior of a car with 0. If the customer purchased a car, I want the column LargestPurchase = 2; if the customer didn't buy a car but bought an interior component, I would like the column LargestPurchase = 1; and if the customer did not buy a car or interior component I would like the column LargestPurchase = 0.
The codes for a car purchase are located in a separate data frame with column CarCodes, and the codes for the interior components of a car are located in a separate data frame with column InteriorCodes. Each contain thousands of codes.
The data for the customers would look like the following (called customers):
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3
001 STW387 K987 W9333
002 AZ326 CP993 EN499
003 BKY98 A0091 C2001
Example:
df1$CarCodes = c('STW387', 'W9333')
df2$InteriorCodes = c('K987', 'AZ326')
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
001 STW387 K987 W9333 2
002 AZ326 CP993 EN499 1
003 BKY98 A0091 C2001 0
I attempted to use the following ifelse function with mutate, but it does not seem to work with strings:
customers <- customers %>% mutate(LargestPurchase =
(ifelse(intersect(customers$PurchaseCode1, df1$CarCodes) == TRUE |
intersect(customers$PurchaseCode2, df1$CarCodes) |
intersect(customers$PurchaseCode3, df1$CarCodes), 2, ifelse(
intersect(customers$PurchaseCode1, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode2, df2$InteriorCodes) == TRUE |
intersect(customers$PurchaseCode3, df3$InteriorCodes) == TRUE, 1, 0)))
Any insight would be great.
Here is a dplyr version
CarCodes = c('STW387', 'W9333')
InteriorCodes = c('K987', 'AZ326')
data.frame(customer = c(001, 002, 003),
code1 = c('STW387', 'AZ326', 'BKY98'),
code2 = c('K987', 'CP993', 'A0091'),
code3 = c('W9333', 'EN499', 'C2001')) %>%
gather(variable, value, -customer) %>%
mutate(purchase = case_when(value %in% CarCodes ~ 2,
value %in% InteriorCodes ~ 1,
TRUE ~ 0)) %>%
group_by(customer) %>%
summarise(largest_purchase = max(purchase))
Determine if either the CarCodes or InteriorCodes are contained and then use the max value.
c2 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df2$InteriorCodes), 1, 0))
c1 <- apply(df3[,-1], 1, function(x) ifelse(any(x %in% df1$CarCodes), 2, 0))
df3$LargestPurchase <- pmax(c1, c2)
Customer1 PurchaseCode1 PurchaseCode2 PurchaseCode3 LargestPurchase
1 1 STW387 K987 W9333 2
2 2 AZ326 CP993 EN499 1
3 3 BKY98 A0091 C2001 0

Match part of a string in a dataframe and replace it by entry of another dataframe

I'm fairly new to R and I'm running into the following problem.
Let's say I have the following data frames:
sale_df <- data.frame("Cheese" = c("cheese-01", "cheese-02", "cheese-03"), "Number_of_sales" = c(4, 8, 23))
id_df <- data.frame("ID" = c(1, 2, 3), "Name" = c("Leerdammer", "Gouda", "Mozerella")
What I want to do is match the numbers of the first column of id_df to the numbers in the string of the first column of sale_df.
Then I want to replace the value in sale_df by the value in the second column of id_df, i.e. I want cheese-01 to become "Leerdammer".
Does anyone have any idea how I could solve this?
With tidyverse :
sale_df %>% mutate(ID=as.numeric(str_extract(Cheese,"(?<=cheese-).*"))) %>% inner_join(id_df,by="ID")
# Cheese Number_of_sales ID Name
#1 cheese-01 4 1 Leerdammer
#2 cheese-02 8 2 Gouda
#3 cheese-03 23 3 Mozerella
Assuming that all entries for Cheese in sale_df will start with cheese-, here is a simple solution.
sale_df$CheeseID <- as.numeric(substring(sale_df$Cheese, 8))
merge(sale_df, id_df, by.x = "CheeseID", by.y = "ID", all.x = TRUE)
sale_df$Number_of_sales=id_df$Name[match(id_df$ID,as.numeric(gsub("\\D","",sale_df$Cheese)))]
> sale_df
Cheese Number_of_sales
1 cheese-01 Leerdammer
2 cheese-02 Gouda
3 cheese-03 Mozerella

Resources