RStudio: Summation of two datasets of different lengths - r

I have two datasets A and B:
Dataset A (called Sales) has the following data:
ID Person Sales
1 1 100
2 2 300
3 3 400
4 4 200
5 5 50
Dataset B (called Account_Scenarios) has the following data (Note- there are a lot more rows in dataset B I have just included the first 6):
ID Scenario Person Upkeep
1 1 1 -10
2 1 2 -200
3 2 1 -150
4 3 4 -50
5 3 3 -100
6 4 5 -500
I want to add a column called 'Profit' in dataset B such that I am able to see the profit per person per scenario (Profit = Sales + Upkeep). For example as below:
ID Scenario Person Upkeep Profit
1 1 1 -10 90
2 1 2 -200 100
3 2 1 -150 -50
4 3 4 -50 150
5 3 3 -100 300
6 4 5 -500 -450
What is the best way to do this? I am new to R and trying use an aggregate function but it requires the arguments to be the same length.
Account_Scenarios$Profit <- aggregate(Sales[,c('Sales')], Account_Scenarios[,c('Upkeep')], by=list(Sales$Person), 'sum')

Assuming that Sales$Person have only unique values, you can:
Account_Scenarios$Profit=Account_Scenarios$Upkeep-Sales$Sales[sapply(Account_Scenarios$Person,function(x)which(Sales$Person==x))]

I would left_join the two datasets base Person variable, then calculate the profit:
library(tidyverse)
A <- A %>% select(Person, Sales) # Only need the two variables for the join
df <- left_join(B, A, by = "Person") %>%
mutate(Profit = Sales + Upkeep)

A solution can be using sqldf library (a sql style join):
library(sqldf)
A <- data.frame(Person=1:5, Sales=c(100,300,400,200,50))
B <- data.frame(Scenario=c(1,1,2,3,3,4), Person=c(1,2,1,4,3,5), Upkeep=c(-10,-200,-150,-50,-100,-500))
B <- sqldf("SELECT B.*, A.Sales + B.Upkeep as Profit FROM B JOIN A on B.Person = A.Person")

Related

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Count number of shared observations between samples using dplyr

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible.
I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.
example:
sample start end
a 2 4
a 3 6
a 4 8
b 2 4
b 3 6
b 10 12
c 10 12
c 0 4
c 2 4
Here samples a, b and c share 1 observation (2, 4)
sample a and b share 2 observations (2 4, 3 6)
sample b and c share 2 observations (2 4, 10 12)
sample a and c share 1 observation (2 4)
I'd like an output like:
abc 1
ab 2
bc 2
ac 1
and also to see what the shared observations are if possible:
abc 2 4
ab 2 4
ab 3 6
etc
Thanks in advance
Here's something that should get you going:
df %>%
group_by(start, end) %>%
summarise(
samples = paste(unique(sample), collapse = ""),
n = length(unique(sample)))
# Source: local data frame [5 x 4]
# Groups: start [?]
#
# start end samples n
# <int> <int> <chr> <int>
# 1 0 4 c 1
# 2 2 4 abc 3
# 3 3 6 ab 2
# 4 4 8 a 1
# 5 10 12 bc 2
Here is an idea via base R,
final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow),
pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))
# count1 pairs1
#0.4 1 c
#2.4 3 abc
#3.6 2 ab
#4.8 1 a
#10.12 2 bc

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Assigning values according to set limits in R

I have a list of stores with some quantities of different articles in them and a warehouse with these articles - these are two separate data frames.
Article <- c('a','b','a','b','c','d')
forecast <- c( 1,5,80,10,100,1000)
StoreID <- c(1,1,2,2,3,4)
StoreData <- data.frame(StoreID, Article, Order)
Smth like this:
StoreData
StoreID Article forecast
1 a 1
1 b 5
2 a 80
2 b 10
3 c 100
4 d 1000
And the warehouse data:
Stock <- c(10,11,12,100)
WarehouseData <- data.frame(Article, Stock)
WarehouseData
Article Stock
a 10
b 11
c 12
d 100
My target is to have a purchase order column. The logic has to be following: for every row in the StoreData table I have to look at the Stock of the Article in the Warehouse and if it is enough - approve the fcst, if not - approve only the avaulable quantity. My problem is that while approving quantities the avaialble stock is reducing and I cannot find out how to take it into account.
The expected result looks like this:
StoreData
StoreID Article forecast PO
1 a 1 1
1 b 5 5
2 a 80 9
2 b 10 6
3 c 100 12
4 d 1000 100
Can anyone, please, tell how to get it right?
Here's another approach using dplyr:
library(dplyr)
left_join(storeData, WarehouseData, by = "Article") %>%
group_by(Article) %>%
mutate(PO = ifelse(cumsum(forecast) <= Stock, forecast,
Stock - cumsum(forecast) + forecast)) %>% ungroup
#Source: local data frame [6 x 5]
#
# StoreID Article forecast Stock PO
# (int) (fctr) (int) (dbl) (dbl)
#1 1 a 1 10 1
#2 1 b 5 11 5
#3 2 a 80 10 9
#4 2 b 10 11 6
#5 3 c 100 12 12
#6 4 d 1000 100 100
See the loop below for example:
StoreData$PO <- NA
for (i in 1:nrow(StoreData)) {
query <- WarehouseData$Article == StoreData[i, "Article"]
po <- ifelse(StoreData[i, "forecast"] > WarehouseData[query, 2],
WarehouseData[query, 2],
StoreData[i, "forecast"])
WarehouseData[query, 2] <- WarehouseData[query, 2] - po
StoreData[i, "PO"] <- po
}
print(StoreData)
# StoreID Article forecast PO
# 1 1 a 1 1
# 2 1 b 5 5
# 3 2 a 80 9
# 4 2 b 10 6
# 5 3 c 100 12
# 6 4 d 1000 100
This is another alternative based on the other solution using base R:
StoreData <- merge(StoreData, WarehouseData)
StoreData$PO <- do.call(c, lapply(split(StoreData, StoreData$Article), function(z) {
ifelse(cumsum(z$forecast) <= z$Stock, z$forecast,
z$Stock - cumsum(z$forecast) + z$forecast)
}))
And here is what I used to recreate your dataset, might help other answers:
StoreData <- read.table(text = "StoreID Article forecast
1 a 1
1 b 5
2 a 80
2 b 10
3 c 100
4 d 1000", header = T)
Article <- c('a','b','c','d')
Stock <- c(10,11,12,100)
WarehouseData <- data.frame(Article, Stock)

Extract grouped Subset with condition

I have following data structure:
Group Count Value
1 1 1000
1 10 2000
2 6 1000
2 7 2000
Some groups that have a count value and a data value. Now I only want those rows where count > 0.25 * sum(count of group).
For example group 1 has sum(count) = 11 so the first row should not be included in the result.
Result should look like:
Group Count Value
1 10 2000
2 6 1000
2 7 2000
How can I do this in R?
Additionally my dataset has around 5 million rows. So please consider performance.
With the sample data
dd<-read.table(text="Group Count Value
1 1 1000
1 10 2000
2 6 1000
2 7 2000", header=T)
you can do this with base R
subset(dd, Count>.25*ave(Count, Group, FUN=sum))
or the dplyr library
library(dplyr)
dd %>% group_by(Group) %>% filter(Count > .25 * sum(Count))
perhaps you'll find one more readable. Both retrun
Group Count Value
2 1 10 2000
3 2 6 1000
4 2 7 2000

Resources