Bill Of Materials calculations in R - r

I need to create a recursive function to iteratively calculate the accumulative costs of items in the bill of items given that each item belongs to a specific level, where the levels present the production levels from level n to level 0 (finished product). My data frame is as follows:
df <- data.frame(
item_id = c("i1", "i2", "i3", "i4", "i5", "i6", "i7", "i8", "i9"),
quantity = c(2, 2, 5, 1, 1, 1, 4,1, 1),
price = c(2, 5, 3, 7, 0, 10, 0, 4, 0),
itemtype = c("A", "A", "A", "A", "B", "A", "B","A", "C"),
productionlevel = c(3, 3, 3, 3, 2, 2, 1,1, 0)
)
I want to create a new column where the total prices will be calculated. The logic of my calculations is:
Level 3 =>i1 final_price= i1 price * i1 quantity
i2 final_price= i2 price * i2 quantity
i3 final_price= i3 price * i3 quantity
i4 final_price= i4 price * i4 quantity
Level 2 => i5 final_price= i5 quantity *{(i1 price * i1 quantity)+(i2 price * i2 quantity)+(i3 price * i3 quantity)+(i4 price * i4 quantity)}
Level 1 => i7 final_price= i7 quantity *{(i5 final_price)+(i6 price * i6 quantity)}
Level 0 => i9 final_price= i9 quantity *{(i7 final_price)+(i8 price * i8 quantity)}
my purpose is to find the accumulated values of price where price=0 and assign it to final_price after multiplying it by the quantity
I tried finding the solution using the accumulative method but I think it would be better to use a recursive function and call it to get the result.

Thank you very much!
I thought it would be the same if the order of "A" and "B" are as in my df for levels 2 and 1. Actually, A means row materials for which the prices are all assigned and B are the sub-assemblies that should be calculated. That's why I assigned price 0 to them.
Now I appreciate your help however the code you provided calculate only the first sub-assembly of the second level:
item_id quantity price itemtype productionlevel final_price
1 i1 2 2 A 3 4
2 i2 2 5 A 3 10
3 i3 5 3 A 3 15
4 i4 1 7 A 3 7
5 i5 1 10 A 2 36
6 i6 1 0 B 2 0
7 i7 4 4 A 1 144
8 i8 1 0 B 1 0
9 i9 1 0 C 0 144
Can you please help me adjust it. I was expecting a result as this (I got using the accumulation method for a different hierarchy):
item_id quantity price itemtype productionlevel final_price
1 i1 2 2 A 3 4
2 i2 2 5 A 3 10
3 i3 5 3 A 3 15
4 i4 1 7 A 3 7
5 i5 1 0 A 2 36
6 i6 1 10 B 2 10
7 i7 4 0 A 1 4
8 i8 1 4 B 1 184
9 i9 1 0 C 0 188
as the item types are corresponding to A(row materials), B(sub-assemblies) and C(finished good).

This is what I have tried in a very basic way where the number of levels are already known:
df <- data.frame(
item_id = c("i1", "i2", "i3", "i4", "i5", "i6", "i7", "i8", "i9"),
quantity = c(2, 2, 5, 1, 1, 2, 4,1, 1),
price = c(2, 5, 3, 7, 10, 0, 4, 0, 0),
itemtype = c("A", "A", "A", "A", "A", "B", "A","B", "C"),
productionlevel = c(3, 3, 3, 3, 2, 2, 1,1, 0),
stringsAsFactors=FALSE
)
df$final_price=0
maxLevel=max(df$productionlevel)
df[df$price != 0, 'final_price'] <- df[df$price != 0, 'quantity'] * df[df$price != 0, 'price']
df[df$productionlevel == 2 & df$itemtype != "A", 'final_price'] <- sum(df[df$productionlevel == 3, 'final_price']) * df[df$productionlevel == 2 & df$itemtype != "A", 'quantity']
df[df$productionlevel == 1 & df$itemtype != "A", 'final_price'] <- sum(df[df$productionlevel == 2, 'final_price']) * df[df$productionlevel == 1 & df$itemtype != "A", 'quantity']
df[df$productionlevel == 0 & df$itemtype != "A", 'final_price'] <- sum(df[df$productionlevel == 1, 'final_price']) * df[df$productionlevel == 0 & df$itemtype != "A", 'quantity']
The resulting data frame was:
item_id quantity price itemtype productionlevel final_price
1 i1 2 2 A 3 4
2 i2 2 5 A 3 10
3 i3 5 3 A 3 15
4 i4 1 7 A 3 7
5 i5 1 10 A 2 10
6 i6 2 0 B 2 72
7 i7 4 4 A 1 16
8 i8 1 0 B 1 82
9 i9 1 0 C 0 98
I need to make these calculations more generic without specifying the levels and item types.

With my background from manufacturing, I believe that there could be some typo in the dataframe. As it is urgent, I try my best to help. I modified the itemtype to fix the BOM hierarchy. My experiment tells me that it should be as below.
df <- data.frame(
item_id = c("i1", "i2", "i3", "i4", "i5", "i6", "i7", "i8", "i9"),
quantity = c(2, 2, 5, 1, 1, 1, 4,1, 1),
price = c(2, 5, 3, 7, 0, 10, 0, 4, 0),
itemtype = c("A", "A", "A", "A", "A", "B", "A","B", "C"),
productionlevel = c(3, 3, 3, 3, 2, 2, 1,1, 0),
stringsAsFactors=FALSE
)
df$final_price=0
maxLevel=max(df$productionlevel)
df[df$productionlevel==maxLevel,'final_price']=df[df$productionlevel==maxLevel,2]*df[df$productionlevel==maxLevel,3]
for (i in (maxLevel-1):1)
{
tmpdf=df[df$productionlevel==i, ]
for (itemtype in tmpdf$itemtype)
{
item_id=tmpdf[tmpdf$itemtype==itemtype, 'item_id']
cost=sum(df[df$itemtype==itemtype & df$productionlevel==(i+1), 'final_price'])
manufacturecost=ifelse(df[df$item_id==item_id, 3]==0, 1 , df[df$item_id==item_id, 3])
df[df$item_id==item_id, 'final_price']=df[df$item_id==item_id, 2]*manufacturecost*ifelse(cost==0,1,cost)
}
}
df[df$productionlevel==0,'final_price']=sum(df[df$productionlevel==1,'final_price'])
print(df)
Result:
item_id quantity price itemtype productionlevel final_price
1 i1 2 2 A 3 4
2 i2 2 5 A 3 10
3 i3 5 3 A 3 15
4 i4 1 7 A 3 7
5 i5 1 0 A 2 36
6 i6 1 10 B 2 10
7 i7 4 0 A 1 144
8 i8 1 4 B 1 40
9 i9 1 0 C 0 184

I changed the data frame to make sure it's making the right calculations (increased some quantity values):
df <- data.frame(
item_id = c("i1", "i2", "i3", "i4", "i5", "i6", "i7", "i8", "i9"),
quantity = c(2, 2, 5, 1, 1, 2, 4,1, 1),
price = c(2, 5, 3, 7, 10, 0, 4, 0, 0),
itemtype = c("A", "A", "A", "A", "A", "B", "A","B", "C"),
productionlevel = c(3, 3, 3, 3, 2, 2, 1,1, 0),
stringsAsFactors = FALSE
)
Then I used an iterative loop to calculate the final cost:
df$final_price = 0
maxLevel <- max(df$productionlevel)
df[df$price != 0, 'final_price'] <- df[df$price != 0, 'quantity'] * df[df$price != 0, 'price']
for (level in (maxLevel - 1):0) {
condition <- df$productionlevel == level & df$itemtype != "A"
if (any(condition)) {
higherLevel <- level + 1
df[condition, 'final_price'] <- sum(df[df$productionlevel == higherLevel, 'final_price']) * df[condition, 'quantity']
}
}
Till here I think it's perfectly working. Now can anyone help me find a way to create a recursive function doing the job instead of this loop?
Also, I was wondering if it's a good idea to use recursive functions when we can use loops, which approach is more effective?

Related

Find last non-zero element in column for each group, fill different column

I am trying to create a for loop that does the following:
for (i in 2:length(Exampledata$Levels)) {
if(is.na(Exampledata$Levels[i]) == "TRUE" {
find the last instance where
is.na(Exampledata$Levels) == "FALSE"
for that same ID, and input
the day from that row into last_entry[i]
}
}
Example data:
ID<-c("QYZ","MMM","QYZ","bb2","gm6","gm6","YOU","LLL","LLL","LLL")
day<-c(1,2,3,4,5,6,7,8,9,10)
values<-c(1,2,4,5,5,6,8,9,6,4)
Levels<-c("A","","A","C",'D','D',"C","y","","")
last_entry<-c(0,0,0,0,0,0,0,0,0,0)
What data currently looks like:
ID values Levels day last_entry
1 QYZ 1 A 1 0
2 MMM 2 2 0
3 QYZ 4 A 3 0
4 bb2 5 C 4 0
5 gm6 5 D 5 0
6 gm6 6 D 6 0
7 YOU 8 C 7 0
8 LLL 9 y 8 0
9 LLL 6 9 0
10 LLL 4 10 0
What I want it to look like:
ID values Levels day last_entry
1 QYZ 1 A 1 0
2 MMM 2 2 0
3 QYZ 4 A 3 0
4 bb2 5 C 4 0
5 gm6 5 D 5 0
6 gm6 6 D 6 0
7 YOU 8 C 7 0
8 LLL 9 y 8 0
9 LLL 6 9 8
10 LLL 4 10 8
I have seen a lot of code that looks for last non-zero elements or last is.na=FALSE, but none that can do it by ID, and extract a value from that row. I also need to ignore cases where there is no entry for that ID.
Essentially I want to know the last day that a level was entered for that ID.
Here's a solution using data.table:
library('data.table')
dt <- data.table(ID = c("QYZ","MMM","QYZ","bb2","gm6","gm6","YOU","LLL","LLL","LLL"),
Day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
values = c(1, 2, 4, 5, 5, 6, 8, 9, 6, 4),
Levels = c("A", NA, "A", "C", "D", "D", "C", "y", NA, NA),
last_entry = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
func <- function(days, levels){
if(!any(is.na(levels)) | all(is.na(levels))) return(0)
return(last(days[which(!is.na(levels))]))
}
dt[, last_entry := ifelse(!is.na(Levels), 0, func(Day, Levels)), by = ID]
But if you're set on using a for loop:
ID <- c("QYZ","MMM","QYZ","bb2","gm6","gm6","YOU","LLL","LLL","LLL")
Day <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Levels <- c("A", NA, "A", "C", "D", "D", "C", "y", NA, NA)
last_entry <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
i.na <- which(is.na(Levels))
for(id in unique(ID)){
i.id <- which(ID == id)
if(all(is.na(Levels[i.id])) | !any(is.na(Levels[i.id]))) next
day <- last(Day[i.id[!(i.id %in% i.na)]])
last_entry[i.na[i.na %in% i.id]] <- day
}
Here is one way using tidyr::fill. We replace the last_entry columns with NA where the Levels are empty, then use fill to replace those NA's with latest non-NA values and turn last_entry value of all non-empty Levels to 0.
library(dplyr)
df %>%
mutate(last_entry = ifelse(Levels != "", day, NA)) %>%
group_by(ID) %>%
tidyr::fill(last_entry) %>%
mutate(last_entry = replace(last_entry, Levels != "" | n() == 1, 0))
# ID day values Levels last_entry
# <fct> <dbl> <dbl> <fct> <dbl>
# 1 QYZ 1 1 A 0
# 2 MMM 2 2 "" 0
# 3 QYZ 3 4 A 0
# 4 bb2 4 5 C 0
# 5 gm6 5 5 D 0
# 6 gm6 6 6 D 0
# 7 YOU 7 8 C 0
# 8 LLL 8 9 y 0
# 9 LLL 9 6 "" 8
#10 LLL 10 4 "" 8
We can also do
df %>%
group_by(ID) %>%
mutate(last_entry = purrr::map_dbl(row_number(), ~if (Levels[.x] == "" & n() > 1)
day[max(which(Levels[1:.x] != ""))] else 0))
data
ID<-c("QYZ","MMM","QYZ","bb2","gm6","gm6","YOU","LLL","LLL","LLL")
day<-c(1,2,3,4,5,6,7,8,9,10)
values<-c(1,2,4,5,5,6,8,9,6,4)
Levels<-c("A","","A","C",'D','D',"C","y","","")
last_entry<-c(0,0,0,0,0,0,0,0,0,0)
df <- data.frame(ID, day, values, Levels, last_entry)
If you want to do it properly, you may want to code "empty" cells to NA beforehand.
Exampledata[Exampledata == ""] <- NA
Then you may use by from base R to look up "day" of the last !is.na entry of "Levels" in the by "ID" splitted data.
res <- do.call(rbind, by(Exampledata, Exampledata$ID, function(x) {
x$last_entry <- ifelse(is.na(x$Levels), x$day[tail(which(!is.na(x$Levels)), 1)], 0)
x
}))
Since the rbinded result comes out ordered alphabetically by "ID" we can re-order it by day.
res <- res[order(res$day), ]
res
# ID day values Levels last_entry
# QYZ.1 QYZ 1 1 A 0
# MMM MMM 2 2 <NA> NA
# QYZ.3 QYZ 3 4 A 0
# bb2 bb2 4 5 C 0
# gm6.5 gm6 5 5 D 0
# gm6.6 gm6 6 6 D 0
# YOU YOU 7 8 C 0
# LLL.8 LLL 8 9 y 0
# LLL.9 LLL 9 6 <NA> 8
# LLL.10 LLL 10 4 <NA> 8
Now there are the desired last entries for the "LLL" level, and an NA for MMM what it logically should have since "Levels" is NA and it has no last entry.
Data
Exampledata <- structure(list(ID = structure(c(5L, 4L, 5L, 1L, 2L, 2L, 6L, 3L,
3L, 3L), .Label = c("bb2", "gm6", "LLL", "MMM", "QYZ", "YOU"), class = "factor"),
day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), values = c(1, 2,
4, 5, 5, 6, 8, 9, 6, 4), Levels = structure(c(2L, NA, 2L,
3L, 4L, 4L, 3L, 5L, NA, NA), .Label = c("", "A", "C", "D",
"y"), class = "factor"), last_entry = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0)), row.names = c(NA, -10L), class = "data.frame")

How to create dummy variables per group of another variable in tidyverse

I want create (dummy) variables that show whether an observation is in a group of observations (Identifiable by a common Group_ID) with a certain combination of characteristics across that group. The code example makes it clearer what I exactly mean.
I tried combinations of group_by and caret::dummyVars, but had no success. I am running out of ideas - any help would be appreciated very much.
library(tidyverse)
# Input data
# please note: in my case each value of the column Role will appear only once per Group_ID.
input_data <- tribble( ~Group_ID, ~Role, ~Income,
#--|--|----
1, "a", 3.6,
1, "b", 8.5,
2, "a", 7.6,
2, "c", 9.5,
2, "d", 9.7,
3, "a", 1.6,
3, "b", 4.5,
3, "c", 2.7,
3, "e", 7.7,
4, "b", 3.3,
4, "c", 6.2,
)
# desired output
output_data <- tribble( ~Group_ID, ~Role, ~Income, ~Role_A, ~Role_B, ~Role_C, ~Role_D, ~Role_E, ~All_roles,
#--|--|----
1, "a", 3.6, 1, 1, 0, 0, 0, "ab",
1, "b", 8.5, 1, 1, 0, 0, 0, "ab",
2, "a", 7.6, 1, 0, 1, 1, 0, "acd",
2, "c", 9.5, 1, 0, 1, 1, 0, "acd",
2, "d", 9.7, 1, 0, 1, 1, 0, "acd",
3, "a", 1.6, 1, 1, 1, 0, 1, "abce",
3, "b", 4.5, 1, 1, 1, 0, 1, "abce",
3, "c", 2.7, 1, 1, 1, 0, 1, "abce",
3, "e", 7.7, 1, 1, 1, 0, 1, "abce",
4, "b", 3.3, 0, 1, 1, 0, 0, "bc",
4, "c", 6.2, 0, 1, 1, 0, 0, "bc"
)
The following takes advantage of base R modeling functions to create the dummies.
First, create a model matrix with no intercept.
fit <- lm(Group_ID ~ 0 + Role, input_data)
m <- model.matrix(fit)
Now, process that matrix by noting that the dummies the question asks for are the sums by groups of Group_ID.
input_data %>%
bind_cols(m %>% as.data.frame()) %>%
group_by(Group_ID) %>%
mutate_at(vars(matches("Role[[:alpha:]]")), sum) %>%
mutate(all_roles = paste(Role, collapse = ""))
## A tibble: 11 x 9
## Groups: Group_ID [4]
# Group_ID Role Income Rolea Roleb Rolec Roled Rolee all_roles
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 1 a 3.6 1 1 0 0 0 ab
# 2 1 b 8.5 1 1 0 0 0 ab
# 3 2 a 7.6 1 0 1 1 0 acd
# 4 2 c 9.5 1 0 1 1 0 acd
# 5 2 d 9.7 1 0 1 1 0 acd
# 6 3 a 1.6 1 1 1 0 1 abce
# 7 3 b 4.5 1 1 1 0 1 abce
# 8 3 c 2.7 1 1 1 0 1 abce
# 9 3 e 7.7 1 1 1 0 1 abce
#10 4 b 3.3 0 1 1 0 0 bc
#11 4 c 6.2 0 1 1 0 0 bc
Using dplyr and cSplit_e from splitstackshape. For every Group_ID we paste the Role together and then separate them into new columns of binary value based on their presence and absence using cSplit_e.
library(splitstackshape)
library(dplyr)
input_data %>%
group_by(Group_ID) %>%
mutate(new_role = paste(Role, collapse = "")) %>%
ungroup() %>%
cSplit_e("new_role", sep = "", type = "character", fill = 0)
# Group_ID Role Income new_role new_role_a new_role_b new_role_c new_role_d new_role_e
#1 1 a 3.6 ab 1 1 0 0 0
#2 1 b 8.5 ab 1 1 0 0 0
#3 2 a 7.6 acd 1 0 1 1 0
#4 2 c 9.5 acd 1 0 1 1 0
#5 2 d 9.7 acd 1 0 1 1 0
#6 3 a 1.6 abce 1 1 1 0 1
#7 3 b 4.5 abce 1 1 1 0 1
#8 3 c 2.7 abce 1 1 1 0 1
#9 3 e 7.7 abce 1 1 1 0 1
#10 4 b 3.3 bc 0 1 1 0 0
#11 4 c 6.2 bc 0 1 1 0 0

Aggregate using different functions for each column

I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.
key1 price qty status category
1: 1 9.26 3 5 B
2: 1 14.64 1 5 B
3: 1 16.66 3 5 A
4: 1 18.27 1 5 A
5: 2 2.48 1 7 A
6: 2 0.15 2 7 C
7: 2 6.29 1 7 B
8: 3 7.06 1 2 A
9: 3 24.42 1 2 A
10: 3 9.16 2 2 C
11: 3 32.21 2 2 B
12: 4 20.00 2 9 B
Heres the dput() string
dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4),
price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06,
24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1,
1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2,
9), category = c("B", "B", "A", "A", "A", "C", "B", "A",
"A", "C", "B", "B")), .Names = c("key1", "price", "qty",
"status", "category"), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)
I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
key.aggregate = function(x){
return(data.table(
key1 = Mode(x$key1),
perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
status = Mode(x$status),
qty = sum(x$qty),
price = sum(x$price)
))
}
new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))
And expecting the following output:
> results
key1 perc.A perc.B perc.C status qty price
1: 1 0.5937447 0.4062553 0.00000000 5 8 58.83
2: 2 0.2780269 0.7051570 0.01681614 7 4 8.92
3: 3 0.4321208 0.4421414 0.12573782 2 6 72.85
4: 4 0.0000000 1.0000000 0.00000000 9 2 20.00
But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.
I'm thinking this approach is very inefficient, which would be the best way to get this result?

Find if survey data is consistent across years by participant in dplyr in R

I have data which looks like this:
df <- data.frame(
ID = c(rep(c("ABC123", "BCD234", "CDE345", "DEF456", "EFG567", "FGH678", "GHI891", "HIJ910", "IJK101", "JKL011"),2)),
eth = c(1, 2, 2, 3, 1, 1, 4, 4, 3, 3, 1, 4, 1, 3, 1, 3, 4, 4, 3, 2),
nzdep = c(4, 3, 3, 2, 4, 4, 1, 1, 2, 2, 4, 3, 3, 4, 4, 2, 1, 1, 2, 3),
sex = c("M", "M", "F", "F", "M", "M", "F", "F", "M", "M", "F", "M", "M", "M", "M", "F", "F", "M", "F", "M"),
Year = c(rep("Y1", 10), rep("Y2", 10)))
This is survey data, for the same people, in different years. The ID is a unique ID per person, and the Year tells us which year the survey was completed. What I want to know, is whether the same ID answered the same question the same way in both years.
I have tried something like this:
dems <- df %>%
group_by(ID) %>%
mutate(dep_dif = ifelse(nzdep = nzdep, 1, 0),
sex_dif = ifelse(sex = sex, 1, 0),
eth_dif = ifelse(eth = eth, 1, 0))
This doesn't work, but I was thinking something along these lines.
My desired output would be:
dems <- data.frame(
ID = c(rep(c("ABC123", "BCD234", "CDE345", "DEF456", "EFG567", "FGH678", "GHI891", "HIJ910", "IJK101", "JKL011"),2)),
eth = c(1, 2, 2, 3, 1, 1, 4, 4, 3, 3, 1, 4, 1, 3, 1, 3, 4, 4, 3, 2),
nzdep = c(4, 3, 3, 2, 4, 4, 1, 1, 2, 2, 4, 3, 3, 4, 4, 2, 1, 1, 2, 3),
sex = c("M", "M", "F", "F", "M", "M", "F", "F", "M", "M", "F", "M", "M", "M", "M", "F", "F", "M", "F", "M"),
Year = c(rep("Y1", 10), rep("Y2", 10)),
eth_dif = c(rep(c(1, 0, 0, 1, 1, 0, 1, 1, 1, 0),2)),
dep_dif = c(rep(c(1, 1, 1, 0, 1, 0, 1, 1, 1, 0),2)),
sex_dif = c(rep(c(0, 1, 0, 0, 1, 0, 1, 0, 0, 1),2)))
Does anyone know how to do this?
Thanks
Seems like you need unique value equal to one
df%>%group_by(ID)%>%dplyr::mutate( ifelse(length(unique(nzdep))==1, 1, 0),
+ sex_dif = ifelse(length(unique(sex))==1, 1, 0),
+ eth_dif = ifelse(length(unique(eth))==1, 1, 0))
# A tibble: 20 x 8
# Groups: ID [10]
ID eth nzdep sex Year `ifelse(length(unique(nzdep)) == 1, 1, 0)` sex_dif eth_dif
<fctr> <dbl> <dbl> <fctr> <fctr> <dbl> <dbl> <dbl>
1 ABC123 1 4 M Y1 1 0 1
2 BCD234 2 3 M Y1 1 1 0
3 CDE345 2 3 F Y1 1 0 0
4 DEF456 3 2 F Y1 0 0 1
5 EFG567 1 4 M Y1 1 1 1
6 FGH678 1 4 M Y1 0 0 0
7 GHI891 4 1 F Y1 1 1 1
8 HIJ910 4 1 F Y1 1 0 1
9 IJK101 3 2 M Y1 1 0 1
10 JKL011 3 2 M Y1 0 1 0
11 ABC123 1 4 F Y2 1 0 1
12 BCD234 4 3 M Y2 1 1 0
13 CDE345 1 3 M Y2 1 0 0
14 DEF456 3 4 M Y2 0 0 1
15 EFG567 1 4 M Y2 1 1 1
16 FGH678 3 2 F Y2 0 0 0
17 GHI891 4 1 F Y2 1 1 1
18 HIJ910 4 1 M Y2 1 0 1
19 IJK101 3 2 F Y2 1 0 1
20 JKL011 2 3 M Y2 0 1 0
We could do this with mutate_at
library(dplyr)
df %>%
group_by(ID) %>%
mutate_at(2:4, funs(dif = as.integer(.[Year == "Y1"] == .[Year == "Y2"])))
# A tibble: 20 x 8
# Groups: ID [10]
# ID eth nzdep sex Year eth_dif nzdep_dif sex_dif
# <fct> <dbl> <dbl> <fct> <fct> <int> <int> <int>
# 1 ABC123 1 4 M Y1 1 1 0
# 2 BCD234 2 3 M Y1 0 1 1
# 3 CDE345 2 3 F Y1 0 1 0
# 4 DEF456 3 2 F Y1 1 0 0
# 5 EFG567 1 4 M Y1 1 1 1
# 6 FGH678 1 4 M Y1 0 0 0
# 7 GHI891 4 1 F Y1 1 1 1
# 8 HIJ910 4 1 F Y1 1 1 0
# 9 IJK101 3 2 M Y1 1 1 0
#10 JKL011 3 2 M Y1 0 0 1
#11 ABC123 1 4 F Y2 1 1 0
#12 BCD234 4 3 M Y2 0 1 1
#13 CDE345 1 3 M Y2 0 1 0
#14 DEF456 3 4 M Y2 1 0 0
#15 EFG567 1 4 M Y2 1 1 1
#16 FGH678 3 2 F Y2 0 0 0
#17 GHI891 4 1 F Y2 1 1 1
#18 HIJ910 4 1 M Y2 1 1 0
#19 IJK101 3 2 F Y2 1 1 0
#20 JKL011 2 3 M Y2 0 0 1
If the 'ID' is already ordered a base R option would be
df[paste0(names(df)[2:4], "_dif")] <- +(Reduce(`==`, split(df[2:4], df$Year)))

Compute the sum of variables in data frames allocated in a list in R considering differents conditions over others variables

Hi everybody I am working with a list of data frames in R. Lists are awesome in R but I want to solve this. I have a list named global that has five data frames f1,f2,f3,f4,f5 each data frame has a principal variable named CreditValue and variables that works like flags for example f1 has CreditValue and a flag variable b1 with values of 1. f2 has two flag variables b1 with values of 1 and b2 with values of 2. f3 hast three flag variables b1 with values of 1, b2 with values of 2 and b3 with values of 3. f4 has four flag variables b1 with values of 1, b2 with values of 2 ,b3 with values of 3 and b4 with values of 4. f5 has five flag variables b1 with values of 1, b2 with values of 2 ,b3 with values of 3, b4 with values of 4 and b5 with values of 5. Flag variables always start in column 3 for all data frames. I wish to compute the sum of CreditValue in each data frame considering different aspects over flag variables. My list has the next structure (I include dput version in the final part):
global
$f1
KeyID CreditValue b1
1 001 1 1
2 002 2 1
3 003 3 1
4 004 4 1
5 005 5 1
6 006 6 1
7 007 7 1
8 009 8 1
9 010 9 1
$f2
KeyID CreditValue b1 b2
1 001 1 1 2
2 002 2 1 2
3 003 3 NA 2
4 004 4 NA 2
5 005 5 NA 2
6 006 6 1 2
7 007 7 1 2
8 009 8 NA 2
9 010 9 1 2
10 011 10 NA 2
11 012 11 1 2
$f3
KeyID CreditValue b1 b2 b3
1 001 1 1 2 3
2 002 2 1 2 3
3 003 3 1 2 3
4 004 4 1 2 3
5 005 5 NA 2 3
6 006 6 NA 2 3
7 007 7 1 2 3
8 009 8 1 2 3
9 010 9 NA NA 3
10 011 10 NA NA 3
11 012 11 NA 2 3
12 013 11 1 2 3
13 014 11 NA NA 3
$f4
KeyID CreditValue b1 b2 b3 b4
1 001 1 NA 2 3 4
2 002 2 NA 2 3 4
3 003 3 NA NA NA 4
4 004 4 NA NA NA 4
5 005 5 NA NA NA 4
6 006 6 1 2 3 4
7 007 7 1 2 3 4
8 009 8 1 2 3 4
9 010 9 1 2 3 4
10 011 10 1 2 3 4
11 012 11 1 2 3 4
12 013 11 1 2 3 4
13 014 11 1 2 3 4
14 015 12 1 NA 3 4
15 016 12 1 NA 3 4
$f5
KeyID CreditValue b1 b2 b3 b4 b5
1 001 1 1 2 3 4 5
2 002 2 1 2 3 4 5
3 003 3 1 2 3 4 5
4 004 4 1 2 3 4 5
5 005 5 NA NA 3 4 5
6 006 6 1 2 3 4 5
7 007 7 1 2 3 4 5
8 009 8 1 2 3 4 5
9 010 9 1 2 3 4 5
10 011 10 NA NA NA NA 5
11 012 11 1 2 3 4 5
12 013 11 1 2 3 4 5
13 014 11 1 2 3 4 5
14 015 12 1 2 3 4 5
15 016 12 1 2 3 4 5
16 017 14 NA NA NA 4 5
17 018 14 NA NA NA 4 5
I have used llply() function form plyr package to work with lists in R but I don't know how to define a function to make this. I compute the sums using this code but if I had more data frames it would be so complex. Also I would like to save this values in a new data frame or matrix considering flag variables (5). The results of the sums are the next:
sum(f1$CreditValue[f1[,3]==1])
[1] 45
sum(f2$CreditValue[f2[,3]==1],na.rm=TRUE)
[1] 36
sum(f3$CreditValue[f3[,3]==1],na.rm=TRUE)
[1] 36
sum(f4$CreditValue[f4[,3]==1],na.rm=TRUE)
[1] 97
sum(f5$CreditValue[f5[,3]==1],na.rm=TRUE)
[1] 97
These sums are computed applying those formulas considering b1 variable in all data frames.
sum(f2$CreditValue[is.na(f2[,3]) & f2[,4]==2] ,na.rm=TRUE)
[1] 30
sum(f3$CreditValue[is.na(f3[,3]) & f3[,4]==2] ,na.rm=TRUE)
[1] 22
sum(f4$CreditValue[is.na(f4[,3]) & f4[,4]==2] ,na.rm=TRUE)
[1] 3
sum(f5$CreditValue[is.na(f5[,3]) & f5[,4]==2] ,na.rm=TRUE)
[1] 0
These sums are computed applying those formulas considering values of b2 and b1 variables in all data frames. Here there is a condition over values of b1 (column 3).
sum(f3$CreditValue[is.na(f3[,3]) & is.na(f3[,4]) & f3[,5]==3] ,na.rm=TRUE)
[1] 30
sum(f4$CreditValue[is.na(f4[,3]) & is.na(f4[,4]) & f4[,5]==3] ,na.rm=TRUE)
[1] 0
sum(f5$CreditValue[is.na(f5[,3]) & is.na(f5[,4]) & f5[,5]==3] ,na.rm=TRUE)
[1] 5
These sums are computed applying those formulas considering values of b3, b2 and b1 variables in all data frames. Now there is a condition over values of b1 and b2 (columns 3, 4).
sum(f4$CreditValue[is.na(f4[,3]) & is.na(f4[,4]) & is.na(f4[,5]) & f4[,6]==4] ,na.rm=TRUE)
[1] 12
sum(f5$CreditValue[is.na(f5[,3]) & is.na(f5[,4]) & is.na(f5[,5]) & f5[,6]==4] ,na.rm=TRUE)
[1] 28
These sums are computed applying those formulas considering values of b4, b3, b2 and b1 variables in all data frames. Now there is a condition over values of b1, b2 and b3 (columns 3, 4, 5).
sum(f5$CreditValue[is.na(f5[,3]) & is.na(f5[,4]) & is.na(f5[,5]) & is.na(f5[,6]) & f5[,7]==5] ,na.rm=TRUE)
[1] 10
This sum is computed applying last formula considering values of b5, b4, b3, b2 and b1 variables in all data frames. Now there is a condition over values of b1, b2, b3 and b4 (columns 3, 4, 5, 6).
The showed sum are the result of a lot of code but I would like to create a function that works over flag variables (b1, b2, b3, b4, b5) to compute the sums. I don't know if it is possible to make this with a for or a function that works with llply or lapply. I have tried to resume code like this:
sum(f5$CreditValue[is.na(f5[,3]) & is.na(f5[,4]) & is.na(f5[,5]) & is.na(f5[,6]) & f5[,7]==5] ,na.rm=TRUE)
With this code:
sum(f5$CreditValue[is.na(f5[,3,4,5,6]) & f5[,7]==5] ,na.rm=TRUE)
But it doesn't job because with the original conditions I am considering only specific rows in each data frame and the resumed code doesn't make this. I would like to save the results of sums in a new data frame, matrix like this:
f1 f2 f3 f4 f5
f1 45 0 0 0 0
f2 36 30 0 0 0
f3 36 22 30 0 0
f4 97 3 0 12 0
f5 97 0 5 28 10
The zeros in the last data frame are produced due to all data frames don't have all flag variables for example f1 only has b1 and it doesn't have b2,b3,b4,b5 like f5. The dput version of my list is the next:
structure(list(f1 = structure(list(KeyID = c("001", "002", "003",
"004", "005", "006", "007", "009", "010"), CreditValue = c(1,
2, 3, 4, 5, 6, 7, 8, 9), b1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("KeyID",
"CreditValue", "b1"), row.names = c(NA, 9L), class = "data.frame"),
f2 = structure(list(KeyID = c("001", "002", "003", "004",
"005", "006", "007", "009", "010", "011", "012"), CreditValue = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11), b1 = c(1, 1, NA, NA, NA,
1, 1, NA, 1, NA, 1), b2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2)), .Names = c("KeyID", "CreditValue", "b1", "b2"), row.names = c(NA,
11L), class = "data.frame"), f3 = structure(list(KeyID = c("001",
"002", "003", "004", "005", "006", "007", "009", "010", "011",
"012", "013", "014"), CreditValue = c(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 11, 11), b1 = c(1, 1, 1, 1, NA, NA, 1, 1, NA,
NA, NA, 1, NA), b2 = c(2, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2,
2, NA), b3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("KeyID",
"CreditValue", "b1", "b2", "b3"), row.names = c(NA, 13L), class = "data.frame"),
f4 = structure(list(KeyID = c("001", "002", "003", "004",
"005", "006", "007", "009", "010", "011", "012", "013", "014",
"015", "016"), CreditValue = c(1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 11, 11, 12, 12), b1 = c(NA, NA, NA, NA, NA, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1), b2 = c(2, 2, NA, NA, NA, 2, 2, 2,
2, 2, 2, 2, 2, NA, NA), b3 = c(3, 3, NA, NA, NA, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3), b4 = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4)), .Names = c("KeyID", "CreditValue", "b1",
"b2", "b3", "b4"), row.names = c(NA, 15L), class = "data.frame"),
f5 = structure(list(KeyID = c("001", "002", "003", "004",
"005", "006", "007", "009", "010", "011", "012", "013", "014",
"015", "016", "017", "018"), CreditValue = c(1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 11, 11, 12, 12, 14, 14), b1 = c(1,
1, 1, 1, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, NA, NA), b2 = c(2,
2, 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, NA, NA), b3 = c(3,
3, 3, 3, 3, 3, 3, 3, 3, NA, 3, 3, 3, 3, 3, NA, NA), b4 = c(4,
4, 4, 4, 4, 4, 4, 4, 4, NA, 4, 4, 4, 4, 4, 4, 4), b5 = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5)), .Names = c("KeyID",
"CreditValue", "b1", "b2", "b3", "b4", "b5"), row.names = c(NA,
17L), class = "data.frame")), .Names = c("f1", "f2", "f3",
"f4", "f5"))
I hope you can help me it is so complex for me building a function to compute the sums and If I use traditional forms of code I would have problems with lists of more data frames. Thanks for your help.
You can use lapply and call a function that builds the rows of your output data frame:
get.sums = function(df) {
sapply(1:5, function(y) {
if (y > 1) {
na.col = 3:(y+1)
} else {
na.col = NULL
}
if (paste0("b", y) %in% names(df)) {
return(sum(df$CreditValue[rowSums(!is.na(df[,na.col,drop=F])) == 0 & df[,(y+2)] == y], na.rm=T))
} else {
return(0)
}
})
}
rows = lapply(global, get.sums)
sums = do.call(rbind, rows)
sums
# [,1] [,2] [,3] [,4] [,5]
# f1 45 0 0 0 0
# f2 36 30 0 0 0
# f3 36 22 30 0 0
# f4 97 3 0 12 0
# f5 97 0 5 28 10

Resources