Totalize component products from table - r

I need my warehouse to be able to know how many items of each component we need per day. Basically, I have bundled items made of single products, and I want the warehouse to know how many of those single items they should provide in any given date.
I currently have data like this:
date bundle_name totbund prod1 totprod1 prod2 totprod2
06/01/2019 a_bund 1 a 1 b 1
06/01/2019 a 1
06/01/2019 b 2
07/01/2019 b_bund 1 b 2
07/01/2019 b_bund 2 b 4
07/01/2019 b 2
My expected output is this:
date all_item total
06/01/2019 a 2
06/01/2019 b 3
07/01/2019 b 8
Please notice that the bundle_item column can have bundled items or a single item, so it is mixed.

Something like this could work (I use 'a' as an example):
dat = dat %>%
group_by(date) %>%
summarize(a_bund = sum(tot_bund[bundle_name=='a']),
a_prod1 = sum(prod1[totprod1=='a']),
a_prod2 = sum(prod2[totprod2=='a'])) %>%
mutate(a = a_bund+a_prod1+a_prod2)

I wouldn't use the bundled notation, it sounds overly complicated. If you have everything in a row-format, you can use the group_by - summarize functionality of dplyr.
Assuming the data is called 'df'
library(dplyr)
df <- df %>%
select(date, prod = prod1, totprod = totprod1) %>%
filter(prod != "") %>%
bind_rows(df %>% select(date, prod = prod2, totprod = totprod2) %>% filter(prod != "") %>%
group_by(date, prod) %>%
summarize(totprod = sum(totprod))

I commented you needed a better approach at this problem.
I suggest you consider this from a structured database perspective. In such view, your data (and thus your world) is made of tables with differing and complementary information. And when you need to obtain information to solve your problem, you join data from different tables. If you have used excel, then you'll know it as vlookup.
How I'd approach your problem:
Table of components:
First, I'll have a table of components. This would be a very simple table of 3 columns: name of the product, component from which it is made and amount of component needed.
For your example, I'll have
library(data.table)
components <- structure(list(name = c("a", "b", "a_bund", "a_bund", "b_bund"),
component = c("a", "b", "a", "b", "b"),
amount = c(1, 1, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table", "data.frame"))
Which will produce:
components
name component amount
1: a a 1
2: b b 1
3: a_bund a 1
4: a_bund b 1
5: b_bund b 2
Notice that the information contained here is just the same information that you have in your table in columns 4 to 7 (by the way, your table is called "wide", while mine is "long" formatted. Long is much better for machine processing, and it's considered "tidy").
Table of Requests
Now that you have a table for the components, you'll need a table to put how many units of product x your clients need by y date. Do you notice that I separated the information content in both tables? There's one with components and nothing else; and there's one with requests and nothing else. Each thingy in its own basket!
This table I called requests and it's composed of three columns: dates with the date of the request, name with the name of the product requested by the client, and qty with the quantity the client expects of the product. That would be what you have in columns one to three in your data.
requests <- structure(list(dates = structure(c(17902, 17902, 17902, 17903, 17903, 17903), class = "Date"),
name = c("a_bund", "a", "b", "b_bund", "b_bund", "b"),
qty = c(1, 1, 2, 1, 2, 2)),
row.names = c(NA, -6L),
class = c("data.table", "data.frame"))
Which produces:
requests
dates name qty
1: 2019-01-06 a_bund 1
2: 2019-01-06 a 1
3: 2019-01-06 b 2
4: 2019-01-07 b_bund 1
5: 2019-01-07 b_bund 2
6: 2019-01-07 b 2
Joining the tables
With these two tables, you now need to know how many of each component you'll need in any given date. To solve this I'll use the data.table package, please see ?data.table for details.
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
What is in there?
requests[components, on = "name"] joins the table requests with components by matching elements with the same name. In other words, it brings the component and amount (from components, of course) for each name in requests. Paste the command and see what the result is.
Data.table syntaxis allows "chaining" or passing an intermediate result to a new operation. That's what happens with the ][ sequence: I joined the tables and now am feeding that result into a new operation.
That new operation is sum(qty * amount). It is multiplying (you weren't wrong initially) the number of requested units qty by the amount of each component needed to produce it, and sums it (aggregates it) by = .(dates, component), which seems pretty self-explanatory. (If you come from the excel world, just think about a pivot or dynamic table).
That produces your expected output:
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
dates component V1
1: 2019-01-06 a 2
2: 2019-01-06 b 3
3: 2019-01-07 b 8
While the result is the same other answers already provided, I hope you see the difference in approaches, and the enhanced usability of this one. If not, just imagine that k_bundle is made of 19 different components ;)

Related

How can I create a dummy variable based on text analysis and time sequence of events?

Coworkers
Date
A
2011-01-01
D
2011-01-02
B;;D
2011-01-03
E;;F
2011-01-04
D
2012-11-05
D;;G
2012-11-06
A
2012-11-09
Hello, I am trying to create a dummy variable based on text analysis (e.g., grepl).
The unit of analysis is a project, and the two main variables are coworkers (text) and date.
I am curious if there is any way that I can create a dummy variable that indicates all projects done by a worker who collaborated with "B" (in this example, D) as 1.
One more KEY condition I would like to add is... I would like to give the value of 1 only to projects that occurred AFTER B and D worked together.
I mean, in this case, I want to mark the project in the second row, which is done by D, as 0 because it occurred before B and D met.
Can I create this type of variable using R commands?
As I have millions of observations, I would not be able to do it manually.
Alphabets in the table are texts.
Thank you!
PS. In the "Coworkers" column, coworkers are separated by ;;
Here is one possible solution with tidyverse (at least I think this is what you are looking for). First, I create a new column (i.e., flag) that indicates whether B has co-worked with D, and if so, then I assign a 1. Next, I use cummax which will everything 1 after the first row of B;;D. This creates two groups for the before and after. Next, I use case_when to change the first occurrence to 0, as you specified. Then, for any row with D, I change to 1 and all others are changed to 0.
library(tidyverse)
df %>%
mutate(flag = ifelse(str_detect(Coworkers, "B;;D") | str_detect(Coworkers, "D;;B"), 1, 0),
flag = cummax(flag == 1),
flag = case_when(flag != 0 & !duplicated(flag) ~ 0,
grepl("D", Coworkers) & flag == 1 ~ 1,
TRUE ~ 0))
Output
Coworkers Date flag
1 A 2011-01-01 0
2 D 2011-01-02 0
3 B;;D 2011-01-03 0
4 E;;F 2011-01-04 0
5 D 2012-11-05 1
6 D;;G 2012-11-06 1
7 A 2012-11-09 0
8 B 2012-12-09 0
9 C;;B 2012-12-09 0
Data
df <- structure(list(Coworkers = c("A", "D", "B;;D", "E;;F", "D", "D;;G",
"A", "B", "C;;B"), Date = c("2011-01-01", "2011-01-02", "2011-01-03",
"2011-01-04", "2012-11-05", "2012-11-06", "2012-11-09", "2012-12-09",
"2012-12-09")), class = "data.frame", row.names = c(NA, -9L))

filtering specific value with aggregate function in R

Hi,
I would like to filter the largest datetime values for each customer by the first three digits of mccmnc.
As you can see in the picture, customer == 'abghsd' has two different mccmnc values '53208' and '53210'. The first three digits of mccmnc, however, are the same (532). So I want to filter customer abghsd's maximum datetime value with mccmnc = '532'. For customer = 'abbaedl', I need to filter the maximum datetime for mccmnc = '623' and mccmnc = '451'.
So may I ask how to give conditions for this problem?
With the query below, I was able to filter datetime by customer and mccmnc, but I want to filter mccmnc's first three digits.
processed <- aggregate(datetime ~ customer + mccmnc, data =raw_data3, max)
This is the result that I want to get:
Customer datetime mccmnc
abghsd 20181123222022 53210
abbaedl 20181226121213 62330
abbaedl 20181227191919 45123
Thank you.
Editing your original code, you can just add substr():
processed <- aggregate(datetime ~ customer + substr(mccmnc, 1, 3), data = raw_data3, max)
Alternatively, a tidyverse solution:
Code
library(tidyverse)
df %>%
# Group by customer ID and first 3 characters of mccmnc
group_by(customer, mccmnc_group = substr(mccmnc, 1, 3)) %>%
# Get the max datetime per group
summarise(max_datetime = max(datetime)) %>%
# Put columns in original order
select(1, 3, 2)
# A tibble: 3 x 3
# Groups: customer [2]
customer max_datetime mccmnc_group
<fct> <dbl> <chr>
1 John Package 20181201 532
2 Miranda Nuts 20181227 451
3 Miranda Nuts 20181226 623
Data
df <- data.frame(customer = c(rep("John Package", 3), rep("Miranda Nuts", 4)),
datetime = c(20181123, 20181201, 20181124, 20181125, 20181226, 20181226, 20181227),
mccmnc = c("532-08", "532-08", "532-10", "623-12", "623-30", "451-21", "451-23"))
> df
customer datetime mccmnc
1 John Package 20181123 532-08
2 John Package 20181201 532-08
3 John Package 20181124 532-10
4 Miranda Nuts 20181125 623-12
5 Miranda Nuts 20181226 623-30
6 Miranda Nuts 20181226 451-21
7 Miranda Nuts 20181227 451-23

How to optimize iterating over a huge dataframe with non-unique rows

I understand that if R is not updating a variable in place within the confines of a for loop then I've just made some horrendously slow and expensive code. Unfortunately, with a set of very tight deadlines and a strong background in C++/Java it's my go-to behaviour until I can get my R hat on.
I have a function I need to improve. It takes a dataframe (as below) returns the unique patid values and uses those to retrieve subsets of that dataframe for date modifications. A trimmed example below (note, I just pulled this out of a completed run, so the date has already been modified). The last R run I performed was over a dataframe of 27 million row and took about four/five hours. The size of the dataframe will be a lot bigger.
patid eventdate
1 12/03/1998
1 12/03/1998
2 04/03/2007
3 15/11/1980
3 15/11/1980
3 01/02/1981
A trimmed example of the function:
rearrangeDates <- function(dataFrame) {
#return a list of the unique patient ids
uniquePatids <- getUniquePatidList(dataFrame) #this is only called once and is very fast
out=NULL
for(i in 1:length(uniquePatids)) { # iterate over the list
idf <- subset(dataFrame, dataFrame$patid=uniquePatids[[i]])
idf$eventdate <- as.POSIXct(idf$eventdate,format="%d/%m/%Y")
idf <- idf[order(idf$eventdate,decreasing=FALSE),]
out = rbind(out,idf)
}
return(out)
}
Can anyone suggest improvements?
Since you want to sort your data on patid & eventdate this should work.
library(dplyr)
df %>%
mutate(eventdate = as.Date(eventdate, format="%d/%m/%Y")) %>%
arrange(patid, eventdate)
Output is:
patid eventdate
1 1 1998-03-12
2 1 1998-03-12
3 2 2007-03-04
4 3 1980-11-15
5 3 1980-11-15
6 3 1981-02-01
Sample data:
df <- structure(list(patid = c(1L, 1L, 2L, 3L, 3L, 3L), eventdate = c("12/03/1998",
"12/03/1998", "04/03/2007", "15/11/1980", "15/11/1980", "01/02/1981"
)), class = "data.frame", row.names = c(NA, -6L))
This is ideally suited to data.table: your data has a well-defined key that you group-by (patid,eventdate), you know the size of the output df will be <= size of input df, so it's safe to do do in-place assignments (waaay faster) instead of appends, you don't need the output iterative-append, and data.table has a nice fast unique function. So please try out the (loop-free!) code below and let us know how it compares both to your original, and to the dplyr approach:
require(data.table)
dt = data.table(patid=c(1,1,2,3,3,3), eventdate=c('12/03/1998','12/03/1998',
'04/03/2007', '15/11/1980', '15/11/1980','01/02/1981'))
dt[, eventdate := as.POSIXct(eventdate,format="%d/%m/%Y") ]
# If you set a key, the `by` operation will be super-fast
setkeyv(dt, c('patid','eventdate'))
odt <- dt[, by=.(patid,eventdate)]
patid eventdate
1: 1 1998-03-12
2: 1 1998-03-12
3: 2 2007-03-04
4: 3 1980-11-15
5: 3 1980-11-15
6: 3 1981-02-01
(One last thing: don't be afraid of POSIXct/lt, convert to them early, they're more efficient than strings, they support comparison operators hence the column can be used as key, sorted on, compared.)
(And for the fastest dplyr implementation, use dplyr::distinct())

Create new index / re-index in dplyr [duplicate]

This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3

Summarize a data.table with unreliable data

I have a data.table of events recording, say, user ID, country of residence, and event.
E.g.,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here).
So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
I get the wrong country for user 3.
Ideally, I would like to get the most common country for a user and the
percentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
How do I do that?
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).
Another way:
Edited. table(.) was the culprit. Changed it to complete data.table syntax.
dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)],
max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
# user country support
# 1: 3 1 0.8
# 2: 4 2 1.0
Using plyr's count function:
dt[, count(country), by = user][order(-freq),
list(country = x[1],
support = freq[1]/sum(freq)),
by = user]
# user country support
#1: 4 2 1.0
#2: 3 1 0.8
Idea is to count the countries per user, order by max frequency and then get the data you like.
A smarter answer thanks to #mnel, that doesn't use extra functions:
dt[, list(freq = .N),
by = list(user, country)][order(-freq),
list(country = country[1],
support = freq[1]/sum(freq)),
by = user]

Resources