Way to loop over multiple tables and keep only if condition met? - r

So I'm working on project that has multiple data tables, separated by month, that I need to iterate through. Speed is of the essence here, and I can't seem to get the time down to something reasonable unless I do a lot of crossjoins through data table functions. So here are my tables:
TABLE 1
Product Date Cost
A 8/1/2020 10
A 8/2/2020 20
A 8/3/2020 30
B 8/4/2020 15
B 8/5/2020 25
B 8/6/2020 35
and TABLE 2:
Product Date Price
A 9/1/2020 20
A 9/2/2020 30
A 9/3/2020 40
B 9/4/2020 27
B 9/5/2020 33
B 9/6/2020 42
So I need to iterate over every combination of Table 2 Price - Table 1 Cost, and do it by Product. So output would be:
NEW TABLE
Product Date1 Date2 Profit
A 8/1/2020 9/1/2020 10
A 8/1/2020 9/2/2020 20
...
EDIT: To clarify, the New Table should continue on. Product A should have 27 different profits (3 dates under A x 3 dates under A x 3 discount rates) assuming they are all above 0. If any of the profits are below 0, then I don't want them as part of the New Table.
I also have a Discount factor I need to apply to each permutation of Price as we give discounts quite a bit
Discount = c(10%,12%,18%)
I've tried using a loop and various ways of using apply but the loops take way too long to finish (hours, and some never do). The combinations lead to millions of rows but I only want to keep the profitable ones, where Price*Discount > Cost, which are only maybe 10,000 in number.
My solution is to cross join the data tables to create a massive table that I can vectorize against, which is much faster (around 1 min) but with some of the larger tables I quickly run into memory constraints and it isn't very scalable.
CTbl =setkey(CTbl[,c(k=1,.SD)],k)[Price[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
CTbl[,Profit:=(Discount*Price - Cost]
CTbl = setDT(CTbl)[, .SD[Price > Cost ]]
DT = CTbl[,list(MinProfit = min(Profit)),by = Product]
Of course this is quite fast but is a huge of waste of memory when all I really want is profitable rows, and of course the ongoing memory issue.
Can anyone help? I've asked some R users at work but they seem stumped as well, the loops they made couldn't get close to the sub-5 minutes it takes to run the above. I don't mind a bit of extra time if it means I can scale it up.
Thanks!

This sounds like a problem for the dplyr package, which. The dplyr package allows you to string together data operations in a "pipe" to avoid storing things in memory. The pipe operator %>%takes the output of the function on the left and uses it as the first argument of the function on the right. Each function in the dplyr package works over the entire vector or data tibble, so no need for loops.
So, your operation might look like the following:
# Initialize random data like your first table
df1 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date1 = sample(seq(as.Date("2020/08/01"), as.Date("2020/08/31"),
by = "day"), 10000, replace = TRUE),
cost = round(runif(10000, 5, 100)))
# Initialize random data like your second table
df2 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date2 = sample(seq(as.Date("2020/09/01"), as.Date("2020/09/30"),
by = "day"), 10000, replace = TRUE),
price = round(runif(10000, 5, 100)))
# Initialize discounts
discounts <- data.frame(product = rep(LETTERS[1:10],4),
discount = rep(c(0, 0.1, 0.12, 0.18), 10))
library(dplyr)
out_table <- df1 %>%
full_join(df2) %>%
full_join(discounts) %>%
mutate(profit = price * discount - cost) %>%
filter(profit > 0)
For my random data, this takes about 3 seconds on my machine. Furthermore, the filter verb only keeps those rows we want.

This is not a complete answer to your question, but maybe you can iterate a loop by products. The following function finds profits for a specified product. The function does not include discount but it can be added if the function works as you want.
profit = function(product, df1, df2) {
cost = with(df1, df1[which(Product == product), 'Cost'])
price = with(df2, df2[which(Product == product), 'Price'])
date = merge(
with(df1, df1[which(Product == product), 'Date']),
(with(df2, df2[which(Product == product), 'Date']))
)
product = t(matrix(rep(price, length(cost)), nrow = length(cost)) - t(matrix(rep(cost, length(price)), ncol = length(price))))
product = data.frame(cbind(date[which(product > 0), ], product[which(product > 0)]))
names(product) = c('costdate', 'pricedate', 'profit')
return(product)
}
Example:
df1 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('8/1/2020', '8/2/2020', '8/3/2020', '8/4/2020', '8/5/2020', '8/6/2020'),
Cost = c(10, 20, 30, 15, 25, 35))
df2 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('9/1/2020', '9/2/2020', '9/3/2020', '9/4/2020', '9/5/2020', '9/6/2020'),
Price = c(20, 30, 40, 27, 33, 42))
> profit('A', df1, df2)
costdate pricedate profit
1 8/1/2020 9/1/2020 10
4 8/1/2020 9/2/2020 20
5 8/2/2020 9/2/2020 10
7 8/1/2020 9/3/2020 30
8 8/2/2020 9/3/2020 20
9 8/3/2020 9/3/2020 10
> profit('B', df1, df2)
costdate pricedate profit
1 8/4/2020 9/4/2020 12
2 8/5/2020 9/4/2020 2
4 8/4/2020 9/5/2020 18
5 8/5/2020 9/5/2020 8
7 8/4/2020 9/6/2020 27
8 8/5/2020 9/6/2020 17
9 8/6/2020 9/6/2020 7
I could not test it properly since I have limited data.

Related

Select columns from a data frame

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.

How to add values of one column based on conditional statement of another column that has blank cells?

I'm trying to subset data based on a conditional statement of a column that has blank values which means the employee logged in multiple times on a work order. An example data set is shown below:
employee_name <- c("Person A","Person A","Person A","Person A","Person A", "Person B","Person B","Person B")
work_order <- c("WO001","WO001","WO001","WO002","WO003","WO001","WO003", "WO003")
num_of_points <- c(40,"","",64,25,20,68,"")
time <- c(10, 30, 15, 20, 25, 5, 15, 30)
final_summary <- data.frame(employee_name,work_order,num_of_points, time)
View(final_summary)
Input
Basically, I want to sum up the points and time by selecting all rows with points > 30, then grouped by Employee Name and Work Order which should return this:
Output
I can do the summarize function properly, but when I perform the initial subset, it excludes the blank rows for num_of_points and thus does not compute all the adjacent time (in minutes) values. This makes sense because subset(num_of_points > 30) only finds anything greater than 30. How can I tweak this to include the blank rows so I can successfully filter the data in order to compute the sum of time accurately, grouped by unique work order and employee name?
Conver the num_of_points to numeric class, grouped by 'employee_name', 'work_order', get the sum of 'num_of_points' where it is greater than 30, and the sum of 'time', then filter out the rows where 'num_of_points' are 0
library(dplyr)
final_summary %>%
mutate(num_of_points = as.numeric(num_of_points)) %>%
group_by(employee_name, work_order) %>%
summarise(num_of_points = sum(num_of_points[num_of_points> 30],
na.rm = TRUE), time = sum(time)) %>%
filter(num_of_points > 0)
# A tibble: 3 x 4
# Groups: employee_name [2]
# employee_name work_order num_of_points time
# <chr> <chr> <dbl> <dbl>
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
In base R you will do:
aggregate(.~employee_name + work_order, type.convert(final_summary), sum, subset = num_of_points>30)
employee_name work_order num_of_points time
1 Person A WO001 40 10
2 Person A WO002 64 20
3 Person B WO003 68 15
You can aggregate num_of_points and time separately and merge the results.
merge(aggregate(num_of_points~employee_name + work_order, final_summary,
sum, subset = num_of_points>30),
aggregate(time~employee_name + work_order, final_summary, sum))
# employee_name work_order num_of_points time
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45

Grouped moving average with new time window in R

A minimal example would be:
set.seed(42)
df <- data.frame(ID = rep("P1", 100),
treatment = c(rep("A", 50), rep("B", 50)),
t = rep(seq(1:50), 2),
x = rnorm(n = 100))
which, let's imagine, is some value measured each second. Now I would also want to have the average of each 20 second window. so 1-20, 21-40, ... (i.e. no overlap)
My actual data has a column ID representing multiple subjects, treatment with labels for two treatments and the actual time column is in increments of 5 (0, 5, 10, 15, 20). There are other important grouping columns. The values I have are from EEG recordings and I need to increase the size of window for some analyses.
What would be cleanest way to do this? Note that there must be a way of knowing which time-window the calculated average relates to (just 20, 40 would be enough, not a string with 1-20).
EDIT:
based on a now deleted comment I got here
df %>%
mutate(timeWin = ceiling(.$t/20)*20) %>%
group_by(ID, treatment, timeWin) %>%
summarise(xAvg = mean(x))
Only issue with that bit of code is the rather crude use of ceiling. In the example above the 40-50 bracket will be printed as 60.
df$grouped_time = ave(df$t, ceiling(df$t/20), FUN = max)
aggregate(df["x"], df[c("ID", "treatment", "grouped_time")], mean)
# ID treatment grouped_time x
#1 P1 A 20 0.19192002
#2 P1 B 20 0.27873536
#3 P1 A 40 -0.27099180
#4 P1 B 40 0.01661547
#5 P1 A 50 -0.02021535
#6 P1 B 50 -0.08719458

R: Using different DFs to get third DF with specific info from first 2

I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}

Adding a calculated column to a data matrix in R

I have a data matrix with several columns, with Revenue, Cost_Unit and Quantity being some of them. I want to append a "Profit" column to my matrix, calculated as Revenue - Cost_Unit*Quantity. What's the most efficient way to do this? There might be a million rows in my matrix so I want it to be as fast as possible.
This is the error I'm getting. Could anyone help me out please?
final_set$Profit = final_set$Revenue - (final_set$Cost_Unit*final_set$Quantity)
Error in [<-.data.table(x, j = name, value = value) :
RHS of assignment to new column 'Profit' is zero length but not empty list(). For new columns the RHS must either be empty list() to create an empty list column, or, have length > 0; e.g. NA_integer_, 0L, etc.
Assuming you have the following data:
set.seed(1)
Cost_Unit <- rnorm(10, 100, 10)
Quantity <- rnorm(10, 1000, 100)
Revenue <- Cost_Unit*runif(10,1.02,1.1)*Quantity
final_set <- data.frame(Cost_Unit, Quantity, Revenue)
final_set$Profit <- with(final_set, Revenue - Cost_Unit * Quantity)
This will give you:
# Cost_Unit Quantity Revenue Profit
#1 93.73546 1151.1781 117151.15 9244.941
#2 101.83643 1038.9843 113399.64 7593.181
#3 91.64371 937.8759 93052.92 7102.482
#4 115.95281 778.5300 96072.12 5799.383
#5 103.29508 1112.4931 122083.18 7168.122
#6 91.79532 995.5066 98981.19 7598.346
#7 104.87429 998.3810 106994.02 2289.520
#8 107.38325 1094.3836 124355.50 6837.037
#9 105.75781 1082.1221 123436.37 8993.504
#10 96.94612 1059.3901 110449.52 7745.766

Resources