Adding a calculated column to a data matrix in R

Adding a calculated column to a data matrix in R - r

I have a data matrix with several columns, with Revenue, Cost_Unit and Quantity being some of them. I want to append a "Profit" column to my matrix, calculated as Revenue - Cost_Unit*Quantity. What's the most efficient way to do this? There might be a million rows in my matrix so I want it to be as fast as possible.
This is the error I'm getting. Could anyone help me out please?
final_set$Profit = final_set$Revenue - (final_set$Cost_Unit*final_set$Quantity)
Error in [<-.data.table(x, j = name, value = value) :
RHS of assignment to new column 'Profit' is zero length but not empty list(). For new columns the RHS must either be empty list() to create an empty list column, or, have length > 0; e.g. NA_integer_, 0L, etc.

Assuming you have the following data:
set.seed(1)
Cost_Unit <- rnorm(10, 100, 10)
Quantity <- rnorm(10, 1000, 100)
Revenue <- Cost_Unit*runif(10,1.02,1.1)*Quantity
final_set <- data.frame(Cost_Unit, Quantity, Revenue)
final_set$Profit <- with(final_set, Revenue - Cost_Unit * Quantity)
This will give you:
# Cost_Unit Quantity Revenue Profit
#1 93.73546 1151.1781 117151.15 9244.941
#2 101.83643 1038.9843 113399.64 7593.181
#3 91.64371 937.8759 93052.92 7102.482
#4 115.95281 778.5300 96072.12 5799.383
#5 103.29508 1112.4931 122083.18 7168.122
#6 91.79532 995.5066 98981.19 7598.346
#7 104.87429 998.3810 106994.02 2289.520
#8 107.38325 1094.3836 124355.50 6837.037
#9 105.75781 1082.1221 123436.37 8993.504
#10 96.94612 1059.3901 110449.52 7745.766

Related

How can I find the optimal price for each time t in R?

I have the price range price <- c(2.5,2.6,2.7,2.8)
and my dataset have several time t. For each time t, I have a corresponding cost c and demand quantity d.
I need to find the optimal price for each time t to maximise my required profit function (p-c)*d.
How can I achieve that?
The sample of mydata looks like this, I have 74 observations in total:
t
c
d
1
0.8
20
2
0.44
34
3
0.54
56
4
0.67
78
5
0.65
35
Here is my code but it reports error, can anybody help me to fix it? Much thanks!
max <-data.frame()
for (i in mydata$t) {
for (p in price) {
profit <- ((p-mydata$c)*mydata$d)
max <- max %>% bind_rows(data.frame(time=mydata$t,
price=p,
cost=mydata$c,
profit = profit
))
}
}
maxvalue <- max %>% group_by(time) %>% max(profit)

Since you did not provide a piece of your data which I could use, this is a bit of a guess, but the idea would be:
dat <- as.data.table(mydata)
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, p[which.max((p-c)*d))], t]

Ok! I did not realize you kept the price outside your table. Then try adding all possibilities to the table first this:
dat <- data.table(t= 1:5,
c= c(0.8,0.44,0.54,0.67,0.65),
d= c(20,34,56,78,35))
# Add all possible prices as an extra column (named p)
# Note that all lines will be repeated accordingly
dat <- dat[, .(p= c(2.5,2.6,2.7,2.8)), (dat)]
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, .(best_price= p[which.max((p-c)*d)]), t]

Way to loop over multiple tables and keep only if condition met?

So I'm working on project that has multiple data tables, separated by month, that I need to iterate through. Speed is of the essence here, and I can't seem to get the time down to something reasonable unless I do a lot of crossjoins through data table functions. So here are my tables:
TABLE 1
Product Date Cost
A 8/1/2020 10
A 8/2/2020 20
A 8/3/2020 30
B 8/4/2020 15
B 8/5/2020 25
B 8/6/2020 35
and TABLE 2:
Product Date Price
A 9/1/2020 20
A 9/2/2020 30
A 9/3/2020 40
B 9/4/2020 27
B 9/5/2020 33
B 9/6/2020 42
So I need to iterate over every combination of Table 2 Price - Table 1 Cost, and do it by Product. So output would be:
NEW TABLE
Product Date1 Date2 Profit
A 8/1/2020 9/1/2020 10
A 8/1/2020 9/2/2020 20
...
EDIT: To clarify, the New Table should continue on. Product A should have 27 different profits (3 dates under A x 3 dates under A x 3 discount rates) assuming they are all above 0. If any of the profits are below 0, then I don't want them as part of the New Table.
I also have a Discount factor I need to apply to each permutation of Price as we give discounts quite a bit
Discount = c(10%,12%,18%)
I've tried using a loop and various ways of using apply but the loops take way too long to finish (hours, and some never do). The combinations lead to millions of rows but I only want to keep the profitable ones, where Price*Discount > Cost, which are only maybe 10,000 in number.
My solution is to cross join the data tables to create a massive table that I can vectorize against, which is much faster (around 1 min) but with some of the larger tables I quickly run into memory constraints and it isn't very scalable.
CTbl =setkey(CTbl[,c(k=1,.SD)],k)[Price[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
CTbl[,Profit:=(Discount*Price - Cost]
CTbl = setDT(CTbl)[, .SD[Price > Cost ]]
DT = CTbl[,list(MinProfit = min(Profit)),by = Product]
Of course this is quite fast but is a huge of waste of memory when all I really want is profitable rows, and of course the ongoing memory issue.
Can anyone help? I've asked some R users at work but they seem stumped as well, the loops they made couldn't get close to the sub-5 minutes it takes to run the above. I don't mind a bit of extra time if it means I can scale it up.
Thanks!

This sounds like a problem for the dplyr package, which. The dplyr package allows you to string together data operations in a "pipe" to avoid storing things in memory. The pipe operator %>%takes the output of the function on the left and uses it as the first argument of the function on the right. Each function in the dplyr package works over the entire vector or data tibble, so no need for loops.
So, your operation might look like the following:
# Initialize random data like your first table
df1 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date1 = sample(seq(as.Date("2020/08/01"), as.Date("2020/08/31"),
by = "day"), 10000, replace = TRUE),
cost = round(runif(10000, 5, 100)))
# Initialize random data like your second table
df2 <- data.frame(product = sample(LETTERS[1:10], 10000, replace = TRUE),
date2 = sample(seq(as.Date("2020/09/01"), as.Date("2020/09/30"),
by = "day"), 10000, replace = TRUE),
price = round(runif(10000, 5, 100)))
# Initialize discounts
discounts <- data.frame(product = rep(LETTERS[1:10],4),
discount = rep(c(0, 0.1, 0.12, 0.18), 10))
library(dplyr)
out_table <- df1 %>%
full_join(df2) %>%
full_join(discounts) %>%
mutate(profit = price * discount - cost) %>%
filter(profit > 0)
For my random data, this takes about 3 seconds on my machine. Furthermore, the filter verb only keeps those rows we want.

This is not a complete answer to your question, but maybe you can iterate a loop by products. The following function finds profits for a specified product. The function does not include discount but it can be added if the function works as you want.
profit = function(product, df1, df2) {
cost = with(df1, df1[which(Product == product), 'Cost'])
price = with(df2, df2[which(Product == product), 'Price'])
date = merge(
with(df1, df1[which(Product == product), 'Date']),
(with(df2, df2[which(Product == product), 'Date']))
)
product = t(matrix(rep(price, length(cost)), nrow = length(cost)) - t(matrix(rep(cost, length(price)), ncol = length(price))))
product = data.frame(cbind(date[which(product > 0), ], product[which(product > 0)]))
names(product) = c('costdate', 'pricedate', 'profit')
return(product)
}
Example:
df1 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('8/1/2020', '8/2/2020', '8/3/2020', '8/4/2020', '8/5/2020', '8/6/2020'),
Cost = c(10, 20, 30, 15, 25, 35))
df2 = data.frame(Product = c('A', 'A', 'A', 'B', 'B', 'B'),
Date = c('9/1/2020', '9/2/2020', '9/3/2020', '9/4/2020', '9/5/2020', '9/6/2020'),
Price = c(20, 30, 40, 27, 33, 42))
> profit('A', df1, df2)
costdate pricedate profit
1 8/1/2020 9/1/2020 10
4 8/1/2020 9/2/2020 20
5 8/2/2020 9/2/2020 10
7 8/1/2020 9/3/2020 30
8 8/2/2020 9/3/2020 20
9 8/3/2020 9/3/2020 10
> profit('B', df1, df2)
costdate pricedate profit
1 8/4/2020 9/4/2020 12
2 8/5/2020 9/4/2020 2
4 8/4/2020 9/5/2020 18
5 8/5/2020 9/5/2020 8
7 8/4/2020 9/6/2020 27
8 8/5/2020 9/6/2020 17
9 8/6/2020 9/6/2020 7
I could not test it properly since I have limited data.

How to count and plot a cumulative number over a date range by groups

I want to find the best way to plot a chart showing the cumulative number of individuals in a group based on the date they came into the group as well as the date they may have left the group. This would be within the minimum and maximum date ranges of the date values. Each row is a person.
group_id Date_started Date_exit
1 2005-06-23 NA
1 2013-03-17 2013-09-20
2 2019-10-24 NA
3 2019-11-27 2019-11-27
4 2019-08-14 NA
3 2018-10-17 NA
4 2018-04-13 2019-10-12
1 2019-07-10 NA
I've considered creating a new data frame with a row per day within the min/max range and then applying some kind of function to tally the groups totals for each row (adding and subtracting from a running total based on whether or not there is a new value in either of the columns) but I'm not sure if one, that's the best way to approach the problem and two, how to practically run the cumulative count function either.
Ultimately though I want to be able to plot this as a line chart so I can see the trends over time for each group as I suspect one or more of them are much more volatile in terms of overall numbers. So again I'm not sure if ggplot2 has something already in place to handle this.

As you mentioned, you will need to create a dataframe with the desired dates and count, for each group, how many individuals are in the group.
I quickly put this together, so I'm sure there's a more optimal solution, but it should be what you're looking for.
library(ggplot2)
library(reshape2) # for melt
# your data
test <- read.table(
text =
"group_id,Date_started,Date_exit
1,2005-06-23,NA
1,2013-03-17,2013-09-20
2,2019-10-24,NA
3,2019-11-27,2019-11-27
4,2019-08-14,NA
3,2018-10-17,NA
4,2018-04-13,2019-10-12
1,2019-07-10,NA",
h = T, sep = ",", stringsAsFactors = F
)
# make date series
from <- min(as.POSIXct(test$Date_started))
to <- max(as.POSIXct(test$Date_started))
datebins <- seq(from, to, by = "1 month")
d_between <- function(d, ds, de){
if(ds <= d & (de > d | is.na(de)))
return(TRUE)
return(FALSE)
}
# make df to plot
df <- data.frame(dates = datebins)
df[,paste0("g", unique(test$group_id))] <- 0
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
# plot
ggplot(melt(df, id.vars = "dates"), aes(dates, value, color = variable)) +
geom_line(size = 1) + theme_bw()
This gives:
Feel free to play with the date bins (in seq()) as necessary.
EDIT : for loop explanation
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
The first loop iterates over the chosen dates.
For each date, go through the dataframe of interest (test) with the second for loop and use the custom d_between() function to determine whether or not an individual is part of the group. That function returns a boolean (which can translate to 0/1). The value 0 or 1 is then added to the df dataframe's column corresponding to the appropriate group (with gid) at the date we checked (row i).
Note that I'm considering the individuals as part of the group as soon as they join (ds <= d), but are not a part of the group the day they quit (de > d).

Use first digit as factor to standardize values in R

I have large data frame tocalculate from a survey (original data frame brfss2013 where one of the variables represents the number of times a person checks blood glucose levels. The data is in 3 digits:
First digit tells you if the measurements are per day (1), per week (2), per month (3)or per year (4). The second and third digits represent the actual value.
Example: 101 is once ( _01) per day (1 _ _), 202 is twice per week, etc.
I want to standardize everything to get value of times per year. So I will multiply the 2nd and 3rd digits by 365, 52.143, 12 and 1 (days, weeks, months, year).
I think I would be able to "select" the digits to use, but I'm not sure how to write something that can work with different rows with different set of instructions.
EDIT:
Adding my attempt and sample data.
tocalculate <- brfss2013 %>%
filter(nchar(bldsugar) > 2)
bldsugar2 <- sapply(tocalculate$bldsugar, function(x) {
if (substr(x,1,1) == 1) {x*365}
if (substr(x,1,1) == 2) {x*52}
if (substr(x,1,1) == 3) {x*12}
if (substr(x,1,1) == 4) {x*365}
})
I'm getting a lot of NULL values though...

Since you're already using dplyr, recode is a handy function. I use %/% to see how many times 100 goes in to each bldsugar value and %% to get the remainder when divided by 100.
# sample data
brfss_sample = data.frame(bldsugar = c(101, 102, 201, 202, 301, 302, 401, 402))
library(dplyr)
mutate(
brfss_sample,
mult = recode(
bldsugar %/% 100,
`1` = 365.25,
`2` = 52.143,
`3` = 12,
`4` = 1
),
checks_per_year = bldsugar %% 100 * mult
)
# bldsugar mult checks_per_year
# 1 101 365.250 365.250
# 2 102 365.250 730.500
# 3 201 52.143 52.143
# 4 202 52.143 104.286
# 5 301 12.000 12.000
# 6 302 12.000 24.000
# 7 401 1.000 1.000
# 8 402 1.000 2.000
You could, of course, remove the mult column (or combine the definitions so it is never created in the first place).

#Data
set.seed(42)
x = sample(101:499, 100, replace = TRUE)
#1st digit
as.factor(floor((x/100)))
#Values
((x/100) %% 1) * 100

Perhaps the first thing you can do is to split the 3-digit variable into two variables. The first variable is only one digit, which shows sampling frequency; and the second variable shows times of measurement.
In R, substr or substring can select the string by specifying the first and last position to subset.
# Create example data frame
ex_data <- data.frame(var = c("101", "202", "204"))
# Split the variable to create two new columns
ex_data$var1 <- substring(ex_data$var, first = 1, last = 1)
ex_data$var2 <- substring(ex_data$var, first = 2, last = 3)
# Remove the original variable
ex_data$var <- NULL
After this, you can manipulate your data frame. Perhaps convert var1 to factor and var2 to numeric for further manipulation and analysis.

R count and substract events from a data frame

I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni

Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7

Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7

I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Adding a calculated column to a data matrix in R - r

Related

How can I find the optimal price for each time t in R?

Way to loop over multiple tables and keep only if condition met?

How to count and plot a cumulative number over a date range by groups

Use first digit as factor to standardize values in R

R count and substract events from a data frame

Categories

Resources