Summation over nearby observations - r

I have a large data.frame which includes the price of goods and the quantity that are sold with each price. I like to find the total quantity of goods that is sold with a price similar (within a range) to price of each row. For example for the jth observation (row) I like to find the sum of quantity of goods that are sold with price lower than Price_j+50 and higher than Price_j-50, and similarly for other observations.
I can run a for loop over observations and filter the data for each observation's price.
df<-data.frame(Price = runif(100)*100 , Q = runif(100)*1000)
SumQ = data.frame()
for (i in c(1:nrow(df))){
df_filterd <- df %>% filter(Price < Price[i]+50 & Price > Price[i]-50)%>% summarize(sumQ = sum(Q))
SumQ<-rbind(SumQ, df_filterd$sumQ)
}
Is there a more efficient way to do this? I have a large dataset and it takes a lot of time to run the for loop over all observations.

You want to avoid looping and binding the results - this will be very slow. Instead, try:
with(df, sapply(Price, function(x) sum(Q[Price < x+50 & Price > x-50])))

Or with dplyr and purrr you could do
df %>% mutate(sumQ = map_dbl(Price,
~sum(Q[Price < .+50 & Price > .-50])))
Price Q sumQ
1 5.2272345 284.433416 28356.80
2 17.7292069 454.122990 35459.90
3 9.7329295 509.266254 29989.69
4 68.1042808 131.169813 41230.23
5 38.5612268 938.653962 45227.63
6 44.5808938 774.296761 47758.30
...

Related

How can I find the optimal price for each time t in R?

I have the price range price <- c(2.5,2.6,2.7,2.8)
and my dataset have several time t. For each time t, I have a corresponding cost c and demand quantity d.
I need to find the optimal price for each time t to maximise my required profit function (p-c)*d.
How can I achieve that?
The sample of mydata looks like this, I have 74 observations in total:
t
c
d
1
0.8
20
2
0.44
34
3
0.54
56
4
0.67
78
5
0.65
35
Here is my code but it reports error, can anybody help me to fix it? Much thanks!
max <-data.frame()
for (i in mydata$t) {
for (p in price) {
profit <- ((p-mydata$c)*mydata$d)
max <- max %>% bind_rows(data.frame(time=mydata$t,
price=p,
cost=mydata$c,
profit = profit
))
}
}
maxvalue <- max %>% group_by(time) %>% max(profit)
Since you did not provide a piece of your data which I could use, this is a bit of a guess, but the idea would be:
dat <- as.data.table(mydata)
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, p[which.max((p-c)*d))], t]
Ok! I did not realize you kept the price outside your table. Then try adding all possibilities to the table first this:
dat <- data.table(t= 1:5,
c= c(0.8,0.44,0.54,0.67,0.65),
d= c(20,34,56,78,35))
# Add all possible prices as an extra column (named p)
# Note that all lines will be repeated accordingly
dat <- dat[, .(p= c(2.5,2.6,2.7,2.8)), (dat)]
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, .(best_price= p[which.max((p-c)*d)]), t]

Churn Rate in R

I have a data set named df_a running in millions. I want to calculate the churn rate and group into months.
On the sample data I ran the code to prepare my data.
The logic is to find the minimum month(acquired month)
find the last month based on the records
find the difference in months and group the difference in months
The code below
df_a<-data.table(df)
df_a[,"min_date" := min(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a[,"max_date" := max(yw), by=c("CUSTOMER_DIMENSION_ID")]
df_a$min_date_m<-anydate(df_a$min_date)
df_a$max_date_m<-anydate(df_a$max_date)
df_a$diff_days <- df_a$max_date_m - df_a$min_date_m
df_a$difference <- as.numeric(df_a$diff_days) /(365.25/12)
df_a$Month_Bucket<-ifelse((df_a$difference>=0 & df_a$difference<3),"3",
ifelse((df_a$difference>=3 & df_a$difference<6),"3-6",
ifelse((df_a$difference>=6 & df_a$difference<9),"6-9",
ifelse((df_a$difference>=9 & df_a$difference<12),"9-12",
ifelse((df_a$difference>=12 & df_a$difference<24),"12-24",
"24+")))))
data_a <- df_a[c(1,1:nrow(df_a)),]
setDT(data_a)
xxx<-(cohorts <-dcast(unique(data_a)[,cohort:=min(yw),by=CUSTOMER_DIMENSION_ID],cohort~Month_Bucket))
I am getting the output in the following format
Month 3
2020-08 92876
2020-07 144873
However the output is not correct
What I want is
Month no of unique customers acquired 0-3 3-6 6-9
2019-08 85749
2019-07 128060
The output basically is summing up the customers across months and assigning a bucket. However if I acquire 85749 customers in 2019-08 i will have lets say 25k customers in 0-3 25k again in 3-6 months
One here could do :
data_unique <- unique(data_a)
ccc <- ( cohorts <- dcast( data_unique[ ,
cohort := min(yw),
by=CUSTOMER_DIMENSION_ID],
cohort ~ Month_Bucket,
value.var = "CUSTOMER_DIMENSION_ID",
function(x) { length(unique(x) } ) )
)

r data.table filter based on count of rows satisfying a condition

I am learning data.table and got confused at one place. Need help to understand how the below can be achieved. The data I am having, I need to filter out those brands which have sales of 0 in the 1st period OR do not have sales > 0 in atleast 14 periods. I have tried and I think I have achieved the 1st part....however not able to get how I can get the second part of filtering those brands which do not have sales > 0 in atleast 14 periods.
Below is my sample data and code that I have written. Please suggest how I can I achieve the second part?
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
group <- sample(1:7,1200,replace = T)
brn <- sample(1:10,1200,replace = T)
period <- rep(101:116,75)
sales <- sample(0:50,1200,replace = T)
#### create the data.table
df1 <- data.table(cbind(group,brn,period,sales))
#### taking the minimum value by group x brand x period
df1_min <- df1[,.(min1 = min(sales,na.rm = T)),by = c('group','brn','period')][order(group,brn,period)]
#### creating the filter
df1_min$fil1 <- ifelse(df1_min$period == 101 & df1_min$min1 == 0,1,0)
Thank you !!
Assuming that the first restriction applies on the dataset wide minimum period (101), implying that brn/group pairs starting with a 0-sales period greater than 101 are still included.
# 1. brn/group pairs with sales of 0 in the 1st period.
brngroup_zerosales101 = df1[sales == 0 & period == min(period), .(brn, group)]
# 2a. Identify brn/group pairs with <14 positive sale periods
df1[, posSale := ifelse(sales > 0, 1, 0)] # Was the period sale positive?
# 2b. For each brn/group pair, sum posSale and filter posSale < 14
brngroup_sub14 = df1[, .(GroupBrnPosSales = sum(posSale)), by = .(brn, group)][GroupBrnPosSales < 14, .(brn, group)]
# 3. Join the two restrictions
restr = rbindlist(list(brngroup_zerosales101, brngroup_sub14))
df1[, ID := paste(brn, group)] # Create a brn-group ID
restr[, ID := paste(brn, group)] # See above
filtered = df1[!(ID %in% restr[,ID]),]

In R, iterating over two datasets and getting back results without looping

I have two data sets, Transaction_long, and Transaction_short. Transaction_long has many quotes of policy and price with a purchase point (denoted by true) in the dataset. Transaction_short has only entries of the purchase points.
My objective is to add a column in the Transaction_short dataset called Policy_Change_Frequency. For every customer in the short dataset, iterate over the rows for that customer in the long dataset and calculate how many time the policy changed.
To find the policy change I can use sum(diff(Transaction_Long$policy)!=0) but not sure how to iterate over these two data sets and get results
Details:
Customer_Name : name of customer
Customer_ID: Customer Identifier number
Purchase: Boolean variable (Yes-1,No-0)
Policy: Categorical (takes values 1-5)
Price : Price quoted
Data set1-Transaction_Long
Customer_Name,Customer_ID,Purchased,Policy,Price
Joe,101,0,1,500
Joe,101,0,1,505
Joe,101,0,2,510
Joe,101,0,2,504
Joe,101,0,2,507
Joe,101,0,1,505
Joe,101,1,3,501
Mary,103,0,1,675
Mary,103,0,3,650
Mary,103,0,2,620
Mary,103,0,2,624
Mary,103,0,2,630
Mary,103,1,2,627
Data set 2:Transaction_Short
Customer_Name , Customer_ID,Purchased,Policy, Price
Joe,101,1,3,501
Mary,103,1,2,627
Need to add a Policy Change Frequency column in the Transaction Short Dataset, so my final Transcation short Dataset will look like
Final Dataset should look like this
Customer_Name , Customer_ID,Purchased, Policy, Price,Policy_ChangeFreq
Joe,101,1,3,501,3
Mary,103,1,2,627,2
Consider a calculated column for policy change which tags changes from previous row within each customer with one. Then, aggregates the ones for a count. Merge is used due to two aggregations needed (final row for each customer and PolicyChanged count):
Transaction_Long$PolicyChangedFreq <- sapply(1:nrow(Transaction_Long),
function(i)
if (i > 1) {
ifelse(Transaction_Long$Policy[i-1]==
Transaction_Long$Policy[i], 0,
ifelse(Transaction_Long$Customer_ID[i-1] !=
Transaction_Long$Customer_ID[i], 0, 1))
} else { 0 }
)
Transaction_Final <- merge(aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:5)], FUN = tail, n = 1),
aggregate(.~ Customer_ID + Customer_Name,
Transaction_Long[,c(1:2,6)], FUN = sum),
by = c('Customer_ID', 'Customer_Name'))
Transaction_Final
# Customer_ID Customer_Name Purchased Policy Price PolicyChangedFreq
#1 101 Joe 1 3 501 3
#2 103 Mary 1 2 627 2
#Parfait. Thank you for the solution. i solved this using the sqldf package in R
for (i in 1:nrow(Transaction_short)){
sql <- sprintf("SELECT policy from Transaction_long where customer_ID = %s",ML_Train_short$customer_ID[i])
df<- sqldf(sql)
NF <- sum(df$policy[-1]!= df$policy[-length(df$policy)])
ML_Train_short$Policy_Change_Freq[i] <- NF
}
Since i have about 500K rows in the long dataset and about 100K in the short dataset..this is taking a while..is there any other solution that does not require loops? Thank you

Calculating grades in r

I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).
Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})
Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087
Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.

Resources