Perform function on subset of data

Perform function on subset of data - r

I have a set of data below which shows a number of cars at an imaginary car dealership. The 'current_price' variable is obviously the price the car is currently set to sell at. The 'minimum_price' variable shows the hard floor price that the car must not, under any circumstance, sell at. (This can be assumed to be the purchase price).
I'm trying to create a function where the user can select a subset of the cars in the database (using the 'user defined parameters' as noted below), and then reduce or increase the 'Current_Price' by either a percentage or pound(£) value.
The 'minimum profit parameters' set the minimum profit which is to be made on all cars. In this example they have been set to £10 and 10%. Meaning the profit for each car must be either £10, or 10% of the current price - whichever is larger.
The price change parameters set how much the price is to move and whether it should move up or down.
# Dummy data
Type <- rep(c("Car", "Van"),each=3)
Age <- as.numeric(c(2, 2, 5, 4, 8,1))
Colour <- c("Red", "Red", "Yellow", "Red", "Black", "Red")
Make <- c("Ford", "VW", "VW", "VW", "BMW", "Ford")
Current_Price <- as.numeric(c(1050, 1000, 1500, 995, 2200, 2100))
Minimum_Price <- as.numeric(c(900, 600, 500, 850, 1900, 1950))
df1 <- data.frame(Type, Age, Colour, Make, Current_Price, Minimum_Price)
# User defined parameters - price to be changed for all cars which fit below selection
Input_Type <- "Car"
Input_Min_Age <- 2 # All cars this age and above
Input_Max_Age <- 10 # All cars this age and below
Input_Colour <- "Red"
Input_Make <- c("Ford", "VW")
# Minimum profit parameters
Input_Min_Pounds <- 10
Input_Min_Percentage <- 0.10
# Price change parameters
Input_Change_Type <- "Percentage" # "Percentage" or "Pound"
Input_Change_Value <- -0.10 # "-" sign to represent price reduction
Given the above, I would expect lines 1 & 2 to be effected by the change. Line 1's price should move down from £1,050 to £1,000. This is because £1,000 is the lowest price possible where 10% of the price is profit (900/(1-0.10) = 1000).
Line 2's price should simply move down by 10% to 900.
Has anyone got an idea how to put this into a function which will be fairly intuitive to use for someone who is not used to using R?

This answer uses a data.table to support the price changes in the original "data.frame" (as you have explained in the comments to your question) the solution could look like this.
I am still ignoring the pricing logic because I want to focus on the usability aspect
(this specialized pricing logic is arbitrary and not of particular interest for anybody else here at SO; if you have a specific problem to implement it yourself please open a new question and explain the problem in detail).
library(data.table)
data <- as.data.table(df1)
calc_price <- function(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage) {
# TODO implement your pricing logic here...
return(Current_Price + 1)
}
update_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 0,
min_profit_pounds = 10,
min_profit_percentage = 0.10) {
stopifnot(is.data.table(data))
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
# date the price using a separate function to encapsulate the logic
data[eval(filter_exp), Current_Price := calc_price(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage)][]
return(data)
}
The usage is still similar to my data.frame answer, e. g.:
update_car_prices(data, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
update_car_prices(data, Colour == "Red")
update_car_prices(data, Colour == "Red", "Pound", 500)
The differences are:
The whole data.table (data) is returned to see the impact
The original data is changed since data.tables are passed by reference
and I am updating the price "by reference" using the data.table syntax :=

This answer is based on a data.frame...
Your questions addresses multiple aspects (pricing logic, filtering logic and usability).
I am focusing on the usability (and ignoring the pricing logic since this is just a deliberate detail).
I see at least three options:
Use a strongly typed function:
get_car_prices1 <- function(data, Input_Type, Input_Min_Age, Input_Max_Age, Input_Colour, Input_Make, Input_Min_Pounds, Input_Min_Percentage)
Use an "untyped" function with a deliberate number of arguments via ... to support filtering by passing only the required arguments:
get_car_prices2 <- function(data, Input_Min_Pounds, Input_Min_Percentage, ...)
Use meta programming with substitute + eval
I have decided for option 3 as being the best (user friendly + flexible) option IMHO:
get_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 1)
{
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
data_subset <- subset(data, eval(filter_exp))
# TODO add your pricing logic here (e. g. using "ifelse")
return(data_subset)
}
# Usage examples:
get_car_prices(df1, Colour == "Red")
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
# 4 Van 4 Red VW 995 850
# 6 Van 1 Red Ford 2100 1950
get_car_prices(df1, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
get_car_prices(df1, Colour == "Red", "Pound", 500)
# ...
get_car_prices(df1, Colour == "Red", "dumping price", 1)
# Error in match.arg(price_change_type) :
# 'arg' should be one of “Percentage”, “Pound”
# But: The user has to learn at least the expression logic of R and that variables (and values) are case-sensitive:
get_car_prices(df1, Colour == "red")
# [1] Type Age Colour Make Current_Price Minimum_Price
# <0 rows> (or 0-length row.names)
# Error: Assignment operator (=) used instead of comparison operator (==)
get_car_prices(df1, Colour = "Red")
# Error in get_car_prices(df1, Colour = "Red") :
# unused argument (Colour = "Red")

Related

Create a list from a complex comparison of two lists

I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.

Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0

Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data

How to count and plot a cumulative number over a date range by groups

I want to find the best way to plot a chart showing the cumulative number of individuals in a group based on the date they came into the group as well as the date they may have left the group. This would be within the minimum and maximum date ranges of the date values. Each row is a person.
group_id Date_started Date_exit
1 2005-06-23 NA
1 2013-03-17 2013-09-20
2 2019-10-24 NA
3 2019-11-27 2019-11-27
4 2019-08-14 NA
3 2018-10-17 NA
4 2018-04-13 2019-10-12
1 2019-07-10 NA
I've considered creating a new data frame with a row per day within the min/max range and then applying some kind of function to tally the groups totals for each row (adding and subtracting from a running total based on whether or not there is a new value in either of the columns) but I'm not sure if one, that's the best way to approach the problem and two, how to practically run the cumulative count function either.
Ultimately though I want to be able to plot this as a line chart so I can see the trends over time for each group as I suspect one or more of them are much more volatile in terms of overall numbers. So again I'm not sure if ggplot2 has something already in place to handle this.

As you mentioned, you will need to create a dataframe with the desired dates and count, for each group, how many individuals are in the group.
I quickly put this together, so I'm sure there's a more optimal solution, but it should be what you're looking for.
library(ggplot2)
library(reshape2) # for melt
# your data
test <- read.table(
text =
"group_id,Date_started,Date_exit
1,2005-06-23,NA
1,2013-03-17,2013-09-20
2,2019-10-24,NA
3,2019-11-27,2019-11-27
4,2019-08-14,NA
3,2018-10-17,NA
4,2018-04-13,2019-10-12
1,2019-07-10,NA",
h = T, sep = ",", stringsAsFactors = F
)
# make date series
from <- min(as.POSIXct(test$Date_started))
to <- max(as.POSIXct(test$Date_started))
datebins <- seq(from, to, by = "1 month")
d_between <- function(d, ds, de){
if(ds <= d & (de > d | is.na(de)))
return(TRUE)
return(FALSE)
}
# make df to plot
df <- data.frame(dates = datebins)
df[,paste0("g", unique(test$group_id))] <- 0
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
# plot
ggplot(melt(df, id.vars = "dates"), aes(dates, value, color = variable)) +
geom_line(size = 1) + theme_bw()
This gives:
Feel free to play with the date bins (in seq()) as necessary.
EDIT : for loop explanation
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
The first loop iterates over the chosen dates.
For each date, go through the dataframe of interest (test) with the second for loop and use the custom d_between() function to determine whether or not an individual is part of the group. That function returns a boolean (which can translate to 0/1). The value 0 or 1 is then added to the df dataframe's column corresponding to the appropriate group (with gid) at the date we checked (row i).
Note that I'm considering the individuals as part of the group as soon as they join (ds <= d), but are not a part of the group the day they quit (de > d).

Calculating grades in r

I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).

Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})

Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087

Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.

Sum up rows according to specific values

That's my data frame
Colour = c("red", "blue", "red", "blue", "yellow", "green", "red", "blue", "green", "red", "yellow", "blue")
Volume = c(46,46,57,57,57,57,99,99,99,111,111,122)
Cases = c(7,2,4,2,3,5,1,2,3,2,4,1)
df = data.frame(Colour, Volume, Cases)
I want to sum up Cases if Colour is "red" OR "blue" but if Volume is identical.
Those colours which are not specified should be kept. If red and blue can't be summed
up because they differ in Volume then they should also be kept
The reult should look like that:
Colour = c("red_or_blue","red_or_blue","yellow","green","red_or_blue","green","red","yellow","blue")
Volume = c(46,57,57,57,99,99,111,111,122)
Cases = c(9,6,3,5,3,3,2,4,1)
df_agg = data.frame(Colour, Volume, Cases)
I've figured out a way where I create a further column which assigns an "red_or_blue" to the row with red or blue and an x for the remaining rows. I then used aggregate:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue","x")
df_agg = aggregate(df$Cases, list(df$Volume, df$test), sum)
It works but i found this a bit cumbersome. Is there a more handy way that would skip creating an extra column? In future I need to sum up cases for red/blue AND for Volume 57/99. Having the extra column appears to make it a bit more tricky.
Also, I didn't manage to get the original colour being taken over if it's not red nor blue. I tried it this way but it woudln't work:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue",df$Colour)
Cheers, Paul

Here's a way sticking in base R (but probably not the most efficient way....)
Split your data into groups by Volume
temp = split(df, df$Volume)
Create a quick function to change the values for "red" and "blue" only in groups where there is a "red" AND a "blue" present.
red.and.blue = function(x) {
if (sum(c("red", "blue") %in% x$Colour) > 1) {
x$Colour = gsub("red|blue", "red-and-blue", x$Colour)
} else {
x$Colour = as.character(x$Colour)
}
x
}
Use that function on your temp object that you created in Step 1:
temp = lapply(temp, red.and.blue)
Use aggregate() to perform the aggregation you need to do. Specify the names in the aggregate() arguments so that you maintain your original column names.
temp = lapply(temp, function(x) aggregate(list(Cases = x$Cases),
list(Colour = x$Colour,
Volume = x$Volume), sum))
Put it back all into a data.frame(). Don't forget to assign a name if you want to store it as is.
do.call(rbind, temp)
# Colour Volume Cases
# 46 red-and-blue 46 9
# 57.1 green 57 5
# 57.2 red-and-blue 57 6
# 57.3 yellow 57 3
# 99.1 green 99 3
# 99.2 red-and-blue 99 3
# 111.1 red 111 2
# 111.2 yellow 111 4
# 122 blue 122 1

I think if you follow #mrdwab's approach, you can use sapply on each "split volume" to do
df$Cases <- sum(df[(df$Colour =='blue' | df$Colour == 'red'),][,3])
to get the number of cases, and
df$Colour[(df$Colour =='blue' | df$Colour == 'red')] <- 'readandblue'
to change the colornames. I'm also willing to bet there's a 2-line solution using ddply but I'm not an expert w/ that tool (yet).

Including custom text in legend of plot

Suppose I have a data that looks like this.
> print(dat)
V1 V2
1 1 11613
2 2 6517
3 3 2442
4 4 687
5 5 159
6 6 29
# note that V2 is the frequency and V1 does not always start with 1.
> plot(dat,main=title,type="h")
# legend()??
Now what I want to do is to plot histogram, and have the mean
and standard deviation included as the legend. In the above example the standard deviation equals 0.87 and the mean eauals 1.66.
How can I achieve that automatically in R?

This solves the problem with legend creation that Gavin notices.
require(Hmisc)
myMean <- wtd.mean(dat$V1, dat$V2)
mySD <- sqrt(wtd.var(dat$V1, dat$V2))
plot(dat,main="title",type="h")
L= list( bquote(Mean== .(myMean)), bquote(SD== .(mySD) ) )
legend('topright', legend=sapply(L, as.expression))
This was pulled from an answer on Rhelp that I posted in 2010 that attributed the strategy for the solution to a 2005 exchange between Gabor Grothendieck and Thomas Lumley.

This gets pretty close:
dat <- data.frame(V1 = 1:6, V2 = c(11613, 6517, 2442, 687, 159, 29))
addMyLegend <- function(data, where = "topright", digits = 3, ...) {
MEAN <- round(mean(data), digits = digits)
SD <- round(sd(data), digits = digits)
legend(where, legend = list(bquote(Mean == .(MEAN)),
bquote(SD == .(SD))),
...)
}
plot(dat, type = "h")
addMyLegend(dat$V1, digits = 2, bty = "n")
Which gives
I'm not sure why the plotmath code is not displaying the == and a typeset =... Will have to look into that.
To see what is going on read ?bquote which explains that it can be used to replace components of an expression with dynamic data. Anything wrapped in .( ) will be replaced by the value of the object named in the wrapped part of the expression. Thus foo == .(bar) will look for an object named bar and insert the value of bar into the expression. If bar contained 1.3 then the result after applying bquote(foo == .(bar)) would be similar to expression(foo == 1.3).
The rest of my function addMyLegend() should be fairly self explanatory, if not read ?legend. Note you can pass on any arguments to legend() via the ... in addMyLegend().

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Perform function on subset of data - r

Related

Create a list from a complex comparison of two lists

How to count and plot a cumulative number over a date range by groups

Calculating grades in r

Sum up rows according to specific values

Including custom text in legend of plot

Categories

Resources