Assign a dataframe column a value, based on multiple conditions - r

I'm having a dataframe as below.
price <- c(287655, 456355, 662500, 597864,876545)
House <- data.frame(price)
I need to create another column for this dataset, based on multiple conditions. Lets assume the new column of the dataframe is A.
I need to assign the new column, a value considering some multiple conditions.
I tried in several ways. But none was a success.
if(House$price < 300000) then {House$A='red'}
if(House$price > 300000 & House$price < 500000) then {House$A='blue'}
if(House$price > 500000) then {House$A='green'}
How can I success this.

We can use cut
transform(House, newcol = cut(price, breaks = c(-Inf, 300000, 500000, Inf),
labels = c("red", "blue", "green")))
# price newcol
#1 287655 red
#2 456355 blue
#3 662500 green
#4 597864 green
#5 876545 green
Note that if/else is not vectorized and it expects the input to have length of 1. If we are doing in this a loop with each element having length 1, it works, but it is also inefficient as there is ifelse vectorized version of if/else
House <- transform(House, newcol = ifelse(price < 300000, "red",
ifelse(price > 300000 & price < 500000, "blue", "green")))
House
# price newcol
#1 287655 red
#2 456355 blue
#3 662500 green
#4 597864 green
#5 876545 green
If we look at the results, both of them got the same output, but the difference is in the number of ifelse statements which can increase when there are more number of comparisons. It would be better to use cut or findInterval instead of nested ifelse
if goes with else rather than then
House$newcol <- NA
for(i in seq_len(nrow(House))) {
House$newcol[i] <- if(House$price[i] < 300000) {
'red'
} else if( House$price[i] > 300000 & House$price[i] < 500000) {
'blue'
} else 'green'
}

Related

How to build a new column using for loop and if statement based on the values already present

I have a dataset where I want to create a new column and add values based on the value present in each row. below is the example
Sales Result of new column
100 Low
200 Low
300 Moderate
400 High
500 High
Below is the code i tried to get result
data$New_Column = for (Sal in data$Sales){
if(Sal > 300){
print("High")
} else if(Sal == 300){
print("Moderate")
}else{
print("Low")
}
}
Thank You
Perhaps try using mutate instead of a loop:
newdat <- data %>>% mutate(New_Column, if(Sal==300 = "moderate"), ifelse(Sal>300, "High", "Low")
You would usually not do that in a loop in R since most of the relevant functions can work directly with vectors.
Here is one option using data.table::fcase() and another using ifelse() from base R:
library(data.table)
data$New_Column <- fcase(
data$Sales > 300, "High",
data$Sales == 300, "Moderate",
data$Sales < 300, "Low"
)
# Using base R only:
data$New_Column <-
ifelse(data$Sales == 300, "Moderate", ifelse(data$Sales > 300, "High", "Low"))
# Sales New_Column
# 1 300 Moderate
# 2 500 High
# 3 200 Low
# 4 500 High
# 5 400 High
# 6 500 High
# 7 400 High
# 8 300 Moderate
Example input data
set.seed(13L)
data <- data.frame(
Sales = sample(seq(100, 500, 100), size = 8L, replace = TRUE)
)

Perform function on subset of data

I have a set of data below which shows a number of cars at an imaginary car dealership. The 'current_price' variable is obviously the price the car is currently set to sell at. The 'minimum_price' variable shows the hard floor price that the car must not, under any circumstance, sell at. (This can be assumed to be the purchase price).
I'm trying to create a function where the user can select a subset of the cars in the database (using the 'user defined parameters' as noted below), and then reduce or increase the 'Current_Price' by either a percentage or pound(£) value.
The 'minimum profit parameters' set the minimum profit which is to be made on all cars. In this example they have been set to £10 and 10%. Meaning the profit for each car must be either £10, or 10% of the current price - whichever is larger.
The price change parameters set how much the price is to move and whether it should move up or down.
# Dummy data
Type <- rep(c("Car", "Van"),each=3)
Age <- as.numeric(c(2, 2, 5, 4, 8,1))
Colour <- c("Red", "Red", "Yellow", "Red", "Black", "Red")
Make <- c("Ford", "VW", "VW", "VW", "BMW", "Ford")
Current_Price <- as.numeric(c(1050, 1000, 1500, 995, 2200, 2100))
Minimum_Price <- as.numeric(c(900, 600, 500, 850, 1900, 1950))
df1 <- data.frame(Type, Age, Colour, Make, Current_Price, Minimum_Price)
# User defined parameters - price to be changed for all cars which fit below selection
Input_Type <- "Car"
Input_Min_Age <- 2 # All cars this age and above
Input_Max_Age <- 10 # All cars this age and below
Input_Colour <- "Red"
Input_Make <- c("Ford", "VW")
# Minimum profit parameters
Input_Min_Pounds <- 10
Input_Min_Percentage <- 0.10
# Price change parameters
Input_Change_Type <- "Percentage" # "Percentage" or "Pound"
Input_Change_Value <- -0.10 # "-" sign to represent price reduction
Given the above, I would expect lines 1 & 2 to be effected by the change. Line 1's price should move down from £1,050 to £1,000. This is because £1,000 is the lowest price possible where 10% of the price is profit (900/(1-0.10) = 1000).
Line 2's price should simply move down by 10% to 900.
Has anyone got an idea how to put this into a function which will be fairly intuitive to use for someone who is not used to using R?
This answer uses a data.table to support the price changes in the original "data.frame" (as you have explained in the comments to your question) the solution could look like this.
I am still ignoring the pricing logic because I want to focus on the usability aspect
(this specialized pricing logic is arbitrary and not of particular interest for anybody else here at SO; if you have a specific problem to implement it yourself please open a new question and explain the problem in detail).
library(data.table)
data <- as.data.table(df1)
calc_price <- function(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage) {
# TODO implement your pricing logic here...
return(Current_Price + 1)
}
update_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 0,
min_profit_pounds = 10,
min_profit_percentage = 0.10) {
stopifnot(is.data.table(data))
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
# date the price using a separate function to encapsulate the logic
data[eval(filter_exp), Current_Price := calc_price(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage)][]
return(data)
}
The usage is still similar to my data.frame answer, e. g.:
update_car_prices(data, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
update_car_prices(data, Colour == "Red")
update_car_prices(data, Colour == "Red", "Pound", 500)
The differences are:
The whole data.table (data) is returned to see the impact
The original data is changed since data.tables are passed by reference
and I am updating the price "by reference" using the data.table syntax :=
This answer is based on a data.frame...
Your questions addresses multiple aspects (pricing logic, filtering logic and usability).
I am focusing on the usability (and ignoring the pricing logic since this is just a deliberate detail).
I see at least three options:
Use a strongly typed function:
get_car_prices1 <- function(data, Input_Type, Input_Min_Age, Input_Max_Age, Input_Colour, Input_Make, Input_Min_Pounds, Input_Min_Percentage)
Use an "untyped" function with a deliberate number of arguments via ... to support filtering by passing only the required arguments:
get_car_prices2 <- function(data, Input_Min_Pounds, Input_Min_Percentage, ...)
Use meta programming with substitute + eval
I have decided for option 3 as being the best (user friendly + flexible) option IMHO:
get_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 1)
{
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
data_subset <- subset(data, eval(filter_exp))
# TODO add your pricing logic here (e. g. using "ifelse")
return(data_subset)
}
# Usage examples:
get_car_prices(df1, Colour == "Red")
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
# 4 Van 4 Red VW 995 850
# 6 Van 1 Red Ford 2100 1950
get_car_prices(df1, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
get_car_prices(df1, Colour == "Red", "Pound", 500)
# ...
get_car_prices(df1, Colour == "Red", "dumping price", 1)
# Error in match.arg(price_change_type) :
# 'arg' should be one of “Percentage”, “Pound”
# But: The user has to learn at least the expression logic of R and that variables (and values) are case-sensitive:
get_car_prices(df1, Colour == "red")
# [1] Type Age Colour Make Current_Price Minimum_Price
# <0 rows> (or 0-length row.names)
# Error: Assignment operator (=) used instead of comparison operator (==)
get_car_prices(df1, Colour = "Red")
# Error in get_car_prices(df1, Colour = "Red") :
# unused argument (Colour = "Red")

How to combine multiple variable data to a single variable data?

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset
names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths
I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)
One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

Efficient way of assigning of IF conditions provided in a data frame

I have a set of IF conditions provided in tabular form. E.g.
rule_id number colour shape
1 1 200
2 2 100 triangle
3 3 NA red
4 4 NA ‘blue’,‘orange’ rectangle
5 5 NA green Not(‘triangle’,‘square’)
6 6 NA
The table needs to be read as follows (in pseudo code):
IF number = 200 THEN rule_id = 1
IFELSE number = 100 AND shape = triangle THEN rule_id = 2
IFELSE colour = red THEN rule_id = 3
.....
I am then interested in categorizing the rows in a dataset according to these rules. E.g. (rule_id is the result of the categorization)
number colour shape rule_id
1 100 red triangle 2
2 200 yellow none 1
3 300 blue none 6
4 200 none none 1
5 100 red square 3
6 500 green circle 5
7 400 green square 6
8 600 none none 6
In order to do so, I used a FOR loop running through all rules (I categorize 1st all rows that meet the conditions of rule_id = 1, then go to the next iteration for rule_id = 2). This seems unfortunately a very slow procedure (the dimensions of both my rules table and my dataset are much larger).
Is there a more optimal way of doing this?
And, also how do I best deal with the Not in the rules table?
The code I used is (maybe not a beauty...):
rules = data.frame(rule_id = 1:6,
number = c(200, 100, rep(NA,4)),
colour = c('', '', 'red', paste(sQuote('blue'), sQuote('orange'), sep = ','), 'green', ''),
shape = c('', 'triangle','','rectangle', paste(paste('Not(', sQuote('triangle'), sep = ''),
paste(sQuote('square'), ')', sep = ''), sep = ','), ''),
stringsAsFactors = FALSE)
data = data.frame(number = c(100, 200, 300, 200, 100, 500, 400, 600),
colour = c('red', 'yellow', 'blue', 'none', 'red', 'green', 'green', 'none'),
shape = c('triangle', 'none', 'none', 'none', 'square', 'circle', 'square', 'none'),
stringsAsFactors = FALSE)
data$rule_id = NA
nbrRules = nrow(rules)
for (j in 1:nbrRules){
data$rule_id[is.na(data$rule_id)
& (data$number == rules$number[j]
| is.na(rules$number[j]))
& ((apply(as.data.frame(data$colour),
1,
function(x) grepl(x, rules$colour[j]))
& (!grepl("Not", rules$colour[j])))
| (apply(as.data.frame(data$colour),
1,
function(x) !grepl(x, rules$colour[j]))
& (grepl("Not", rules$colour[j])))
| (rules$colour[j] == ""))
& ((apply(as.data.frame(data$shape),
1,
function(x) grepl(x, rules$shape[j]))
& (!grepl("Not", rules$shape[j])))
| (apply(as.data.frame(data$shape),
1,
function(x) !grepl(x, rules$shape[j]))
& (grepl("Not", rules$shape[j])))
| (rules$shape[j] == ""))] = rules$rule_id[j]
}
UPDATE
I haven't had time to implement #alexis_laz his suggestion, but he made me realize that maybe some time could be gained by looping through the rules in reverse for (j in nbrRules:1) instead of for (j in 1:nbrRules) as this allows me to remove the is.na(data$rule_id) filter from
data$rule_id[is.na(data$rule_id)
& (data$number == rules$number[j]
| is.na(rules$number[j]))
....
On the actual dataset I applied it, it led to a small gain in time (from 3.978148 mins to 3.972381 mins).
I also realized that my comment was wrong and that #Oliver Frost his ifelse suggestion would not lead to more conditions. Changing the code to this however, led to a slower program (4.079141 mins on my actual dataset)
for (j in 1:nbrRules){
data$rule_id = ifelse(is.na(data$rule_id)
& (data$number == rules$number[j]
| is.na(rules$number[j]))
& ((apply(as.data.frame(data$colour),
1,
function(x) grepl(x, rules$colour[j]))
& (!grepl("Not", rules$colour[j])))
| (apply(as.data.frame(data$colour),
1,
function(x) !grepl(x, rules$colour[j]))
& (grepl("Not", rules$colour[j])))
| (rules$colour[j] == ""))
& ((apply(as.data.frame(data$shape),
1,
function(x) grepl(x, rules$shape[j]))
& (!grepl("Not", rules$shape[j])))
| (apply(as.data.frame(data$shape),
1,
function(x) !grepl(x, rules$shape[j]))
& (grepl("Not", rules$shape[j])))
| (rules$shape[j] == ""))
,rules$rule_id[j],
data$rule_id)
}

Sum up rows according to specific values

That's my data frame
Colour = c("red", "blue", "red", "blue", "yellow", "green", "red", "blue", "green", "red", "yellow", "blue")
Volume = c(46,46,57,57,57,57,99,99,99,111,111,122)
Cases = c(7,2,4,2,3,5,1,2,3,2,4,1)
df = data.frame(Colour, Volume, Cases)
I want to sum up Cases if Colour is "red" OR "blue" but if Volume is identical.
Those colours which are not specified should be kept. If red and blue can't be summed
up because they differ in Volume then they should also be kept
The reult should look like that:
Colour = c("red_or_blue","red_or_blue","yellow","green","red_or_blue","green","red","yellow","blue")
Volume = c(46,57,57,57,99,99,111,111,122)
Cases = c(9,6,3,5,3,3,2,4,1)
df_agg = data.frame(Colour, Volume, Cases)
I've figured out a way where I create a further column which assigns an "red_or_blue" to the row with red or blue and an x for the remaining rows. I then used aggregate:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue","x")
df_agg = aggregate(df$Cases, list(df$Volume, df$test), sum)
It works but i found this a bit cumbersome. Is there a more handy way that would skip creating an extra column? In future I need to sum up cases for red/blue AND for Volume 57/99. Having the extra column appears to make it a bit more tricky.
Also, I didn't manage to get the original colour being taken over if it's not red nor blue. I tried it this way but it woudln't work:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue",df$Colour)
Cheers, Paul
Here's a way sticking in base R (but probably not the most efficient way....)
Split your data into groups by Volume
temp = split(df, df$Volume)
Create a quick function to change the values for "red" and "blue" only in groups where there is a "red" AND a "blue" present.
red.and.blue = function(x) {
if (sum(c("red", "blue") %in% x$Colour) > 1) {
x$Colour = gsub("red|blue", "red-and-blue", x$Colour)
} else {
x$Colour = as.character(x$Colour)
}
x
}
Use that function on your temp object that you created in Step 1:
temp = lapply(temp, red.and.blue)
Use aggregate() to perform the aggregation you need to do. Specify the names in the aggregate() arguments so that you maintain your original column names.
temp = lapply(temp, function(x) aggregate(list(Cases = x$Cases),
list(Colour = x$Colour,
Volume = x$Volume), sum))
Put it back all into a data.frame(). Don't forget to assign a name if you want to store it as is.
do.call(rbind, temp)
# Colour Volume Cases
# 46 red-and-blue 46 9
# 57.1 green 57 5
# 57.2 red-and-blue 57 6
# 57.3 yellow 57 3
# 99.1 green 99 3
# 99.2 red-and-blue 99 3
# 111.1 red 111 2
# 111.2 yellow 111 4
# 122 blue 122 1
I think if you follow #mrdwab's approach, you can use sapply on each "split volume" to do
df$Cases <- sum(df[(df$Colour =='blue' | df$Colour == 'red'),][,3])
to get the number of cases, and
df$Colour[(df$Colour =='blue' | df$Colour == 'red')] <- 'readandblue'
to change the colornames. I'm also willing to bet there's a 2-line solution using ddply but I'm not an expert w/ that tool (yet).

Resources