Sum up rows according to specific values

Sum up rows according to specific values - r

That's my data frame
Colour = c("red", "blue", "red", "blue", "yellow", "green", "red", "blue", "green", "red", "yellow", "blue")
Volume = c(46,46,57,57,57,57,99,99,99,111,111,122)
Cases = c(7,2,4,2,3,5,1,2,3,2,4,1)
df = data.frame(Colour, Volume, Cases)
I want to sum up Cases if Colour is "red" OR "blue" but if Volume is identical.
Those colours which are not specified should be kept. If red and blue can't be summed
up because they differ in Volume then they should also be kept
The reult should look like that:
Colour = c("red_or_blue","red_or_blue","yellow","green","red_or_blue","green","red","yellow","blue")
Volume = c(46,57,57,57,99,99,111,111,122)
Cases = c(9,6,3,5,3,3,2,4,1)
df_agg = data.frame(Colour, Volume, Cases)
I've figured out a way where I create a further column which assigns an "red_or_blue" to the row with red or blue and an x for the remaining rows. I then used aggregate:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue","x")
df_agg = aggregate(df$Cases, list(df$Volume, df$test), sum)
It works but i found this a bit cumbersome. Is there a more handy way that would skip creating an extra column? In future I need to sum up cases for red/blue AND for Volume 57/99. Having the extra column appears to make it a bit more tricky.
Also, I didn't manage to get the original colour being taken over if it's not red nor blue. I tried it this way but it woudln't work:
df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue",df$Colour)
Cheers, Paul

Here's a way sticking in base R (but probably not the most efficient way....)
Split your data into groups by Volume
temp = split(df, df$Volume)
Create a quick function to change the values for "red" and "blue" only in groups where there is a "red" AND a "blue" present.
red.and.blue = function(x) {
if (sum(c("red", "blue") %in% x$Colour) > 1) {
x$Colour = gsub("red|blue", "red-and-blue", x$Colour)
} else {
x$Colour = as.character(x$Colour)
}
x
}
Use that function on your temp object that you created in Step 1:
temp = lapply(temp, red.and.blue)
Use aggregate() to perform the aggregation you need to do. Specify the names in the aggregate() arguments so that you maintain your original column names.
temp = lapply(temp, function(x) aggregate(list(Cases = x$Cases),
list(Colour = x$Colour,
Volume = x$Volume), sum))
Put it back all into a data.frame(). Don't forget to assign a name if you want to store it as is.
do.call(rbind, temp)
# Colour Volume Cases
# 46 red-and-blue 46 9
# 57.1 green 57 5
# 57.2 red-and-blue 57 6
# 57.3 yellow 57 3
# 99.1 green 99 3
# 99.2 red-and-blue 99 3
# 111.1 red 111 2
# 111.2 yellow 111 4
# 122 blue 122 1

I think if you follow #mrdwab's approach, you can use sapply on each "split volume" to do
df$Cases <- sum(df[(df$Colour =='blue' | df$Colour == 'red'),][,3])
to get the number of cases, and
df$Colour[(df$Colour =='blue' | df$Colour == 'red')] <- 'readandblue'
to change the colornames. I'm also willing to bet there's a 2-line solution using ddply but I'm not an expert w/ that tool (yet).

Related

Renaming coat colors in R goes wrong with str_detect

I have a dataset with horses and want to group them based on coat colors. In my dataset more than 140 colors are used, I would like to go back to only a few coat colors and assign the rest to Other. But for some horses the coat color has not been registered, i.e. those are unknown. Below is what the new colors should be. (To illustrate the problem I have an old coat color and a new one. But I want to simply change the coat colors, not create a new column with colors)
Horse ID
Coatcolor(old)
Coatcolor
1
black
Black
2
bayspotted
Spotted
3
chestnut
Chestnut
4
grey
Grey
5
cream dun
Other
6
Unknown
7
blue roan
Other
8
chestnutgrey
Grey
9
blackspotted
Spotted
10
Unknown
Instead, I get the data below(second table), where unknown and other are switched.
Horse ID
Coatcolor
1
Black
2
Spotted
3
Chestnut
4
Grey
5
Unknown
6
Other
7
Unknown
8
Grey
9
Spotted
10
Other
I used the following code
mydata <- data %>%
mutate(Coatcolor = case_when(
str_detect(Coatcolor, "spotted") ~ "Spotted",
str_detect(Coatcolor, "grey") ~ "Grey",
str_detect(Coatcolor, "chestnut") ~ "Chestnut",
str_detect(Coatcolor, "black") ~ "Black",
str_detect(Coatcolor, "") ~ "Unknown",
TRUE ~ Coatcolor
))
mydata$Coatcolor[!mydata$Coatcolor %in% c("Spotted", "Grey", "Chestnut", "Black", "Unknown")] <- "Other"
So what am I doing wrong/missing? Thanks in advance.

You can use the recode function of thedplyr package. Assuming the missing spots are NA' s, you can then subsequently set all NA's to "Other" with replace_na of the tidyr package. It depends on the format of your missing data spots.
mydata <- tibble(
id = 1:10,
coatcol = letters[1:10]
)
mydata$coatcol[5] <- NA
mydata$coatcol[4] <- ""
mydata <- mydata %>%
mutate_all(list(~na_if(.,""))) %>% # convert empty string to NA
mutate(Coatcolor_old = replace_na(coatcol, "Unknown")) %>% #set all NA to Unknown
mutate(Coatcolor_new = recode(
Coatcolor_old,
'spotted'= 'Spotted',
'bayspotted' = 'Spotted',
'old_name' = 'new_name',
'a' = 'A', #etc.
))
mydata

Assign a dataframe column a value, based on multiple conditions

I'm having a dataframe as below.
price <- c(287655, 456355, 662500, 597864,876545)
House <- data.frame(price)
I need to create another column for this dataset, based on multiple conditions. Lets assume the new column of the dataframe is A.
I need to assign the new column, a value considering some multiple conditions.
I tried in several ways. But none was a success.
if(House$price < 300000) then {House$A='red'}
if(House$price > 300000 & House$price < 500000) then {House$A='blue'}
if(House$price > 500000) then {House$A='green'}
How can I success this.

We can use cut
transform(House, newcol = cut(price, breaks = c(-Inf, 300000, 500000, Inf),
labels = c("red", "blue", "green")))
# price newcol
#1 287655 red
#2 456355 blue
#3 662500 green
#4 597864 green
#5 876545 green
Note that if/else is not vectorized and it expects the input to have length of 1. If we are doing in this a loop with each element having length 1, it works, but it is also inefficient as there is ifelse vectorized version of if/else
House <- transform(House, newcol = ifelse(price < 300000, "red",
ifelse(price > 300000 & price < 500000, "blue", "green")))
House
# price newcol
#1 287655 red
#2 456355 blue
#3 662500 green
#4 597864 green
#5 876545 green
If we look at the results, both of them got the same output, but the difference is in the number of ifelse statements which can increase when there are more number of comparisons. It would be better to use cut or findInterval instead of nested ifelse
if goes with else rather than then
House$newcol <- NA
for(i in seq_len(nrow(House))) {
House$newcol[i] <- if(House$price[i] < 300000) {
'red'
} else if( House$price[i] > 300000 & House$price[i] < 500000) {
'blue'
} else 'green'
}

Perform function on subset of data

I have a set of data below which shows a number of cars at an imaginary car dealership. The 'current_price' variable is obviously the price the car is currently set to sell at. The 'minimum_price' variable shows the hard floor price that the car must not, under any circumstance, sell at. (This can be assumed to be the purchase price).
I'm trying to create a function where the user can select a subset of the cars in the database (using the 'user defined parameters' as noted below), and then reduce or increase the 'Current_Price' by either a percentage or pound(£) value.
The 'minimum profit parameters' set the minimum profit which is to be made on all cars. In this example they have been set to £10 and 10%. Meaning the profit for each car must be either £10, or 10% of the current price - whichever is larger.
The price change parameters set how much the price is to move and whether it should move up or down.
# Dummy data
Type <- rep(c("Car", "Van"),each=3)
Age <- as.numeric(c(2, 2, 5, 4, 8,1))
Colour <- c("Red", "Red", "Yellow", "Red", "Black", "Red")
Make <- c("Ford", "VW", "VW", "VW", "BMW", "Ford")
Current_Price <- as.numeric(c(1050, 1000, 1500, 995, 2200, 2100))
Minimum_Price <- as.numeric(c(900, 600, 500, 850, 1900, 1950))
df1 <- data.frame(Type, Age, Colour, Make, Current_Price, Minimum_Price)
# User defined parameters - price to be changed for all cars which fit below selection
Input_Type <- "Car"
Input_Min_Age <- 2 # All cars this age and above
Input_Max_Age <- 10 # All cars this age and below
Input_Colour <- "Red"
Input_Make <- c("Ford", "VW")
# Minimum profit parameters
Input_Min_Pounds <- 10
Input_Min_Percentage <- 0.10
# Price change parameters
Input_Change_Type <- "Percentage" # "Percentage" or "Pound"
Input_Change_Value <- -0.10 # "-" sign to represent price reduction
Given the above, I would expect lines 1 & 2 to be effected by the change. Line 1's price should move down from £1,050 to £1,000. This is because £1,000 is the lowest price possible where 10% of the price is profit (900/(1-0.10) = 1000).
Line 2's price should simply move down by 10% to 900.
Has anyone got an idea how to put this into a function which will be fairly intuitive to use for someone who is not used to using R?

This answer uses a data.table to support the price changes in the original "data.frame" (as you have explained in the comments to your question) the solution could look like this.
I am still ignoring the pricing logic because I want to focus on the usability aspect
(this specialized pricing logic is arbitrary and not of particular interest for anybody else here at SO; if you have a specific problem to implement it yourself please open a new question and explain the problem in detail).
library(data.table)
data <- as.data.table(df1)
calc_price <- function(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage) {
# TODO implement your pricing logic here...
return(Current_Price + 1)
}
update_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 0,
min_profit_pounds = 10,
min_profit_percentage = 0.10) {
stopifnot(is.data.table(data))
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
# date the price using a separate function to encapsulate the logic
data[eval(filter_exp), Current_Price := calc_price(Current_Price,
Minimum_Price,
price_change_type,
price_change_value,
min_profit_pounds,
min_profit_percentage)][]
return(data)
}
The usage is still similar to my data.frame answer, e. g.:
update_car_prices(data, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
update_car_prices(data, Colour == "Red")
update_car_prices(data, Colour == "Red", "Pound", 500)
The differences are:
The whole data.table (data) is returned to see the impact
The original data is changed since data.tables are passed by reference
and I am updating the price "by reference" using the data.table syntax :=

This answer is based on a data.frame...
Your questions addresses multiple aspects (pricing logic, filtering logic and usability).
I am focusing on the usability (and ignoring the pricing logic since this is just a deliberate detail).
I see at least three options:
Use a strongly typed function:
get_car_prices1 <- function(data, Input_Type, Input_Min_Age, Input_Max_Age, Input_Colour, Input_Make, Input_Min_Pounds, Input_Min_Percentage)
Use an "untyped" function with a deliberate number of arguments via ... to support filtering by passing only the required arguments:
get_car_prices2 <- function(data, Input_Min_Pounds, Input_Min_Percentage, ...)
Use meta programming with substitute + eval
I have decided for option 3 as being the best (user friendly + flexible) option IMHO:
get_car_prices <- function(data,
filter,
price_change_type = c("Percentage", "Pound"),
price_change_value = 1)
{
price_change_type <- match.arg(price_change_type) # use the first value if none was provided
filter_exp <- substitute(filter) # "parse" the passed value as expression
data_subset <- subset(data, eval(filter_exp))
# TODO add your pricing logic here (e. g. using "ifelse")
return(data_subset)
}
# Usage examples:
get_car_prices(df1, Colour == "Red")
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
# 4 Van 4 Red VW 995 850
# 6 Van 1 Red Ford 2100 1950
get_car_prices(df1, Type == "Car" & Age >= 2 & Age <= 10 & Colour == "Red" & Make %in% c("Ford", "VW"))
# Type Age Colour Make Current_Price Minimum_Price
# 1 Car 2 Red Ford 1050 900
# 2 Car 2 Red VW 1000 600
get_car_prices(df1, Colour == "Red", "Pound", 500)
# ...
get_car_prices(df1, Colour == "Red", "dumping price", 1)
# Error in match.arg(price_change_type) :
# 'arg' should be one of “Percentage”, “Pound”
# But: The user has to learn at least the expression logic of R and that variables (and values) are case-sensitive:
get_car_prices(df1, Colour == "red")
# [1] Type Age Colour Make Current_Price Minimum_Price
# <0 rows> (or 0-length row.names)
# Error: Assignment operator (=) used instead of comparison operator (==)
get_car_prices(df1, Colour = "Red")
# Error in get_car_prices(df1, Colour = "Red") :
# unused argument (Colour = "Red")

How to combine multiple variable data to a single variable data?

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset

names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths

I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)

One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

R tableGrob heatmap or conditional formating in column

Is there a way to create a similar effect to excel's conditional formating -> color scales in order to present a table in grid.table/tablegrob object? The color indicator should be red for the lower values and green for the higher values in the column.
That object format is needed so the table can be presented in grid format along with plots.
Thank you.

You can do this within tableGrob. You create a vector of colours, and then assign these to the cells.
So using the data from clemens's answer:
library(gridExtra)
library(grid)
# define colour vector
# change `vec` argument of `findInterval` to suit your cut-points
cols <- c("red" ,"orange", "green") [findInterval(my_data$Balance, c(-Inf, 1e4, 2e4, Inf))]
# or
# https://stackoverflow.com/questions/34517031/red-amber-green-sequential-palette-for-treemap-in-r
cols <- colorRampPalette(c("red", "yellow", "green"))(nrow(my_data))[rank(my_data$Balance)]
# create tales individually for each column
# this make it easy to assign colours to rows
t1 <- tableGrob(my_data["Balance"],
theme=ttheme_default(
core=list(bg_params = list(fill=cols)),
colhead = list(bg_params=list(fill="white", col="grey90"))),
rows = NULL)
t2 <- tableGrob(my_data["ID"],
theme=ttheme_default(
core=list(bg_params = list(fill="white", col="grey90")),
colhead = list(bg_params=list(fill="white", col="grey90"))),
rows = NULL)
# join tables
tab <- gtable_combine(t2, t1)
# grid.newpage() ; grid.draw(tab)
# if also want to add black border
# https://stackoverflow.com/questions/31506294/gtable-put-a-black-line-around-all-cells-in-the-table-body
library(gtable)
tab <- gtable::gtable_add_grob(tab,
grobs = rectGrob(gp=gpar(fill=NA, lwd=2)),
t = 1, b = nrow(tab), l = 1, r = ncol(tab))
grid.newpage() ; grid.draw(tab)

You could use tableHTML for that:
library(tableHTML)
for the dataset:
set.seed(666)
my_data <- data.frame(ID = 101:117,
Balance = sample(-1000:60000, 17))
ID Balance
1 101 46237
2 102 11030
3 103 58657
4 104 11280
5 105 21034
6 106 44296
7 107 58697
8 108 29381
9 109 -188
10 110 14854
11 111 46322
12 112 -2
13 113 4839
14 114 7670
15 115 11875
16 116 48475
17 117 1228
You can than create an HTML table using the tableHTML() function. Then apply a colour rank with theme RAG to the 2nd column of the table:
my_data %>%
tableHTML(rownames = FALSE,
widths = c(50, 100)) %>%
add_css_conditional_column(columns = 2,
colour_rank_theme = 'RAG',
decreasing = TRUE)
The result looks like this:

The most natural solution for that is to use a heatmap()?
heatmap(data.matrix(mtcars))
Would yield a heatmap with some default color options. You can change the color using an additional parameter (e.g col = cm.colors(256)) or your own color palette to achieve the desired output.
,

A solution I found was to do the following.. this only works if the data is in order and you list the count of rows(17 based on your screenshot):
theme=ttheme_default(
core=list(bg_params = list(fill=blues9[1:17]) or
theme=ttheme_default(
core=list(bg_params = list(fill=blues9[1:17])
Hope that helps. I am also seeking for alternatives myself

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sum up rows according to specific values - r

Related

Renaming coat colors in R goes wrong with str_detect

Assign a dataframe column a value, based on multiple conditions

Perform function on subset of data

How to combine multiple variable data to a single variable data?

R tableGrob heatmap or conditional formating in column

Categories

Resources