How to do colsum and average on the same dataframe

How to do colsum and average on the same dataframe - r

I am trying to add a new row to this dataframe below name total. Now for the columns counts,cost,views are colsums or totals but for average I want to do average and for average I want to do a custom formula .So how can I do that. I did use the janitor library (adorn_totals("row"))but it just does the sum . Below is the sample dataframe:
data.frame(stringsAsFactors=FALSE,
Site = c("Channel1", "Channel2", "Channel3", "Channel4"),
Counts = c(7637587, 19042385, 72019057, 45742745),
Cost = c(199999.993061, 102196.9726, 102574.79, 196174.712132),
Views = c(3007915, 5897235, 14245859, 24727451),
Average = c(2.54, 3.23, 5.05543800482653, 2.21111111111111),
avg_views = c(7.5197875, 14.7430875, 35.6146475, 48.24)
)

I'm not sure if this can help you but that's what I'm using when I want to add a Total row at the end. I also use this with the data.table package.
Code example:
dt <- rbind(dt, data.table(Site = "Total",
Counts = sum(dt[, Counts]),
Cost = sum(dt[, Cost]),
Views = mean(dt[, Views]),
Average = sum(dt[, Average]),
avg_views = paste("Hi OP")))
Output :
Site Counts Cost Views Average avg_views
1: Channel1 7637587 200000.0 3007915 2.540000 7.519787
2: Channel2 19042385 102197.0 5897235 3.230000 14.743087
3: Channel3 72019057 102574.8 14245859 5.055438 35.614647
4: Channel4 45742745 196174.7 24727451 2.211111 48.240000
5: Total 144441774 600946.5 11969615 13.036549 Hi OP
You can apply whatever functions you want. In my code example you have sum() and also mean() but you could use anything.

Here's a base way. It's ok.
DF_summary <- colSums(DF[, -1])
DF_summary[4] <- mean(DF[, 5])
rbind(DF, c('Total',DF_summary))
Site Counts Cost Views Average avg_views
1 Channel1 7637587 199999.993061 3007915 2.54 7.5197875
2 Channel2 19042385 102196.9726 5897235 3.23 14.7430875
3 Channel3 72019057 102574.79 14245859 5.05543800482653 35.6146475
4 Channel4 45742745 196174.712132 24727451 2.21111111111111 48.24
5 Total 144441774 600946.467793 47878460 3.25913727898441 106.1175225

Related

How can I find the optimal price for each time t in R?

I have the price range price <- c(2.5,2.6,2.7,2.8)
and my dataset have several time t. For each time t, I have a corresponding cost c and demand quantity d.
I need to find the optimal price for each time t to maximise my required profit function (p-c)*d.
How can I achieve that?
The sample of mydata looks like this, I have 74 observations in total:
t
c
d
1
0.8
20
2
0.44
34
3
0.54
56
4
0.67
78
5
0.65
35
Here is my code but it reports error, can anybody help me to fix it? Much thanks!
max <-data.frame()
for (i in mydata$t) {
for (p in price) {
profit <- ((p-mydata$c)*mydata$d)
max <- max %>% bind_rows(data.frame(time=mydata$t,
price=p,
cost=mydata$c,
profit = profit
))
}
}
maxvalue <- max %>% group_by(time) %>% max(profit)

Since you did not provide a piece of your data which I could use, this is a bit of a guess, but the idea would be:
dat <- as.data.table(mydata)
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, p[which.max((p-c)*d))], t]

Ok! I did not realize you kept the price outside your table. Then try adding all possibilities to the table first this:
dat <- data.table(t= 1:5,
c= c(0.8,0.44,0.54,0.67,0.65),
d= c(20,34,56,78,35))
# Add all possible prices as an extra column (named p)
# Note that all lines will be repeated accordingly
dat <- dat[, .(p= c(2.5,2.6,2.7,2.8)), (dat)]
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, .(best_price= p[which.max((p-c)*d)]), t]

r data.table filter based on count of rows satisfying a condition

I am learning data.table and got confused at one place. Need help to understand how the below can be achieved. The data I am having, I need to filter out those brands which have sales of 0 in the 1st period OR do not have sales > 0 in atleast 14 periods. I have tried and I think I have achieved the 1st part....however not able to get how I can get the second part of filtering those brands which do not have sales > 0 in atleast 14 periods.
Below is my sample data and code that I have written. Please suggest how I can I achieve the second part?
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
group <- sample(1:7,1200,replace = T)
brn <- sample(1:10,1200,replace = T)
period <- rep(101:116,75)
sales <- sample(0:50,1200,replace = T)
#### create the data.table
df1 <- data.table(cbind(group,brn,period,sales))
#### taking the minimum value by group x brand x period
df1_min <- df1[,.(min1 = min(sales,na.rm = T)),by = c('group','brn','period')][order(group,brn,period)]
#### creating the filter
df1_min$fil1 <- ifelse(df1_min$period == 101 & df1_min$min1 == 0,1,0)
Thank you !!

Assuming that the first restriction applies on the dataset wide minimum period (101), implying that brn/group pairs starting with a 0-sales period greater than 101 are still included.
# 1. brn/group pairs with sales of 0 in the 1st period.
brngroup_zerosales101 = df1[sales == 0 & period == min(period), .(brn, group)]
# 2a. Identify brn/group pairs with <14 positive sale periods
df1[, posSale := ifelse(sales > 0, 1, 0)] # Was the period sale positive?
# 2b. For each brn/group pair, sum posSale and filter posSale < 14
brngroup_sub14 = df1[, .(GroupBrnPosSales = sum(posSale)), by = .(brn, group)][GroupBrnPosSales < 14, .(brn, group)]
# 3. Join the two restrictions
restr = rbindlist(list(brngroup_zerosales101, brngroup_sub14))
df1[, ID := paste(brn, group)] # Create a brn-group ID
restr[, ID := paste(brn, group)] # See above
filtered = df1[!(ID %in% restr[,ID]),]

Calculating grades in r

I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).

Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})

Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087

Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.

Calculate marginal totals as a function within a ddply call

I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate
library(plyr)
fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')
Inf_table <- ddply(hospG, "Year", summarise,
Infections = sum(SSI == 1),
Procedures = length(Procedure),
PropInf = round(Infections/Procedures * 100 ,2)
)
This gives me the number of infections, procedures, and proportion infected per year for this hospital.
What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is:
s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)
Expected_prop_inf <- Inf_table$Procedures * s1/s2
Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far.
Thanks for any help offered.

It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R.
# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141
rather than with ddply which is not so natural:
# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
Year ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141

Here is a solution to aggregate using data.table.
I'm not sure if it's posible to do it in one step.
require("data.table")
fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)
Inf_table <- DT[, {Infections = sum(SSI == 1)
Procedures = length(Procedure)
PropInf = round(Infections/Procedures * 100 ,2)
list(
Infections = Infections,
Procedures = Procedures,
PropInf = PropInf
)
}, by = Year]
Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]
tables()
The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.

Is there a quick way to get the mean of various "bins" of a column in a dataframe?

I apologize if this has been answered - I just can't find it! To simplify, I have a dataframe of cars with 2 pertinent columns: mileage and price. I want to calculate the mean price and number of cars for 0-20,000 miles, 20,000-40,000, and so on (in 20,000 mile "bins"). I have been making subsets of data for the various mileage ranges and then looking at the mean and number or vehicles for that subset. I'm wondering if there is a more efficient way to do this, instead of making all of these subsets - I'm doing it many times over with various "bins" and data. I would love to learn a slicker way of doing this.
Thanks!!

You probably want smth along these lines:
library(data.table)
d = data.table(mileage = runif(1000, 0, 100000), price = runif(1000, 15000, 35000))
d[, list(price = mean(price), number = .N),
by = cut(mileage, c(0, 20000, 25000, 30000, 100000))][order(cut)]
# cut price number
# 1: (0,2e+04] 25252.70 215
# 2: (2e+04,2.5e+04] 25497.66 46
# 3: (2.5e+04,3e+04] 25349.79 45
# 4: (3e+04,1e+05] 25037.93 694

This shows how to use aggregate to return more than one statistic by category in a single run.
# Using Quentin's data
d[['mileage.cat']] <- cut(d$mileage, breaks=seq(0, 200000, by= 20000))
aggregate(d$price, d['mileage.cat'] ,
FUN=function(price) c(counts=length(price),
mean.price=mean(price) ) )
mileage.cat x.counts x.mean.price
1 (0,2e+04] 212.00 24859.01
2 (2e+04,4e+04] 194.00 24343.16
3 (4e+04,6e+04] 196.00 24357.73
4 (6e+04,8e+04] 191.00 25006.71
5 (8e+04,1e+05] 207.00 25250.23

To make the bins, "cut". Example:
x=1:10
bkpt=c(0,2.5,7.5,10)
x.cut=cut(x,breaks=bkpt)
which finishes up like this. y is some data for later:
y=21:30
data.frame(x,x.cut,y)
To calculate something for each group, use tapply. Following my example:
tapply(y,x.cut,length)
tapply(y,x.cut,mean)
which calculates (a) the number of y's and (b) the mean of y's in each of the groups defined by x.cut.

Another approach using aggregate:
df <- data.frame(mil = sample(1e5,20),price = sample(1000,20) )
#mil2 is our "mile bin" ( 0 -> [0:20000[; 1 -> [20000:40000[ ...)
df$mil2 = trunc(df$mil /20000)
# then to get the mean by "mile bin":
aggregate(price ~ mil2,df,mean)
# or the number:
aggregate(price ~ mil2,df,length)
# or simply:
table(df$mil2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to do colsum and average on the same dataframe - r

Related

How can I find the optimal price for each time t in R?

r data.table filter based on count of rows satisfying a condition

Calculating grades in r

Calculate marginal totals as a function within a ddply call

Is there a quick way to get the mean of various "bins" of a column in a dataframe?

Categories

Resources