Summary statistics of retail prices grouped by categorical data - r

I need some help writing a function that takes three categorical inputs and returns a vector of summary statistics based on these inputs.
The data set contains information on retail goods that can be specified by their retail segment, brand name, and type of good along with its retail price and what it actually sold for.
Now I need to write a function that will take these inputs and average, count, and calculate whatever else is needed.
I have set the function up as follows (using made up data):
dataold = data.frame(segment=c("golf","tenis","football","tenis","golf","golf"),
brand=c("x","y","z","y","x","a"),
type=c("iron","ball","helmet","shoe","driver","iron"),
retail=c(124,.60,80,75,150,108),
actual=c(112,.60,72,75,135,100))
retailsum = funtion(segment,brand,type){
datanew = dataold[which(dataold$segment='segment' &
dataold$brand='brand' &
dataold$type='type'),c("retail","actaul")]
summary = c(dim(datanew)[1],colMeans(datanew))
return(summary)
}
The code inside the function braces works on its own, but once I wrap a function around it I start getting errors or it will just return 0 counts and NaN for the means.
Any help would be greatly appreciated. I have very little experience in R, so I apologize if this is a trivial question, but I have not been able to find a solution.

There are rather a lot of errors in your code, including:
misspelling of function
using single = (assignment) rather than == (equality test)
mistype of actual
hardcoding of segment, brand and type in your function, rather than referencing the arguments.
This is how your function could look like, i.e. it produces valid results:
retailsum <- function(data, segment,brand,type, FUN=colMeans){
x = with(data, data[segment==segment && brand==brand && type==type,
c("retail","actual")])
match.fun(FUN)(x)
}
retailsum(dataold, "golf", "x", "iron", colMeans)
retail actual
89.60000 82.43333
And here is a (possibly much more flexible) solution using the plyr package. This calculates your function for all combinations of segment, brand and type:
library(plyr)
ddply(dataold, .(segment, brand, type), colwise(mean))
segment brand type retail actual
1 football z helmet 80.0 72.0
2 golf a iron 108.0 100.0
3 golf x driver 150.0 135.0
4 golf x iron 124.0 112.0
5 tenis y ball 0.6 0.6
6 tenis y shoe 75.0 75.0

Andrie's solution is pretty complete already. (ddply is cool! Didn't know about that function...)
Just one addition, though: If you want to compute the summary values over all possible combinations, you can do this as a one-liner using R's onboard function by:
by(dataold, list(dataold$segment, dataold$brand, dataold$type),
function(x) summary(x[,c('retail', 'actual')])
)
That is not strictly what you asked for, but may still be instructive.

Related

Walk a CHAID tree R - need to sort by number of instances

I have a number of trees, when printing they are 7 pages long. I've had to rebalance the data and need to look at the branches with the highest frequency to see if they make sense - I need to identify a cancellation rate for different clusters.
Given the data is so long what I need is to have the biggest branches and then I can validate those rather than go through 210 branches manually. I will have lots of trees so need to automate this to look at the important results.
Example code to use:
library(CHAID)
updatecars<-mtcars
updatecars$cyl<-as.factor(updatecars$cyl)
updatecars$vs<-as.factor(updatecars$vs)
updatecars$am<-as.factor(updatecars$am)
updatecars$gear<-as.factor(updatecars$gear)
plot(carsChaid)
carsChaid<-chaid(am~ cyl+vs+gear, data=updatecars)
carsChaid
When you print this data, you see n=15 for the first group. I need a table where I can sort on this value.
What I need is a decision tree table with the variable values and the number within each group from the tree. This is not exactly the same as this answer Walk a tree
as it doesn't give the number within but I think it's in the direction.
Can someone help,
Thanks,
James
Sure there is a better way to do this but this works.Obviously willing to have corrections and improvements suggested.
The particular trouble i had was creating the list of all combinations. When the expand.grid goes over 3 factors, it stops working. So I had to build a loop ontop of it to create the complete list.
All_canx_rates<-function(Var1,Var2,Var3,Var4,Var5,nametree){
df1<-data.frame("CanxRate"=0,"Num_Canx"=0,"Num_Cust"=0)
pars<-as.list(match.call()[-1])
a<-eval(pars$nametree)[,as.character(pars$Var1)]
b<-eval(pars$nametree)[,as.character(pars$Var2)]
c<-eval(pars$nametree)[,as.character(pars$Var3)]
d<-eval(pars$nametree)[,as.character(pars$Var4)]
e<-eval(pars$nametree)[,as.character(pars$Var5)]
allcombos<-expand.grid(levels(a),levels(b),levels(c))
clean<- allcombos
allcombos$Var4<-d[1]
for (i in 2:length(levels(d))) {
clean$Var4<-levels(d)[i]
allcombos<-rbind(allcombos,clean)
}
#define a forloop
for (i in 1:nrow(allcombos)) {
#define values
f1<-allcombos[i,1]
f2<-allcombos[i,2]
f3<-allcombos[i,3]
f4<-allcombos[i,4]
y5<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4 &
e =='1'),])
y4<-nrow(nametree[(a %in% f1 & b %in% f2 & c %in% f3 & d %in% f4),])
df2<-data.frame("CanxRate"=y5/y4,"Num_Canx"=y5,"Num_Cust"=y4)
df1<-rbind(df1, df2)
}
#endforloop
#make the dataframe available for global viewing
df1<-df1[-1,]
output<<-cbind(allcombos,df1)
}
You can use data.tree to do further operations on a party object like sorting, walking the tree, custom plotting, etc. The latest release v0.3.7 from github has a conversion from party class objects:
devtools::install_github("gluc/data.tree#v0.3.7")
library(data.tree)
tree <- as.Node(carsChaid)
tree$fieldsAll
The last command shows the names of the converted fields of the party class:
[1] "data" "fitted" "nodeinfo" "partyinfo" "split" "splitlevels" "splitname" "terms" "splitLevel"
You can sort by a function, e.g. the rows of the data on each node:
tree$Sort(attribute = function(node) nrow(node$data), decreasing = TRUE)
print(tree,
"splitname",
count = function(node) nrow(node$data),
"splitLevel")
This prints, for instance, like so:
levelName splitname count splitLevel
1 1 gear 32
2 ¦--3 17 4, 5
3 °--2 15 3

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

In R and ddply, is it possible to avoid enumerating all columns I need when using ddply?

Other posts suggested that ddply is a good workhorse.
I am trying to learn xxply functions and I can not solve this problem.
This is my
library(ggplot2)
(df= tips[1:5,])
total_bill tip sex smoker day time size
1 16.989999999999998437 1.0100000000000000089 Female No Sun Dinner 2
2 10.339999999999999858 1.6599999999999999201 Male No Sun Dinner 3
3 21.010000000000001563 3.5000000000000000000 Male No Sun Dinner 3
4 23.679999999999999716 3.3100000000000000533 Male No Sun Dinner 2
5 24.589999999999999858 3.6099999999999998757 Female No Sun Dinner 4
and I need to something like this
ddply(df
,.(<do I have to enumerate all columns I need to operate on here?)>
, function(x) {if size>=3 return(size) else return(total_bill+tip)
)
(the example is a fake problem (does not make real life sense) and only demonstrates my problem with larger data)
I could not get the ddply code right reading just help files. Any advise appreciated. Or even great ddply tutorial?
I like that with ddply I can just pass my dataframe as input, but in the second argument, it is not nice that I am forced to enumerate all columns that I need later. Is there a way to pass the whole row (all columns)?
I like defining the function on the fly, but I am not sure how to make my pseudocode correct in R (my last argument).
Based on your code, it doesn't look like you need to use plyr here at all. It seems to me you are calculating a new variable for each row of the data.frame. If that's the case, then just use some base R functions:
dat <- transform(dat, newval = ifelse(size >= 3, size, total_bill + tip))
total_bill tip sex smoker day time size newval
1 16.99 1.01 Female No Sun Dinner 2 18.00
2 10.34 1.66 Male No Sun Dinner 3 3.00
3 21.01 3.50 Male No Sun Dinner 3 3.00
4 23.68 3.31 Male No Sun Dinner 2 26.99
5 24.59 3.61 Female No Sun Dinner 4 4.00
Sorry if I misunderstood what you are doing. If you do in fact need to pass the entire row of a data.frame into plyr with no grouping variable, perhaps you can treat it as an array with margin = 1? i.e adply(dat, 1, ...)
Great introduction of plyr here: www.jstatsoft.org/v40/i01/paper
The second argument is the "splitting" variable. so in your sample data set, if you're looking to see the difference in spending habits between the sexes you would supply .(sex) or if you want all possibilities of your categorical variables, yes you would have to supply them all .(sex, smoker, day, time).
On a separate note, when using ddply your function should take a data.frame and return a data.frame. Currently It returns a vector. Also, if is not vectorized, you should use ifelse.
ddply(df, .(sex), function(x) {
x$new.var <- ifelse(x$size >= 3, x$size, x$total_bill + x$tip)
return(x)
})
if you don't specify the return value, R will return the last thing calculated which is a vector.
My only other suggestion is to keep playing with plyr. Eventually it will click and you'll love it!
don't know if this is still useful. Whilst I am not sure whether this is adequate I am used to solve tasks similar to yours as follows:
ddply(df
, as.quoted(colnames(df))
, function(x) {if size>=3 return(size) else return(total_bill+tip)
)

Good ways to code complex tabulations in R?

Does anyone have any good thoughts on how to code complex tabulations in R?
I am afraid I might be a little vague on this, but I want to set up a script to create a bunch of tables of a complexity analogous to the stat abstract of the united states.
e.g.: http://www.census.gov/compendia/statab/tables/09s0015.pdf
And I would like to avoid a whole bunch of rbind and hbind statements.
In SAS, I have heard, there is a table creation specification language; I was wondering if there was something of similar power for R?
Thanks!
It looks like you want to apply a number of different calculations to some data, grouping it by one field (in the example, by state)?
There are many ways to do this. See this related question.
You could use Hadley Wickham's reshape package (see reshape homepage). For instance, if you wanted the mean, sum, and count functions applied to some data grouped by a value (this is meaningless, but it uses the airquality data from reshape):
> library(reshape)
> names(airquality) <- tolower(names(airquality))
> # melt the data to just include month and temp
> aqm <- melt(airquality, id="month", measure="temp", na.rm=TRUE)
> # cast by month with the various relevant functions
> cast(aqm, month ~ ., function(x) c(mean(x),sum(x),length(x)))
month X1 X2 X3
1 5 66 2032 31
2 6 79 2373 30
3 7 84 2601 31
4 8 84 2603 31
5 9 77 2307 30
Or you can use the by() function. Where the index will represent the states. In your case, rather than apply one function (e.g. mean), you can apply your own function that will do multiple tasks (depending upon your needs): for instance, function(x) { c(mean(x), length(x)) }. Then run do.call("rbind" (for instance) on the output.
Also, you might give some consideration to using a reporting package such as Sweave (with xtable) or Jeffrey Horner's brew package. There is a great post on the learnr blog about creating repetitive reports that shows how to use it.
Another options is the plyr package.
library(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, "month", function(x){
with(x, c(meantemp = mean(temp), maxtemp = max(temp), nonsense = max(temp) - min(solar.r)))
})
Here is an interesting blog posting on this topic. The author tries to create a report analogous to the United Nation's World Population Prospects: The 2008 Revision report.
Hope that helps,
Charlie

Resources