The following is a toy problem that demonstrates my question.
I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.
aggregate(salary ~ state) # Returns the average salary per state
aggregate(salary ~ state + gender, data, FUN = mean) # Avg salary per state/gender
What I actually need is a summary of the fraction of the total salary earned by women in each state.
aggregate(salary ~ state + gender, data, FUN = sum)
returns the total salary earned by women (and men) in each state ,but what I really need is salary_w / salary_total on a per-state level. I can write a for-loop, etc -- but I am wondering if there is some way to use aggregate to do that.
Another option would be using plyr. ddply() expects a data.frame as an input and will return a data.frame as an output. The second argument is how you want to split the data frame. The third argument is what we want to apply to the chunks, here we are using summarise to create a new data.frame from the existing data.frame.
library(plyr)
#Using the sample data from kohske's answer above
> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"]) / sum(salary))
state ratio
1 1 0.5789860
2 2 0.4530224
probably reshape or reshape2 would help your work.
Here is a sample script:
library(reshape2) # from CRAN
# sample data
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))),
salaly=runif(8))
d2 <- dcast(d, state~gender, sum)
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon)
The ave function is good for problems like this.
Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum) /
ave(Data$salary, Data$state, FUN=sum)
Another solution is to use xtabs and prop.table:
prop.table(xtabs(salary ~ state + gender,data),margin=1)
It's generally not advisable to name your datasets, "data", so I will change the problem slightly to name the dataset "dat1".
with( subset(dat1, gender="Female"), aggregate(salary, state, sum )/
# should return a vector
with( data=dat1, aggregate(salary, state, sum )
# using R's element-wise division
I think you are also using attach and there are good reasons to reconsider that decision, despite what you might read in Crawley.
Since you want the results on a per state basis the tapply might be what you want.
To illustrate let's generate some arbitrary data to play with:
set.seed(349) # For replication
n <- 20000 # Sample size
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI',
'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN',
'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH',
'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA',
'WV','WI','WY') # All US states
state <- sample(state, size = n, replace = TRUE) # Random selection of the states
state_index <- tapply(state, state) # Just for the data generatino part ...
gender_index <- tapply(gender, gender)
# Generate salaries
salary <- runif(length(unique(state)))[state_index] # Make states different
salary <- salary + c(.02, -.02)[gender_index] # Make gender different
salary <- salary + log(50) + rnorm(n) # Add mean and error term
salary <- exp(salary) # The variable of interest
What you asked for, the sum of salaries for the women per state and the sum of total salaries per state:
salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum)
salary_total <- tapply(salary, state, sum)
Or if it is in a data-frame:
salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum))
salary_total <- with(myData, tapply(salary, state, sum))
Then the answer is:
> salary_w / salary_total
AK AL AR AZ CA CO CT DC
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799
DE FL GA HI IA ID IL IN
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932
KS KY LA MA MD ME MI MN
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435
MO MS MT NC ND NE NH NJ
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822
NM NV NY OH OK OR PA RI
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672
SC SD TN TX UT VA VT WA
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393
WI WV WY
0.4787588 0.5495602 0.5029816
Related
I have some basic calculations I want to apply on residuals of a plm model but I am stuck on how to automate the steps for a lot of data.
Let's assume the input is a data.frame (df) with the following data:
Id Year Population Y X1 X2 X3
country A 2009 977612 212451.009 19482.7995 0.346657979 0.001023221
country A 2010 985332 221431.632 18989.3 0.345142551 0.001015205
country A 2011 998211 219939.296 18277.79286 0.344020453 0.001002106
country A 2012 1010001 218487.503 17916.2765 0.342434314 0.000990409
country B 2009 150291 177665.268 18444.04522 0.330864789 0.001940218
country B 2010 150841 183819.407 18042 0.327563461 0.001933143
country B 2011 152210 183761.566 17817.3515 0.32539255 0.001915756
country B 2012 153105 182825.112 17626.62261 0.321315437 0.001904557
country c 2009 83129 132328.034 17113.64268 0.359525557 0.005862866
country c 2010 83752 137413.878 16872.5 0.357854141 0.005819254
country c 2011 84493 136002.537 16576.17856 0.356479235 0.005768219
country c 2012 84958 133064.911 16443.3057 0.355246122 0.005736648
A model was applied and the residuals are stored:
fixed <- plm(Y ~ Y1 + X2 + X3,
data=df, drop.unused.levels = TRUE, index=c("Id", "Year"), model="within")
residuals <- resid(fixed)
In my next step, I want to calculate "weighted averages" of my residuals with:
with nit standing for the population in country i at time t and nt being the total population at t.
My approach so far is:
First I compute the total population nt for every year:
year_range <- seq(from=2009,to=2012,by=1)
tot_pop = NULL
for (n in year_range)
{
tot_pop[n] = with(df, sum(Population[Year == n]))
}
Before taking the sum of the "weighted" residuals, my next step would be to automate the calculation of my "new" residuals:
res1 <- df$Population[1]/tot_pop[2009] * residuals[1]
res2 <- df$Population[2]/tot_pop[2010] * residuals[2]
res3 <- df$Population[3]/tot_pop[2011] * residuals[3]
...
res12 <- df$Population[12]/tot_pop[2011] * residuals[12]
Edit: Applying the solution of JTT to my problem, the last step would then be:
year_range1 <- rep(year_range, 3)
df_res <- data.frame(year = year_range1, res=as.vector(res))
aggr_res <- aggregate(df_res$res, list(df_res$year), sum)
colnames(aggr_res) <- c("Year", "Aggregated residual")
Is that correct?
I have tried the lapply function and a double "for-loop" without success. I don't know how to do this. Your help would be appreciated. If my question is unclear, please comment and I will try to improve it.
First, instead of a for-loop, you might want to calculate the total population using the aggregate funtion, e.g.:
a<-aggregate(df$Population, list(df$Year), sum)
Notice the column names of a (Group.1 and x).
Then you could match the results in a to the data in df using the match()-function. It gives the matching row numbers, which can be used to subset data from df to the division before multiplying with the residuals. For example:
res<-df$Population/a$x[match(df$Year, a$Group.1)]*residuals
Now you should have a vector of "new" residuals in object res.
I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %
I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).
Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})
Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087
Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.
I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.
I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate
library(plyr)
fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')
Inf_table <- ddply(hospG, "Year", summarise,
Infections = sum(SSI == 1),
Procedures = length(Procedure),
PropInf = round(Infections/Procedures * 100 ,2)
)
This gives me the number of infections, procedures, and proportion infected per year for this hospital.
What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is:
s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)
Expected_prop_inf <- Inf_table$Procedures * s1/s2
Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far.
Thanks for any help offered.
It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R.
# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141
rather than with ddply which is not so natural:
# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
Year ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141
Here is a solution to aggregate using data.table.
I'm not sure if it's posible to do it in one step.
require("data.table")
fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)
Inf_table <- DT[, {Infections = sum(SSI == 1)
Procedures = length(Procedure)
PropInf = round(Infections/Procedures * 100 ,2)
list(
Infections = Infections,
Procedures = Procedures,
PropInf = PropInf
)
}, by = Year]
Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]
tables()
The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.