I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %
Related
I have the price range price <- c(2.5,2.6,2.7,2.8)
and my dataset have several time t. For each time t, I have a corresponding cost c and demand quantity d.
I need to find the optimal price for each time t to maximise my required profit function (p-c)*d.
How can I achieve that?
The sample of mydata looks like this, I have 74 observations in total:
t
c
d
1
0.8
20
2
0.44
34
3
0.54
56
4
0.67
78
5
0.65
35
Here is my code but it reports error, can anybody help me to fix it? Much thanks!
max <-data.frame()
for (i in mydata$t) {
for (p in price) {
profit <- ((p-mydata$c)*mydata$d)
max <- max %>% bind_rows(data.frame(time=mydata$t,
price=p,
cost=mydata$c,
profit = profit
))
}
}
maxvalue <- max %>% group_by(time) %>% max(profit)
Since you did not provide a piece of your data which I could use, this is a bit of a guess, but the idea would be:
dat <- as.data.table(mydata)
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, p[which.max((p-c)*d))], t]
Ok! I did not realize you kept the price outside your table. Then try adding all possibilities to the table first this:
dat <- data.table(t= 1:5,
c= c(0.8,0.44,0.54,0.67,0.65),
d= c(20,34,56,78,35))
# Add all possible prices as an extra column (named p)
# Note that all lines will be repeated accordingly
dat <- dat[, .(p= c(2.5,2.6,2.7,2.8)), (dat)]
# Iterate through each value of t and get the price for which (p-c)*d is the highest
result <- dat[, .(best_price= p[which.max((p-c)*d)]), t]
I have a weighted survey dataset that involves age groups, incomes and expenditure. I want to find the average of spending within age groups and within income deciles.
So for example
DF:
Age Income Spending1 Spending2 Weight
45-49 1000 50 35 100
30-39 2000 40 60 150
40-44 3434 30 55 120
Currently I have coded this:
DF$hhdecile<-weighted_ntile(DF$Income, weights=DF$Weight, 5)
Result1<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending1, w = Weight))
Result2<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending2, w = Weight))
df.list <- list(Result1=Result1,
Result2=Result2)
names(df.list$Result1)[names(df.list$Result1)=="mean.exp"]<- Result1
ResultJoined < - df.list %>% reduce(full_join, by=c('Age','hhdecile')
That finds the quintile of people compared to the population of all ages, and I'm interested in their quintile compared to their age group.
Is there a way to use group_by or similar to perform the weighted percentile function on each age group individually?
(there are actually 15 categories of spending)
I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate
library(plyr)
fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')
Inf_table <- ddply(hospG, "Year", summarise,
Infections = sum(SSI == 1),
Procedures = length(Procedure),
PropInf = round(Infections/Procedures * 100 ,2)
)
This gives me the number of infections, procedures, and proportion infected per year for this hospital.
What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is:
s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)
Expected_prop_inf <- Inf_table$Procedures * s1/s2
Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far.
Thanks for any help offered.
It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R.
# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141
rather than with ddply which is not so natural:
# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
Year ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141
Here is a solution to aggregate using data.table.
I'm not sure if it's posible to do it in one step.
require("data.table")
fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)
Inf_table <- DT[, {Infections = sum(SSI == 1)
Procedures = length(Procedure)
PropInf = round(Infections/Procedures * 100 ,2)
list(
Infections = Infections,
Procedures = Procedures,
PropInf = PropInf
)
}, by = Year]
Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]
tables()
The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.
I have a hospital visit data that contain records for gender, age, main diagnosis, and hospital identifier. I intend to create separate variables for these entries. The data has some pattern: most observations start with gender code (M or F) followed by age, then diagnosis and mostly the hospital identifier. But there are some exceptions. In some the gender id is coded 01 or 02 and in this case the gender identifier appears at the end.
I looked into the archives and found some examples of grep but I was not successful to efficiently implement it to my data. For example the code
ndiag<-dat[grep("copd", dat[,1], fixed = TRUE),]
could extract each diagnoses individually, but not all at once. How can I do this task?
Sample data that contain current situation (column 1) and what I intend to have is shown below:
diagnosis hospital diag age gender
m3034CVDA A cvd 30-34 M
m3034cardvA A cardv 30-34 M
f3034aceB B ace 30-34 F
m3034hfC C hf 30-34 M
m3034cereC C cere 30-34 M
m3034resPC C resp 30-34 M
3034copd_Z_01 Z copd 30-34 M
3034copd_Z_01 Z copd 30-34 M
fcereZ Z cere NA F
f3034respC C resp 30-34 F
3034copd_Z_02 Z copd 30-34 F
There appears to be two key parts to this problem.
Dealing with the fact that strings are coded in two different
ways
Splicing the string into the appropriate data columns
Note: as for applying a function over several values at once, many of the functions can handle vectors already. For example str_locate and substr.
Part 1 - Cleaning the strings for m/f // 01/02 coding
# We will be using this library later for str_detect, str_replace, etc
library(stringr)
# first, make sure diagnosis is character (strings) and not factor (category)
diagnosis <- as.character(diagnosis)
# We will use a temporary vector, to preserve the original, but this is not a necessary step.
diagnosisTmp <- diagnosis
males <- str_locate(diagnosisTmp, "_01")
females <- str_locate(diagnosisTmp, "_02")
# NOTE: All of this will work fine as long as '_01'/'_02' appears *__only__* as gender code.
# Therefore, we put in the next two lines to check for errors, make sure we didn't accidentally grab a "_01" from the middle of the string
#-------------------------
if (any(str_length(diagnosisTmp) != males[,2], na.rm=T)) stop ("Error in coding for males")
if (any(str_length(diagnosisTmp) != females[,2], na.rm=T)) stop ("Error in coding for females")
#------------------------
# remove all the '_01'/'_02' (replacing with "")
diagnosisTmp <- str_replace(diagnosisTmp, "_01", "")
diagnosisTmp <- str_replace(diagnosisTmp, "_02", "")
# append to front of string appropriate m/f code
diagnosisTmp[!is.na(males[,1])] <- paste0("m", diagnosisTmp[!is.na(males[,1])])
diagnosisTmp[!is.na(females[,1])] <- paste0("m", diagnosisTmp[!is.na(females[,1])])
# remove superfluous underscores
diagnosisTmp <- str_replace(diagnosisTmp, "_", "")
# display the original next to modified, for visual spot check
cbind(diagnosis, diagnosisTmp)
Part 2 - Splicing the string
# gender is the first char, hospital is the last.
gender <- toupper(str_sub(diagnosisTmp, 1,1))
hosp <- str_sub(diagnosisTmp, -1,-1)
# age, if present is char 2-5. A warning will be shown if values are missing. Age needs to be cleaned up
age <- as.numeric(str_sub(diagnosisTmp, 2,5)) # as.numeric will convert none-numbers to NA
age[!is.na(age)] <- paste(substr(age[!is.na(age)], 1, 2), substr(age[!is.na(age)], 3, 4), sep="-")
# diagnosis is variable length, so we have to find where to start
diagStart <- 2 + 4*(!is.na(age))
diag <- str_sub(diagnosisTmp, diagStart, -2)
# Put it all together into a data frame
dat <- data.frame(diagnosis, hosp, diag, age, gender)
## OR WITHOUT ORIGINAL DIAGNOSIS STRING ##
dat <- data.frame(hosp, diag, age, gender)
The following is a toy problem that demonstrates my question.
I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.
aggregate(salary ~ state) # Returns the average salary per state
aggregate(salary ~ state + gender, data, FUN = mean) # Avg salary per state/gender
What I actually need is a summary of the fraction of the total salary earned by women in each state.
aggregate(salary ~ state + gender, data, FUN = sum)
returns the total salary earned by women (and men) in each state ,but what I really need is salary_w / salary_total on a per-state level. I can write a for-loop, etc -- but I am wondering if there is some way to use aggregate to do that.
Another option would be using plyr. ddply() expects a data.frame as an input and will return a data.frame as an output. The second argument is how you want to split the data frame. The third argument is what we want to apply to the chunks, here we are using summarise to create a new data.frame from the existing data.frame.
library(plyr)
#Using the sample data from kohske's answer above
> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"]) / sum(salary))
state ratio
1 1 0.5789860
2 2 0.4530224
probably reshape or reshape2 would help your work.
Here is a sample script:
library(reshape2) # from CRAN
# sample data
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))),
salaly=runif(8))
d2 <- dcast(d, state~gender, sum)
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon)
The ave function is good for problems like this.
Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum) /
ave(Data$salary, Data$state, FUN=sum)
Another solution is to use xtabs and prop.table:
prop.table(xtabs(salary ~ state + gender,data),margin=1)
It's generally not advisable to name your datasets, "data", so I will change the problem slightly to name the dataset "dat1".
with( subset(dat1, gender="Female"), aggregate(salary, state, sum )/
# should return a vector
with( data=dat1, aggregate(salary, state, sum )
# using R's element-wise division
I think you are also using attach and there are good reasons to reconsider that decision, despite what you might read in Crawley.
Since you want the results on a per state basis the tapply might be what you want.
To illustrate let's generate some arbitrary data to play with:
set.seed(349) # For replication
n <- 20000 # Sample size
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI',
'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN',
'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH',
'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA',
'WV','WI','WY') # All US states
state <- sample(state, size = n, replace = TRUE) # Random selection of the states
state_index <- tapply(state, state) # Just for the data generatino part ...
gender_index <- tapply(gender, gender)
# Generate salaries
salary <- runif(length(unique(state)))[state_index] # Make states different
salary <- salary + c(.02, -.02)[gender_index] # Make gender different
salary <- salary + log(50) + rnorm(n) # Add mean and error term
salary <- exp(salary) # The variable of interest
What you asked for, the sum of salaries for the women per state and the sum of total salaries per state:
salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum)
salary_total <- tapply(salary, state, sum)
Or if it is in a data-frame:
salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum))
salary_total <- with(myData, tapply(salary, state, sum))
Then the answer is:
> salary_w / salary_total
AK AL AR AZ CA CO CT DC
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799
DE FL GA HI IA ID IL IN
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932
KS KY LA MA MD ME MI MN
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435
MO MS MT NC ND NE NH NJ
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822
NM NV NY OH OK OR PA RI
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672
SC SD TN TX UT VA VT WA
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393
WI WV WY
0.4787588 0.5495602 0.5029816