Calculating the overall mean, not a mean of means - r

I feel like this is a really simple question which I should be able to figure out but I have been trying for a while now to no success. I have a dataframe and wish to determine the overall mean rt by cond type and by speaker type, ignoring position. How do I do this?
Simply, three groups of people read sentences ("Speaker"). Each "cond" is a different sentence type ("ExpA, B, C, D), all made of 5 parts ("Position"). Each part has a corresponding reading time in each sentence type. I want to look at the overall reading time for each cond (all positions together) for each group. Eg. the sum of all position reading times (0, 1, 2, 3, 4) for just FR participants for cond "ExpA" to compare whether they were faster or slower overall in "ExpA" than "ExpB".
Dataframe:
Speaker: FR, EN, KR
cond (ExpA, ExpB, ExpC, ExpD)
Position (0, 1, 2, 3, 4)
rt: 1000, 1500, 2000, 1500, 1000
How would I do this? I've been able to get the mean rt by position, condition, and speaker using the code below, but when I remove "position" thinking it will then give me the combined mean for each "cond", it gives me only one value which is too small to be the sum of 5 means, rather it looks to be a mean of those means.
pcsmeans = ddply(subj.means, .(cond, position, speaker), summarise, sd = sd(mean.rt), mean = mean(mean.rt))
I hope the lack of proper dataframe isn't offputting, I don't know how to input one of those here. Many thanks for any help!

Slightly unclear what you are after but looks like you could use dplyr's group_by and summarise:
library(dplyr)
df <- data.frame(Speaker = rep(c("FR", "EN", "KR"),20),
cond = rep(c("ExpA", "ExpB", "ExpC", "ExpD"),15),
Position = rep(c(0, 1, 2, 3, 4),12),
rt = runif(min=1000, max=2000, n = 60))
df %>% group_by(Speaker, cond) %>% summarise(mean_rt = mean(rt), overall_rt = sum(rt))
This gives you the mean and sum by Speaker and cond:
# A tibble: 12 x 4
# Groups: Speaker [3]
Speaker cond mean_rt overall_rt
<fct> <fct> <dbl> <dbl>
1 EN ExpA 1690. 8449.
2 EN ExpB 1625. 8127.
3 EN ExpC 1588. 7940.
4 EN ExpD 1475. 7375.
5 FR ExpA 1321. 6603.
6 FR ExpB 1584. 7922.
7 FR ExpC 1493. 7465.
8 FR ExpD 1463. 7315.
9 KR ExpA 1393. 6965.
10 KR ExpB 1540. 7702.
11 KR ExpC 1569. 7847.
12 KR ExpD 1570. 7849.

It is not very clear what your actual issue is. Since you already mentioned that you do not know ho to add a sample data.frame, here is one example which I guess fits your problem:
#generate mock df
speaker<-c("FR", "EN", "KR")
exp<-c("ExpA", "ExpB", "ExpC", "ExpD")
position<-c(0, 1, 2, 3, 4)
#rt<- 1000, 1500, 2000, 1500, 1000
data<-expand.grid(speaker,exp,position)
names(data)<-c('speaker','exp','position')
data$rt<-rnorm(n=nrow(data),mean = 1300,sd = 250)
head(data)
speaker exp position rt
FR ExpA 0 1269
EN ExpA 0 859
KR ExpA 0 863
FR ExpB 0 718
EN ExpB 0 956
KR ExpB 0 867
...
From this point on, there's multiple options. My preferred quick & efficient tool comes with the sqldf package, which introduces sql language like constructs. SQL is super efficient and easy to read:
require(sqldf)
sqldf::sqldf('select count(*) as N, speaker, exp, avg(rt) as mean from df group by speaker, exp')
Obviously, R has a million tools when it comes to solving a problem, but this is my favourite by far. For anything more complex (i.e. custom functions etc.), I'd probably use a for loop that iterates through each reader and exp combination:
data$identifier <- paste0(data$speaker,data$exp) # helper column
results <- data.frame()
for ( ident in unique(data$identifier) ){
df <- subset(data, identifier == ident)
speaker<-unique(df$speaker)
exp<-unique(df$exp)
mean<-sum( df$rt )
se<-sd( df$rt )/ sqrt( nrow(df) )
quantileButTransformed <- t(as.data.frame(quantile(df$rt))) #whatever you can think of
newLine<-data.frame(speaker = speaker, exp = exp,N = nrow(df), mean = mean, se = se, quantile = quantileButTransformed)
results <- rbind(results, newLine)
}
Cheers!

Related

R: trying to calculate means and sd + warning that object cannot be found while the object is listed in the data frame header

I'm fairly new to R and I struggle with calculating means for a single column. RStudio returns the same warning on several different possibilities (described further down below). I have searched the existing questions, but either the questions did not ask for what I was searching for, or the solution did not work with my data.
My data has different studies as rows and study quality ratings with multiple sub-points as columns.
A simplified version looks like this:
> dd <- data.frame(authoryear = c("Smith, 2020", "Meyer, 2019", "Lim, 2019", "Lowe, 2018"),
+ stqu1 = c(1, 3, 2, 4),
+ stqu2 = c(8, 3, 9, 9),
+ stqu3 = c(1, 1, 1, 2))
> dd
authoryear stqu1 stqu2 stqu3
1 Smith, 2020 1 8 1
2 Meyer, 2019 3 3 1
3 Lim, 2019 2 9 1
4 Lowe, 2018 4 9 2
I calculated the sums of the study quality ratings for each study by rowSums and created a new column in my data frame called "stqu_sum".
Like so:
dd$stqu_sum <- rowSums(subset(dd, select = c(stqu1, stqu2, stqu3)), na.rm = TRUE)
Now I would like to calculate the mean and standard deviation of stqu_sum over all the studies (rows). I googled and found many different ways to do this, but no matter what I try, I get the same warning which I don't know how to fix.
Things I have tried:
#defining stqu_sum as numeric
dd[, stqu_sum := as.numeric(stqu_sum)]
#colMeans
colMeans(dd, select = stqu_sum, na.rm = TRUE)
#sapply
sapply(dd, function(dd) c( "Stand dev" = sd(stqu_sum),
"Mean"= mean(stqu_sum,na.rm=TRUE),
"n" = length(stqu_sum),
"Median" = median(stqu_sum),
))
#data.table
dd[, .(mean_stqu = mean("stqu_sum"), sd_stqu = sd("stqu_sum")),.(variable, value)]
All of these have returned the warning: object stqu_sum not found. However the stqu_sum column is shown in the header of my data frame as seen above.
Can anyone help me to fix this or show me another way to do this?
I hope this is detailed enough. Please let me know if I should add any information.
Thank you in advance!
Is this what you're after? Mean and SD for stqu_sum:
dd_summary <- dd %>%
summarise(mean=mean(stqu_sum),
SD = sd(stqu_sum))
Gives:
> dd_summary
mean SD
1 11 3.366502
With data.table, we don't need to quote the column names
library(data.table)
dd[, .(mean_stqu = mean(stqu_sum), sd_stqu = sd(stqu_sum)),.(variable, value)]

How can I tidy student enrollment data on a per semester basis?

I have a dataset that currently lists student information on a term basis (i.e., 201610, 201620, 201630, 201640, 201710, etc.) with suffix 10 = fall, 20 = winter, 30 = spring, and 40 = summer. Not all terms are necessarily listed for every student.
What I would like to do is identify the first term in which a student was enrolled, presumably the fall, as T1, and subsequent terms as T2, T3, etc. Since some students may take a winter summer term, I would like to identify those as T1_Winter, T2_Summer, etc.
I've been able to isolate the individual terms for which a student has enrolled, and have been able to identify the first, intermediate, and last terms as 1, 2, 3, etc. However, I can't manage to wrap my head around how to identify fall and spring as 1, 2, 3, 4, and the intermediary terms, winter and summer, and 1.5, 2.5, 3.5, 4.5, etc.
# Create the sample dataset
data <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 2),
RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010))
)
# Isolate student IDs and terms
stdTerm <- subset(data, select = c("ID","RegTerm"))
# Sort according to ID and RegTerm
stdTerm <- stdTerm[
with(stdTerm, order(ID, RegTerm)),
]
# Remove duplicate combinations of ID and term
y <- stdTerm[!duplicated(stdTerm[c(1,2)]),]
# Create an index to identify the term number
# for which a student enrolled
library(dplyr)
z <- y %>%
arrange(ID, RegTerm) %>%
group_by(ID) %>%
mutate(StdTermIndex = seq(n()))
Right now, it's identifying the progression of all terms for a student as 1, 2, 3, etc., but not winter and summer as intermediary terms. That is, if a student enrolled in fall and winter, winter will appear as 2 and spring will appear as 3.
In the sample data provided, I would like Student ID 1 to reflect 201810 as 1, 201820 as 1.5, and 201830 as 2, etc. Any suggestions or previous code I could reference to wrap my head around how I can code the intermediary semesters?
So, to do it in your sample, I created a handle variable that tells me whether the RegTerm is even or odd.
The reason is simple, odd RegTerm means it is a regular term, whereas even ones will be either winter or summer terms.
library(dplyr)
data <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 2),
RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010)
)
dat <- data %>%
mutate(term = str_extract(RegTerm, '(?<=\\d{4})\\d{1}(?=0)'),
term = as.numeric(term) %% 2) %>%
group_by(ID) %>%
mutate(numTerm = cumsum(term),
numTerm = ifelse(term == 0, numTerm + 0.5, numTerm))
The first mutate extracts the 5th digit in the RegTerm column and get the rest of its division by 2. If it equals 1, it means it is a regular term, otherwise it will be either summer or winter.
Next I take the cumulative sum of this variable, which will give you in which RegTerm the student is. Then, for every term == 0 I add to numTerm 0.5, to account for the winter and summer terms.
# A tibble: 7 x 4
# Groups: ID [2]
ID RegTerm term numTerm
<dbl> <dbl> <dbl> <dbl>
1 1 201810 1 1
2 1 201820 0 1.5
3 1 201830 1 2
4 2 201910 1 1
5 2 201930 1 2
6 2 201940 0 2.5
7 2 202010 1 3
This way, if there is a student starting in a winter term, numTerm will be assigned a 0.5 value, having numTerm = 1 only when he reaches a regular term (term == 1)
I think a good way to do this would be to separate your RegTerm column into year and suffix and then apply some condition formula once you have the values split up.
The below code does that, we just have to then apply it to the whole column and do some rejigging.
paste(strsplit(as.character(201810), "")[[1]][1:4], collapse = ""))
# "2018"
paste(strsplit(as.character(201810), "")[[1]][5:6], collapse = ""))
# "10"
So to do it on the data frame you want to use something like lapply and then unlist the result and add a new column. After that you can change the values to numeric and then use some conditional statements in a mutate function to set the intermediary values etc.
z$year <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][1:4], collapse = "")))
z$suf <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][5:6], collapse = "")))
It looks a bit ugly but all it is doing is separating RegTerm then selecting the first 4 or last 2 characters for year and suf respectively then collapsing (using collapse = "" in paste) them into a single string. We lapply this to the whole column then unlist it to make vector.
I would recommend understanding the first two lines of code in this answer and then it will be made obvious.

Find shortest distance between multiple points

Imagine a small dataset of xy coordinates. These points are grouped by a variable called indexR, there are 3 groups in total. All xy coordinates are in the same units. The data looks approximately like so:
# A tibble: 61 x 3
indexR x y
<dbl> <dbl> <dbl>
1 1 837 924
2 1 464 661
3 1 838 132
4 1 245 882
5 1 1161 604
6 1 1185 504
7 1 853 870
8 1 1048 859
9 1 1044 514
10 1 141 938
# ... with 51 more rows
The goal is to determine which 3 points, one from each group, are closest to each other, in the sense of minimizing the sum of the pairwise distances between selected points.
I have attempted this by considering euclidian distances, as follows. (Credit goes to #Mouad_S, in this thread, and https://gis.stackexchange.com/questions/233373/distance-between-coordinates-in-r)
#dput provided at bottom of this post
> df$dummy = 1
> df %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ filter(indexR.x != indexR.y & indexR.x != indexR & indexR.y != indexR) %>%
+ mutate(dist =
+ ((.$x - .$x.x)^2 + (.$y- .$y.x)^2)^.5 +
+ ((.$x - .$x.y)^2 + (.$y- .$y.y)^2)^.5 +
+ ((.$x.x - .$x.y)^2 + (.$y.x- .$y.y)^2)^.5,
+ dist = round(dist, digits = 0)) %>%
+ arrange(dist) %>%
+ filter(dist == min(dist))
# A tibble: 6 x 11
indexR.x x.x y.x dummy indexR.y x.y y.y indexR x y dist
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 638 324 1 2 592 250 3 442 513 664
2 1 638 324 1 3 442 513 2 592 250 664
3 2 592 250 1 1 638 324 3 442 513 664
4 2 592 250 1 3 442 513 1 638 324 664
5 3 442 513 1 1 638 324 2 592 250 664
6 3 442 513 1 2 592 250 1 638 324 664
From this we can identify the three points closest together (minimum distance apart; enlarged on the figure below). However, the challenge comes when extending this such that indexR has 4,5 ... n groups. The problem is in finding a more practical or optimised method for making this calculation.
structure(list(indexR = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3), x = c(836.65, 464.43, 838.12, 244.68, 1160.86,
1184.52, 853.4, 1047.96, 1044.2, 141.06, 561.01, 1110.74, 123.4,
1087.24, 827.83, 100.86, 140.07, 306.5, 267.83, 1118.61, 155.04,
299.52, 543.5, 782.25, 737.1, 1132.14, 659.48, 871.78, 1035.33,
867.81, 192.94, 1167.8, 1099.59, 1097.3, 1089.78, 1166.59, 703.33,
671.64, 346.49, 440.89, 126.38, 638.24, 972.32, 1066.8, 775.68,
591.86, 818.75, 953.63, 1104.98, 1050.47, 722.43, 1022.17, 986.38,
1133.01, 914.27, 725.15, 1151.52, 786.08, 1024.83, 246.52, 441.53
), y = c(923.68, 660.97, 131.61, 882.23, 604.09, 504.05, 870.35,
858.51, 513.5, 937.7, 838.47, 482.69, 473.48, 171.78, 774.99,
792.46, 251.26, 757.95, 317.71, 401.93, 326.32, 725.89, 98.43,
414.01, 510.16, 973.61, 445.33, 504.54, 669.87, 598.75, 225.27,
789.45, 135.31, 935.51, 270.38, 241.19, 595.05, 401.25, 160.98,
778.86, 192.17, 323.76, 361.08, 444.92, 354, 249.57, 301.64,
375.75, 440.03, 428.79, 276.5, 408.84, 381.14, 459.14, 370.26,
304.05, 439.14, 339.91, 435.85, 759.42, 513.37)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -61L), .Names = c("indexR",
"x", "y"))
One possibility would be to formulate the problem of identifying the closest elements, one from each group, as a mixed integer program. We could define decision variables y_i for whether each point i is selected, as well as x_{ij} for whether points i and j are both selected (x_{ij} = y_iy_j). We need to select one element from each group.
In practice, you could implement this mixed integer program using the lpSolve package (or one of the other R optimization packages).
opt.closest <- function(df) {
# Compute every pair of indices
library(dplyr)
pairs <- as.data.frame(t(combn(nrow(df), 2))) %>%
mutate(G1=df$indexR[V1], G2=df$indexR[V2]) %>%
filter(G1 != G2) %>%
mutate(dist = sqrt((df$x[V1]-df$x[V2])^2+(df$y[V1]-df$y[V2])^2))
# Compute a few convenience values
n <- nrow(df)
nP <- nrow(pairs)
groups <- sort(unique(df$indexR))
nG <- length(groups)
gpairs <- combn(groups, 2)
nGP <- ncol(gpairs)
# Solve the optimization problem
obj <- c(pairs$dist, rep(0, n))
constr <- rbind(cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V2, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==") - outer(pairs$V2, seq_len(n), "==")),
cbind(matrix(0, nG, nP), outer(groups, df$indexR, "==")),
cbind((outer(gpairs[1,], pairs$G1, "==") &
outer(gpairs[2,], pairs$G2, "==")) |
(outer(gpairs[2,], pairs$G1, "==") &
outer(gpairs[1,], pairs$G2, "==")), matrix(0, nGP, n)))
dir <- rep(c("<=", ">=", "="), c(2*nP, nP, nG+nGP))
rhs <- rep(c(0, -1, 1), c(2*nP, nP, nG+nGP))
library(lpSolve)
mod <- lp("min", obj, constr, dir, rhs, all.bin=TRUE)
which(tail(mod$solution, n) == 1)
}
This can compute the closest 3 points, one from each cluster, in your example dataset:
df[opt.closest(df),]
# A tibble: 3 x 3
# indexR x y
# <dbl> <dbl> <dbl>
# 1 1 638.24 323.76
# 2 2 591.86 249.57
# 3 3 441.53 513.37
It can also compute the best possible solution for datasets with more points and groups. Here are the runtimes for datasets with 7 groups each and 100 and 200 points:
make.dataset <- function(n, nG) {
set.seed(144)
data.frame(indexR = sample(seq_len(nG), n, replace=T), x = rnorm(n), y=rnorm(n))
}
df100 <- make.dataset(100, 7)
system.time(opt.closest(df100))
# user system elapsed
# 11.536 2.656 15.407
df200 <- make.dataset(200, 7)
system.time(opt.closest(df200))
# user system elapsed
# 187.363 86.454 323.167
This is far from instantaneous -- it takes 15 seconds for the 100-point, 7-group dataset and 323 seconds for the 200-point, 7-group dataset. Still, it is much quicker than iterating through all 92 million 7-tuples in the 100-point dataset or all 13.8 billion 7-tuples in the 200-point dataset. You could set a runtime limit with a solver like the one from the Rglpk package to get the best solution obtained within that limit.
You cannot afford to enumerate all possible solutions, and I don't see any obvious shortcut.
So I guess you'll have to do a branch and bound optimization approach.
First guess a reasonably good solution. Like the closest two points with different labels. Then add the nearest with a different label until you have all labels covered.
Now do some trivial optimization: for every label, try if there is some point that you can use instead of the current point to improve the result. Stop when you can't find any further improvement.
For this initial guess, compute the distances. This will give you an upper bound, which allows you to stop your search early. You can also compute a lower bound, the sum of all best two-label solutions.
Now you can try to remove points, where the nearest neighbors of each label + the lower bounds for all other labels is already worse than your initial solution. This will hopefully eliminate a lot of points.
Then you can start enumerating solutions (probably begin with the smallest labels first), but stop recursion whenever the current solution + the remaining lower bounds are larger than your best known solution (branch and bound).
You can also try sorting points e.g. by minimum distance to the remaining labels, to hopefully find better bounds fast.
I'd certainly not choose R to implement this...
you can use cross joins to have all the points combinations, calculate the total distance between all three points, then take the minimum of that.
df$id <- row.names(df) # to create ID's for the points
df2 <- merge(df, df, by = NULL ) # the first cross join
df3 <- merge(df2, df, by = NULL) # the second cross join
# eliminating rows where the points are of the same indexR
df3 <- df3[df3$indexR.x != df3$indexR.y & df3$indexR.x != df3$indexR
& df3$indexR.y != df3$indexR,]
## calculating the total distance
df3$total_distance <- ((df3$x - df3$x.x)^2 + (df3$y- df3$y.x)^2)^.5 +
((df3$x - df3$x.y)^2 + (df3$y- df3$y.y)^2)^.5 +
((df3$x.x - df3$x.y)^2 + (df3$y.x- df3$y.y)^2)^.5
## minimum distance
df3[which.min(df3$total_distance),]
indexR.x x.x y.x id.x indexR.y x.y y.y id.y indexR x y id total_distance
155367 3 441.53 513.37 61 2 591.86 249.57 46 1 638.24 323.76 42 664.3373
I developed a simple algorithm to quickly solve this problem. The first step is to overlay a grid on the entire area of points. The first step is to assign each point from each group to the cell or unit square where it is located. Next we go to the lower left corner of the graph and go over one cell and up one cell. This is the starting cell. Then we define a region of interest consisting of this cell and all of its 8 neighbors. Then a test is made to determine whether or not at least one point from each of the groups is within this 9 cell region. If so then the distance from each point represented in this region from each of the groups of points to all other points from all other groups is calculated. In other words all combinations of points in this 9-cell region are used to get a total distance where paired points for distance calculation are never from the same group. From these calculations the one with the minimum distance involving a single point from each group is saved as a possible solution. Then this entire process is repeated by going over one cell to the right. Each 9-cell region is calculated as the central cell moves on to the right. This is stopped one cell from the right end. When the first row is completed the process proceeds by going up one row and starting again at the left but one cell over again. Thus the each cell has been considered when the top row is finished. The solution will be the minimum distance computed from all the tests done for each 9-cell region.
The reason we consider a 9-cell region and not just go cell-by-cell is that we could miss closely spaced points from different groups that are located in the corners of cells.
It's important to choose the correct cell or grid size. If the cells are too small then no possible solution will be found because none of the regions will encompass at least one point from each group. If the cells are too large then there will be many points from each group and calculation time will be excessive. Fortunately this optimal cell size can be quickly found through trial and error.
I've run this algorithm multiple times with varying number of groups and number of points in a group. For randomly scattered points in all groups I found that a 15 x 15 grid size works well for a 10 group - 400 point (40 points per group) case. That example runs in under one second.

R conditional lookup and sum

I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...

Is there a quick way to get the mean of various "bins" of a column in a dataframe?

I apologize if this has been answered - I just can't find it! To simplify, I have a dataframe of cars with 2 pertinent columns: mileage and price. I want to calculate the mean price and number of cars for 0-20,000 miles, 20,000-40,000, and so on (in 20,000 mile "bins"). I have been making subsets of data for the various mileage ranges and then looking at the mean and number or vehicles for that subset. I'm wondering if there is a more efficient way to do this, instead of making all of these subsets - I'm doing it many times over with various "bins" and data. I would love to learn a slicker way of doing this.
Thanks!!
You probably want smth along these lines:
library(data.table)
d = data.table(mileage = runif(1000, 0, 100000), price = runif(1000, 15000, 35000))
d[, list(price = mean(price), number = .N),
by = cut(mileage, c(0, 20000, 25000, 30000, 100000))][order(cut)]
# cut price number
# 1: (0,2e+04] 25252.70 215
# 2: (2e+04,2.5e+04] 25497.66 46
# 3: (2.5e+04,3e+04] 25349.79 45
# 4: (3e+04,1e+05] 25037.93 694
This shows how to use aggregate to return more than one statistic by category in a single run.
# Using Quentin's data
d[['mileage.cat']] <- cut(d$mileage, breaks=seq(0, 200000, by= 20000))
aggregate(d$price, d['mileage.cat'] ,
FUN=function(price) c(counts=length(price),
mean.price=mean(price) ) )
mileage.cat x.counts x.mean.price
1 (0,2e+04] 212.00 24859.01
2 (2e+04,4e+04] 194.00 24343.16
3 (4e+04,6e+04] 196.00 24357.73
4 (6e+04,8e+04] 191.00 25006.71
5 (8e+04,1e+05] 207.00 25250.23
To make the bins, "cut". Example:
x=1:10
bkpt=c(0,2.5,7.5,10)
x.cut=cut(x,breaks=bkpt)
which finishes up like this. y is some data for later:
y=21:30
data.frame(x,x.cut,y)
To calculate something for each group, use tapply. Following my example:
tapply(y,x.cut,length)
tapply(y,x.cut,mean)
which calculates (a) the number of y's and (b) the mean of y's in each of the groups defined by x.cut.
Another approach using aggregate:
df <- data.frame(mil = sample(1e5,20),price = sample(1000,20) )
#mil2 is our "mile bin" ( 0 -> [0:20000[; 1 -> [20000:40000[ ...)
df$mil2 = trunc(df$mil /20000)
# then to get the mean by "mile bin":
aggregate(price ~ mil2,df,mean)
# or the number:
aggregate(price ~ mil2,df,length)
# or simply:
table(df$mil2)

Resources