How to loop with two lists in R - r

I have a dataset with demographic information and with questions.
DF<-(Participant = c(1,2,3,4,5,6,7,8,9,10)
Male = c(1,0,1,1,0,1,0,0,1,0)
Female = c(0,1,0,0,1,0,1,1,0,1)
Q1 = c(9,6,5,4,5,1,3,5,5,2)
Q2 = c(2,4,5,4,2,1,3,5,4,2)
Q3 = c(6,8,2,7,5,2,1,1,6,3))
I have two lists (made from column titles), one of demographic information (Males, Females, age group etc) and one of questions with their associated response.
Demographic <- c(“Male”, “Female”, “Age_group_1”, “Age_group_2”…)
Questions<- c(“Q1”, “Q2”, Q3”, “Q4”…)
I need something along the lines of- if value in demographic column is equal to 1 then sum scores in all separate question columns. But I want to do this is a loop so I have the separate question scores (~300) for all columns in the demographic list (~80). Plus I want to save the output. I have no idea how to do this and I’m getting into a loop of bad programming myself!
The end result should resemble this:
M F
Q1 20 21
Q2 16 16
Q3 23 18
I would be grateful for any help!
Thanks in advance.
UPDATE:
With help from a friend, I have found a work around my problem. How do you make this more efficient though?
df.list <- list()
for(question in questions){
question.df <- (DF[, lapply(.SD,sum, na.rm=T), by=question,
.SDcols=c(demographic)])
df.list <- append(df.list, question.df)}
list_new <- bind_cols(df.list, .id = "column_label")

library(tidyr)
library(dplyr)
df <- data.frame(
Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3)
)
df %>%
mutate(sex = ifelse(Male == 1, "M", "F")) %>%
select(-Male, -Female) %>%
pivot_longer(cols = starts_with("Q"), names_to = "Q") %>%
group_by(sex, Q) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = sex)
gives:
Q F M
<chr> <dbl> <dbl>
1 Q1 21 24
2 Q2 16 16
3 Q3 18 23

Depending on what you want to do with the output, another approach is to use tables::tabular(), which can be used to generate additional statistics (e.g. percentages), as well as customizing row and column headings.
We'll generate a simple table using the data provided in the question.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
...and the output:
> tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
Processing multiple demographic variables
In the comments to my answer a question was asked about how to use tabular() with more than one demographic variable.
We can use a combination of lapply(), paste(), and substitute() to build the correct formula expressions for `tabular().
To illustrate the process we will add a second demographic variable, Income to the data frame listed above. Then we create a vector to represent the list of demographic variables for which we will generate tables. Finally, we use the vector with lapply() to produce the tables.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Income = c(rep("low",5),rep("high",5)),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$Sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(Sex)*(sum),data=df)
demoVars <- c("Sex","Income")
lapply(demoVars,function(x){
# generate a formula expression including the column variable
# and use substitute() to render it correctly within tabular()
theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
tabular(substitute(theExpr),data=df)
})
...and the output:
> lapply(demoVars,function(x){
+ # generate a formula expression including the column variable
+ # and use substitute() to render it correctly within tabular()
+ theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
+ tabular(substitute(theExpr),data=df)
+ })
[[1]]
Sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
[[2]]
Income
high low
sum sum
Q1 16 29
Q2 15 17
Q3 13 28
Note that we can enhance the solution further by saving the tables to an output object and rendering them in a printer friendly format as needed.

Related

Select columns from a data frame

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.

Adding value range tags to values in data frame

I am trying to add certain tags to values in my data frame, similar to adding a grade column to marks. The only difference being that the grade scales for each subject is different.
Reprex:
# Specifying grade range for each subject
range <- data.frame(Subject <- rep(c('Math','Physics'),each = 3),
Start <- c(91,81,71,81,61,41),
End <- c(100,90,80,100,80,60),
Grade <- rep(LETTERS[1:3],2),stringsAsFactors = F)
colnames(range) <- c('Subject','Start','End','Grade')
# Marks data of students
set.seed(50)
df <- data.frame(Subject <- rep(c('Math','Physics'),each = 4),
Student <- rep(c('Eeny','Meeny','Miny','Mo'),2),
Marks <- c(sample(40:100,7,T),NA))
colnames(df) <- c('Subject','Student','Marks')
You may have noticed that there are cases in df where the marks scored by student do not fall under any grade range or the marks are missing. In such cases I want NA under the grade column.
This is what I've tried to do
res <- merge(df,range) %>% filter(between(Marks,Start,End))
But it give the following error:
Error: Expecting a single value: [extent=24]
And the reason for this may be that the left and right arguments must be a single value and not a vector in the between() function.
I might want to avoid this approach because it creates all possible combinations of match and then later filters the data. In my case, I have a large data frame which takes a more than a couple of minutes to just create the merged data frame. Also, I would miss the rows where marks do not fall under any grade range using this approach.
How should I achieve this?
It might be just as easy to write a little function to do this for you ahead of the dplyr pipe:
grade_it <- function(marks, subject)
{
helper <- function(x, y)
{
z <- range$Grade[range$Start <= x & range$End >= x & range$Subject == y];
if(length(z) == 1) return(z) else return("FAIL")
}
mapply(helper, marks, subject)
}
So now you can just do:
df %>% mutate(Grade = grade_it(Marks, Subject))
#> Subject Student Marks Grade
#> 1 Math Eeny 87 B
#> 2 Math Meeny 50 FAIL
#> 3 Math Miny 91 A
#> 4 Math Mo 70 FAIL
#> 5 Physics Eeny 70 B
#> 6 Physics Meeny 89 A
#> 7 Physics Miny 85 A
#> 8 Physics Mo NA FAIL
Is this what you are looking for?
df %>% mutate(Grade = case_when(Subject == "Math" & Marks %in% 91:100 ~ "A",
Subject == "Math" & Marks %in% 81:90 ~ "B",
Subject == "Math" & Marks %in% 71:80 ~ "C",
Subject == "Physics" & Marks %in% 81:100 ~ "A",
Subject == "Physics" & Marks %in% 61:80 ~ "B",
Subject == "Physics" & Marks %in% 41:60 ~ "C",
TRUE ~ NA_character_))
Subject Student Marks Grade
1 Math Eeny 94 A
2 Math Meeny 42 <NA>
3 Math Miny 47 <NA>
4 Math Mo 99 A
5 Physics Eeny 55 C
6 Physics Meeny 57 C
7 Physics Miny 66 B
8 Physics Mo NA <NA>

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

How to calculate the percent change of a data frame column, then the next, and so on?

I have the below data-frame, call it "p":
Q1 Q2 Q3
X Product 4.986184956 5.083868356 5.109861156
Y Product 2.86990877 2.834816682 2.904347607
Z Product 6.58413545 6.238497279 6.40142101
I would like to calculate the percent change between each of the columns in p, and place the output for each column into a new data-frame called "pchange".
I've tried using the lag() function, but I haven't been successful with it. (I'm still quite new with the language.)
I really appreciate any thoughts on how best to tackle this. Thank you!
Copying from my above comment. Simple solution using dplyr::transmute:
pchange <- df %>%
transmute(
change_Q1_Q2 = ((Q2 - Q1)/Q1)*100,
change_Q2_Q3 = ((Q3 - Q2)/Q2)*100
)
gives
# A tibble: 3 x 2
change_Q1_Q2 change_Q2_Q3
<dbl> <dbl>
1 1.959081 0.511280
2 -1.222760 2.452749
3 -5.249560 2.611586
If you wanted to keep the Product column you could use mutate instead of transmute. I'd echo Jens Leerssen's endorsement of R for Data Science.
(Assuming your data is structured like so)
df <- tibble::tribble(
~Product, ~Q1, ~Q2, ~Q3,
"X Product", 4.986184956, 5.083868356, 5.109861156,
"Y Product", 2.86990877, 2.834816682, 2.904347607,
"Z Product", 6.58413545, 6.238497279, 6.40142101)
Here are a few different approaches. No packages are used.
1) Divide all but the first 2 columns by all but the first and last columns, subtract 1 and multiply by 100. Combine that with the original first column and NA times the original second column.
data.frame(DF[1], NA * DF[2], 100 * (DF[-(1:2)] / DF[-c(1, ncol(DF))] - 1))
giving:
Product Q1 Q2 Q3
1 X Product NA 1.959081 0.511280
2 Y Product NA -1.222760 2.452749
3 Z Product NA -5.249560 2.611586
1a) A variation of (1) that is even shorter is based on working in the log domain and then converting back:
data.frame(DF[1], NA * DF[2], 100 * t(exp(diff(t(log(DF[-1]))))-1))
giving:
Product Q1 Q2 Q3
1 X Product NA 1.959081 0.511280
2 Y Product NA -1.222760 2.452749
3 Z Product NA -5.249560 2.611586
2) Define a function percent which calculates the percentages based on vector x returning a vector the same length as x filling in the first element with NA since there is no prior value for which to calculate its percent. Then apply that to each row noting that apply will return the transpose of what we want so transpose it back.
percent <- function(x) 100 * c(NA * x[1], diff(x) / head(x, -1))
data.frame(DF[1], t(apply(DF[-1], 1, percent)))
giving:
Product Q1 Q2 Q3
1 X Product NA 1.959081 0.511280
2 Y Product NA -1.222760 2.452749
3 Z Product NA -5.249560 2.611586
Note: The input DF in reproducible form was assumed to be:
DF <- structure(list(Product = structure(1:3, .Label = c("X Product",
"Y Product", "Z Product"), class = "factor"), Q1 = c(4.986184956,
2.86990877, 6.58413545), Q2 = c(5.083868356, 2.834816682, 6.238497279
), Q3 = c(5.109861156, 2.904347607, 6.40142101)), .Names = c("Product",
"Q1", "Q2", "Q3"), class = "data.frame", row.names = c(NA, -3L
))
A clean and readily extensible solution can be most easily achieved by tidying up your dataframe. The subject can get complicated, but essentialy just make it so each row is a single observation and each column is one variable.
While Constructing direct references between your columns may get you a quick win, if you start adding more columns you will be forced to write more code. With Tidy data, you will not. The tidy solutions will handle the updating data without further hiccups.
Using a rebuild of your view of your dataframe: p
library(tidyverse)
id <- c("X", "Y", "Z")
object <- "Product"
Q1 <- c(4.986184956, 2.86990877, 6.58413545)
Q2 <- c(5.083868356, 2.834816682, 6.238497279)
Q3 <- c(5.109861156, 2.904347607, 6.40142101)
p <- tibble(id, object, Q1, Q2, Q3)
> p
# A tibble: 3 x 5
id object Q1 Q2 Q3
<chr> <chr> <dbl> <dbl> <dbl>
1 X Product 4.986185 5.083868 5.109861
2 Y Product 2.869909 2.834817 2.904348
3 Z Product 6.584135 6.238497 6.401421
You can then execute the transform in tidyverse as below:
tidy_p_change <-
p %>%
gather(qrtr, perf, c(Q1:Q3)) %>% # tidy the data
arrange(id, qrtr) %>% # prep for lag (and easy auditing)
group_by(id) %>% # keep the lags within products
mutate(prev_q = lag(perf), # bring data together into same row
pct_chng = (perf/prev_q - 1)*100
) %>%
select(-c(perf, prev_q)) %>% # stop showing the work
spread(qrtr, pct_chng) # spread the data back out into a `pivot table`
Which will give you this output:
> tidy_p_change
# A tibble: 3 x 5
# Groups: id [3]
id object Q1 Q2 Q3
* <chr> <chr> <dbl> <dbl> <dbl>
1 X Product NA 1.959081 0.511280
2 Y Product NA -1.222760 2.452749
3 Z Product NA -5.249560 2.611586
I have left the wrangling in its verbose form. I can spool the wire down tighter, but thought it best to show all the steps. Let us know if you would like to see a more bummed down version, too.
Additionally, a really great treatment about working with tidy data (and working in tidyverse) can be found in Hadley Wickham's R for Data Science

R conditional lookup and sum

I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...

Resources