Using Mutate to rank specific columns - r

I'm a relative newbie to dplyr. I have a data.frame organized with each store name and source (made up of the results for 2018) making up the observations. The variables are total revenue, quantity, customer experience score, and a few others.
I'd like to rank each category in the data.frame and create new observations. All variables would be ranked in descending order, but customer experience and one additional column would be ranked in ascending order. The source I'd like to call this would be called "ranks".
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
table <- cbind(store, source, revenue, quantity, satisfaction)
I was able to get what I needed using mutate, but I had to manually name each new column. I'm sure there is a more efficient way to rank these values out there!
Here is what I originally did:
table <- table %>%
mutate(revenue_rank = rank(-revenue), quantity_rank = rank(-quantity), satisfaction_rank = rank(satisfaction))

In general, if you're having to do something repeatedly in a data frame, such as calculating ranks, you probably want to reshape to long data. Also note that what you got from cbind is a matrix, not data frame--probably not what you want, since this means numeric variables actually come through as characters. Instead of cbind, use data.frame or data_frame (for a tibble).
What I did here is gathered into a long data frame, grouped by the measures (revenue, quantity, or satisfaction), then gave ranks based on the value, keeping in mind that you wanted different orders for satisfaction and the other measures.
library(tidyverse)
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
df <- data_frame(store, source, revenue, quantity, satisfaction)
df %>%
gather(key = measure, value = value, revenue:satisfaction) %>%
group_by(measure) %>%
mutate(rank = ifelse(measure == "satisfaction", rank(value), rank(-value))) %>%
ungroup() %>%
select(-value) %>%
mutate(measure = paste(measure, "rank", sep = "_")) %>%
spread(key = measure, value = rank)
#> # A tibble: 3 x 5
#> store source quantity_rank revenue_rank satisfaction_rank
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Boston 2018 3 3 3
#> 2 Chicago 2018 2 1 1
#> 3 NYC 2018 1 2 2
Created on 2018-05-04 by the reprex package (v0.2.0).

Related

matching yearly time points to preceding 365 days of data in R

I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.
For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.
There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.
I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).
Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:
For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 1982-06-09 1982 6 6-1982 aleutian_islands-5-1983 0
For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 2013-07-22 2013 7 7-2013 aleutian_islands-5-2014 0.998
The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>%
mutate(month_year = paste0(startmonth,"-",year)) %>%
select(region, month_year) %>%
distinct() %>%
mutate(region_month_year = paste0(region,"-",month_year))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
month_year = paste0(month,"-",year)) %>%
select(region, month_year, ref_year) %>%
distinct() %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
month_year = paste0(month,"-",year)) %>%
left_join(surv_dat_exploded, by=c('region','month_year')) %>%
filter(!is.na(ref_year))# get rid of dates that are after any ref_year
Sounds like you want a non-equi join. This is easily done with data.table and is very fast. Here's an example that lightly modifies your MWE:
library(data.table)
# make survey dfs
ai_dat = data.table(year = c(1983, 1986, 1991, 1994, 1997),
region = "aleutian_islands", "startmonth" = 5)
ebs_dat = data.table(year = seq(1983, 1999, 1),
region = "eastern_bering_sea", "startmonth" = 5)
# bind together and create date (and cutoffdate) vars
surv_dat = rbind(ai_dat, ebs_dat)
surv_dat[, startdate := as.IDate(paste(year, startmonth, '01', sep = '-'))
][, cutoffdate := startdate - 365L]
# make temperature df
temp_dat = CJ(date=seq(as.IDate("1982-01-01"), as.IDate("1999-12-31"), "days"),
region=c('aleutian_islands','eastern_bering_sea'))
# add temperature var
temp_dat$temp = rnorm(nrow(temp_dat))
# create duplicate date variable (will make post-join processing easier)
temp_dat[, matchdate := date]
# Optional: Set keys for better join performance
setkey(surv_dat, region, startdate)
setkey(temp_dat, region, matchdate)
# Where the magic happens: Non-equi join
surv_dat = temp_dat[surv_dat, on = .(region == region,
matchdate <= startdate,
matchdate >= cutoffdate)]
# Optional: get rid of unneeded columns
surv_dat[, c('matchdate', 'matchdate.1') := NULL][]
#> date region temp year startmonth
#> 1: 1982-05-01 aleutian_islands 0.3680810 1983 5
#> 2: 1982-05-02 aleutian_islands 0.8349334 1983 5
#> 3: 1982-05-03 aleutian_islands -1.3622227 1983 5
#> 4: 1982-05-04 aleutian_islands 1.4327587 1983 5
#> 5: 1982-05-05 aleutian_islands 0.5068226 1983 5
#> ---
#> 8048: 1999-04-27 eastern_bering_sea -1.2924594 1999 5
#> 8049: 1999-04-28 eastern_bering_sea 0.7519078 1999 5
#> 8050: 1999-04-29 eastern_bering_sea -1.0185174 1999 5
#> 8051: 1999-04-30 eastern_bering_sea -1.4322252 1999 5
#> 8052: 1999-05-01 eastern_bering_sea -1.0412836 1999 5
Created on 2021-05-20 by the reprex package (v2.0.0)
Try this solution.
I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <-
rbind(ai_dat, ebs_dat) %>%
mutate(year_month = paste0(year,"-",startmonth),
region_year_month = paste0(region,"-",year,"-",startmonth))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <-
expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(year_month = paste0(year,"-",month)) %>%
mutate(region_year_month = paste0(region,"-",year,"-",month)) %>%
mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
year_month = paste0(year,"-",month))
final_df <-
left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>%
#split ref_column in ref_year and ref_region
separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
#convert ref_year into date
mutate_at("ref_year", as.Date, format= "%Y-%M") %>%
#round it down to be in the first day of the month (not needed if the day matters)
mutate_at("ref_year", floor_date, "month" ) %>%
#difference between observed and the reference
mutate(diff_days = date - ref_year) %>%
# ifelse statement for capturing values of interest
mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))

dplyr: applying calculations with differing column names

I am trying to create the following formula:
Interest expense / (Total Debt(for all years)) / # number of years
The data looks like the following;
GE2017 GE2016 GE2015 GE2014
Interest Expense -2753000 -2026000 -1706000 -1579000
Long Term Debt 108575000 105080000 144659000 186596000
Short/Current Long Term Debt 134591000 136211000 197602000 261424000
Total_Debt 243166000 241291000 342261000 448020000
GOOG2017 GOOG2016 GOOG2015 GOOG2014
Interest Expense -109000 -124000 -104000 -101000
Long Term Debt 3943000 3935000 1995000 2992000
Short/Current Long Term Debt 3969000 3935000 7648000 8015000
Total_Debt 7912000 7870000 9643000 11007000
NVDA2018 NVDA2017 NVDA2016 NVDA2015
Interest Expense -61000 -58000 -47000 -46000
Long Term Debt 1985000 1985000 7000 1384000
Short/Current Long Term Debt 2000000 2791000 1434000 1398000
Total_Debt 3985000 4776000 1441000 2782000
That is, for GE, I am trying to take interest expense for the latest year -2753000 divide this by the average of Total Debt for all 4 years for GE.
So;
-2753000 / AVERAGE(243166000 + 241291000 + 342261000 + 448020000) = 0.0086
However I am running into problems with group_by() when taking the average since GE and the other firms have different column names due to the different years.
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
group_by(rn)
#Calcualtion here
Secondly; If possible, I would like to do the same calculation as above but use only the last two years of each firm.
-2753000 / AVERAGE(243166000 + 241291000) = 0.01136
Would perhaps a grepl function work here?
I have a vector called symbols.
symbols <- c("NVDA", "GOOG", "GE")
Data:
cost_of_debt <- structure(list(GE2017 = c(-2753000, 108575000, 134591000, 243166000
), GE2016 = c(-2026000, 105080000, 136211000, 241291000), GE2015 = c(-1706000,
144659000, 197602000, 342261000), GE2014 = c(-1579000, 186596000,
261424000, 448020000), GOOG2017 = c(-109000, 3943000, 3969000,
7912000), GOOG2016 = c(-124000, 3935000, 3935000, 7870000), GOOG2015 = c(-104000,
1995000, 7648000, 9643000), GOOG2014 = c(-101000, 2992000, 8015000,
11007000), NVDA2018 = c(-61000, 1985000, 2e+06, 3985000), NVDA2017 = c(-58000,
1985000, 2791000, 4776000), NVDA2016 = c(-47000, 7000, 1434000,
1441000), NVDA2015 = c(-46000, 1384000, 1398000, 2782000)), .Names = c("GE2017",
"GE2016", "GE2015", "GE2014", "GOOG2017", "GOOG2016", "GOOG2015",
"GOOG2014", "NVDA2018", "NVDA2017", "NVDA2016", "NVDA2015"), row.names = c("Interest Expense",
"Long Term Debt", "Short/Current Long Term Debt", "Total_Debt"
), class = "data.frame")
For the first case, after creating row names as a column (rownames_to_column - from tibble), separate that to 'firm' and 'year' by splitting at the junction between the start of the 'year' and the end of the firm, name, grouped by 'firm', create a 'New' column by taking the proportion of 'Interest.Expense' with the mean value of 'Total_Debt'. Then, we can arrange by 'year', get the mean of the last two 'Total_Debt' for each 'firm' and divide with 'Interest.Expense
library(dplyr)
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
separate(rn, into = c("firm", "year"),
"(?<=[A-Z])(?=[0-9])", convert = TRUE) %>%
group_by(firm) %>%
mutate(New = Interest.Expense/mean(Total_Debt)) %>%
arrange(firm, year) %>%
mutate(NewLast = Interest.Expense/mean(tail(Total_Debt, 2)))
I think you need to clean your data first so that it is easier to understand what is an observation and what is a variable. Google tidy data :) Here is my solution. First I make the data tidy, then the calculations are straightforward.
library(tidyverse)
library(stringr)
), class = "data.frame")
# Clean and make the data tidy
cost_of_debt <- cost_of_debt %>%
as_tibble() %>%
rownames_to_column(var = "indicator") %>%
mutate(indicator = str_replace_all(indicator, regex("\\s|\\/"), "_")) %>%
gather(k, value, -indicator) %>%
separate(k, into = c("company", "year"), -4) %>%
spread(indicator, value) %>%
rename_all(tolower)
Results in the data looking like this:
company year interest_expense long_term_debt short_current_long_term_debt total_debt
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000
2 GE 2015 -1706000 144659000 197602000 342261000
3 GE 2016 -2026000 105080000 136211000 241291000
4 GE 2017 -2753000 108575000 134591000 243166000
5 GOOG 2014 -101000 2992000 8015000 11007000
Then we can answer your question:
cost_of_debt <- cost_of_debt %>%
group_by(company) %>%
mutate(int_over_totdept4 = interest_expense / mean(total_debt),
int_over_totdept2 = interest_expense / mean(total_debt[year %in% c("2017", "2016")]))
Which gives a dataframe (with your new varibles furthest to the right):
company year interest_expense long_term_debt short_current_long_term_debt total_debt int_over_totdept4 int_over_totdept2
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000 -0.00495 -0.00652
2 GE 2015 -1706000 144659000 197602000 342261000 -0.00535 -0.00704
3 GE 2016 -2026000 105080000 136211000 241291000 -0.00636 -0.00836
4 GE 2017 -2753000 108575000 134591000 243166000 -0.00864 -0.0114
5 GOOG 2014 -101000 2992000 8015000 11007000 -0.0111 -0.0128
And if you want the summarized form of your questions:
# First question:
cost_of_debt %>% filter(company == "GE", year == "2017") %>% select(company, year, int_over_totdept4)
# Second question:
cost_of_debt %>% filter(year == "2017") %>% select(company, year, int_over_totdept2)

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Subsetting rows based on dates and criteria across two data frames

I have one data frame outlining pollution levels continuously measured from two sites.
Dates <- as.data.frame(seq(as.Date("2015/01/01"), as.Date("2017/01/01"),"day"))
Pollution_Site.A <- as.data.frame(c(seq(from = 1, to = 366, by = 1),
(seq(from = 366, to = 1, by = -1))))
Pollution_Site.B <- as.data.frame(c(seq(from = 0, to = 365, by = 1),
(seq(from = 365, to = 0, by = -1))))
df1 <- cbind(Dates,Pollution_Site.A,Pollution_Site.B)
colnames(df1) <- c("Dates","Site.A","Site.B")
I have a separate data frame highlighting when surveyors (each site has one unique surveyor) visited each site.
Site<- c("Site.A","Site.A","Site.B","Site.B")
Survey_Dates <- as.data.frame(as.POSIXct(c("2014/08/17","2016/08/01",
"2015/02/01","2016/10/31")))
df2 <- as.data.frame(cbind(Site,Survey_Dates))
colnames(df2) <- c("Site","Survey_Dates")
What I want to do is (i) define a high pollution event (although perhaps some form of 'apply' function would be better to do this iteratively across multiple sites)?
High_limit_Site.A <- 1.5*median(df1$Site.A)
High_limit_Site.B <- 1.5*median(df1$Site.B)
The I want to (ii) subset the second data frame to show which surveyors have visited the site before and after a high pollution event within 1 year (providing there is pollution data as well). I presume something along the 'difftime' function will work here, but am not sure how I would apply this.
Finally, I would like (iii) the subsetted data frame to highlight whether the surveyor was out before or after the pollution event.
So in the example above, the desired output should only contain Site B. This is because Site A's first survey date precedes the first pollution measurement AND was over a year before the high pollution event. Thank you in advance for any help on this.
You need to pivot df1 and then cross-join it with df2
library(dplyr)
library(tidyr)
df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
# this will function as cross-join because Site is not a unique ID
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Here you can see earliest and latest event corresponding to each visit
# A tibble: 3 x 5
# Groups: Site, Survey_Dates [?]
Site Survey_Dates Been_Before Event_date_min Event_date_max
<chr> <dttm> <chr> <date> <date>
1 Site.A 2016-08-01 after 2015-10-03 2016-04-01
2 Site.B 2015-02-01 before 2015-10-02 2016-01-30
3 Site.B 2016-10-31 after 2015-11-01 2016-04-02
Just to build on the answer #dmi3kno displayed above, I can then subset sites which contain both a "before" and "after" sign for each site.
Output_df <- df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Then using dplyr again:
Final_df <- Output_df %>%
group_by(Site) %>%
filter(all(c("before", "after") %in% Been_Before))

R dplyr summarise date gaps

I have data on a set of students and the semesters they were enrolled in courses.
ID = c(1,1,1,
2,2,
3,3,3,3,3,
4)
The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...
Date = c(201220,201240,201330,
201340,201420,
201120,201340,201420,201440,201540,
201640)
Enrolled<-data.frame(ID,Date)
I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history
Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
First.Semester = min(Date))
I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:
Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)
I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution
Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.
First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).
all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))
Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:
Enrolled %>%
mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))
## A tibble: 4 × 2
# ID max_gap
# <dbl> <dbl>
#1 1 1
#2 2 0
#3 3 7
#4 4 0
I used max(c(0, ...)) in the summarise, because otherwise you would end up with -Inf for IDs with a single entry.
Similarly, you could also achieve this by using match instead of a factor:
Enrolled %>%
mutate(semester = match(Date, all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))

Resources