I have dataset which is panel data that contains the following variables:
1. Country
2. Company
3. Monthly date
4. Revenue
`A <- data.frame(Country=as.factor(rep('A', 138)),
Company = as.factor(c(rep('AAA', 12), rep('BBB', 8), rep('CCC', 72), rep('DDD', 46))),
Date = c(seq(as.Date('2010-01-01'), as.Date('2011-01-01'), by = 'month'),
seq(as.Date('2010-01-01'), as.Date('2010-07-01'), by = 'month'),
seq(as.Date('2010-01-01'), as.Date('2015-12-01'), by = 'month'),
seq(as.Date('2012-03-01'), as.Date('2015-12-01'), by = 'month')),
Revenue= sample(10000:25000, 138)
)
B<- data.frame(Country=as.factor(rep('B', 108)),
Company = as.factor(c(rep('EEE', 36), rep('FFF', 36), rep('GGG', 36))),
Date = c(seq(as.Date('2013-01-01'), as.Date('2015-12-01'), by = 'month'),
seq(as.Date('2013-01-01'), as.Date('2015-12-01'), by = 'month'),
seq(as.Date('2013-01-01'), as.Date('2015-12-01'), by = 'month')),
Revenue = sample(10000:25000, 108)
)`
I want to add other variable to the dataset - Competitor's revenue, which is the total sum of the revenues of all other companies in the own country for the corresponding month.
I wrote the following code:
new_B<-data.frame()
for(i in 1:nlevels(B$Company)){
temp_i<-B[which(B$Company==levels(B$Company)[i]),]
temp_j<-B[which(B$Company!=levels(B$Company)[i]),]
agg_temp<-aggregate(temp_j$Revenue, by = list(temp_j$Date), sum)
temp_i$competitor_value<-ifelse(agg_temp$Group.1 %in% temp_i$Date, agg_temp$x, 0)
new_B<-rbind(new_B, temp_i)
}
I created two temporary data set inside for loop one containing company i only and the other - all other companies. I summed all revenues of other companies by month. Then using ifelse for the same dates I add new variable to temp_i. It works nice for the companies that operated during the same period, but in country A there are companies that operated for different periods and when I try to use my code, I have error that they are not of the same length
new_A<-data.frame()
for(i in 1:nlevels(A$Company)){
temp_i<-A[which(A$Company==levels(A$Company)[i]),]
temp_j<-A[which(A$Company!=levels(A$Company)[i]),]
agg_temp<-aggregate(temp_j$Revenue, by = list(temp_j$Date), sum)
temp_i$competitor_value<-ifelse(agg_temp$Group.1 %in% temp_i$Date, agg_temp$x, 0)
new_A<-rbind(new_A, temp_i)
}
I found similar answer ifelse statements with dataframes of different lengths, but still do not know how to solve my problem.
I would really appreciate help
I suggest a different approach using the dplyr package:
library(dplyr)
A %>%
bind_rows(B) %>%
group_by(month=format(Date, "%Y-%m")) %>%
mutate(revComp = sum(Revenue)) %>%
group_by(Company, add = T) %>%
mutate(revComp = revComp-Revenue)
# Source: local data frame [246 x 6]
# Groups: month, Company [246]
#
# Country Company Date Revenue month revComp
# (chr) (chr) (date) (int) (chr) (int)
# 1 A AAA 2010-01-01 10657 2010-01 30356
# 2 A AAA 2010-02-01 11620 2010-02 22765
# 3 A AAA 2010-03-01 17285 2010-03 33329
# 4 A AAA 2010-04-01 22886 2010-04 33469
# 5 A AAA 2010-05-01 20129 2010-05 39974
# 6 A AAA 2010-06-01 22865 2010-06 26896
# 7 A AAA 2010-07-01 13087 2010-07 29542
# 8 A AAA 2010-08-01 19451 2010-08 14842
# 9 A AAA 2010-09-01 12364 2010-09 15309
# 10 A AAA 2010-10-01 19375 2010-10 14090
Related
I have a dataframe that represents a two year daily time series of temperature for two rivers. For each river and each year, I would like to know what day of year:
temperature is greater than or equal to 15 degrees
temperature is sustained greater than or equal to 15 degrees (sustained is when there are no more dips below 15 until the autumn)
temperature is less than or equal to 15 degrees
temperature is sustained less than or equal to 15 degrees (sustained is when there are are no more peaks above 15 until the following spring)
Example timeseries
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01428500","01432805") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "2018-01-01"
endDate <- "2019-10-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
# Adds a new column for year and day of year (doy; Jan 1 = 1, Dec 31 = 365)
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date))
I have tried using the dplyr filter() function but have had little success
dat %>%
group_by(site_no,year) %>%
filter(temperature >= 15 & temperature <= 15)
The ideal output would look something like this:
site_no year doy_firstabove15 doy_sustainedAbove15 doy_firstbelow15 doy_sustainedBelow15
1 01428500 2018 136 144 253 286
2 01428500 2019 140 146 279 289
3 01432805 2018 143 143 272 276
4 01432805 2019 140 140 278 278
I like this question. Its specific, tricky, and has a good example. I would go about this by creating a rolling variable that checks for changes in temperature above and below 15. Then you can group by site and year and check for the first instance of the change in group id. Now I make several assumptions here that might be bad. 1) I am grouping by year, but it would likely be better to group by water year. You don't explicitly say that the year is important, but I suspect it is. 2) I also assume that Autumn is defined by the first day of month 9. Maybe you want the sustained to be through the fall instead of the first day. Same assumption for spring.
library(tidyverse)
library(dataRetrieval)
dat |>
mutate(month = lubridate::month(Date),
year = lubridate::year(Date)) |>
group_by(site_no, year) |>
mutate(gt_15 = temperature > 15,
lt_15 = temperature < 15,
chng1 = cumsum(gt_15 != lag(gt_15, def = first(gt_15))),
chng2 = cumsum(lt_15 != lag(lt_15, def = first(lt_15))),
)|>
summarise(first_above_15 = Date[chng1==1][1],
indx_autum = chng1[month == 9][1],
first_above_15_sustained = Date[chng1== indx_autum][1],
first_below_15 = Date[chng2==1][1],
indx_spring = chng1[month == 3][1],
first_below_15_sustained = Date[chng2== indx_spring][1],
.groups = "drop") |>
select(-contains("indx"))
#> # A tibble: 4 x 6
#> site_no year first_above_15 first_above_15_sustained first_belo~1 first_be~2
#> <chr> <dbl> <date> <date> <date> <date>
#> 1 01428500 2018 2018-05-18 2018-05-24 2018-05-16 2018-01-01
#> 2 01428500 2019 2019-05-20 2019-05-26 2019-05-20 2019-01-01
#> 3 01432805 2018 2018-05-23 2018-05-23 2018-05-23 2018-01-01
#> 4 01432805 2019 2019-05-20 2019-05-20 2019-05-20 2019-02-22
#> # ... with abbreviated variable names 1: first_below_15,
#> # 2: first_below_15_sustained
This works by splitting the annual time series into times before and after the peak temperature. Then it logically returns the day of year (doy) when temperature is first above 15 before the peak temperature and the first below 15 after the peak temperature.
Similarly, the sustain_above and sustain_beloware determined logically first creating a run-length ID column (run) then extracting the doy where the run is max for the subset of run where the values are TRUE in below_peak or after_peak and get the first element of 'doy'
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(site_no, year) %>%
mutate(gt_15 = temperature >= 15,
lt_15 = temperature <= 15,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_15)) %>%
summarise(first_above = doy[below_peak & gt_15][1],
sustain_above = first(doy[run == max(run[below_peak])]),
first_below = doy[after_peak & lt_15][1],
sustain_below = first(doy[run == max(run[after_peak])]), .groups = 'drop')
I have a dataframe representing a two-year daily time series of temperature for two rivers. I have identified when the temperature is either above or below the peak temperature. I have also created a run-length ID column for when temperature is either above or below a threshold temperature of 10 degrees.
How can I get the first day of year for each site and year and the following conditions:
maximum run-length & below peak = TRUE
maximum run-length & above peak = TRUE
Example Data:
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- c("01432805","01388000") # United States Geological Survey site numbers
parameterCd <- "00010" # temperature
statCd <- "00003" # mean
startDate <- "1996-01-01"
endDate <- "1997-12-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd) # obtains the timeseries from the USGS
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# To view at the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw() +
facet_wrap(~site_no)
To create the columns described above
dat <- dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>% # doy = day of year
group_by(site_no, year) %>%
mutate(lt_10 = temperature <= 10,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_10))
View(dat)
The ideal output would look as follows:
site_no year doy_below doy_after
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294
doy_after = the first row for after_peak == TRUE & max(run) when group_by(site_no,year)
doy_below = the first row for below_peak == TRUE & max(run) when group_by(site_no,year)
For site_no = 01388000 in year = 1996, the max(run) when below_peak == TRUE is 4. The first row whenrun = 4 and below_peak == TRUE corresponds with date 1996-04-20 which has a doy = 111.
As the data is already grouped, just summarise by extracting the 'doy' where the run is max for the subset of run where the values are TRUE in 'below_peak' or 'after_peak' and get the first element of 'doy'
library(dplyr)
dat %>%
summarise(doy_below = first(doy[run == max(run[below_peak])]),
doy_above = first(doy[run == max(run[after_peak])]), .groups = 'drop')
-output
# A tibble: 4 × 4
site_no year doy_below doy_above
<chr> <dbl> <dbl> <dbl>
1 01388000 1996 111 317
2 01388000 1997 112 312
3 01432805 1996 137 315
4 01432805 1997 130 294
I have pollution measures by day and location. I have a population of people for which I want to measure pollution exposure. Each person has a location and a period of time during which they were in that location.
For each person in my dataset, I need to sum up the pollution values from their location over their time period, and also count the number of missing pollution measurements.
The table structures are the following:
ids start_dates end_dates zips
1 1 2000-10-10 2001-02-18 45108
2 2 2000-11-11 2001-04-07 45190
3 3 2000-03-05 2000-06-27 45117
4 4 2001-02-04 2001-06-09 45142
5 5 2000-03-16 2000-07-13 45197
6 6 1999-12-15 2000-04-27 45060
exposure_day exposure_zip exposure_value
1 1999-06-26 45108 14
2 1999-06-27 45108 27
3 1999-06-28 45108 22
4 1999-06-29 45108 4
5 1999-06-30 45108 26
6 1999-07-01 45108 20
Desired output:
ids start_dates end_dates zips exposure_sum na_count
1: 1 2000-10-10 2001-02-18 45108 3188 5
2: 2 2000-11-11 2001-04-07 45190 3789 1
3: 3 2000-03-05 2000-06-27 45117 2917 3
4: 4 2001-02-04 2001-06-09 45142 2969 2
5: 5 2000-03-16 2000-07-13 45197 2860 3
6: 6 1999-12-15 2000-04-27 45060 3497 2
My current solution is quite slow. I would like to find a more efficient solution, so that I can efficiently perform this calculation for roughly 1,000,000 people.
Below is the code to simulate my data and my current solution.
set.seed(123)
library(lubridate)
library(data.table)
# Make person dataframe
n = 1000 # sample size
ids = c(1:n)
end_dates = sample(seq(as.Date('2000-01-01'), as.Date('2002-01-01'), by="day"), n, replace = T)
time_intervals = sample(seq(100, 200), n, replace = T)
start_dates = end_dates - time_intervals
zips = sample(seq(45000, 45200), n, replace = T)
person_df = data.frame(ids, start_dates, end_dates, zips)
# Make exposure dataframe
ziplist = unique(zips)
nzips = length(ziplist)
ndays = as.numeric(as.Date(max(person_df$end_dates)) - as.Date(min(person_df$start_dates)) + 1)
exposure_dates = seq(as.Date(min(person_df$start_dates)), as.Date(max(person_df$end_dates)), by = 'day')
exposure_day = rep(exposure_dates, nzips)
exposure_zip = rep(ziplist, each = ndays)
exposure_value = sample(c(NA, 1:50), length(exposure_day), replace = T)
exposure_df = data.frame(exposure_day, exposure_zip, exposure_value)
# convert to datatable
person_dt = data.table(person_df)
exposure_dt = data.table(exposure_df)
#summarize
summary_dt = person_dt[, ":="(exposure_sum = .(sum(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value], na.rm = T)),
na_count = .(sum(is.na(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value])))),
by = 'ids'][]
EDIT --- added variation of #langtang's clever approach, which allows an n=1M approach in 4 seconds with dplyr.
This dplyr approach is about 40x faster at n=1000, 50x at n=10k, and 60x faster at n=100k, with the same output. The main gain is from translating the non-equi join into a left join by expanding person_df to have one row for each exposure_day in each ids range. That upfront step of expanding all the dates can be done once to make the subsequent joins dramatically faster.
It takes about 2 minutes when I run this for n=1,000,000, which I presume would take around 2 hours using the original code. I imagine further improvements could be made by porting to data.table or collapse if that isn't fast enough.
person_df %>%
group_by(ids, exposure_zip = zips) %>%
summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
left_join(exposure_df) %>%
group_by(ids) %>%
summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
na_count = sum(is.na(exposure_value))) %>%
# optional to add start dates end dates, zips columns back
left_join(person_df)
Update: porting to data.table using dtplyr improved the 1M row test slightly, to 100 seconds on my machine.
library(dtplyr)
person_df %>%
lazy_dt() %>%
group_by(ids, exposure_zip = zips) %>%
summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
left_join(exposure_df) %>%
group_by(ids) %>%
summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
na_count = sum(is.na(exposure_value)), .groups = "drop") %>%
left_join(person_df) %>%
collect()
#langtang's approach is clever, by recognizing that the sum over each id's range can be done more efficiently by subtracting the cumulative value at the end of the range from the cumulative value the day before the start of the range. This improves the time for n=1M to 4 seconds, even using dplyr's slower aggregate calculations.
exposure_df_cuml <- exposure_df %>%
group_by(zips = exposure_zip) %>%
transmute(value = exposure_day,
expo_cuml = cumsum(coalesce(exposure_value,0)),
expo_na_cuml = cumsum(is.na(exposure_value))) %>%
ungroup()
person_df %>%
tidyr::pivot_longer(ends_with("dates")) %>%
mutate(value = value - if_else(name == "start_dates", 1, 0)) %>%
left_join(exposure_df_cuml) %>%
mutate(across(starts_with("expo"), ~if_else(name == "start_dates", -.x, .x))) %>%
group_by(ids, zips) %>%
summarize(across(starts_with("expo"), sum), .groups = "drop")
You should be able to get the runtime for one million ids down to a couple of seconds.
The trick here is to use cumulative sums across the entire range of days, and then subtract the cumulative sum at the end date for an id from the cumulative sum at the start date for an id. This is quite fast, because it doesn’t require any expansion of rows and apart from the direct merge, doesn’t require any grouping by ids:
Step 1: Create the cumulative sum of exposure values (cval) and number of NA values (nas)
exposure_dt[order(exposure_day), `:=`(
cval=cumsum(fifelse(is.na(exposure_value),0,exposure_value)),
nas = cumsum(is.na(exposure_value))
),exposure_zip]
Step 2: Simply melt the person_dt frame, and do direct merge on the exposure_dt frame. Make sure to subtract the exposure value from the cumulative sum if this is a start day and exposure value is not NA; similarly subtract one from nas if this is a start day and exposure value is NA.
k <- melt(person_dt,id.vars = c("ids","zips")) %>%
.[exposure_dt, on=.(zips==exposure_zip, value=exposure_day), nomatch=0] %>%
.[variable=="start_dates" & is.na(exposure_value), nas:=nas-1] %>%
.[variable=="start_dates" & !is.na(exposure_value),cval:=cval-exposure_value] %>%
.[order(ids,value)]
Step 3: Simply subtract the odd rows from the even rows, and cbind the result to the person_dt
cbind(
person_dt,
k[seq(2,.N,2),.(cval,nas)] - k[seq(1,.N,2),.(cval,nas)]
)
All of this takes 0.08 seconds on my machine with the original 1000 ids dataset. If I set n=1000000, it takes about 1.1 seconds.
Output:
ids start_dates end_dates zips cval nas
<int> <Date> <Date> <int> <num> <int>
1: 1 2000-10-10 2001-02-18 45108 3188 5
2: 2 2000-11-11 2001-04-07 45190 3789 1
3: 3 2000-03-05 2000-06-27 45117 2917 3
4: 4 2001-02-04 2001-06-09 45142 2969 2
5: 5 2000-03-16 2000-07-13 45197 2860 3
---
996: 996 2000-02-21 2000-07-29 45139 4250 2
997: 997 2000-02-02 2000-07-15 45074 4407 4
998: 998 2001-07-29 2001-11-15 45139 2686 3
999: 999 2001-09-10 2001-12-20 45127 2581 1
1000: 1000 2000-10-15 2001-05-01 45010 4941 2
I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(
expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F
)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certains subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As i said before i want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because i haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 2 Sales over Assets 1.30
1 3 Sales over Assets Nan
1 4 Sales over Assets Inf
1 5 Sales over Assets 2.32
1 6 Sales over Assets NA
.
.
.
1 1 Sales over Expenses A + B 3.25
.
.
.
2 1 Sales over Assets 0.256
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope i have made myself clear this time :)
Is there any way to solve this row problem with dplyr? Should I cast the df_2017 for mutating? In this case, which is the best way for casting?
Any help would be welcome!
This is one way of doing it. At the end I timed the code on all of your records.
First create a function to create all the ratios. Do note, this function is only useful inside the dplyr code.
ratio <- function(data){
result <- data.frame(desc_ratio = rep(NA, ncol(ratio_df) -1), ratio_value = rep(NA, ncol(ratio_df) -1))
for(i in 2:ncol(ratio_df)){
num <- ratio_df[ratio_df$concept == "numerator", i]
denom <- ratio_df[ratio_df$concept == "denom", i]
result$desc_ratio[i-1] <- ratio_df[ratio_df$concept == "name", i]
result$ratio_value[i-1] <- sum(ifelse(data$acc_concept %in% num, data$value, 0)) / sum(ifelse(data$acc_concept %in% denom, data$value, 0))
}
return(result)
}
Using dplyr, tidyr and purrr to put everything together. First group by the data, nest the data needed for the function, run the function with a mutate on the nested data. Drop the not needed nested data and unnest to get your wanted output. I leave the sorting up to you.
library(dplyr)
library(purrr)
library(tidyr)
output <- df_2017 %>%
group_by(company, product, period) %>%
nest() %>%
mutate(ratios = map(data, ratio)) %>%
select(-data) %>%
unnest
output
# A tibble: 25,350 x 5
company product period desc_ratio ratio_value
<chr> <chr> <chr> <chr> <dbl>
1 103 2 2017 Sales over Assets 0.733
2 103 2 2017 Sales over Expenses A + B 0.219
3 26 26 2017 Sales over Assets 0.954
4 26 26 2017 Sales over Expenses A + B 1.01
5 85 59 2017 Sales over Assets 4.14
6 85 59 2017 Sales over Expenses A + B 1.83
7 186 38 2017 Sales over Assets 7.85
8 186 38 2017 Sales over Expenses A + B 0.722
9 51 25 2017 Sales over Assets 2.34
10 51 25 2017 Sales over Expenses A + B 0.627
# ... with 25,340 more rows
Time it took to run this code on my machine measured with system.time:
user system elapsed
6.75 0.00 6.81
I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.