R - dplyr - How to mutate rows or divitions between rows - r

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(
expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F
)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certains subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As i said before i want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because i haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 2 Sales over Assets 1.30
1 3 Sales over Assets Nan
1 4 Sales over Assets Inf
1 5 Sales over Assets 2.32
1 6 Sales over Assets NA
.
.
.
1 1 Sales over Expenses A + B 3.25
.
.
.
2 1 Sales over Assets 0.256
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope i have made myself clear this time :)
Is there any way to solve this row problem with dplyr? Should I cast the df_2017 for mutating? In this case, which is the best way for casting?
Any help would be welcome!

This is one way of doing it. At the end I timed the code on all of your records.
First create a function to create all the ratios. Do note, this function is only useful inside the dplyr code.
ratio <- function(data){
result <- data.frame(desc_ratio = rep(NA, ncol(ratio_df) -1), ratio_value = rep(NA, ncol(ratio_df) -1))
for(i in 2:ncol(ratio_df)){
num <- ratio_df[ratio_df$concept == "numerator", i]
denom <- ratio_df[ratio_df$concept == "denom", i]
result$desc_ratio[i-1] <- ratio_df[ratio_df$concept == "name", i]
result$ratio_value[i-1] <- sum(ifelse(data$acc_concept %in% num, data$value, 0)) / sum(ifelse(data$acc_concept %in% denom, data$value, 0))
}
return(result)
}
Using dplyr, tidyr and purrr to put everything together. First group by the data, nest the data needed for the function, run the function with a mutate on the nested data. Drop the not needed nested data and unnest to get your wanted output. I leave the sorting up to you.
library(dplyr)
library(purrr)
library(tidyr)
output <- df_2017 %>%
group_by(company, product, period) %>%
nest() %>%
mutate(ratios = map(data, ratio)) %>%
select(-data) %>%
unnest
output
# A tibble: 25,350 x 5
company product period desc_ratio ratio_value
<chr> <chr> <chr> <chr> <dbl>
1 103 2 2017 Sales over Assets 0.733
2 103 2 2017 Sales over Expenses A + B 0.219
3 26 26 2017 Sales over Assets 0.954
4 26 26 2017 Sales over Expenses A + B 1.01
5 85 59 2017 Sales over Assets 4.14
6 85 59 2017 Sales over Expenses A + B 1.83
7 186 38 2017 Sales over Assets 7.85
8 186 38 2017 Sales over Expenses A + B 0.722
9 51 25 2017 Sales over Assets 2.34
10 51 25 2017 Sales over Expenses A + B 0.627
# ... with 25,340 more rows
Time it took to run this code on my machine measured with system.time:
user system elapsed
6.75 0.00 6.81

Related

R median for past 4,8,12 rows - for loop and group not working, df filter not working as well

I need to calculate different medians of past 4,8,12 weeks (weeks are even numbers). But the group by is not working in for loop. See code below.
I have a dataframe:
keyword - hits - date
wine - 23 - 01.01.2020
wine - 20 - 01.08.2020
wine - 5 - 01.15.2020
food - 21 - 01.01.2020
food - 8 - 01.08.2020
food - 4 - 01.15.2020
other - 19 - 01.01.2020
other - 12 - 01.08.2020
other - 11 - 01.15.2020
etc...
summary3 <- summary2 %>% group_by(keyword)
for (i in 1:nrow(summary3)) {
if (i < "9") {
summary3$median8[i] = i
}
else
summary3$median8[i] = median(summary3$hits[(i-1):(i-8)])
}
for (i in 1:nrow(summary3)) {
if (i < "5") {
summary3$median4[i] = i
}
else
summary3$median4[i] = median(summary3$hits[(i-1):(i-4)])
}
for (i in 1:nrow(summary3)) {
if (i < "13") {
summary3$median12[i] = i
}
else if (i < "13") {
summary3$median12[i] = median(summary3$hits[(i-1):(i-12)])
}
}
I also tried the dataframe filer method. But the PID is not correct displaying when I try to subtrack 1 and 8. PID = identify rows in each keyword group.
#summary3 <- summary2 %>% group_by(keyword) %>% mutate(pid = row_number())
#summary3$median8 <- median(summary3$hits[lag(summary3$pid, k=1) : lag(summary3$pid), k=8])
or this
#summary3$median8 <- median(summary3$hits[(summary3$pid-1) : (summary3$pid)-8])
Does anyone have a smart solution? I know there is a rollmedian but that is for odd numbers week.
The following is based on a {tidyverse}, i.e. {dplyr}, and the nice "rolling window" {runner} package. The latter saves you from creating and adapting the week ids (or PID) and then iterating over them.
I need to simulate some data, as you have not provided a reproducible example. But you can readily apply this to your data. I also constrain my example to the 4 and 8 weeks ...
library(dplyr)
# simulate some data
set.seed(666)
# create weekly timeseries
dates <- seq(from = lubridate::ymd("2020-01-01"), to = lubridate::ymd("2020-05-01")
,by = "1 week")
# example dataframe with weeks, the wine/food/other, and some random hits
df <- data.frame(
keyword = c( rep("wine", length.out = length(dates))
,rep("food", length.out = length(dates))
,rep("other", length.out = length(dates))
)
, hits = sample(4:25, size = 3 * length(dates), replace = TRUE) # random integers
, date = rep(dates, times = 3)
)
With {runner} calculate the rolling mean over the last 4 and 8 weeks.
library(runner)
df %>%
group_by(keyword) %>%
mutate(
ROLL_MEAN_LAST4 = runner(
x = hits
,k = "4 weeks"
,idx = date
,f = mean
)
, ROLL_MEAN_LAST8 = runner(
x = hits
,k = "8 weeks"
,idx = date
,f = mean
)
) %>%
ungroup()
This yields
# A tibble: 54 x 5
keyword hits date ROLL_MEAN_LAST4 ROLL_MEAN_LAST8
<chr> <int> <date> <dbl> <dbl>
1 wine 14 2020-01-01 14 14
2 wine 17 2020-01-08 15.5 15.5
3 wine 8 2020-01-15 13 13
4 wine 12 2020-01-22 12.8 12.8
5 wine 15 2020-01-29 13 13.2
6 wine 4 2020-02-05 9.75 11.7
7 wine 21 2020-02-12 13 13
8 wine 6 2020-02-19 11.5 12.1
9 wine 8 2020-02-26 9.75 11.4
10 wine 16 2020-03-04 12.8 11.2
Note: Check the runner documentation if you need to offset your rolling windows, etc. I.e. the solution calculates the mean for the partial time horizon (first week mean based on a single week, 2nd week mean based on the first 2 weeks).

How to loop many factors into one function

I have a large data frame regarding Covid patients. I have included a very simplified version of what this frame looks like.
CovidFake <- data.frame(DateReporting=sample(seq(as.Date("2020-10-1"), as.Date("2020-11-01"), by="day"), 50, replace=TRUE),
Industry=sample(c("Minor or Student", "Educational Services", "Medical Services", "Food Production"),50, replace =TRUE))
I want use ggplot to make a graph of the daily cases by industry of the patient. I have this function to structure the frame so ggplot can graph it.
library(zoo)
MainFunction <- function(MainFrame, CatVal){
Frame <- subset(MainFrame, Industry==CatVal)
Frame <- as.data.frame(table(Frame$DateReporting))
colnames(Frame) <- c("Var1", "Freq")
Frame$Var1 <- as.Date(Frame$Var1, "%Y-%m-%d")
Frame <- Frame %>% complete(Var1 = seq.Date(as.Date("2020-10-01", "%Y-%m-%d"),
as.Date("2020-11-01", "%Y-%m-%d"), by="day"))
Frame$Freq <- replace_na(Frame$Freq, 0)
Frame$CumSum <- cumsum(Frame$Freq)
Frame$Cat <- CatVal
Frame$SevenDayAverage <- rollmean(Frame$Freq, 7, fill=NA, align = "right")
colnames(Frame) <- c("Date", "DailyCases", "CumSum", "Industry", "SevenDayAve")
Frame <- subset(Frame, Date >= "2020-03-13")
return(Frame)
}
I need to create a frame that has all of these industries, so I've been doing something like this.
IndGraph <- rbind(MainFunction(CovidFake, "Minor or Student"),
MainFunction(CovidFake, "Educational Services"),
MainFunction(CovidFake, "Medical Services"),
MainFunction(CovidFake, "Food Production"))
The true frame has about 15 industries, so the code gets pretty long and seemingly unnecessarily repetitive. Is there anyway to loop in all the factors into the function and do this in one? Or is there a simpler way to structure the frame? I'm new to R so any and all help is much appreciated.
Thanks!
Using a for loop:
IndGraph <- vector()
for(i in CovidFake$Industry){
IndGraph <- rbind(IndGraph, MainFunction(CovidFake, i))}
Output:
> IndGraph
# A tibble: 1,600 x 5
Date DailyCases CumSum Industry SevenDayAve
<date> <dbl> <dbl> <chr> <dbl>
1 2020-10-01 0 0 Minor or Student NA
2 2020-10-02 0 0 Minor or Student NA
3 2020-10-03 1 1 Minor or Student NA
4 2020-10-04 0 1 Minor or Student NA
5 2020-10-05 0 1 Minor or Student NA
6 2020-10-06 0 1 Minor or Student NA
7 2020-10-07 1 2 Minor or Student 0.286
8 2020-10-08 1 3 Minor or Student 0.429
9 2020-10-09 2 5 Minor or Student 0.714
10 2020-10-10 0 5 Minor or Student 0.571
# ... with 1,590 more rows
One option would be:
do.call("rbind", lapply(unique(CovidFake$Industry), FUN = function(x, y = CovidFake) MainFunction(y, x)))

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Resources