How calculate growth rate in long format data frame? - r

With data structured as follows...
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
I'm having a tough time creating a growth rate column (by year) within category. Can anyone help with code to create something like this...
Category Year Value Growth
A 2010 1
A 2011 2 1.000
A 2012 3 0.500
A 2013 4 0.333
A 2014 5 0.250
A 2015 6 0.200
B 2010 7
B 2011 8 0.143
B 2012 9 0.125
B 2013 10 0.111
B 2014 11 0.100
B 2015 12 0.091

For these sorts of questions ("how do I compute XXX by category YYY")? there are always solutions based on by(), the data.table() package, and plyr. I generally prefer plyr, which is often slower, but (to me) more transparent/elegant.
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df,"Category",transform,
Growth=c(NA,exp(diff(log(Value)))-1))
The main difference between this answer and #krlmr's is that I am using a geometric-mean trick (taking differences of logs and then exponentiating) while #krlmr computes an explicit ratio.
Mathematically, diff(log(Value)) is taking the differences of the logs, i.e. log(x[t+1])-log(x[t]) for all t. When we exponentiate that we get the ratio x[t+1]/x[t] (because exp(log(x[t+1])-log(x[t])) = exp(log(x[t+1]))/exp(log(x[t])) = x[t+1]/x[t]). The OP wanted the fractional change rather than the multiplicative growth rate (i.e. x[t+1]==x[t] corresponds to a fractional change of zero rather than a multiplicative growth rate of 1.0), so we subtract 1.
I am also using transform() for a little bit of extra "syntactic sugar", to avoid creating a new anonymous function.

You can simply use dplyr package:
> df %>% group_by(Category) %>% mutate(Growth = (Value - lag(Value))/lag(Value))
which will produce the following result:
# A tibble: 12 x 4
# Groups: Category [2]
Category Year Value Growth
<fct> <int> <int> <dbl>
1 A 2010 1 NA
2 A 2011 2 1
3 A 2012 3 0.5
4 A 2013 4 0.333
5 A 2014 5 0.25
6 A 2015 6 0.2
7 B 2010 7 NA
8 B 2011 8 0.143
9 B 2012 9 0.125
10 B 2013 10 0.111
11 B 2014 11 0.1
12 B 2015 12 0.0909

Using R base function (ave)
> dfdf$Growth <- with(df, ave(Value, Category,
FUN=function(x) c(NA, diff(x)/x[-length(x)]) ))
> df
Category Year Value Growth
1 A 2010 1 NA
2 A 2011 2 1.00000000
3 A 2012 3 0.50000000
4 A 2013 4 0.33333333
5 A 2014 5 0.25000000
6 A 2015 6 0.20000000
7 B 2010 7 NA
8 B 2011 8 0.14285714
9 B 2012 9 0.12500000
10 B 2013 10 0.11111111
11 B 2014 11 0.10000000
12 B 2015 12 0.09090909
#Ben Bolker's answer is easily adapted to ave:
transform(df, Growth=ave(Value, Category,
FUN=function(x) c(NA,exp(diff(log(x)))-1)))

Very easy with plyr:
library(plyr)
ddply(df, .(Category),
function (d) {
d$Growth <- c(NA, tail(d$Value, -1) / head(d$Value, -1) - 1)
d
}
)
We have two problems here:
Splitting by category
Computing the growth rate
ddply is the workhorse, the split and the function to compute the growth rate is defined by parameters to this function.

A more elegant variant based on Ben's idea with the new gdiff function in my R package:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df, "Category", transform,
Growth=c(NA, kimisc::gdiff(Value, FUN = `/`)-1))
Here, gdiff is used to compute a lagged rate (instead of a lagged difference as diff would).

Many years later: the tsbox package aims to work with all kind of time series objects, including data frames, and offers a standard time series toolkit. Thus, calculating growth rates is as simple as:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(tsbox)
ts_pc(df)
#> [time]: 'Year' [value]: 'Value'
#> Category Year Value
#> 1 A 2010-01-01 NA
#> 2 A 2011-01-01 100.000000
#> 3 A 2012-01-01 50.000000
#> 4 A 2013-01-01 33.333333
#> 5 A 2014-01-01 25.000000
#> 6 A 2015-01-01 20.000000
#> 7 B 2010-01-01 NA
#> 8 B 2011-01-01 14.285714
#> 9 B 2012-01-01 12.500000
#> 10 B 2013-01-01 11.111111
#> 11 B 2014-01-01 10.000000
#> 12 B 2015-01-01 9.090909

The package collapse available in CRAN provides an easy and fully C/C++ based solution to these kinds of problems: with the generic function fgrowth and the associated growth operator G:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(collapse)
G(df, by = ~Category, t = ~Year)
Category Year G1.Value
1 A 2010 NA
2 A 2011 100.000000
3 A 2012 50.000000
4 A 2013 33.333333
5 A 2014 25.000000
6 A 2015 20.000000
7 B 2010 NA
8 B 2011 14.285714
9 B 2012 12.500000
10 B 2013 11.111111
11 B 2014 10.000000
12 B 2015 9.090909
# fgrowth is more of a programmers function, you can do:
fgrowth(df$Value, 1, 1, df$Category, df$Year)
[1] NA 100.000000 50.000000 33.333333 25.000000 20.000000 NA 14.285714 12.500000 11.111111 10.000000 9.090909
# Which means: Calculate the growth rate of Value, using 1 lag, and iterated 1 time (you can compute arbitrary sequences of lagged / leaded and iterated growth rates with these functions), identified by Category and Year.
fgrowth / G also has methods for the plm::pseries and plm::pdata.frame classes available in the plm package.

Related

counting NA from R Dataframe in a for loop

If I have a timeseries dataframe in r from 2011 to 2018. How can I do a for loop where I count the number of NA per year separately and if that specific year has more than x % I drop that year or do something.
please refer to the image to see how my Dataframe looks like.
https://i.stack.imgur.com/2fwDk.png
years_values <- 2011:2020
years = pretty(years_values,n=10)
count = 0
for (y in years){
for (j in df$Flow == y) {
if (is.na(df$Flow[j]){
count = count+1
}
}
if (count) > 1{
bfi = BFI(df$Flow == y)}
else {bfi = NA}
}
I am trying to use this code to loop for each year and then count the NA. and if the NA is greater than 1% I want to no compute for BFI and if it is less the compute for the BFI. I do have the BFI function working well. The problem I have is to formulate this loop.
Since you have not included any reproducible data, let us take a simple example that captures the essence of your own data. We have a column called Year and one called Flow that contains some missing values:
df <- data.frame(Year = rep(2011:2013, each = 4),
Flow = c(1, 2, NA, NA, 5, 6, NA, 8, 9, 10, 11, 12))
df
#> Year Flow
#> 1 2011 1
#> 2 2011 2
#> 3 2011 NA
#> 4 2011 NA
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Now suppose we want to count the number of missing values in each year. We can use table and is.na, like this:
tab <- table(df$Year, is.na(df$Flow))
tab
#>
#> FALSE TRUE
#> 2011 2 2
#> 2012 3 1
#> 2013 4 0
We can see that these are the absolute counts of missing values, but we can convert this into proportions by dividing the second column by the row sums of this table:
props <- tab[,2] / rowSums(tab)
props
#> 2011 2012 2013
#> 0.50 0.25 0.00
Now, suppose we want to find and remove the years where more than 33% of cases are missing. We can just filter the values of props that are greater than 0.33 and get the associated year (or years):
years_to_drop <- names(props)[props > 0.33]
years_to_drop
#> [1] "2011"
Now we can use this to remove the years with more than 33% missing values from our original data frame by doing:
df[!df$Year %in% years_to_drop,]
#> Year Flow
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Created on 2022-11-14 with reprex v2.0.2
As Allan Cameron suggests, there's no need to use a loop, and R is usually more efficient working vectorially anyway.
I would suggest a solution based on ave (using the synthetic data from the previous answer)
df$NA_fraction <- ave(df$Flow, df$Year, FUN = \(values) mean(is.na(values)))
df
Year Flow NA_fraction
1 2011 1 0.50
2 2011 2 0.50
3 2011 NA 0.50
4 2011 NA 0.50
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00
You can then pick whatever threshold and filter by it
> df[df$NA_fraction < 0.3,]
Year Flow NA_fraction
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00

Average percentage change over different years in R

I have a data frame from which I created a reproducible example:
country <- c('A','A','A','B','B','C','C','C','C')
year <- c(2010,2011,2015,2008,2009,2008,2009,2011,2015)
score <- c(1,2,2,1,4,1,1,3,2)
country year score
1 A 2010 1
2 A 2011 2
3 A 2015 2
4 B 2008 1
5 B 2009 4
6 C 2008 1
7 C 2009 1
8 C 2011 3
9 C 2015 2
And I am trying to calculate the average percentage increase (or decrease) in the score for each country by calculating [(final score - initial score) รท (initial score)] for each year and averaging it over the number of years.
country year score change
1 A 2010 1 NA
2 A 2011 2 1
3 A 2015 2 0
4 B 2008 1 NA
5 B 2009 4 3
6 C 2008 1 NA
7 C 2009 1 0
8 C 2011 3 2
9 C 2015 2 -0.33
The final result I am hoping to obtain:
country avg_change
1 A 0.5
2 B 3
3 C 0.55
As you can see, the trick is that countries have spans over different years, sometimes with a missing year in between. I tried different ways to do it manually but I do struggle. If someone could hint me a solution would be great. Many thanks.
With dplyr, we can group_by country and get mean of difference between scores.
library(dplyr)
df %>%
group_by(country) %>%
summarise(avg_change = mean(c(NA, diff(score)), na.rm = TRUE))
# country avg_change
# <fct> <dbl>
#1 A 0.500
#2 B 3.00
#3 C 0.333
Using base R aggregate with same logic
aggregate(score~country, df, function(x) mean(c(NA, diff(x)), na.rm = TRUE))
We can use data.table to group by 'country' and take the mean of the difference between the 'score' and the lag of 'score'
library(data.table)
setDT(df1)[, .(avg_change = mean(score -lag(score), na.rm = TRUE)), .(country)]
# country avg_change
#1: A 0.5000000
#2: B 3.0000000
#3: C 0.3333333

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

Replace NA with mean of adjacent values

I want to replace NA value with mean of adjacent non-missing values in "return" column, grouped by "id". Let assume that there are only two months: 1,2 in a year.
df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1, 2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
df1
id year month marketcap return
1: A 2014 1 4 NA # <-
2: A 2014 2 6 0.23
3: A 2015 1 2 0.20
4: A 2015 2 6 0.10
5: B 2014 1 23 0.40
6: B 2014 2 2 0.90
7: B 2015 1 5 NA # <-
8: B 2015 2 34 0.60
Desired data
desired_df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1,2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6))
desired_df
id year month marketcap return
1 A 2014 1 4 0.23 # <-
2 A 2014 2 6 0.23
3 A 2015 1 2 0.20
4 A 2015 2 6 0.10
5 B 2014 1 23 0.40
6 B 2014 2 2 0.90
7 B 2015 1 5 0.75 # <-
8 B 2015 2 34 0.60
The second NA (row 7) should be replaced by the mean of the values before and after, i.e. (0.9 + 0.6)/2 = 0.75.
Note that the first NA (row 1), has no previous data. Here NA should be replaced with the next non-missing value, 0.23 ("last observation carried backwards").
A data.table solution is preferred if it is possible
UPDATE:
When use the code structure as follows (which works for the sample)
df[,returnInterpolate:=na.approx(return,rule=2), by=id]
I have encountered the error:
Error in approx(x[!na], y[!na], xout, ...) :
need at least two non-NA values to interpolate
I guess that may be there is some id that have no non-NA values to interpolate. .any suggestions?
library(data.table)
df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
setDT(df)
library(zoo)
df[, returnInterpol := na.approx(return, rule = 2), by = id]
# id year month marketcap return returnInterpol
#1: A 2014 1 4 NA 0.23
#2: A 2014 2 6 0.23 0.23
#3: A 2015 1 2 0.20 0.20
#4: A 2015 2 6 0.10 0.10
#5: B 2014 1 23 0.40 0.40
#6: B 2014 2 2 0.90 0.90
#7: B 2015 1 5 NA 0.75
#8: B 2015 2 34 0.60 0.60
Edit:
If you have groups with only NA values or only one non-NA, you could do this:
df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34, 1:4),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA))
setDT(df)
df[, returnInterpol := switch(as.character(sum(!is.na(return))),
"0" = return,
"1" = {na.omit(return)},
na.approx(return, rule = 2)), by = id]
# id year month marketcap return returnInterpol
# 1: A 2014 1 4 NA 0.23
# 2: A 2014 2 6 0.23 0.23
# 3: A 2015 1 2 0.20 0.20
# 4: A 2015 2 6 0.10 0.10
# 5: B 2014 1 23 0.40 0.40
# 6: B 2014 2 2 0.90 0.90
# 7: B 2015 1 5 NA 0.75
# 8: B 2015 2 34 0.60 0.60
# 9: C 2014 1 1 NA 0.30
# 10: C 2014 2 2 NA 0.30
# 11: C 2015 1 3 0.30 0.30
# 12: C 2015 2 4 NA 0.30
The easy imputeTS solution without caring for the ID would be:
library("imputeTS")
na.interpolate(df)
Since the imputation should be according to ID, it is a little bit more complicated - since it seems often there are not enough values left when filtered by ID. I would take the solution Roland posted and use imputeTS::na.interpolation() where possible and in the other cases maybe the overall mean with imputeTS::na.mean() or a random guess in the overall bounds imputeTS::na.random() could be used.
In this case it might also be a very good idea to look beyond univariate time series interpolation / imputation. There are a lot of other variables that could help estimating the missing values (if there is a correlation). Packages like AMELIA could help here.

For Loop and Table Printing in R

I'm trying to compute confidence intervals for many rows of a table using a for loop, and would like output that is more readable.. Here is a snippet of how the data looks.
QUESTION X_YEAR X_PARTNER X_CAMP X_N X_CODE1
1 Q1 2011 SCSD ITC 15 4
2 Q1 2011 SCSD Nottingham 4 1
3 Q1 2011 SCSD ALL 19 5
4 Q1 2011 CP CP1 18 4
5 Q1 2011 ALL ALL 37 9
6 Q1 2012 SCSD ITC 8 1
7 Q1 2012 SCSD Nottingham 8 2
8 Q1 2012 SCSD ALL 16 3
9 Q1 2012 CP CP1 18 2
10 Q1 2012 CP CP1 22 2
11 Q1 2012 CP ALL 40 4
I'm trying to print out a confidence interval, with the Question, Year and Camp included. I'd like the output to be in table form like this
QUESTION YEAR CAMP X N MEAN LOWER UPPER
Q1 2011 ITC 4 15 0.26 0.07 0.55
Q1 2011 NOTTINGHAM 1 4 0.25 0.006 0.8
with the first three columns being taken directly from the data table, and the latter 4 extracted from a confidence interval test I'm using.
The code I'm currently using:
for (i in 1:26){
print(data[i,1],max.levels=0)
print(data[i,2],max.levels=0)
print(data[i,4],max.levels=0)
print(binom.confint(data[i,6],data[i,5],conf.level=0.95,methods="exact"))
}
provides output that (I have a lot more data than the snippet) will be far too time consuming to sift through...
[1] Q1
[1] 2011
[1] ITC
method x n mean lower upper
1 exact 4 15 0.2666667 0.07787155 0.5510032
[1] Q1
[1] 2011
[1] Nottingham
method x n mean lower upper
1 exact 1 4 0.25 0.006309463 0.8058796
Any advice is appreciated!
If df is the name of your data, and you only want to do this for where QUESTION is Q1 (see comments), then
library(binom)
df2 <- df[df$QUESTION == "Q1",]
x <- vector("list", nrow(df2))
for(i in seq_len(nrow(df2))) {
x[[i]] <- binom.confint(df2[i,6], df2[i,5], methods = "exact")
}
cbind(df2[c(1,2,4)], do.call(rbind, x)[,-1])
# QUESTION X_YEAR X_CAMP x n mean lower upper
# 1 Q1 2011 ITC 4 15 0.26666667 0.077871546 0.5510032
# 2 Q1 2011 Nottingham 1 4 0.25000000 0.006309463 0.8058796
# 3 Q1 2011 ALL 5 19 0.26315789 0.091465785 0.5120293
# 4 Q1 2011 CP1 4 18 0.22222222 0.064092048 0.4763728
# 5 Q1 2011 ALL 9 37 0.24324324 0.117725174 0.4119917
# 6 Q1 2012 ITC 1 8 0.12500000 0.003159724 0.5265097
# 7 Q1 2012 Nottingham 2 8 0.25000000 0.031854026 0.6508558
# 8 Q1 2012 ALL 3 16 0.18750000 0.040473734 0.4564565
# 9 Q1 2012 CP1 2 18 0.11111111 0.013751216 0.3471204
# 10 Q1 2012 CP1 2 22 0.09090909 0.011205586 0.2916127
# 11 Q1 2012 ALL 4 40 0.10000000 0.027925415 0.2366374
Note that conf.level = 0.95 is the default setting for binom.confint, so you don't need to include it in your call.

Resources