Have a simple R question but cannot seem to find an answer:
I have a data frame like this:
assumption_val year
1.2 2015
0 2016
0 2017
0 2018
0 2019
I want to grow each value as 20% greater than compared to the previous year, to output something like this:
assumption_val year
1.2 2015
1.44 2016
1.73 2017
2.07 2018
2.49 2019
How can I reference the previous row and multiply it by 1.2 to achieve this?
Thanks!
You are looking for cumprod:
cumprod(rep(1.2, 5))
Like its better known friend, cumsum, it accumulates past results, but it performs a multiplication rather than addition.
df <- data.frame(assumption_val=cumprod(rep(1.2, 5)),
years=2015:2019)
A nice generalization of these functions is Reduce. For example, here is Reduce performing this calculation. You can replace the "*" with "+" and have cumsum.
Reduce("*", rep(1.2, 5), accumulate = T)
A nice feature of this method is that you can adjust the growth rate in each period. For instance if you wanted to start at 1.5 rather than 1.2, you would simply adjust your growth vector to c(1.5, rep(1.2, 4)) to calculate the new growth as follows:
cumprod(c(1.5, rep(1.2, 4)))
data <- read.table(textConnection("
assumption_val year
1.2 2015
0 2016
0 2017
0 2018
0 2019"), header = TRUE)
data$assumption_val <- data$assumption_val[1]^(1:nrow(data))
data
## assumption_val year
## 1 1.20000 2015
## 2 1.44000 2016
## 3 1.72800 2017
## 4 2.07360 2018
## 5 2.48832 2019
Related
I am trying to create a new vector by a formula, but I want to include the result of that formula for subsequent values in the same vector. Is there a simple way to do this that I am missing?
I am starting with one value and then multiplying it by a percentage (v1) that changes each year. I can get that to work, but it only includes that row and I would like it to be a running sum. I've tried manually inserting the row 1 value (1000) and then doing a formula for the remaining rows [2:10], but that's not working either (getting NAs) and I don't think that will actually result in what I want. 1000 should be the initial value and each of the following should build off it and the row above. I want each row to start with the previous row's answer and multiply it by the percentage. I've tried this several ways and have searched for the answer, but I think I'm not using the correct terminology or something, as I am sure there is a simpler way to do this. Here's a reproducible example:
library(dplyr)
df <- data.frame(Year = c(2010:2019),
v1 = c(1.05, 1.1, 1.12, 1.15, 1.05, 1.3, 1.2, 1.2, 1.1, 1.1))
df$v2[1]=1000
df <- df %>%
mutate(v2 = v2[2:10] * v1 + v2[1])
The intended result should look like this.
Year v1 v2
1 2010 1.05 1050.00
2 2011 1.10 1155.00
3 2012 1.12 1293.60
4 2013 1.15 1487.64
5 2014 1.05 1562.02
6 2015 1.30 2030.63
7 2016 1.20 2436.75
8 2017 1.20 2924.11
9 2018 1.10 3216.52
10 2019 1.10 3538.17
So v2[1] = 1000*1.05, v2[2] = 1050*1.10, v2[3] = 1155*1.12, etc.
We can use accumulate from purrr
library(tidyverse)
df %>%
mutate(v2 = tail(accumulate(v1, ~ .x * .y, .init = 1000), -1))
# Year v1 v2
#1 2010 1.05 1050.000
#2 2011 1.10 1155.000
#3 2012 1.12 1293.600
#4 2013 1.15 1487.640
#5 2014 1.05 1562.022
#6 2015 1.30 2030.629
#7 2016 1.20 2436.754
#8 2017 1.20 2924.105
#9 2018 1.10 3216.516
#10 2019 1.10 3538.167
A base R option is
tail(Reduce(function(x, y) x * y, df$v1, init = 1000, accumulate = TRUE), -1)
#[1] 1050.000 1155.000 1293.600 1487.640 1562.022 2030.629
#[7] 2436.754 2924.105 3216.516 3538.167
I have panel data that has county data for 15 years of different economic measures (which I have created an index for). There are missing data in the values that I would like to interpolate. However, because the values are randomly missing by year, linear interpolation doesn't work, it only gives me interpolation values between the first and last data points. This is a problem because I need interpolated values for the entire series.
Since all of the series have more than 5 data points, is there any code out there that would interpolate the series based on data that already exists within the specific series?
I first thought about indexing my data to try and run a loop but then I found code on linear interpolation by groups. While the latter solved some of the NA's it did not interpolate all of them. Here would be an example of my data that interpolates some of the data but not all.
library(dplyr)
data <- read.csv(text="
index,year,value
1,2001,20864.135
1,2002,20753.867
1,2003,NA
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,NA
1,2008,NA
1,2009,9021.556
1,2010,NA
1,2011,NA
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,NA
2,2013,128.646
2,2014,NA
2,2015,NA")
Using
interpolation<-data %>%
group_by(index) %>%
mutate(valueIpol = approx(year, value, year,
method = "linear", rule = 1, f = 0, ties = mean)$y)
I get the following interpolated values.
1,2001,20864.135
1,2002,20753.867
1,2003,19231.046
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,11604.686
1,2008,10313.121
1,2009,9021.556
1,2010,10612.955
1,2011,12204.353
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,116.394
2,2013,128.646
2,2014,NA
2,2015,NA
Any help would be appreciated. I'm pretty new to R and have never worked with loops but I have looked up other "interpolation by groups" help. Nothing seems to solve the issue of filling in data when the first and last points are NA's as well.
Maybe this could help:
library(imputeTS)
for(i in unique(data$index)) {
data[data$index == i,] <- na.interpolation(data[data$index == i,])
}
Only works when the groups itself are already ordered by year. (which is the case in your example)
Output would look like this:
> data
index year value
1 1 2001 20864.135
2 1 2002 20753.867
3 1 2003 19231.046
4 1 2004 17708.224
5 1 2005 12483.767
6 1 2006 12896.251
7 1 2007 11604.686
8 1 2008 10313.121
9 1 2009 9021.556
10 1 2010 10612.955
11 1 2011 12204.353
12 1 2012 13795.752
13 1 2013 16663.741
14 1 2014 19349.992
15 1 2015 19349.992
16 2 2001 151.108
17 2 2002 151.108
18 2 2003 151.108
19 2 2004 151.108
20 2 2005 151.108
21 2 2006 151.108
22 2 2007 151.108
23 2 2008 151.108
24 2 2009 107.205
25 2 2010 90.869
26 2 2011 104.142
27 2 2012 116.394
28 2 2013 128.646
29 2 2014 128.646
30 2 2015 128.646
Since the na.interpolation function uses approx internally, you can pass parameters of approx trough to adjust the behavior.
The parameters you used in your example: method = "linear", rule = 1, f = 0, ties = mean are the standard parameters. If you want to use these you don't have to add anything.
Otherwise you would change the part in the loop with for example this:
data[data$index == i,] <- na.interpolation(data[data$index == i,], ties ="ordered", f = 1, rule = 2)
My dataset looks like this:
Year Risk Resource Utilization Band Percent
2014 0 .25
2014 1 .19
2014 2 .17
2014 3 .31
2014 4 .06
2014 5 .01
2015 0 .23
2015 1 .21
2015 2 .19
2015 3 .31
2015 4 .06
2015 5 .31
I am attempting to compare percentage change year to year for the dataset I am working with. For example 2014 decreased 2% in 2015. So far, I have created a loop that puts each by year into bins and runs the calculation. The issue I am having is that the loop is indexing each loop as 1 so I have a bunch of repeating 1s next to my calculations. Here is the code I have been using, any help is much appreciated
Results.data <- data.frame()
head(data)
percent <- 0
baseyear <- 0
nextyear <- 0
bin <- 0
yearPlus1 <-0
bin2 <-0
percent1 <-0
percent2 <-0
percentDif <-0
for(i in 1:nrow(data))
{
percent[i] <- data$PERCENT[i]
baseyear[i] <- as.numeric(data$YEAR_RISK[i])
bin[i] <- as.numeric(data$RESOURCE_UTILIZATION_BAND[i])
#print(percent[i])
#print(baseyear[i])
#print(bin[i])
}
for (k in 1:nrow(data))
{
for (j in 1:nrow(data))
{
yearPlus1 <- as.numeric(baseyear[j])-1
firstYear <- as.numeric(baseyear[k])
bin2 <-bin[j]
bin1 <- bin[k]
percent1 <- as.numeric(percent[k])
percent2 <- as.numeric(percent[j])
if(firstYear==yearPlus1 && bin1==bin2)
{
percentDif <- percent2 - percent1
print(percentDif)
Results.data <- rbind(Results.data, c(percentDif))
}
}
}
If I understand your question, you can use grouping and vectorization to avoid loops. Here's an example using the dplyr package.
The code below first sorts by Year_Risk so that the data are ordered properly by time. Then we group by Resource_Utilization_Band so that we can get results separately for each level of Resource_Utilization_Band. Finally, we calculate the difference in Percent from year to year. The lag function returns the previous value in a sequence. (Instead of lag, we could have done Change = c(NA, diff(Percent)) as well.) All of these operations are chained one after the other using the dplyr chaining operator (%>%).
(Note that when I imported your data, I also changed your column names by adding underscores to make them legal R column names.)
library(dplyr)
# Year-over-year change within each Resource_Utilization_Band
# (Assuming your starting data frame is called "dat")
dat %>% arrange(Year_Risk) %>%
group_by(Resource_Utilization_Band) %>%
mutate(Change = Percent - lag(Percent))
Year_Risk Resource_Utilization_Band Percent Change
1 2014 0 0.25 NA
2 2014 1 0.19 NA
3 2014 2 0.17 NA
4 2014 3 0.31 NA
5 2014 4 0.06 NA
6 2014 5 0.01 NA
7 2015 0 0.23 -0.02
8 2015 1 0.21 0.02
9 2015 2 0.19 0.02
10 2015 3 0.31 0.00
11 2015 4 0.06 0.00
12 2015 5 0.31 0.30
I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.
I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)
On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02