Populate a column with forecasts of panel data using data.table in R - r

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

Related

How to prevent extrapolation using na.spline()

I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.
This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

R - Combining multiple columns together within a data frame, while keeping connected data

So I've looked quite a lot for an answer to this question, but I can't find an answer that satisfies my needs or my understanding of R.
First, here's some code to just give you an idea of what my data set looks like
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
> df
Year Subdiv H1 H2
1 1991 24 31.20 53.9
2 1992 25 34.00 121.5
3 1993 26 70.20 16.9
4 1994 27 19.80 11.9
5 1995 28 433.70 114.6
6 1996 24 126.34 129.9
7 1997 25 178.39 221.1
8 1998 26 30.40 433.4
9 1999 27 56.90 319.2
10 2000 28 818.30 52.6
So what I've got here is a data set containing abundance of herring of different ages in different areas ("Subdiv") over time. H1 stands for herring at age 1. My real data set contains more ages as well as more areas (,and additional species of fish).
What I would like to do is combine the abundance of different ages into one column while keeping the connected data (Year, Subdiv) as well as creating a new column for Age.
Like so:
Year Subdiv Abun Age
1 1991 24 31.20 1
2 1992 25 34.00 1
3 1993 26 70.20 1
4 1994 27 19.80 1
5 1995 28 433.70 1
6 1991 24 53.9 2
7 1992 25 121.5 2
8 1993 26 16.9 2
9 1994 27 11.9 2
10 1995 28 114.6 2
Note: Yes, I removed some rows, but only to not crowd the screen
I hope this is enough of information for making it understandable what I need and for someone to help.
Since I have more species of fish, if someone would like to include a description for adding a Species column as well, that would be helpful.
Here's code for the same data, just duplicated for sprat (Sn):
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6),
S1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
S2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
Cheers!
I don't think the tags of this question should be unrelated, but if you don't find the tags fitting for my question, go a head and change.
This is a typical reshape then supplement task so you can:
1) 'Melt' your data with reshape2
library("reshape2")
df.m<-melt(df,id.vars=c("Year","Subdiv"))
2) Then add additional columns based on the variable column that holds your previous df's column names
library("stringr")
df.m$Fish<-str_extract(df.m$variable,"[A-Z]")
df.m$Age<-str_extract(df.m$variable,"[0-9]")
I recommend you look up the reshape functions as these are very commonly required and learning them will save you lots of time in future
http://www.statmethods.net/management/reshape.html
I think the basic data.frame function will do exactly what you want. Try something like:
data.frame(df$Year,df$Subdiv,Abun=c(df$H1,df$H2),
Age=rep(c(1,2),each=nrow(df)))
So I'm concatenating the values you want in the abundance column, and creating a new column that is just the ages replicated for each row. You can create a similar species column easily.
Hope that helps!

Ifelse statements for a dataframe in R

I am hoping that someone can help me figure out how to write an if-else statement to work on my dataset. I have data on tree growth rates by year. I need to calculate whether growth rates decreased by >50% from one year to the next. I am having trouble applying an ifelse statement to calculate my final field. I am relatively new to R, so my code is probably not very efficient, but here is an example of what I have so far:
For an example dataset,
test<-data.frame(year=c("1990","1991","1992","1993"),value=c(50,25,20,5))
year value
1 1990 50
2 1991 25
3 1992 20
4 1993 5
I then calculate the difference between the current year and previous year's growth ("value"):
test[-1,"diff"]<-test[-1,"value"]-test[-nrow(test),"value"]
year value diff
1 1990 50 NA
2 1991 25 -25
3 1992 20 -5
4 1993 5 -15
and then calculate what 50% of each years' growth would be:
test$chg<-test$value * 0.5
year value diff chg
1 1990 50 NA 25.0
2 1991 25 -25 12.5
3 1992 20 -5 10.0
4 1993 5 -15 2.5
I am then trying to use an ifelse statement to calculate a field "abrupt" that would be "1" when the decline from one year to the next is greater than 50%. This is the code I am trying to use, but I'm not sure how to properly reference the "chg" field from the previous year, because I am getting an error (copied below):
test$abrupt<-ifelse(test$diff<0 && abs(test$diff)>=test[-nrow(test),"chg"],1,0)
Warning message:
In abs(test$diff) >= test[-nrow(test), "chg"] :
longer object length is not a multiple of shorter object length
> test
year value diff chg abrupt
1 1990 50 NA 25.0 NA
2 1991 25 -25 12.5 NA
3 1992 20 -5 10.0 NA
4 1993 5 -15 2.5 NA
A test of a similar ifelse statement worked when I just assigned a few numbers, but I'm not sure how to get this to work in the context of a datframe. Here is an example of it working on just a few values:
prevyear<-50
curryear<-25
chg<-prevyear*0.5
> chg
[1] 25
> diff<-curryear-prevyear
> diff
[1] -25
> abrupt<-ifelse(diff<0 && abs(diff)>= chg,1,0)
> abrupt
[1] 1
If anyone could help me figure out how to apply a similar ifelse statement to my dataframe I would greatly appreciate it! Thank you for any help you can provide.
thank you,
Katie
It's throwing a warning because the two vectors compared abs(test$diff) >= test[-nrow(test),"chg"] have different lengths. Also, for logical and, you are using && (which gives only one TRUE or FALSE) when you should be using & (which is vectorized: it operates elementwise over two vectors and returns a vector of the same length). Try this:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)
I would change where you're putting chg so that it lines up with the diff you want to compare it to:
test$chg[2:nrow(test)] <- test$value[1:(nrow(test)-1)] * 0.5
Then, correct your logical operator like Blue Magister said:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)
and you have your results:
year value diff chg abrupt
1 1990 50 NA NA NA
2 1991 25 -25 25.0 1
3 1992 20 -5 12.5 0
4 1993 5 -15 10.0 1
Also, you may find the function diff helpful: rather than doing this:
test[-1,"value"]-test[-nrow(test),"value"]
you can just do
diff(test$value)

Resources