Dynamic negative binomial regression for time series - r

I have a count data and I need to do time series analysis using Dynamic negative binomial regression as the data has autocorrelation and Overdispersion issues.
I did an online search for any R package that I can use but I was not able to find one.
I would appreciate any help.
An example of my data:
>St1
[1] 17 9 28 7 23 16 17 12 11 16 19 29 5 40 13 27 13 11 10 14 13 23 21 24 9 42 14 22 17 9
>Years
[1] 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
[23] 2007 2008 2009 2010 2011 2012 2013 2014
>library(AER)
>library(stats)
>rd <- glm(St1 ~ Years, family = poisson)
>dispersiontest(rd)
Overdispersion test
data: rd
z = 2.6479, p-value = 0.00405
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
4.305539
#Autocorrelation
>Box.test (St1, lag=ceiling(log(length(St1))), type = "Ljung")
Box-Ljung test
data: St1
X-squared = 13.612, df = 4, p-value = 0.008641

So this is basically a request to find a package (and such requests are considered off-topic). So I'm going to see if I can convert it to a question that has a coding flavor. As I said in my comment, trying to use "dynamic" as a search term is often disappointing since everybody seems to want to use the word for a bunch of disconnected purposes. Witness the functions that come up with this search from the console:
install.packages("sos")
sos::findFn(" dynamic negative binomial")
found 20 matches
Downloaded 20 links in 13 packages.
Nothing that appeared useful. But looking at your citation it appeared that all the models had an autoreggression component, so this search ....
sos::findFn(" autoregressive negative binomial")
found 28 matches; retrieving 2 pages
2
Downloaded 27 links in 16 packages.
Finds: "Fitting Longitudinal Data with Negative Binomial Marginal..." and "Generalized Linear Autoregressive Moving Average Models with...". So consider this rather my answer to an "implicit question": How to do effective searching from the R console with the sos-package?

Related

Dynamic GMM in R

I am dealing with large panel data, here is just my sample panel data:
Year PERMNO A B SIC
1991 a 3 11 2
1991 b 4 12 2
1991 c 5 15 1
1992 a 4 10 2
1992 b 4 14 2
1992 c 3 11 1
1993 a 2 9 2
1993 b 3 15 2
Dynamic GMM enables us to estimate the A-B relation while including both past A levels and fixed effects to account for the dynamic aspects of the A-B relation and time-invariant unobservable heterogeneity.
Specifically, I have the following:
A(t)=alpha+B(t)+A(t-1)+A(t-2)+controls+Year_dummy+Industry_dummy
Currently, I am running:
lm(A ~ B + Controls + lag_A +lag_A_2 + factor(SIC_2)+factor(Year), data = data)
The reason I am not using "plm" is becuase normally we have to set the index:
pdata.frame(data, index = c("PERMNO","Year"))
However, my fixed effects are Year and industry, not individual PERMNO. Each industry contains mutiple PERMNO.In "lm" fixed effects can be implemented by adding factor(SIC_2)+factor(Year).
I do not know how to run this Dynamic GMM in R? I notice some people refer to "pgmm".

Observations with low frequency go all in train set and produce error in predict ()

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.
At first I excluded/deleted the countries which appeared only once:
# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]
When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train.
Now I did a count to see how often a country appears.
count <- as.data.frame(count(mydata1, vars=mydata1$country))
count[rev(order(count$n)),]
vars n
3 Bundesrep. Deutschland 7616
9 Grossbritannien 1436
12 Italien 930
2 Belgien 731
22 Schweden 611
23 Schweiz 590
13 Japan 587
19 Oesterreich 449
17 Niederlande 354
8 Frankreich 276
18 Norwegen 238
7 Finnland 130
21 Portugal 105
5 Daenemark 65
26 Spanien 57
4 China 55
20 Polen 51
27 Taiwan 31
14 Korea Süd 30
11 Irland 26
29 Tschechien 13
16 Litauen 9
10 Hong Kong 7
30 <NA> 3
6 Estland 3
24 Serbien 2
1 Australien 2
28 Thailand 1
25 Singapur 1
15 Kroatien 1
From this I can see, I also have NA's in my data.
My question now is, how can I proceed with this problem?
Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets?
It's somehow not "fancy" just to delete the rows...is there any other possibility?
You need to convert every chr variable in factor:
mydata1$country <- as.factor(mydata1$country)
Then you can simply proceed with train/test splitting. You won't need to remove anything (except NAs)
By using the type factor, your model will know that an observation country, will have some possible levels:
Example:
country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country
[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you
See the difference with:
country <- "Italy"
country
[1] "Italy"
By using factor, the model will know all the possible levels. Because of this, even if in the train data you won't have an observation "Italy", the model will know that it's possible to have it in the test data.
factor is always the correct type for characters in models.

R: Interpolation of values for NAs by indices/groups when first or last values aren't given

I have panel data that has county data for 15 years of different economic measures (which I have created an index for). There are missing data in the values that I would like to interpolate. However, because the values are randomly missing by year, linear interpolation doesn't work, it only gives me interpolation values between the first and last data points. This is a problem because I need interpolated values for the entire series.
Since all of the series have more than 5 data points, is there any code out there that would interpolate the series based on data that already exists within the specific series?
I first thought about indexing my data to try and run a loop but then I found code on linear interpolation by groups. While the latter solved some of the NA's it did not interpolate all of them. Here would be an example of my data that interpolates some of the data but not all.
library(dplyr)
data <- read.csv(text="
index,year,value
1,2001,20864.135
1,2002,20753.867
1,2003,NA
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,NA
1,2008,NA
1,2009,9021.556
1,2010,NA
1,2011,NA
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,NA
2,2013,128.646
2,2014,NA
2,2015,NA")
Using
interpolation<-data %>%
group_by(index) %>%
mutate(valueIpol = approx(year, value, year,
method = "linear", rule = 1, f = 0, ties = mean)$y)
I get the following interpolated values.
1,2001,20864.135
1,2002,20753.867
1,2003,19231.046
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,11604.686
1,2008,10313.121
1,2009,9021.556
1,2010,10612.955
1,2011,12204.353
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,116.394
2,2013,128.646
2,2014,NA
2,2015,NA
Any help would be appreciated. I'm pretty new to R and have never worked with loops but I have looked up other "interpolation by groups" help. Nothing seems to solve the issue of filling in data when the first and last points are NA's as well.
Maybe this could help:
library(imputeTS)
for(i in unique(data$index)) {
data[data$index == i,] <- na.interpolation(data[data$index == i,])
}
Only works when the groups itself are already ordered by year. (which is the case in your example)
Output would look like this:
> data
index year value
1 1 2001 20864.135
2 1 2002 20753.867
3 1 2003 19231.046
4 1 2004 17708.224
5 1 2005 12483.767
6 1 2006 12896.251
7 1 2007 11604.686
8 1 2008 10313.121
9 1 2009 9021.556
10 1 2010 10612.955
11 1 2011 12204.353
12 1 2012 13795.752
13 1 2013 16663.741
14 1 2014 19349.992
15 1 2015 19349.992
16 2 2001 151.108
17 2 2002 151.108
18 2 2003 151.108
19 2 2004 151.108
20 2 2005 151.108
21 2 2006 151.108
22 2 2007 151.108
23 2 2008 151.108
24 2 2009 107.205
25 2 2010 90.869
26 2 2011 104.142
27 2 2012 116.394
28 2 2013 128.646
29 2 2014 128.646
30 2 2015 128.646
Since the na.interpolation function uses approx internally, you can pass parameters of approx trough to adjust the behavior.
The parameters you used in your example: method = "linear", rule = 1, f = 0, ties = mean are the standard parameters. If you want to use these you don't have to add anything.
Otherwise you would change the part in the loop with for example this:
data[data$index == i,] <- na.interpolation(data[data$index == i,], ties ="ordered", f = 1, rule = 2)

Time trend analyses using "years vs continuous var" ecological data?

Looking at my time trend plot, I wonder how to test the statistical significance in the trend shown here given this simple "years vs rate" ecological data, using R? I tried ANOVA turned in p<0.05 treating year variable as a factor. But I'm not satisfied with ANOVA. Also, the article I reviewed suggested Wald statistics to test the time trend. But I found no guiding examples in Google yet.
My data:
> head(yrrace)
year racecat rate outcome pop
1 1995 1 14.2 1585 11170482
2 1995 2 8.7 268 3070363
3 1996 1 14.1 1574 11170482
4 1996 2 7.5 230 3070363
5 1997 1 13.3 1482 11170482
6 1997 2 8.3 254 3070363

R - Combining multiple columns together within a data frame, while keeping connected data

So I've looked quite a lot for an answer to this question, but I can't find an answer that satisfies my needs or my understanding of R.
First, here's some code to just give you an idea of what my data set looks like
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
> df
Year Subdiv H1 H2
1 1991 24 31.20 53.9
2 1992 25 34.00 121.5
3 1993 26 70.20 16.9
4 1994 27 19.80 11.9
5 1995 28 433.70 114.6
6 1996 24 126.34 129.9
7 1997 25 178.39 221.1
8 1998 26 30.40 433.4
9 1999 27 56.90 319.2
10 2000 28 818.30 52.6
So what I've got here is a data set containing abundance of herring of different ages in different areas ("Subdiv") over time. H1 stands for herring at age 1. My real data set contains more ages as well as more areas (,and additional species of fish).
What I would like to do is combine the abundance of different ages into one column while keeping the connected data (Year, Subdiv) as well as creating a new column for Age.
Like so:
Year Subdiv Abun Age
1 1991 24 31.20 1
2 1992 25 34.00 1
3 1993 26 70.20 1
4 1994 27 19.80 1
5 1995 28 433.70 1
6 1991 24 53.9 2
7 1992 25 121.5 2
8 1993 26 16.9 2
9 1994 27 11.9 2
10 1995 28 114.6 2
Note: Yes, I removed some rows, but only to not crowd the screen
I hope this is enough of information for making it understandable what I need and for someone to help.
Since I have more species of fish, if someone would like to include a description for adding a Species column as well, that would be helpful.
Here's code for the same data, just duplicated for sprat (Sn):
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6),
S1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
S2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
Cheers!
I don't think the tags of this question should be unrelated, but if you don't find the tags fitting for my question, go a head and change.
This is a typical reshape then supplement task so you can:
1) 'Melt' your data with reshape2
library("reshape2")
df.m<-melt(df,id.vars=c("Year","Subdiv"))
2) Then add additional columns based on the variable column that holds your previous df's column names
library("stringr")
df.m$Fish<-str_extract(df.m$variable,"[A-Z]")
df.m$Age<-str_extract(df.m$variable,"[0-9]")
I recommend you look up the reshape functions as these are very commonly required and learning them will save you lots of time in future
http://www.statmethods.net/management/reshape.html
I think the basic data.frame function will do exactly what you want. Try something like:
data.frame(df$Year,df$Subdiv,Abun=c(df$H1,df$H2),
Age=rep(c(1,2),each=nrow(df)))
So I'm concatenating the values you want in the abundance column, and creating a new column that is just the ages replicated for each row. You can create a similar species column easily.
Hope that helps!

Resources