Parallelize for loops that subset panel data by industry-year - r

I want to carry out an estimation procedure that uses data on all firms in a given sector, for a rolling window of 5 years.
I can do it easily in a loop, but since the estimation procedure takes quite a while, I would like to parallelize it. Is there any way to do this?
My data looks like this:
sale_log cogs_log ppegt_log m_naics4 naics_2 gvkey year
1 3.9070198 2.5146032 3.192821715 9.290151e-02 72 1001 1983
2 4.1028774 2.7375141 3.517861329 1.067687e-01 72 1001 1984
3 4.5909863 3.2106595 3.975112703 2.511660e-01 72 1001 1985
4 3.2560391 2.7867256 -0.763368555 1.351031e-02 44 1003 1982
5 3.2966287 2.8088799 -0.305698649 1.151525e-02 44 1003 1983
6 3.2636907 2.8330357 0.154036559 8.699394e-03 44 1003 1984
7 3.7916480 3.2346849 0.887916936 1.351803e-02 44 1003 1985
8 4.1778028 3.5364473 1.177985972 1.761273e-02 44 1003 1986
9 4.1819066 3.7297111 1.393016951 1.686331e-02 44 1003 1987
10 4.0174411 3.6050022 1.479584215 1.601205e-02 44 1003 1988
11 3.4466429 2.9633579 1.312863013 8.888067e-03 44 1003 1989
12 3.0667367 2.6128805 0.909779173 2.102674e-02 42 1004 1965
13 3.2362968 2.8140391 1.430690273 2.050934e-02 42 1004 1966
14 3.1981990 2.8822097 1.721614365 1.702929e-02 42 1004 1967
15 3.9265031 3.6159280 2.399823853 2.559074e-02 42 1004 1968
16 4.3343438 4.0116068 2.592692585 3.649313e-02 42 1004 1969
17 4.5869564 4.3059855 2.772196529 4.743631e-02 42 1004 1970
18 4.7015486 4.3995561 2.875267240 5.155589e-02 42 1004 1971
19 5.0564414 4.7539697 3.218686385 6.863808e-02 42 1004 1972
20 5.4323873 5.1711531 3.350849771 8.272720e-02 42 1004 1973
21 5.2979696 5.0033437 3.383504340 6.726429e-02 42 1004 1974
22 5.3958779 5.1475985 3.475121024 1.534230e-01 42 1004 1975
23 5.5442635 5.3195666 3.517557041 1.674937e-01 42 1004 1976
24 5.6260795 5.3909462 3.694842501 1.711362e-01 42 1004 1977
25 5.8039766 5.5455887 3.895724689 1.836405e-01 42 1004 1978
26 5.8198831 5.5665980 3.960153940 1.700499e-01 42 1004 1979
27 5.7474447 5.4697019 3.943733263 1.520660e-01 42 1004 1980
where gvkey is the firm id and naics are the industry codes.
The code I wrote:
theta=matrix(,60,23)
count=1
temp <- dat %>% select(
"sale_log", "cogs_log", "ppegt_log",
"m_naics4", "naics_2", "gvkey", "year"
)
for (i in 1960:2019) { # 5-year rolling sector-year specific production functions
sub <- temp[between(temp$year,i-5,i),] # subset 5 years
jcount <- 1
for (j in sort(unique(sub$naics_2))) { # loop over sectors
temp2 <- sub[sub$naics_2==j,]
mdl <- prodestOP(
Y=temp2$sale_log, fX=temp2$cogs_log, sX=temp2$ppegt_log,
pX=temp2$cogs_log, cX=temp2$m_naics4, idvar=temp2$gvkey,
timevar=temp2$year
)
theta[count,jcount] <- mdl#Model$FSbetas[2]
jcount <- jcount+1
}
count <- count+1
}

Related

Unique value in row compared to previous rows by group and year in dataframe

I am working with patent data and I would like to find out whether firms have been assigned patents in similar or dissimilar patent classes in the years prior to the year the current patent has been assigned.
As an example: Firm 1010 (see table below) has patented in subcat 67 in year 1984 and I would like to find out whether it has applied for a patent in the same subcat in the X previous years (where X could be 3 or 5, for example). The result should be that for every patent (row), a value of 1 gets assigned if this is the case and 0 if not.
The amount of observations per firm (gvkey) and publication year are unbalanced (so not the same amount of observations for every firm).
I have fumbled around with dplyr and data.table, but cannot seem to find any solution that comes even close.
gvkey publn_year subcat patent
1: 1010 1980 53 4184663
2: 1010 1980 55 4185564
3: 1010 1980 53 4187814
4: 1010 1981 45 4242866
5: 1010 1981 55 4242966
6: 1010 1981 69 4246928
7: 1010 1982 53 4310145
8: 1010 1982 53 4311298
9: 1010 1982 69 4313458
10: 1010 1983 69 4367764
11: 1010 1983 53 4368927
12: 1010 1983 53 4368928
13: 1010 1984 67 4428585
14: 1010 1984 53 4429855
15: 1010 1984 53 4430983
16: 1012 1987 52 4683010
17: 1013 1980 43 4203066
18: 1013 1981 41 4245879
19: 1013 1982 41 4363941
20: 1013 1983 41 4367907
I've searched here and elsewhere for help but have not found what I'm looking for. I'm sure this is possible and I may be overlooking something very simple.
Thanks for your help.
One possible solution for the whole past is as follows
df %>%
group_by(gvkey, subcat) %>%
mutate(flagged = ifelse(min(publn_year) == publn_year,
0,
1)
)
Example
Consider the data
> df
gvkey publn_year subcat patent
1 1010 1979 53 44434
2 1010 1980 55 43424
3 1010 1981 53 243423
4 1010 1982 45 234234
Then you get
> df %>% group_by(gvkey, subcat) %>% mutate(flagged = ifelse(min(publn_year) == publn_year, 0, 1))
# A tibble: 4 x 5
# Groups: gvkey, subcat [3]
gvkey publn_year subcat patent flagged
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1010 1979 53 44434 0
2 1010 1980 55 43424 0
3 1010 1981 53 243423 1
4 1010 1982 45 234234 0
Here is one approach using dplyr. First, group_by both gvkey for firm and subcat for subcategory. Then, arrange and sort by year. Then, you can add a column with your new value of 0/1 based on if the difference between a year and the most recent year of patent were within X years (here example is 3 years). I also check to see if the first row within a group, so that does not get set as 1. Let me know if this is what you had in mind.
library(dplyr)
df %>%
group_by(gvkey, subcat) %>%
arrange(gvkey, subcat, publn_year) %>%
mutate(prior = ifelse(publn_year - lag(publn_year) <= 3 & row_number() != 1, 1, 0))
Output
gvkey publn_year subcat patent prior
<int> <int> <int> <int> <dbl>
1 1010 1981 45 4242866 0
2 1010 1980 53 4184663 0
3 1010 1980 53 4187814 1
4 1010 1982 53 4310145 1
5 1010 1982 53 4311298 1
6 1010 1983 53 4368927 1
7 1010 1983 53 4368928 1
8 1010 1984 53 4429855 1
9 1010 1984 53 4430983 1
10 1010 1980 55 4185564 0
11 1010 1981 55 4242966 1
12 1010 1984 67 4428585 0
13 1010 1981 69 4246928 0
14 1010 1982 69 4313458 1
15 1010 1983 69 4367764 1
16 1012 1987 52 4683010 0
17 1013 1981 41 4245879 0
18 1013 1982 41 4363941 1
19 1013 1983 41 4367907 1
20 1013 1980 43 4203066 0

R: Substituting missing values (NAs) with two different values

I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)

Adding value of row below in R - most efficient

My data:
no.att
year freq
1 1896 380
2 1900 1936
3 1904 1301
4 1906 1733
5 1908 3101
6 1912 4040
7 1920 4292
8 1924 5693
9 1928 5574
10 1932 3321
11 1936 7401
12 1948 7480
13 1952 9358
14 1956 6434
15 1960 9235
16 1964 9480
17 1968 10479
18 1972 11959
19 1976 10502
20 1980 8937
21 1984 11588
22 1988 14676
23 1992 16413
24 1994 3160
25 1996 13780
26 1998 3605
27 2000 13821
28 2002 4109
29 2004 13443
30 2006 4382
31 2008 13602
32 2010 4402
33 2012 12920
34 2014 4891
35 2016 13688
My goal:
from year 1992 and forwards the observation interval changes from every 4th year to every 2nd year.
I want to keep it every 4th year. so I want to ->
no.att[24,2] + no.att[25,2]
my solution is:
x <- 24
y <- 25
temp <- no.att[x,2]
temp1 <- no.att[y,2]
no.att[y,2] <- temp + temp1
x <- x + 2
y <- y + 2
running the above once and then skipping the two top lines does the trick.
What would an alternative to this approach be?
Using ave to sum freq every 4 yearly,
ans <- dat
ans$freq <- ave(dat$freq, ceiling(dat$year/4), FUN=sum)
ans[ans$year %in% seq(1896,2016,4),]
output:
year freq
1 1896 380
2 1900 1936
3 1904 1301
5 1908 4834
6 1912 4040
7 1920 4292
8 1924 5693
9 1928 5574
10 1932 3321
11 1936 7401
12 1948 7480
13 1952 9358
14 1956 6434
15 1960 9235
16 1964 9480
17 1968 10479
18 1972 11959
19 1976 10502
20 1980 8937
21 1984 11588
22 1988 14676
23 1992 16413
25 1996 16940
27 2000 17426
29 2004 17552
31 2008 17984
33 2012 17322
35 2016 18579
data:
dat <- read.table(text="year freq
1896 380
1900 1936
1904 1301
1906 1733
1908 3101
1912 4040
1920 4292
1924 5693
1928 5574
1932 3321
1936 7401
1948 7480
1952 9358
1956 6434
1960 9235
1964 9480
1968 10479
1972 11959
1976 10502
1980 8937
1984 11588
1988 14676
1992 16413
1994 3160
1996 13780
1998 3605
2000 13821
2002 4109
2004 13443
2006 4382
2008 13602
2010 4402
2012 12920
2014 4891
2016 13688", header=TRUE)

Numerical Method for SARIMAX Model using R

My friend is currently working on his assignment about estimation of parameter of a time series model, SARIMAX (Seasonal ARIMA Exogenous), with Maximum Likelihood Estimation (MLE) method. The data used by him is about the monthly rainfall from 2000 - 2012 with Indian Ocean Dipole (IOD) index as the exogenous variable.
Here is data:
MONTH YEAR RAINFALL IOD
1 1 2000 15.3720526 0.0624
2 2 2000 10.3440804 0.1784
3 3 2000 14.6116392 0.3135
4 4 2000 18.6842179 0.3495
5 5 2000 15.2937896 0.3374
6 6 2000 15.0233152 0.1946
7 7 2000 11.1803399 0.3948
8 8 2000 11.0589330 0.4391
9 9 2000 10.1488916 0.3020
10 10 2000 21.1187121 0.2373
11 11 2000 15.3980518 -0.0324
12 12 2000 18.9393770 -0.0148
13 1 2001 19.1075901 -0.2448
14 2 2001 14.9097284 0.1673
15 3 2001 19.2379833 0.1538
16 4 2001 19.6900990 0.3387
17 5 2001 8.0684571 0.3578
18 6 2001 14.0463518 0.3394
19 7 2001 5.9916609 0.1754
20 8 2001 8.4439327 0.0048
21 9 2001 11.8321596 0.1648
22 10 2001 24.3700636 -0.0653
23 11 2001 22.3584436 0.0291
24 12 2001 23.6114379 0.1731
25 1 2002 17.8409641 0.0404
26 2 2002 14.7377067 0.0914
27 3 2002 21.2226294 0.1766
28 4 2002 16.6403125 -0.1512
29 5 2002 10.8074049 -0.1072
30 6 2002 6.3796552 0.0244
31 7 2002 17.0704423 0.0542
32 8 2002 1.7606817 0.0898
33 9 2002 5.3665631 0.6736
34 10 2002 8.3246622 0.7780
35 11 2002 17.8044938 0.3616
36 12 2002 16.7062862 0.0673
37 1 2003 13.5572859 -0.0628
38 2 2003 17.1113997 0.2038
39 3 2003 14.9899967 0.1239
40 4 2003 14.0996454 0.0997
41 5 2003 11.4017542 0.0581
42 6 2003 6.7749539 0.3490
43 7 2003 7.1484264 0.4410
44 8 2003 10.3004854 0.4063
45 9 2003 10.6630202 0.3289
46 10 2003 20.6518764 0.1394
47 11 2003 20.8638443 0.1077
48 12 2003 20.5548048 0.4093
49 1 2004 16.0436903 0.2257
50 2 2004 17.2568827 0.2978
51 3 2004 20.2361063 0.2523
52 4 2004 11.6619038 0.1212
53 5 2004 12.8296532 -0.3395
54 6 2004 8.4202138 -0.1764
55 7 2004 15.5916644 0.0118
56 8 2004 0.9486833 0.1651
57 9 2004 7.2732386 0.2825
58 10 2004 18.0083314 0.3747
59 11 2004 14.4672043 0.1074
60 12 2004 17.3637554 0.0926
61 1 2005 18.9420168 0.0551
62 2 2005 17.0146995 -0.3716
63 3 2005 23.3002146 -0.2641
64 4 2005 17.8689675 0.2829
65 5 2005 17.2365890 0.1883
66 6 2005 14.0178458 0.0347
67 7 2005 12.6925175 -0.0680
68 8 2005 9.3861600 -0.0420
69 9 2005 11.7132404 -0.1425
70 10 2005 18.5768673 -0.0514
71 11 2005 19.6723156 -0.0008
72 12 2005 18.3248465 -0.0659
73 1 2006 18.6252517 0.0560
74 2 2006 18.7002674 -0.1151
75 3 2006 23.4882950 -0.0562
76 4 2006 19.5652754 0.1862
77 5 2006 13.6857590 0.0105
78 6 2006 11.1265448 0.1504
79 7 2006 11.0227038 0.3490
80 8 2006 7.6550637 0.5267
81 9 2006 1.8708287 0.8089
82 10 2006 5.4129474 0.9479
83 11 2006 15.2249795 0.7625
84 12 2006 14.1703917 0.3941
85 1 2007 22.8691932 0.4027
86 2 2007 14.3317829 0.3353
87 3 2007 13.0766968 0.2792
88 4 2007 23.2335964 0.2960
89 5 2007 12.2474487 0.4899
90 6 2007 11.3357840 0.2445
91 7 2007 9.3112835 0.3629
92 8 2007 1.6431677 0.5396
93 9 2007 6.8483575 0.6252
94 10 2007 13.1529464 0.4540
95 11 2007 14.5120639 0.2489
96 12 2007 18.7909553 0.0054
97 1 2008 17.6493626 0.3037
98 2 2008 13.3828248 0.1166
99 3 2008 19.0525589 0.2730
100 4 2008 17.3262806 0.0467
101 5 2008 5.2345009 0.4020
102 6 2008 3.3166248 0.4263
103 7 2008 10.1094016 0.5558
104 8 2008 11.7260394 0.4236
105 9 2008 10.7470926 0.4762
106 10 2008 15.1591557 0.4127
107 11 2008 25.5558213 0.1474
108 12 2008 18.2455474 0.1755
109 1 2009 14.5430396 0.2185
110 2 2009 12.8569048 0.3521
111 3 2009 24.0707291 0.2680
112 4 2009 16.0374562 0.3234
113 5 2009 7.2387844 0.4757
114 6 2009 13.8021737 0.3078
115 7 2009 7.5232972 0.1179
116 8 2009 6.3403470 0.1999
117 9 2009 4.6583259 0.2814
118 10 2009 13.0958008 0.3646
119 11 2009 15.3329710 0.1914
120 12 2009 19.0394328 0.3836
121 1 2010 15.5080624 0.4732
122 2 2010 17.1551742 0.2134
123 3 2010 23.9729014 0.6320
124 4 2010 18.2537667 0.5644
125 5 2010 18.2236111 0.1881
126 6 2010 14.6082169 0.0680
127 7 2010 13.6161669 0.3111
128 8 2010 11.1220502 0.2472
129 9 2010 20.7870152 0.1259
130 10 2010 19.5371441 -0.0529
131 11 2010 24.8837296 -0.2133
132 12 2010 15.5016128 0.0233
133 1 2011 17.3435867 0.3739
134 2 2011 17.6096564 0.4228
135 3 2011 19.0682983 0.5413
136 4 2011 20.4890214 0.3569
137 5 2011 12.0540450 0.1313
138 6 2011 12.5896783 0.2642
139 7 2011 5.0990195 0.5356
140 8 2011 6.5726707 0.6490
141 9 2011 2.5099801 0.5884
142 10 2011 17.6380271 0.7376
143 11 2011 17.5128524 0.6004
144 12 2011 17.2655727 0.0990
145 1 2012 16.6883193 0.2272
146 2 2012 20.8374663 0.1049
147 3 2012 16.7002994 0.1991
148 4 2012 18.7962762 -0.0596
149 5 2012 16.9292646 -0.1165
150 6 2012 11.6490343 0.2207
151 7 2012 6.2529993 0.8586
152 8 2012 5.8991525 0.9473
153 9 2012 7.8485667 0.8419
154 10 2012 12.5817328 0.4928
155 11 2012 24.7770055 0.1684
156 12 2012 23.2486559 0.4899
In doing this, he works with R because it has the package for analysing the SARIMAX model. And so far, he's been doing it good with arimax() function of TSA package with seasonal ARIMA order (1,0,1).
So here I attach his syntax:
#Importing data
data=read.csv("C:/DATA.csv", header=TRUE)
rainfall=data$RAINFALL
exo=data$IOD
#Creating the suitable model of data that is able to be read by R with ts() function
library(forecast)
rainfall_ts=ts(rainfall, start=c(2000, 1), end=c(2012, 12), frequency = 12)
exo_ts=ts(exo, start=c(2000, 1), end=c(2012, 12), frequency = 12)
#Fitting SARIMAX model with seasonal ARIMA order (1,0,1) & estimation method is MLE (or ML)
library(TSA)
model_ts=arimax(log(rainfall_ts), order=c(1,0,1), seasonal=list(order=c(1,0,1), period=12), xreg=exo_ts, method='ML')
Below is the result:
> model_ts
Call:
arimax(x = log(rainfall_ts), order = c(1, 0, 1), seasonal = list(order = c(1,
0, 1), period = 12), xreg = exo_ts, method = "ML")
Coefficients:
ar1 ma1 sar1 sma1 intercept xreg
0.5730 -0.4342 0.9996 -0.9764 2.6757 -0.4894
s.e. 0.2348 0.2545 0.0018 0.0508 0.1334 0.1489
sigma^2 estimated as 0.1521: log likelihood = -86.49, aic = 184.99
Although he claimed the syntax is working, but his lecturer expected more.
Theoretically, because he used MLE, he has proven that the first derivatives of the log-likelihood function give implicit solutions. It means that the estimation process couldn't be done analytically with MLE so we need
to continue our working with the numerical method to get it done.
So this is the expectation of my friend's lecturer. He expected him that he can at least convince him that the estimation is truly need to be done numerically
and if so, he might be able to show him what method that is used by R (the numerical method such as Newton-Raphson, BFGS, BHHH, etc).
But the thing here is the arimax() function doesn't give us the choice on numerical method to choose if the estimation need to be executed numerically like below:
model_ts=arimax(log(rainfall_ts), order=c(1,0,1), seasonal=list(order=c(1,0,1), period=12), xreg=exo_ts, method='ML')
The 'method' above is for the estimation method and the available method are ML, CSS, and CSS-ML. It is clear that the sintax above doesn't consist of the numerical method and this is the matter.
So is there any possible way to know what numerical method used by R? Or my friend just got to construct his own program without depending to arimax() function?
If there are any errors in my code, please let me know. I also apologize for any grammatical or vocabulary mistakes. English is not my native language.
Some suggestions:
Estimate the model with each of the methods: ML, CSS, CSS-ML. Do the parameter estimates agree?
You can view the source code of the arimax() function by typing arimax, View(arimax) or getAnywhere(arimax) in the console.
Or you can do a debug by placing a debug bullet before the line model_ts=arimax(...) and then sourcing or debugSource()-ing your script. You can then step into the arimax function and see/verify yourself which optimization method arimax uses.

Subset dataframe according to maxima of groups

I am trying to create a subset of a dataframe conditional on grouped cumulative sums of one of the columns (i.e., cumsum of Total, grouped by Year, below).
I have a population table that looks as follows (simplified)
Year Age Total Cum.Sum
1991 20 94619 94619
1991 21 97455 192074
1991 22 101418 293492
1991 23 104192 397684
1991 24 108332 506016
1991 25 111355 617371
1991 26 114569 731940
1991 27 113852 845792
1991 28 112264 958056
1991 29 110230 1068286
1991 30 109149 1177435
1991 31 108222 1285657
1991 32 106641 1392298
1991 33 106658 1498956
1991 34 104730 1603686
1991 35 103383 1707069
1991 36 101441 1808510
1991 37 99773 1908283
1991 38 100621 2008904
1991 39 98135 2107039
1991 40 101946 2208985
2010 20 93470 93470
2010 21 94762 188232
2010 22 92527 280759
2010 23 94696 375455
2010 24 95416 470871
2010 25 98016 568887
2010 26 98387 667274
2010 27 102254 769528
2010 28 103343 872871
2010 29 105179 978050
2010 30 104278 1082328
2010 31 104099 1186427
2010 32 105240 1291667
2010 33 105316 1396983
2010 34 106250 1503233
2010 35 109019 1612252
2010 36 110044 1722296
2010 37 113949 1836245
2010 38 118086 1954331
2010 39 119845 2074176
2010 40 123647 2197823
Now I'd like to subset this dataframe so that the cumulative sum of each year does not exceed a certain treshold, e.g.
1991 2010
1605897 1803476
I do not want to have separate datasets per year.
This will do:
t.h <- read.table(header=TRUE, text=
'Year th
1991 1605897
2010 1803476')
d <- merge(dataset, t.h)
subset(dataset, Cum.Sum < t.h)

Resources