Rolling window regressions within multiple groups - r

I am trying to apply a rolling window regression model to multiple groups in my data. Part of my data is as below:
gvkey year LC YTO
1 001004 1972 0.1919713 2.021182
2 001004 1973 0.2275895 2.029056
3 001004 1974 0.3341368 2.053517
4 001004 1975 0.3313518 2.090532
5 001004 1976 0.4005829 2.136939
6 001004 1977 0.4471945 2.123909
7 001004 1978 0.4442004 2.150281
8 001004 1979 0.5054544 2.173162
9 001004 1980 0.5269449 2.188077
10 001004 1981 0.5423774 2.200805
11 001004 1982 0.3528982 2.200851
12 001004 1983 0.3674031 2.190487
13 001004 1984 0.2267620 2.181291
14 001004 1985 0.2796132 2.159443
15 001004 1986 0.3382120 2.128420
16 001004 1987 0.3214131 2.089670
17 001004 1988 0.3883732 2.048279
18 001004 1989 0.4466488 1.999539
19 001004 1990 0.4929991 1.955500
20 001004 1991 0.5150894 1.934893
21 001004 1992 0.5218845 1.925521
22 001004 1993 0.5038105 1.904241
23 001004 1994 0.5041639 1.881731
24 001004 1995 0.5196658 1.863143
25 001004 1996 0.5352994 1.844464
26 001004 1997 0.4556059 1.835676
27 001004 1998 0.4905767 1.837886
28 001004 1999 0.5471959 1.824636
29 001004 2000 0.5920976 1.814944
30 001004 2001 0.5998172 1.893943
31 001004 2002 0.4499911 1.889703
32 001004 2003 0.4207154 1.870703
33 001004 2004 0.4371594 1.831638
34 001004 2005 0.4525900 1.802684
35 001004 2006 0.4342149 1.781757
36 001004 2007 0.4899473 1.753360
37 001004 2008 0.5436673 1.680464
38 001004 2009 0.5873861 1.612499
39 001004 2010 0.5216734 1.544322
40 001004 2011 0.5592963 1.415892
41 001004 2012 0.5627509 1.407393
42 001004 2013 0.5904637 1.384202
43 001004 2014 0.6170085 1.353340
44 001004 2015 0.7145900 1.314014
45 001007 1975 0.3721916 2.090532
46 001007 1976 0.2760902 2.136939
47 001007 1977 0.1866554 2.123909
48 001007 1978 0.1977654 2.150281
49 001007 1979 0.1927100 2.173162
50 001007 1980 0.2112344 2.188077
51 001007 1981 -0.2141724 2.200805
52 001007 1982 -0.2072785 2.200851
53 001007 1983 -1.7406963 2.190487
54 001007 1984 -14.8071429 2.181291
55 001009 1982 -1.2753247 2.200851
56 001009 1983 1.3349904 2.190487
57 001009 1984 2.6192237 2.181291
58 001009 1985 0.5867925 2.159443
59 001009 1986 0.6959436 2.128420
60 001009 1987 0.7142857 2.089670
61 001009 1988 0.7771897 2.048279
62 001009 1989 0.8293820 1.999539
63 001009 1990 0.8655382 1.955500
64 001009 1991 0.8712144 1.934893
65 001009 1992 0.8882548 1.925521
66 001009 1993 0.9190540 1.904241
67 001009 1994 0.9411806 1.881731
68 001010 1971 0.6492499 2.002337
69 001010 1972 0.6667664 2.021182
70 001010 1973 0.6840115 2.029056
71 001010 1974 0.7011797 2.053517
72 001010 1975 0.7189469 2.090532
73 001010 1976 0.7367344 2.136939
74 001010 1977 0.7511779 2.123909
75 001010 1978 0.7673365 2.150281
76 001010 1979 0.7795880 2.173162
77 001010 1980 0.7824448 2.188077
78 001010 1981 0.7821913 2.200805
79 001010 1982 0.7646078 2.200851
80 001010 1983 0.7426172 2.190487
81 001010 1984 -0.0657935 2.181291
82 001010 1985 0.2802410 2.159443
83 001010 1986 0.2052373 2.128420
84 001010 1987 0.2465290 2.089670
85 001010 1988 0.3437856 2.048279
86 001010 1989 0.7398662 1.999539
87 001010 1990 0.6360582 1.955500
88 001010 1991 0.7790707 1.934893
89 001010 1992 0.7588472 1.925521
90 001010 1993 0.7695341 1.904241
91 001010 1994 0.8060759 1.881731
92 001010 1995 0.8381234 1.863143
93 001010 1996 0.8661541 1.844464
94 001010 1997 0.8700456 1.835676
95 001010 1998 0.8748443 1.837886
96 001010 1999 0.8884077 1.824636
97 001010 2000 0.8979903 1.814944
98 001010 2003 0.6812689 1.870703
99 001011 1983 0.3043007 2.190487
100 001011 1984 0.3080601 2.181291
My function is
Match.LC.YTO<-function(x){rollapplyr(x,width=10,by.column=F,fill=NA, FUN=function(m){
temp.1<-lm(LC~YTO,data=m)
summary(temp.1)$r.squared*(sign(summary(temp.1)$coefficients[2,1]))
})}
df<-df%>%group_by(gvkey)%>%mutate(MTCH=Match.LC.YTO(df))
My data is grouped by gvkey, and for each group I need to calculate a variable named "MTCH" which equals the R squared value times the sign of coefficient of YTO in the linear model LC~YTO, and the model is estimated at a rolling window of 10 observations. I got the error message:
Error in mutate_impl(.data, dots) :
'data' must be a data.frame, not a matrix or an array
I have checked many other posts concerning the function rollapply and rollapplyr, and some suggest that I need to convert my df to zoo or matrix before I use rollapply function, but it still did not work.

rollapply in zoo will accept plain matrix and data frame arguments. That is not the problem. The following are problems with this code:
the code passes a matrix to lm but lm takes a data.frame
the code attempts to use rollapply with width of 10 on an object with fewer than 10 rows in the last group
if the intercept fits perfectly then there will be no 2nd coefficient from lm so the reference to coefficients[2, 1] will fail with an error.
Although not erroneous the following are areas for improvement:
TRUE and FALSE should be written out in full since T and F are valid variable names making this highly error-prone.
when using group_by in dplyr always match it with an ungroup. If you don't do that then the output will remember the grouping and the next time you use the output you will get a surprise. For example, consider the differnce between the following two snippets. The first results in n being the number of elements in the group that that row belongs to whereas the second results in the n being the number of rows in out.
out <- df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO))
out %>% mutate(n = n())
out <- df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO)) %>% ungroup
out %>% mutate(n = n())
questions to SO should be self-contained and reproducible so the library statements should not be omitted and the data should be provided in a reproducible manner
To fix these problems we
use partial = TRUE in rollapply to allow it to pass objects with fewer than 10 rows.
pass the variables involved directly
rollapply over the row numbers.
add an NA to the end of the coefficients to be picked up if the coefficient vector otherwise has only 1 element.
for clarity we have separated out the lm_summary function which was anonymous in the question
for reproduciblity we have added library statements and the Note at the end
The revised code is:
library(dplyr)
library(zoo)
Match.LC.YTO <- function(LC, YTO) {
lm_summary <- function(ix) {
temp.1 <- lm(LC ~ YTO, subset = ix)
summary(temp.1)$r.squared * sign(c(coef(temp.1), NA)[2])
}
rollapplyr(seq_along(LC), width = 10, FUN = lm_summary, partial = TRUE)
}
df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO)) %>% ungroup
If you would rather use fill = NA insted of partial = TRUE then add a check for the series length being less than the series width, i.e. less than 10:
Match.LC.YTO2 <- function(LC, YTO) {
lm_summary <- function(ix) {
temp.1 <- lm(LC ~ YTO, subset = ix)
summary(temp.1)$r.squared * sign(c(coef(temp.1), NA)[2])
}
if (length(LC) < 10) return(NA) ##
rollapplyr(seq_along(LC), width = 10, FUN = lm_summary, fill = NA)
}
df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO2(LC, YTO)) %>% ungroup
Note 1
For sake of reproducibility we used this as the input df:
Lines <- " gvkey year LC YTO
1 001004 1972 0.1919713 2.021182
2 001004 1973 0.2275895 2.029056
3 001004 1974 0.3341368 2.053517
4 001004 1975 0.3313518 2.090532
5 001004 1976 0.4005829 2.136939
6 001004 1977 0.4471945 2.123909
7 001004 1978 0.4442004 2.150281
8 001004 1979 0.5054544 2.173162
9 001004 1980 0.5269449 2.188077
10 001004 1981 0.5423774 2.200805
11 001004 1982 0.3528982 2.200851
12 001004 1983 0.3674031 2.190487
13 001004 1984 0.2267620 2.181291
14 001004 1985 0.2796132 2.159443
15 001004 1986 0.3382120 2.128420
16 001004 1987 0.3214131 2.089670
17 001004 1988 0.3883732 2.048279
18 001004 1989 0.4466488 1.999539
19 001004 1990 0.4929991 1.955500
20 001004 1991 0.5150894 1.934893
21 001004 1992 0.5218845 1.925521
22 001004 1993 0.5038105 1.904241
23 001004 1994 0.5041639 1.881731
24 001004 1995 0.5196658 1.863143
25 001004 1996 0.5352994 1.844464
26 001004 1997 0.4556059 1.835676
27 001004 1998 0.4905767 1.837886
28 001004 1999 0.5471959 1.824636
29 001004 2000 0.5920976 1.814944
30 001004 2001 0.5998172 1.893943
31 001004 2002 0.4499911 1.889703
32 001004 2003 0.4207154 1.870703
33 001004 2004 0.4371594 1.831638
34 001004 2005 0.4525900 1.802684
35 001004 2006 0.4342149 1.781757
36 001004 2007 0.4899473 1.753360
37 001004 2008 0.5436673 1.680464
38 001004 2009 0.5873861 1.612499
39 001004 2010 0.5216734 1.544322
40 001004 2011 0.5592963 1.415892
41 001004 2012 0.5627509 1.407393
42 001004 2013 0.5904637 1.384202
43 001004 2014 0.6170085 1.353340
44 001004 2015 0.7145900 1.314014
45 001007 1975 0.3721916 2.090532
46 001007 1976 0.2760902 2.136939
47 001007 1977 0.1866554 2.123909
48 001007 1978 0.1977654 2.150281
49 001007 1979 0.1927100 2.173162
50 001007 1980 0.2112344 2.188077
51 001007 1981 -0.2141724 2.200805
52 001007 1982 -0.2072785 2.200851
53 001007 1983 -1.7406963 2.190487
54 001007 1984 -14.8071429 2.181291
55 001009 1982 -1.2753247 2.200851
56 001009 1983 1.3349904 2.190487
57 001009 1984 2.6192237 2.181291
58 001009 1985 0.5867925 2.159443
59 001009 1986 0.6959436 2.128420
60 001009 1987 0.7142857 2.089670
61 001009 1988 0.7771897 2.048279
62 001009 1989 0.8293820 1.999539
63 001009 1990 0.8655382 1.955500
64 001009 1991 0.8712144 1.934893
65 001009 1992 0.8882548 1.925521
66 001009 1993 0.9190540 1.904241
67 001009 1994 0.9411806 1.881731
68 001010 1971 0.6492499 2.002337
69 001010 1972 0.6667664 2.021182
70 001010 1973 0.6840115 2.029056
71 001010 1974 0.7011797 2.053517
72 001010 1975 0.7189469 2.090532
73 001010 1976 0.7367344 2.136939
74 001010 1977 0.7511779 2.123909
75 001010 1978 0.7673365 2.150281
76 001010 1979 0.7795880 2.173162
77 001010 1980 0.7824448 2.188077
78 001010 1981 0.7821913 2.200805
79 001010 1982 0.7646078 2.200851
80 001010 1983 0.7426172 2.190487
81 001010 1984 -0.0657935 2.181291
82 001010 1985 0.2802410 2.159443
83 001010 1986 0.2052373 2.128420
84 001010 1987 0.2465290 2.089670
85 001010 1988 0.3437856 2.048279
86 001010 1989 0.7398662 1.999539
87 001010 1990 0.6360582 1.955500
88 001010 1991 0.7790707 1.934893
89 001010 1992 0.7588472 1.925521
90 001010 1993 0.7695341 1.904241
91 001010 1994 0.8060759 1.881731
92 001010 1995 0.8381234 1.863143
93 001010 1996 0.8661541 1.844464
94 001010 1997 0.8700456 1.835676
95 001010 1998 0.8748443 1.837886
96 001010 1999 0.8884077 1.824636
97 001010 2000 0.8979903 1.814944
98 001010 2003 0.6812689 1.870703
99 001011 1983 0.3043007 2.190487
100 001011 1984 0.3080601 2.181291"
df <- read.table(text = Lines)
Note 2
The check for length in the line marked with ## at the end is no longer necessary as recent versions of zoo automatically make this check.

Related

R: Substituting missing values (NAs) with two different values

I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)

Parallelize for loops that subset panel data by industry-year

I want to carry out an estimation procedure that uses data on all firms in a given sector, for a rolling window of 5 years.
I can do it easily in a loop, but since the estimation procedure takes quite a while, I would like to parallelize it. Is there any way to do this?
My data looks like this:
sale_log cogs_log ppegt_log m_naics4 naics_2 gvkey year
1 3.9070198 2.5146032 3.192821715 9.290151e-02 72 1001 1983
2 4.1028774 2.7375141 3.517861329 1.067687e-01 72 1001 1984
3 4.5909863 3.2106595 3.975112703 2.511660e-01 72 1001 1985
4 3.2560391 2.7867256 -0.763368555 1.351031e-02 44 1003 1982
5 3.2966287 2.8088799 -0.305698649 1.151525e-02 44 1003 1983
6 3.2636907 2.8330357 0.154036559 8.699394e-03 44 1003 1984
7 3.7916480 3.2346849 0.887916936 1.351803e-02 44 1003 1985
8 4.1778028 3.5364473 1.177985972 1.761273e-02 44 1003 1986
9 4.1819066 3.7297111 1.393016951 1.686331e-02 44 1003 1987
10 4.0174411 3.6050022 1.479584215 1.601205e-02 44 1003 1988
11 3.4466429 2.9633579 1.312863013 8.888067e-03 44 1003 1989
12 3.0667367 2.6128805 0.909779173 2.102674e-02 42 1004 1965
13 3.2362968 2.8140391 1.430690273 2.050934e-02 42 1004 1966
14 3.1981990 2.8822097 1.721614365 1.702929e-02 42 1004 1967
15 3.9265031 3.6159280 2.399823853 2.559074e-02 42 1004 1968
16 4.3343438 4.0116068 2.592692585 3.649313e-02 42 1004 1969
17 4.5869564 4.3059855 2.772196529 4.743631e-02 42 1004 1970
18 4.7015486 4.3995561 2.875267240 5.155589e-02 42 1004 1971
19 5.0564414 4.7539697 3.218686385 6.863808e-02 42 1004 1972
20 5.4323873 5.1711531 3.350849771 8.272720e-02 42 1004 1973
21 5.2979696 5.0033437 3.383504340 6.726429e-02 42 1004 1974
22 5.3958779 5.1475985 3.475121024 1.534230e-01 42 1004 1975
23 5.5442635 5.3195666 3.517557041 1.674937e-01 42 1004 1976
24 5.6260795 5.3909462 3.694842501 1.711362e-01 42 1004 1977
25 5.8039766 5.5455887 3.895724689 1.836405e-01 42 1004 1978
26 5.8198831 5.5665980 3.960153940 1.700499e-01 42 1004 1979
27 5.7474447 5.4697019 3.943733263 1.520660e-01 42 1004 1980
where gvkey is the firm id and naics are the industry codes.
The code I wrote:
theta=matrix(,60,23)
count=1
temp <- dat %>% select(
"sale_log", "cogs_log", "ppegt_log",
"m_naics4", "naics_2", "gvkey", "year"
)
for (i in 1960:2019) { # 5-year rolling sector-year specific production functions
sub <- temp[between(temp$year,i-5,i),] # subset 5 years
jcount <- 1
for (j in sort(unique(sub$naics_2))) { # loop over sectors
temp2 <- sub[sub$naics_2==j,]
mdl <- prodestOP(
Y=temp2$sale_log, fX=temp2$cogs_log, sX=temp2$ppegt_log,
pX=temp2$cogs_log, cX=temp2$m_naics4, idvar=temp2$gvkey,
timevar=temp2$year
)
theta[count,jcount] <- mdl#Model$FSbetas[2]
jcount <- jcount+1
}
count <- count+1
}

Adding value of row below in R - most efficient

My data:
no.att
year freq
1 1896 380
2 1900 1936
3 1904 1301
4 1906 1733
5 1908 3101
6 1912 4040
7 1920 4292
8 1924 5693
9 1928 5574
10 1932 3321
11 1936 7401
12 1948 7480
13 1952 9358
14 1956 6434
15 1960 9235
16 1964 9480
17 1968 10479
18 1972 11959
19 1976 10502
20 1980 8937
21 1984 11588
22 1988 14676
23 1992 16413
24 1994 3160
25 1996 13780
26 1998 3605
27 2000 13821
28 2002 4109
29 2004 13443
30 2006 4382
31 2008 13602
32 2010 4402
33 2012 12920
34 2014 4891
35 2016 13688
My goal:
from year 1992 and forwards the observation interval changes from every 4th year to every 2nd year.
I want to keep it every 4th year. so I want to ->
no.att[24,2] + no.att[25,2]
my solution is:
x <- 24
y <- 25
temp <- no.att[x,2]
temp1 <- no.att[y,2]
no.att[y,2] <- temp + temp1
x <- x + 2
y <- y + 2
running the above once and then skipping the two top lines does the trick.
What would an alternative to this approach be?
Using ave to sum freq every 4 yearly,
ans <- dat
ans$freq <- ave(dat$freq, ceiling(dat$year/4), FUN=sum)
ans[ans$year %in% seq(1896,2016,4),]
output:
year freq
1 1896 380
2 1900 1936
3 1904 1301
5 1908 4834
6 1912 4040
7 1920 4292
8 1924 5693
9 1928 5574
10 1932 3321
11 1936 7401
12 1948 7480
13 1952 9358
14 1956 6434
15 1960 9235
16 1964 9480
17 1968 10479
18 1972 11959
19 1976 10502
20 1980 8937
21 1984 11588
22 1988 14676
23 1992 16413
25 1996 16940
27 2000 17426
29 2004 17552
31 2008 17984
33 2012 17322
35 2016 18579
data:
dat <- read.table(text="year freq
1896 380
1900 1936
1904 1301
1906 1733
1908 3101
1912 4040
1920 4292
1924 5693
1928 5574
1932 3321
1936 7401
1948 7480
1952 9358
1956 6434
1960 9235
1964 9480
1968 10479
1972 11959
1976 10502
1980 8937
1984 11588
1988 14676
1992 16413
1994 3160
1996 13780
1998 3605
2000 13821
2002 4109
2004 13443
2006 4382
2008 13602
2010 4402
2012 12920
2014 4891
2016 13688", header=TRUE)

Convert rows to Columns in R

My Dataframe:
> head(scotland_weather)
JAN Year.1 FEB Year.2 MAR Year.3 APR Year.4 MAY Year.5 JUN Year.6 JUL Year.7 AUG Year.8 SEP Year.9 OCT Year.10
1 293.8 1993 278.1 1993 238.5 1993 191.1 1947 191.4 2011 155.0 1938 185.6 1940 216.5 1985 267.6 1950 258.1 1935
2 292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954
3 275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014
4 252.3 2015 227.9 1989 200.2 1967 142.1 1949 149.5 2015 137.7 1931 165.8 2010 191.4 1962 189.7 2011 247.7 1938
5 246.2 1974 224.9 2014 180.2 1979 133.5 1950 137.4 2003 135.0 1966 162.9 1956 190.3 2014 189.7 1927 242.3 1983
6 245.0 1975 195.6 1995 180.0 1989 132.9 1932 129.7 2007 131.7 2004 159.9 1985 189.1 2004 189.6 1985 240.9 2001
NOV Year.11 DEC Year.12 WIN Year.13 SPR Year.14 SUM Year.15 AUT Year.16 ANN Year.17
1 262.0 2009 300.7 2013 743.6 2014 409.5 1986 455.6 1985 661.2 1981 1886.4 2011
2 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
3 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
4 231.3 1917 265.4 2011 638.3 2007 393.2 1967 422.6 1956 594.5 1935 1735.8 1938
5 229.9 1981 264.0 2006 608.9 1990 391.7 1992 397.0 2004 590.6 1982 1720.0 2008
6 224.9 1951 261.0 1912 592.8 2015 389.1 1913 390.1 1938 589.2 2006 1716.5 1954
Year.X column is not ordered. I wish to convert this into the following format:
month year rainfall_mm
Jan 1993 293.8
Feb 1993 278.1
Mar 1993 238.5
...
Nov 2015 230.0
I tried t() but it keeps the year column separate.
also tried reshape2 recast(data, formula, ..., id.var, measure.var) but something is missing. as both month and Year.X columns are numeric and int
> str(scotland_weather)
'data.frame': 106 obs. of 34 variables:
$ JAN : num 294 292 276 252 246 ...
$ Year.1 : int 1993 1928 2008 2015 1974 1975 2005 2007 1990 1983 ...
$ FEB : num 278 259 245 228 225 ...
$ Year.2 : int 1990 1997 2002 1989 2014 1995 1998 2000 1920 1918 ...
$ MAR : num 238 233 201 200 180 ...
$ Year.3 : int 1994 1990 1992 1967 1979 1989 1921 1913 2015 1978 ...
$ APR : num 191 149 147 142 134 ...
Based on the pattern of alternating columns in the 'scotland_weather' for the 'YearX' column, one way would be to use c(TRUE, FALSE) to select the alternate column by recycling, which is similar to seq(1, ncol(scotland_weather), by =2). By using c(FALSE, TRUE), we get the seq(2, ncol(scotland_weather), by =2). This will be useful for extracting those columns, get the transpose (t) and concatenate (c) to vector. Once we are done with this, the next step will be to extract the column names that are not 'Year'. For this grep can be used. Then, we use data.frame to bind the vectors to a data.frame.
res <- data.frame(month= names(scotland_weather)[!grepl('Year',
names(scotland_weather))], year=c(t(scotland_weather[c(FALSE,TRUE)])),
rainfall_mm= c(t(scotland_weather[c(TRUE,FALSE)])))
head(res,4)
# month year rainfall_mm
#1 JAN 1993 293.8
#2 FEB 1993 278.1
#3 MAR 1993 238.5
#4 APR 1947 191.1
The problem you have is not only that you need to transform your data you do also have the problem that years for first column is in the second, years for the third column is in the fourth and so on...
Here is a solution using tidyr.
library(tidyr)
match <- Vectorize(function(x,y) grep(x,names(df)) - grep(y,names(df) == 1))
years <- grep("Year",names(scotland_weather))
df %>% gather("month","rainfall_mm",-years) %>%
gather("yearname","year",-c(months,time)) %>%
filter(match(month,yearname)) %>%
select(-yearname)

Subset dataframe according to maxima of groups

I am trying to create a subset of a dataframe conditional on grouped cumulative sums of one of the columns (i.e., cumsum of Total, grouped by Year, below).
I have a population table that looks as follows (simplified)
Year Age Total Cum.Sum
1991 20 94619 94619
1991 21 97455 192074
1991 22 101418 293492
1991 23 104192 397684
1991 24 108332 506016
1991 25 111355 617371
1991 26 114569 731940
1991 27 113852 845792
1991 28 112264 958056
1991 29 110230 1068286
1991 30 109149 1177435
1991 31 108222 1285657
1991 32 106641 1392298
1991 33 106658 1498956
1991 34 104730 1603686
1991 35 103383 1707069
1991 36 101441 1808510
1991 37 99773 1908283
1991 38 100621 2008904
1991 39 98135 2107039
1991 40 101946 2208985
2010 20 93470 93470
2010 21 94762 188232
2010 22 92527 280759
2010 23 94696 375455
2010 24 95416 470871
2010 25 98016 568887
2010 26 98387 667274
2010 27 102254 769528
2010 28 103343 872871
2010 29 105179 978050
2010 30 104278 1082328
2010 31 104099 1186427
2010 32 105240 1291667
2010 33 105316 1396983
2010 34 106250 1503233
2010 35 109019 1612252
2010 36 110044 1722296
2010 37 113949 1836245
2010 38 118086 1954331
2010 39 119845 2074176
2010 40 123647 2197823
Now I'd like to subset this dataframe so that the cumulative sum of each year does not exceed a certain treshold, e.g.
1991 2010
1605897 1803476
I do not want to have separate datasets per year.
This will do:
t.h <- read.table(header=TRUE, text=
'Year th
1991 1605897
2010 1803476')
d <- merge(dataset, t.h)
subset(dataset, Cum.Sum < t.h)

Resources