Rolling avergae for aggregated results in R - r

I have a database with sales value for individual firms that belong to different industries.
In the example dataset below:
set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
year sale ind
1: 1980 114 A
2: 1980 102 A
3: 1980 130 B
4: 1980 113 B
5: 1981 136 A
6: 1981 148 A
7: 1981 141 B
8: 1981 142 B
9: 1982 124 A
10: 1982 125 A
11: 1982 104 A
12: 1982 126 B
13: 1983 108 A
14: 1983 128 A
15: 1983 140 B
16: 1983 127 B
17: 1984 134 A
18: 1984 107 A
19: 1984 106 A
20: 1984 146 B
The column "ind" represents industry and I have omitted the firm identifiers (no use in this example).
I want an average defined as follows:
For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.
For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.
If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.
If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.
The desired output, thus, will be as follows:
year sale ind mymean
1: 1980 130 B NA
2: 1980 114 A NA
3: 1980 113 B NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 126 B 131.5000
12: 1982 104 A 125.0000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 146 B 135.2000
20: 1984 106 A 124.7143
A data.table solution is much preferred for fast implementation.
Many thanks in advance.

I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table
library(tidyverse)
df %>% group_by(ind, year) %>%
summarise(ds = sum(sale),
dn = n()) %>%
mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
) %>% select(ind, year, mymean = ds) %>%
right_join(df, by = c("ind", "year"))
`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
ind year mymean sale
<chr> <int> <dbl> <int>
1 A 1980 NA 114
2 A 1980 NA 102
3 A 1981 NA 136
4 A 1981 NA 148
5 A 1982 125 124
6 A 1982 125 125
7 A 1982 125 104
8 A 1983 122. 108
9 A 1983 122. 128
10 A 1984 125. 134
11 A 1984 125. 107
12 A 1984 125. 106
13 B 1980 NA 130
14 B 1980 NA 113
15 B 1981 NA 141
16 B 1981 NA 142
17 B 1982 132. 126
18 B 1983 130. 140
19 B 1983 130. 127
20 B 1984 135. 146

You can use zoo's rollapply function to perform this rolling calculation. Note that there are dedicated functions to calculate rolling mean like frollmean in data.table and rollmean in zoo but they lack the argument partial = TRUE present in rollapply. partial = TRUE is useful here since you want to calculate the mean even if the window size is less than 3.
We can first calculate mean of sale value for each ind and year, then perform rolling mean calculation with window size of 3 and join this data with the original dataframe to get all the rows of original dataframe back.
library(data.table)
library(zoo)
df1 <- df[, .(sale = mean(sale)), .(ind, year)]
df2 <- df1[, my_mean := shift(rollapplyr(sale, 3, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA, partial = TRUE)), ind]
df[df2, on = .(ind, year)]
This can be written using dplyr as :
library(dplyr)
df %>%
group_by(ind, year) %>%
summarise(sale = mean(sale)) %>%
mutate(avg_mean = lag(rollapplyr(sale, 3, partial = TRUE, function(x)
if(length(x) > 1) mean(x, na.rm = TRUE) else NA))) %>%
left_join(df, by = c('ind', 'year'))

Based on Ronak's answer (the mean of previous means), a more general way (the mean of all previous values), and a data.table solution then can be:
library(data.table)
library(roll)
df1 <- df[, .(sum_1 = sum(sale), n=length(sale)), .(ind, year)]
df1[,`:=`(
my_sum = roll_sum(shift(sum_1),3,min_obs = 2),
my_n = roll_sum(shift(n),3,min_obs = 2)
),by=.(ind)]
df1[,`:=`(my_mean=(my_sum/my_n))]
> df[df1[,!c("sum_1","n","my_sum","my_n")] ,on = .(ind, year)]
year sale ind my_mean
1: 1980 130 B NA
2: 1980 113 B NA
3: 1980 114 A NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 104 A 125.0000
12: 1982 126 B 131.5000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 106 A 124.7143
20: 1984 146 B 135.2000

Related

Euclidean distant for NON-CONSECUTIVE classes of factors iterated by groups

This question is an extension of this question.
Euclidean distant for distinct classes of factors iterated by groups
The same explanations from the previous question apply here as well. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
The difference here is that I want to add another assumption:
If a year/some years is/are missing in between, I want to implement the assumption that the firm has been active in the same patent classes as the latest non-missing year. For example, in the following dataset:
> set.seed(123)
> df <- data.table(firm = rep(c("A","B"),each=5),
year = c(1979,1981,1981,1984,1984,1959,1960,1963,1963,1965),
patent = sample(3800000:4200000,10,replace = FALSE),
class = c("410","73","52","250","252","105","454","380","380","60")
)
> df
firm year patent class
1: A 1979 3988941 410
2: A 1981 3934057 73
3: A 1981 3924021 52
4: A 1984 3960996 250
5: A 1984 4026317 252
6: B 1959 4165208 105
7: B 1960 3924506 454
8: B 1963 3993626 380
9: B 1963 3845403 380
10: B 1965 3865160 60
Firm A is missing the years 1980, 1982, and 1983. For the year 1980, the assumption is that firm has been active in the same technological patent as the year 1979: a single patent in the class 410.
For the year 1982 and 1983, the assumption is that firm has been active in the same technological patent as the year 1981: two patents in classes 73 and 52 each. Therefore, the assumption is manifested in this way:
> df
firm year patent class
1: A 1979 4108578 410
2: A 1980 4108578 410
3: A 1981 3859133 73
4: A 1981 3983203 52
5: A 1982 3859133 73
6: A 1982 3983203 52
7: A 1983 3859133 73
8: A 1983 3983203 52
9: A 1984 4158992 250
10: A 1984 3945254 252
11: B 1959 4077323 105
12: B 1960 3889708 454
13: B 1961 3889708 454
14: B 1962 3889708 454
15: B 1963 3830537 380
16: B 1963 4025588 380
17: B 1964 3830537 380
18: B 1964 4025588 380
19: B 1965 3944607 60
Here again, for firm A, since year 1979 is the beginning year , there is no Euclidean distance for that year (NAs should be produced). Moving forward to year 1980, the distance is zero. For year 1981, the distinct classes for this year (1981) and the previous year (1980) are 73, 52, and 410. Therefore, the above formula is summed over these three distinct classes (there is three distinct 'i's). So the formula's output will be:
To add more clarification, the computation for year 1984 is explained:
In year 1984, firm A has a total of two patents in classes 250 and 252 (one in each). The immediate previous year is 1983, which was originally missing, but after applying the above assumption, now it has two patents in classes 73 and 52 each. Because the distance is only between two consecutive years, to calculate the distance for year 1984, only years 1984 and 1983 are considered. Therefore, we have a total of four classes (250, 252, 73, and 52), meaning the summation is done over four 'i's. Beginning with the first i (class 250), the total number of patents in this class in 1 for 1984 and 0 for 1983, so Xi is equal to 1 and Yi is equal to 0. The same logic applies to 252 (Xi=1, and Yi=0). Now for the third 'i', or class 73, the total number of patents is 0 in 1984 and 1 in 1983, so Xi is equal to 0 and Yi is equal to 1. The same logic applies to class 52. Therefore the distance is given by:
Following the same calculation and reiterating over firms, the final output should be:
> df
firm year patent class El_Dist
1: A 1979 4108578 410 NA
2: A 1980 4108578 410 0.000000
3: A 1981 3859133 73 1.224745
4: A 1981 3983203 52 1.224745
5: A 1982 3859133 73 0.000000
6: A 1982 3983203 52 0.000000
7: A 1983 3859133 73 0.000000
8: A 1983 3983203 52 0.000000
9: A 1984 4158992 250 1.000000
10: A 1984 3945254 252 1.000000
11: B 1959 4077323 105 NA
12: B 1960 3889708 454 1.414214
13: B 1961 3889708 454 0.000000
14: B 1962 3889708 454 0.000000
15: B 1963 3830537 380 1.118034
16: B 1963 4025588 380 1.118034
17: B 1964 3830537 380 0.000000
18: B 1964 4025588 380 0.000000
19: B 1965 3944607 60 1.118034
I'm preferably looking for a data.table solution for speed purposes (my raw data contains about 7 million rows).
Thank you very much in advance for any help.
EDIT: update after some clarification
Here's a mostly (though not completely) vectorized approach with 'implicit expansion':
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
y = unlist(y, use.names = FALSE)
vals = union(x, y)
list(
x = sapply(vals, function(v) sum(x == v)),
y = sapply(vals, function(v) sum(y == v))
)
}
x = df[, .(prev_class = list(class)), by = .(year, firm)]
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[, dist := {
temp = bar(class, prev_class[1L])
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
# firm year patent class prev_class dist
# 1: A 1979 3988941 410 Inf
# 2: A 1981 3934057 73 410 1.224745
# 3: A 1981 3924021 52 410 1.224745
# 4: A 1984 3960996 250 73,52 1.000000
# 5: A 1984 4026317 252 73,52 1.000000
# 6: B 1959 4165208 105 Inf
# 7: B 1960 3924506 454 105 1.414214
# 8: B 1963 3993626 380 454 1.118034
# 9: B 1963 3845403 380 454 1.118034
# 10: B 1965 3865160 60 380,380 1.118034
original answer:
To compute the distance with "implicit expansion" you can use the following approach. However, my results differ from OP's expected output for firm B in the years 1963 and 1965. It's unclear to me, how these results were computed by OP.
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
y = unlist(y, use.names = FALSE)
vals = union(x, y)
list(
x = as.integer(vals %in% x),
y = as.integer(vals %in% y)
)
}
x = df[, .(prev_class = list(unique(class))), by = .(year, firm)]
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[, dist := {
temp = bar(class, prev_class)
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
# firm year patent class prev_class dist op_expected
# 1: A 1979 3988941 410 Inf NA
# 2: A 1981 3934057 73 410 1.224745 1.224745
# 3: A 1981 3924021 52 410 1.224745 1.224745
# 4: A 1984 3960996 250 73,52 1.000000 1.000000
# 5: A 1984 4026317 252 73,52 1.000000 1.000000
# 6: B 1959 4165208 105 Inf NA
# 7: B 1960 3924506 454 105 1.414214 1.414214
# 8: B 1963 3993626 380 454 1.414214 1.118034
# 9: B 1963 3845403 380 454 1.414214 1.118034
# 10: B 1965 3865160 60 380 1.414214 1.118034
The expansion can be done as
df1 <- df[, lapply(.SD, list), .(firm, year)][df[,
.(year = min(year):max(year)), firm], on = .(firm, year)]
df1[, grp := cumsum(sapply(patent, Negate(is.null))), .(firm)]
df1[, c('patent', 'class') := list(patent[1], class[1]), .(firm, grp)]
df1[, .(patent = unlist(patent), class = unlist(class)), .(firm, year)]
-output
# firm year patent class
# 1: A 1979 3988941 410
# 2: A 1980 3988941 410
# 3: A 1981 3934057 73
# 4: A 1981 3924021 52
# 5: A 1982 3934057 73
# 6: A 1982 3924021 52
# 7: A 1983 3934057 73
# 8: A 1983 3924021 52
# 9: A 1984 3960996 250
#10: A 1984 4026317 252
#11: B 1959 4165208 105
#12: B 1960 3924506 454
#13: B 1961 3924506 454
#14: B 1962 3924506 454
#15: B 1963 3993626 380
#16: B 1963 3845403 380
#17: B 1964 3993626 380
#18: B 1964 3845403 380
#19: B 1965 3865160 60
Not an answer, but an extension to the answer provided by talat.
Here's what prevents my modification to be the final answer:
I'm sure there's a simpler and more elegant way to do what I've done.
The answer correctly implements the assumption, but does not directly give outputs for the missing years. Sure this expansion by akrun can be applied first, but maybe there's a way to integrate both steps into one.
Anyways, here's my modification that works according to the formula:
foo = function(x, y) {
sqrt(sum((x - y)^2)) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
bar = function(x, y) {
x = unlist(x, use.names = FALSE)
y = unlist(y, use.names = FALSE)
vals = c(x, y)
xl = length(unique(x))
yl = length(unique(y))
ul = length(union(x,y))
list(
x = c(table(vals)[names(table(vals)) %in% x],rep(0,(ul-xl))),
y = c(rep(0,(ul-yl)),table(vals)[names(table(vals)) %in% y])
)
}
x1 = df[, .(prev_class = list(class)), by = .(year, firm)]
x2 = df[, .(curr_class = list(class)), by = .(year, firm)]
x1
x2
x = merge(x1,x2)
df[x,
on = .(firm, year > year),
prev_class := i.prev_class]
df[x,
on = .(firm, year),
curr_class := curr_class]
df[, dist := {
temp = bar(unique(curr_class), unique(prev_class))
foo(temp$x, temp$y)
}, by = .(firm, year)]
df
firm year patent class prev_class curr_class dist
1: A 1979 3988941 410 410 Inf
2: A 1981 3934057 73 410 73,52 1.224745
3: A 1981 3924021 52 410 73,52 1.224745
4: A 1984 3960996 250 73,52 250,252 1.000000
5: A 1984 4026317 252 73,52 250,252 1.000000
6: B 1959 4165208 105 105 Inf
7: B 1960 3924506 454 105 454 1.414214
8: B 1963 3993626 380 454 380,380 1.118034
9: B 1963 3845403 380 454 380,380 1.118034
10: B 1965 3865160 60 380,380 60 1.118034

Unique value in row compared to previous rows by group and year in dataframe

I am working with patent data and I would like to find out whether firms have been assigned patents in similar or dissimilar patent classes in the years prior to the year the current patent has been assigned.
As an example: Firm 1010 (see table below) has patented in subcat 67 in year 1984 and I would like to find out whether it has applied for a patent in the same subcat in the X previous years (where X could be 3 or 5, for example). The result should be that for every patent (row), a value of 1 gets assigned if this is the case and 0 if not.
The amount of observations per firm (gvkey) and publication year are unbalanced (so not the same amount of observations for every firm).
I have fumbled around with dplyr and data.table, but cannot seem to find any solution that comes even close.
gvkey publn_year subcat patent
1: 1010 1980 53 4184663
2: 1010 1980 55 4185564
3: 1010 1980 53 4187814
4: 1010 1981 45 4242866
5: 1010 1981 55 4242966
6: 1010 1981 69 4246928
7: 1010 1982 53 4310145
8: 1010 1982 53 4311298
9: 1010 1982 69 4313458
10: 1010 1983 69 4367764
11: 1010 1983 53 4368927
12: 1010 1983 53 4368928
13: 1010 1984 67 4428585
14: 1010 1984 53 4429855
15: 1010 1984 53 4430983
16: 1012 1987 52 4683010
17: 1013 1980 43 4203066
18: 1013 1981 41 4245879
19: 1013 1982 41 4363941
20: 1013 1983 41 4367907
I've searched here and elsewhere for help but have not found what I'm looking for. I'm sure this is possible and I may be overlooking something very simple.
Thanks for your help.
One possible solution for the whole past is as follows
df %>%
group_by(gvkey, subcat) %>%
mutate(flagged = ifelse(min(publn_year) == publn_year,
0,
1)
)
Example
Consider the data
> df
gvkey publn_year subcat patent
1 1010 1979 53 44434
2 1010 1980 55 43424
3 1010 1981 53 243423
4 1010 1982 45 234234
Then you get
> df %>% group_by(gvkey, subcat) %>% mutate(flagged = ifelse(min(publn_year) == publn_year, 0, 1))
# A tibble: 4 x 5
# Groups: gvkey, subcat [3]
gvkey publn_year subcat patent flagged
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1010 1979 53 44434 0
2 1010 1980 55 43424 0
3 1010 1981 53 243423 1
4 1010 1982 45 234234 0
Here is one approach using dplyr. First, group_by both gvkey for firm and subcat for subcategory. Then, arrange and sort by year. Then, you can add a column with your new value of 0/1 based on if the difference between a year and the most recent year of patent were within X years (here example is 3 years). I also check to see if the first row within a group, so that does not get set as 1. Let me know if this is what you had in mind.
library(dplyr)
df %>%
group_by(gvkey, subcat) %>%
arrange(gvkey, subcat, publn_year) %>%
mutate(prior = ifelse(publn_year - lag(publn_year) <= 3 & row_number() != 1, 1, 0))
Output
gvkey publn_year subcat patent prior
<int> <int> <int> <int> <dbl>
1 1010 1981 45 4242866 0
2 1010 1980 53 4184663 0
3 1010 1980 53 4187814 1
4 1010 1982 53 4310145 1
5 1010 1982 53 4311298 1
6 1010 1983 53 4368927 1
7 1010 1983 53 4368928 1
8 1010 1984 53 4429855 1
9 1010 1984 53 4430983 1
10 1010 1980 55 4185564 0
11 1010 1981 55 4242966 1
12 1010 1984 67 4428585 0
13 1010 1981 69 4246928 0
14 1010 1982 69 4313458 1
15 1010 1983 69 4367764 1
16 1012 1987 52 4683010 0
17 1013 1981 41 4245879 0
18 1013 1982 41 4363941 1
19 1013 1983 41 4367907 1
20 1013 1980 43 4203066 0

Re-code and spread data in columns, based on values in another column

I have a table that looks like this:
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
etc...
The Tax# columns give info about the tax paid in years subsequent to the value in the Year column. I would like to re-arrange the table, and rename the columns, so it looked like this:
Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
2004 12 123 145 104 NA
2004 145 99 90 56 NA
2005 NA 212 300 240 123
I was thinking of splitting the table into separate tables, based on the year column, then renaming the Tax# columns, and joining back together. But its a bit convoluted, and I was wondering if there was a simpler way to do this?
Any help much appreciated.
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
", header = TRUE)
df %>%
mutate(id = row_number()) %>%
gather(rel_year, amount, contains("Tax")) %>%
mutate(rel_year = as.integer(gsub("Tax", "", rel_year)),
pay_year = Year + rel_year - 1,
pay_year = paste0("Tax", pay_year)) %>%
select(-rel_year) %>%
spread(pay_year, amount)
Result:
Year id Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
dat1%>%
gather(key,value,-Year)%>%
group_by(key)%>%
mutate(col=1:n())%>%
ungroup()%>%
mutate(key=paste0("Tax",2004:2008)[(Year==2005)+
as.numeric(sub("\\D+","",key))])%>%
spread(key,value)
# A tibble: 3 x 7
Year col Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
<int> <int> <int> <int> <int> <int> <int>
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
>
Here is an option using data.table
library(data.table)
library(readr)
dcast(melt(setDT(df, keep.rownames = TRUE), id.var = c("rn", "Year"))[,
newYear := paste0("Tax", Year + parse_number(variable) - 1)],
rn + Year~ newYear, value.var = 'value')[, rn := NULL][]
# Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
#1: 2004 12 123 145 104 NA
#2: 2004 145 99 90 56 NA
#3: 2005 NA 212 300 240 123

Reshape and mean calculation

I have climatic data which have been collected during a whole year along an altitude gradient. Shaped like that:
clim <- read.table(text="alti year month week day meanTemp maxTemp minTemp
350 2011 aug. 31 213 10 14 6
350 2011 aug. 31 214 12 18 6
350 2011 aug. 31 215 10 11 9
550 2011 aug. 31 213 8 10 6
550 2011 aug. 31 214 10 12 8
550 2011 aug. 31 215 8 9 7
350 2011 sep. 31 244 9 10 8
350 2011 sep. 31 245 11 12 10
350 2011 sep. 31 246 10 11 9
550 2011 sep. 31 244 7.5 9 6
550 2011 sep. 31 245 8 10 6
550 2011 sep. 31 246 8.5 9 8", header=TRUE)
and I am trying to reshape this data in order to have only one row per altitude and to calculate the mean data for each month and for the whole year. I would be great if it could be shaped like that:
alti mean_year(meanTemp) mean_year(maxTemp) mean_aug.(meanTemp) mean_aug.(maxTemp) mean_sep.(meanTemp) [...]
350 10.333 12.667 10.667 14.3 10 ...
550 8.333 9.833 8.667 10.333 7.766 ...
Any idea to perform this reshaping & calculation?
You can use data.table and dcast:
library(data.table)
setDT(clim)
merge(
clim[, list("mean_temp_mean_year" = mean(meanTemp), "max_temp_mean_year" = mean(maxTemp)), by = alti]
,
dcast(clim[, list("mean_temp_mean" = mean(meanTemp), "max_temp_mean" = mean(maxTemp)), by = c("alti","month")], alti ~ month, value.var = c("mean_temp_mean","max_temp_mean"))
,
by = "alti")
I've switched the names of some of the variables, and you col order is not perfect, but the can be reordered/renamed afterwards
To get the means of the months or years, you can use aggregate followed by reshape.
The two aggregates can be computed separately, and then merge puts them together:
mon <- aggregate(cbind(meanTemp, maxTemp) ~ month + alti, data=clim, FUN=mean)
mon.wide <- reshape(mon, direction='wide', timevar='month', idvar='alti')
yr <- aggregate(cbind(meanTemp, maxTemp) ~ year + alti, data=clim, FUN=mean)
yr.wide <- reshape(yr, direction='wide', timevar='year', idvar='alti')
Each of these .wide sets have the data that you want. The only common column is alti so we take the merge defaults:
merge(mon.wide, yr.wide)
## alti meanTemp.aug. maxTemp.aug. meanTemp.sep. maxTemp.sep. meanTemp.2011 maxTemp.2011
## 1 350 10.666667 14.33333 10 11.000000 10.333333 12.666667
## 2 550 8.666667 10.33333 8 9.333333 8.333333 9.833333
Here's another variation of data.table solution, but this requires the current devel version, v1.9.5:
require(data.table) # v1.9.5+
setDT(clim)
form = paste("alti", c("year", "month"), sep=" ~ ")
val = c("meanTemp", "maxTemp")
ans = lapply(form, function(x) dcast(clim, x, mean, value.var = val))
Reduce(function(x, y) x[y, on="alti"], ans)
# alti meanTemp_mean_2011 maxTemp_mean_2011 meanTemp_mean_aug. meanTemp_mean_sep. maxTemp_mean_aug. maxTemp_mean_sep.
# 1: 350 10.333333 12.666667 10.666667 10 14.33333 11.000000
# 2: 550 8.333333 9.833333 8.666667 8 10.33333 9.333333

creating index conditioned on value in other column; differences over time

I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727

Resources