Why does grouping function work slow when fitting lm models - r

I have data frame with 5 column and about 12 000 000 rows.
lon lat for_R_WDEP_SOX number year
2 -29.95 30.05 128.44461 1 2000
624002 -29.95 30.05 320.17755 1 2001
1248002 -29.95 30.05 192.20628 1 2002
1872002 -29.95 30.05 325.44336 1 2003
2496002 -29.95 30.05 368.46976 1 2004
3120002 -29.95 30.05 409.80154 1 2005
3744002 -29.95 30.05 265.71161 1 2006
4368002 -29.95 30.05 147.92351 1 2007
4992002 -29.95 30.05 279.87851 1 2008
5616002 -29.95 30.05 136.38370 1 2009
6240002 -29.95 30.05 223.43958 1 2010
6864002 -29.95 30.05 132.92253 1 2011
7488002 -29.95 30.05 112.68416 1 2012
8112002 -29.95 30.05 83.81801 1 2013
8736002 -29.95 30.05 80.33523 1 2014
9360002 -29.95 30.05 71.58231 1 2015
9984002 -29.95 30.05 91.07822 1 2016
10608002 -29.95 30.05 98.69281 1 2017
I try to use groping function to it
gromov_analise_fuction <- function(table)
{
x <- table$year
y <- table$for_R_WDEP_SOX
line<- lm(y~x)
p_value_coef <- summary(line)$coefficients["x","Estimate"]/abs(summary(line)$coefficients["x","Estimate"])*(1 -summary(line)$coefficients["x","Pr(>|t|)"])
k <- summary(line)$coefficients["x","Estimate"]
B_K <- summary(line)$coefficients["x","Estimate"]*1800/summary(line)$coefficients["(Intercept)","Estimate"]
result_vector <- c(p_value_coef,k,B_K)
return (result_vector)
}
result <- table %>%
group_by(number) %>%
do(data.frame(val=gromov_analise_fuction(.)))
It works for about 30-37 minutes.
Tell me please what is the reason?
How should I make this code work faster.
As I understood I should remove unused vectors and data.frame.

library(data.table)
result2 <- table[,
data.frame(val=gromov_analise_fuction(.SD)),
by = number]
See how much faster data.table is! (BTW, check Hadley Wickham's GitHub repo, he has been re-writing dplyr verbs with the data.table backend to make these faster.)
Unit: nanoseconds
expr min lq mean median uq max neval cld
dplyr 3717853 4262732 5028474.44 4526830 5648242.0 15919477 100 b
data.table 35 39 316.29 41 603.5 3024 100 a
And here are the outputs from your code with dplyr and the data.table way (with the data you provided in your question, I agree with #Roland though when it comes to providing data!):
> result (dplyr)
# A tibble: 3 x 2
# Groups: number [1]
number val
<int> <dbl>
1 1 -0.998
2 1 -13.9
3 1 -0.890
> result2 (data.table)
number val
1: 1 -0.9979470
2: 1 -13.8587730
3: 1 -0.8900289

You can speed up fitting the models significantly, but the slow part is calculating the model summaries. The "low-hanging fruit" is limiting the calls to summary to one call per model, obviously. You might want to create your own function that calculates only the values of interest, i.e., the p-values for the slopes (see there for the calculation of the standard errors). For the coefficients you can just use the function coef which is fast.
Here is an approach that fits the models much faster (but still uses summary):
library(data.table)
n <- 1e5
set.seed(42)
DT <- data.table(x = 1:10, y = rnorm(n), g = rep(seq_len(n/10), each = 10))
#fit each model separately
system.time({
res <- DT[, .(pslope = summary(lm(y ~ x, data = .SD))$coefficients["x","Pr(>|t|)"]), by = g]
})
#user system elapsed
#5.89 0.01 6.02
#use the fact that the models have all the same design matrix
system.time({
DT1 <- dcast(DT, x ~ g, value.var = "y")
setnames(DT1, make.names(names(DT1)))
fit <- lm(as.formula(sprintf("cbind(%s) ~ x", paste(names(DT1)[-1], collapse = ","))), data = DT1)
pslope <- vapply(summary(fit), function(fitsum) fitsum$coefficients["x","Pr(>|t|)"], FUN.VALUE = 1.0)
})
#user system elapsed
#4.34 0.00 4.42
#same result
all.equal(unname(pslope), res[["pslope"]])
#[1] TRUE
#intercepts
coef(fit)[1,]
#slopes
coef(fit)[2,]

Related

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

dplyr group_by and iterative loop calculation

I am trying to perform an iterative calculation on grouped data that depend on two previous elements within a group. As a toy example:
set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
Year = c(2001:2005, 2001:2004),
Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
ID Year Price Index
R> df
1: A_index1 2001 NA 1.5
2: A_index1 2002 NA 1.5
3: A_index1 2003 10 2.0
4: A_index1 2004 NA 1.0
5: A_index1 2005 NA 2.0
6: A_index1 2006 15 2.0
7: A_index1 2007 NA 3.0
8: A_index1 2008 13 1.5
9: A_index1 2009 NA 2.0
The objective is to fill the missing prices using the last available price and an index to adjust. I have a loop that performs these calculations, which I am trying to vectorize using dplyr.
My logic is defined in the below loop:
df$Price_adj = df$Price
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
}
}
R> df
ID Year Price Index Price_adj
1: A_index1 2001 NA 1.5 NA
2: A_index1 2002 NA 1.5 NA
3: A_index1 2003 10 2.0 10.00
4: A_index1 2004 NA 1.0 5.00
5: A_index1 2005 NA 2.0 10.00
6: A_index1 2006 15 2.0 15.00
7: A_index1 2007 NA 3.0 22.50
8: A_index1 2008 13 1.5 13.00
9: A_index1 2009 NA 2.0 17.33
In my actual large data, I will have to apply this function to multiple groups and speed is a consideration. My attempt at this is below, that needs help to point me in the right direction. I did consider Reduce, but not sure how it can incorporate two previous elements within the group.
foo = function(Price, Index){
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
}
}
}
df %>%
group_by(ID) %>%
mutate(Price_adj = Price,
Price_adj = foo(Price, Index))
One option with cumprod:
df %>%
# group data frame into chunks starting from non na price
group_by(ID, g = cumsum(!is.na(Price))) %>%
# for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>%
ungroup() %>% select(-g)
# A tibble: 9 x 5
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Group data frame by ID and cumsum(!is.na(Price)), the letter split data frame into chunks and each chunk start with a non NA Price;
first(Price) * cumprod(Index / lag(Index, default=first(Index))) does the iterative calculation, which is equivalent to the formula given in the question if you substitute Price_adj[i-1] with Price_adj[i-2] until it's Price_adj[1] or first(Price);
Caveat: may not be very efficient if you have many NA chunks.
If the speed is the primary concern, you could write your function using Rcpp package:
library(Rcpp)
cppFunction("
NumericVector price_adj(NumericVector price, NumericVector index) {
int n = price.size();
NumericVector adjusted_price(n);
adjusted_price[0] = price[0];
for (int i = 1; i < n; i++) {
if(NumericVector::is_na(price[i])) {
adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
} else {
adjusted_price[i] = price[i];
}
}
return adjusted_price;
}")
Now use the cpp function with dplyr as follows:
cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))
cpp_fun()
# A tibble: 9 x 5
# Groups: ID [1]
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Benchmark:
Define r_fun as:
r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)
On the small sample data, there's already a difference:
microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
Testing on a slightly larger data frame:
df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000 4
microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
Identity test:
identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE

replace NA in a dplyr chain

Question has been edited from the original.
After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
The following does not work as I expected
library(dplyr)
library(Lahman)
df <- Batting[ c("yearID", "teamID", "G_batting") ]
df <- group_by(df, teamID )
df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)
Source: local data frame [20 x 3]
Groups: yearID, teamID
yearID teamID G_batting
1 2004 SFN 11.00000
2 2006 CHN 43.00000
3 2007 CHA 2.00000
4 2008 BOS 5.00000
5 2009 SEA 3.00000
6 2010 SEA 4.00000
7 2012 NYA **49.07894**
> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**
In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?
df %.%
group_by( yearID ) %.%
transform(G_batting = ifelse(is.na(G_batting),
mean(G_batting, na.rm = TRUE),
G_batting)
)
Edit: Replacing transform with mutate gives the following error
Error in mutate_impl(.data, named_dots(...), environment()) :
INTEGER() can only be applied to a 'integer', not a 'double'
Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also #eddi's answer.
df %.%
group_by( teamID ) %.%
mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA 47
> mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47
Edit: Following up on #Romain's comment I installed dplyr from github:
> head(df,10)
yearID teamID G_batting
1 2004 SFN 11
2 2006 CHN 43
3 2007 CHA 2
4 2008 BOS 5
5 2009 SEA 3
6 2010 SEA 4
7 2012 NYA NA
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
> df %.%
+ group_by(teamID) %.%
+ mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID
yearID teamID G_batting
1 2004 SFN 0
2 2006 CHN 0
3 2007 CHA 0
4 2008 BOS 0
5 2009 SEA 0
6 2010 SEA 1074266112
7 2012 NYA 90693125
8 1954 ML1 122
9 1955 ML1 153
10 1956 ML1 153
.. ... ... ...
So I didn't get the error (good) but I got a (seemingly) strange result.
The main issue you're having is that mean returns a double while the G_batting column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.
That said, here are a couple of data.table alternatives - I didn't check which one is faster.
library(data.table)
# using ifelse
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]
# using a temporary column
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]
And this is what I'd want to do ideally (there is an FR about this):
# again, atm this is pure fantasy and will not work
dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]
The dplyr version of the ifelse is (as in OP):
dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))
I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

calculate differences in dataframe

I have a dataframe that looks like this:
set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
distance year mean.length
1 long 2002 0.54966989
2 long 2003 -0.84160374
3 long 2004 0.03299794
4 long 2005 0.52414971
5 long 2006 -1.72760411
6 short 2002 -0.27786453
7 short 2003 0.36082844
8 short 2004 -0.59091244
9 short 2005 0.97559055
10 short 2006 -1.44574995
I need to calculate the difference between in mean.length between long and short in each year. Whats fastest way of doing this?
Here's one way using plyr:
set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
library(plyr)
aggregation.fn <- function(df) {
data.frame(year=df$year[1],
diff=(df$mean.length[df$distance == "long"] -
df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
Gives you
> new.df
year diff
1 2002 0.8275344
2 2003 -1.2024322
3 2004 0.6239104
4 2005 -0.4514408
5 2006 -0.2818542
A second way
df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]
all(new.df.2 == new.df) # True
Use tapply() and apply() like this:
apply(
with(x, tapply(mean.length, list(year, distance), FUN=mean)),
1,
diff
)
2002 2003 2004 2005 2006
-0.8275344 1.2024322 -0.6239104 0.4514408 0.2818542
This works because tapply creates a tabular summary by year and distance:
with(x, tapply(mean.length, list(year, distance), FUN=mean))
long short
2002 0.54966989 -0.2778645
2003 -0.84160374 0.3608284
2004 0.03299794 -0.5909124
2005 0.52414971 0.9755906
2006 -1.72760411 -1.4457499
Since you seem to have paired values and the data.frame is ordered, you can do this:
res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)
# 2002 2003 2004 2005 2006
#0.8275344 -1.2024322 0.6239104 -0.4514408 -0.2818542
This should be quite fast, but not as safe as the other answers as it relies on the assumptions.
You've received some good answers for computing the specific question at hand. It may make sense for you to consider reshaping your data into a wide format. Here are two options:
reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
year mean.length.long mean.length.short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
year long short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
You can easily compute your new statistics now.

Resources