R Function For Loop Data Frame - r

I apologize if this is a duplicate or a bit confusing - I've searched all around SO but can't seem to apply find what I'm trying to accomplish. I haven't used functions/loops extensively, especially writing from scratch, so I'm not sure if the error is from the function (likely) or from the construct of the data. The basic flow as follows:
Dummy data set - grouping, type, rate, years, months
I'm running lm formula on the data set by grouping with this bit:
coef_models <- test_coef %>% group_by(Grouping) %>% do(model = lm(rate ~ years + months, data = .))
The result of the above gives me intercepts and coefficients for the variables -
what I'm trying to accomplish next (and failing) is for all the coefficients for the estimates that are negative, drop that component out of the equation and rerun the lm with just the positive coefficient. So for example a grouping of states, if the years coefficient is negative, I would want to run lm(rate ~ months, data = . with in the formula.
To get there, with plyr/broom, I'm taking the results and putting them into a data frame:
#removed lines with negative coefficients
library(dplyr)
library(broom)
coef_output_test <- as.data.frame(coef_models %>% tidy(model))
coef_output_test$Grouping <- as.character(coef_output_test$Grouping)
#drop these coefficients and rerun
coef_output_test_rerun <- coef_output_test[!(coef_output_test$estimate >= 0),]
From here, I'm trying to rerun the groupings with issues without the negative variable from the initial run. Because the variables will vary, some instances will be years dropping out, some will be months, I need to pass through the correct column to use. I think this is where I'm getting hung up:
lm_test_rerun_out <- data.frame(grouping=character()
, '(intercept)'=double()
, term=character()
, estimate=double()
, stringsAsFactors=FALSE)
lm_test_rerun <- function(r) {
y = coef_output_test_rerun$Grouping
x = coef_output_test_rerun$term
for (i in 2:nrow(coef_output_test_rerun)){
lm_test_rerun_out <- test_coef %>% group_by(Grouping["y"]) %>% do(model = lm(rate ~ x, data = .))
}
}
lm_test_rerun(coef_output_test_rerun)
I get this error:
variable lengths differ (found for 'x')
The output for function should be something like this dummy output:
Grouping, Term, (intercept), Estimate
Sports, Years, 0.56, 0.0430
States, Months, 0.67, 0.340
I'm surely not fluent in R, and I'm sure the parts above that do work could be done more efficiently, but the output of the function should be the grouping and x variable used, along with the intercept and estimate for each. Ultimately I'll be taking that output and appending back to the original 'coef_models' - but I can't get past this part for now.
EDIT: sample test_coef set
Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62

Consider a base R solution with by that slices dataframe by one or more factors for any extended method to run on each grouped subset. Specifically, below will conditionally re-run lm model by checking coefficient matrix and ultimately returns a dataframe with needed values:
Data
txt <- ' Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62'
test_coef <- read.table(text=txt, header=TRUE)
Code
df_list <- by(test_coef, test_coef$Grouping, function(df){
# FIRST MODEL
res <- summary(lm(Rate ~ Years + Months, data = df))$coefficients
# CONDITIONALLY DEFINE FORMULA
f <- NULL
if ((res["Years",1]) < 0 & (res["Months",1]) > 0) f <- Rate ~ Months
if ((res["Years",1]) > 0 & (res["Months",1]) < 0) f <- Rate ~ Years
# CONDITIONALLY RERUN MODEL
if (!is.null(f)) res <- summary(lm(f, data = df))$coefficients
# ITERATE THROUGH LENGTH OF res MATRIX SKIPPING FIRST ROW
tmp_list <- lapply(seq(length(res[-1,1])), function(i)
data.frame(Group = as.character(df$Grouping[[1]]),
Term = row.names(res)[i+1],
Intercept = res[1,1],
Estimate = res[i+1,1])
)
# RETURN DATAFRAME OF 1 OR MORE ROWS
return(do.call(rbind, tmp_list))
})
final_df <- do.call(rbind, unname(df_list))
final_df
# Group Term Intercept Estimate
# 1 Countries Months -0.0512500 0.04375000
# 2 Sports Years 0.6894118 -0.00372549
# 3 States Months 0.2754176 0.02941113
Do note: removing negative coeff of first and re-running new model can render the other component negative when previously it was positive.

Related

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

Optimization function across multiple factors

I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

i am confused with the R implementation of lag in Regression analysis

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X. What confuse me is the R implementation of lag function. In R the lag(X, 1) moves X to the prior time, for example
>library(zoo)
>
>str(zoo(x))
‘zoo’ series from 1 to 4
Data: num [1:4] 11 12 13 14
Index:int [1:4] 1 2 3 4
>lag(zoo(x))
1 2 3
12 13 14
when you regress, which value does the R use exactly at time 2? I guess R use the data like this:
time 1 2 3 4
Y anything
X 11 12 13 14
lagX 12 13 14
But this is nonsense! Because we are supposed to use the fisrt lag of X and the current X at time 2 (or any specific time ), that is 11 and 12 , not 13 12 as above! The fisrt lag of X should be the prior X , isn't it? I am so confused! Please explain to me, thanks a lot.
The question starts out with:
look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear
that it is trying to do a linear regression. and the lag(X,1) means the first
lag of X
Actually that is not the case. It does not refer to this model:
Y[i] = a + b * X[i] + c * X[i-1] + error[i]
It actually refers to this model:
Y[i] = a + b * X[i] + c * X[i+1] + error[i]
which is not likely what you intended.
It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.
The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.
More on these points below.
ts
First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:
tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))
gives:
Time Series:
Start = 0
End = 5
Frequency = 1
tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
zoo
lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.
library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))
giving:
z lag(z) lag(z, 1) lag(z, -1)
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
zooreg
The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.
library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))
giving:
zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
lm
lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:
library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))
giving:
Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))
Coefficients:
(Intercept) zr lag(zr, -1)
1 2 3
Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)
Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:
Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))
An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.
Please, consult the manual first:
Description
Compute a lagged version of a time series, shifting the time base back by a given number of observations.
Default S3 method:
lag(x, k = 1, ...)
Arguments
x A vector or matrix or univariate or multivariate time series
k The number of lags (in units of observations).
So, lag does not return a lagged value. It returns the entire lagged time series, shifted back by some k. This is not something a simple lm can work with, and indeed not what you want to use. This, however, does work for me:
library(zoo)
x <- zoo(c(11, 12, 13, 14))
y <- c(1, 2.3, 3.8, 4.2)
lagged <- lag(x, -1)
lagged <- c(lagged, c=0) # first lag is defined as zero
model <- lm(y ~ x + lagged)
summary(model)
Returns:
Call:
lm(formula = y ~ x + lagged)
Residuals:
1 2 3 4
-8.327e-17 -1.833e-01 3.667e-01 -1.833e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.86333 4.20149 -2.110 0.282
x 0.89667 0.38456 2.332 0.258
lagged 0.05333 0.08199 0.650 0.633
Residual standard error: 0.4491 on 1 degrees of freedom
Multiple R-squared: 0.9687, Adjusted R-squared: 0.9062
F-statistic: 15.49 on 2 and 1 DF, p-value: 0.1769

Integrating Data

I have a large data frame as follows which is a subset of a larger data frame.
tree=data.frame(INVYR=tree$INVYR,
DIA=tree$DIA,PLOT=tree$PLOT,SPCD=tree$SPCD,
D.2=tree$D.2, BA.T=tree$BA.T)
What I am attempting to do is calculate the total BA.T per Plot per Year (plots are remeasured in subsequent years). I do this by ...
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x<- x[with(x, order(Group.1,Group.2)), ]
This gives me the data frame...
x=data.frame(Group.1,Group.2,x,PLOT)
Where Group.1 is the INVYR, Group.2 is the PLOT, and x is total BA.T per plot per year. So far this works great. Here is where my problem begins. I then want to integrate this back into my original tree data.frame. If I merge the data by plot it doesn't account for year and quadrupoles the data set because of the four remeasurements. I can't run an if statement because the data set is not equal lengths. The data.frame I wish to accompolish is
tree=data.frame(INVYR, DIA, PLOT, SPCD, D.2, BA.T, x)
where x is the total BA.T for the given INVYR and PLOT of that record.
Any thoughts would be greatly appreciated. Thanks.
Edit
INVYR=rbind(1982,1982,1982,1982,1982,1995,1995,1995,1995,1995,2000,2000,2000,2000,2000)
PLOT=rbind(1,1,2,2,3,1,1,2,2,3,1,1,2,2,3)
BA.T=rbind(.1,.2,.3,.4,.2,.3,.5,.8,.3,.6,.7,.2,.1,1,1.02)
tree=data.frame(INVYR,PLOT,BA.T)
head(tree)
x<-aggregate(tree$BA.T,list(tree$INVYR,tree$PLOT),FUN=sum)
x$PLOT<-x$Group.2
x$INVYR<-x$Group.1
x<- x[with(x, order(Group.1,Group.2)), ]
head(x)
On solution is to use package reshape2.
library(reshape2)
melt(data=tree,id.vars=c('INVYR','PLOT')) ## Notice the choice of the id!the keys!
dcast(tree.m,formula=...~variable,fun.aggregate=sum)
INVYR PLOT BA.T
1 1982 1 0.30
2 1982 2 0.70
3 1982 3 0.20
4 1995 1 0.80
5 1995 2 1.10
6 1995 3 0.60
7 2000 1 0.90
8 2000 2 1.10
9 2000 3 1.02

Resources