Dataframe operation within and between dataframes - r

How can I make some operation within and between dataframes in R?
For example, here is a dataframe on stock returns.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
Date X Y Z
1 2009-01-01 -0.31758501 -1.2718424 -2.9979292
2 2009-01-02 -1.06440187 0.4202969 -5.7925412
3 2009-01-03 0.26475736 -2.3955779 -2.2638179
4 2009-01-04 -0.83653746 0.4161053 -10.1011995
5 2009-01-05 -0.12214392 0.7143456 3.6851497
6 2009-01-06 -0.01186287 -2.1322029 -0.1577852
7 2009-01-07 0.27729415 0.1323237 -4.4237673
8 2009-01-08 -1.74389562 0.4962045 0.4192498
9 2009-01-09 0.83150240 -0.9241747 -1.6752324
10 2009-01-10 -0.52863956 0.1044531 -1.2083588
Q1) I'd like to create a dataframe with previous day.
For example, final result that I want would be expressed lag(stocks,1)
What is the most simple and elegant way to achieve this?
Is there any simple way to use dplyr?
Q2) How can I apply any basic arithmetic operation to this dataframe?
for example, I'd like to create dataframes with,
stocks1 = stocks + 1
stocks2 = stocks x 3
stocks3 = stocks2 / stocks1 (operation between two dataframes)
stocks4 = stocks3 / lag(stocks1)
Something like this.
What would be the most simple and elegant way?

To address the first problem, this might be of help to you. You don't necessarily need to use dplyr in this instance, using the head() function should be sufficient if all you wish to do is lag the variables.
stocks <- data.frame(
time=as.Date('2009-01-01') + 0:9,
X=rnorm(10, 0, 1),
Y=rnorm(10, 0, 2),
Z=rnorm(10, 0, 4)
)
previous<-head(stocks,9)
df<-data.frame(stocks$time[2:10],stocks$X[2:10],stocks$Y[2:10],stocks$Z[2:10],previous$X,previous$Y,previous$Z)
col_headings<-c("time","X","Y","Z","previousX","previousY","previousZ")
names(df)<-col_headings
Here, the dates from 2nd January to 10th January are displayed, with the lags for X, Y, and Z also being included in the data frame.
> df
time X Y Z previousX previousY
1 2009-01-02 0.7878110 -2.1394047 0.68775794 -0.0759606 1.2863089
2 2009-01-03 -0.2767296 -2.3453356 -1.56313888 0.7878110 -2.1394047
3 2009-01-04 -0.2122021 0.1589629 -1.13926020 -0.2767296 -2.3453356
4 2009-01-05 0.1195826 3.2320352 -0.32020803 -0.2122021 0.1589629
5 2009-01-06 0.7642622 -0.7621168 1.66614679 0.1195826 3.2320352
6 2009-01-07 -0.3073972 -2.9475654 5.63945611 0.7642622 -0.7621168
7 2009-01-08 0.3597369 0.5011861 5.95424269 -0.3073972 -2.9475654
8 2009-01-09 -1.8701881 0.4417496 1.34273218 0.3597369 0.5011861
9 2009-01-10 -1.1172033 -0.5566736 0.05432339 -1.8701881 0.4417496
previousZ
1 3.2188050
2 0.6877579
3 -1.5631389
4 -1.1392602
5 -0.3202080
6 1.6661468
7 5.6394561
8 5.9542427
9 1.3427322
As regards calculations, it depends on what you are trying to do.
e.g. do you want to add 1 to each row in Z?
> df$Z+1
[1] 1.6877579 -0.5631389 -0.1392602 0.6797920 2.6661468 6.6394561
[7] 6.9542427 2.3427322 1.0543234
You could divide two stock returns by each other as you've specified as well. Note that we have combined them in the one dataframe, so we are not necessarily conducting an "operation between two dataframes" per se.
> df$Y/df$Z
[1] -3.11069421 1.50040132 -0.13953168 -10.09354826 -0.45741275
[6] -0.52266839 0.08417294 0.32899307 -10.24740160
By specifying the dataframe (in this case, df), along with the associated variable (as indicated after the $ symbol), then you should be able to carry out a wide range of calculations across the dataframe.

Related

Using a for loop to create time series by groups

I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.
Allow to me explain in more detail.
I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.
So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.
The data:
Here I will replicate a small slice of the data, and will include a missing value for realism.
set.seed(2000)
df <- data.frame(year = rep(c(1946:1970),length.out=50),
ccode1 = rep(c("2"), length.out = 50),
ccode2 = rep(c("20","31"), each=25),
kappavv = rnorm(50,mean = 0, sd=0.25),
output = NA)
df$kappavv[12] <- NA
What I've done:
NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- na_interpolation(dat_ts)
trialseries <- holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.
What isn't working:
I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.
I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.
Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.
You can put the for loop code in a function.
library(dplyr)
library(purrr)
apply_func <- function(df) {
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1),
end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
trialseries <- forecast::holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
return(df)
}
Split the data by ccode2 and apply apply_func.
df %>%group_split(ccode2) %>% map_df(apply_func)
# year ccode1 ccode2 kappavv output
# <int> <chr> <chr> <dbl> <dbl>
# 1 1946 2 20 -0.213 NA
# 2 1947 2 20 -0.0882 NA
# 3 1948 2 20 0.223 0.286
# 4 1949 2 20 0.435 0.413
# 5 1950 2 20 0.229 0.538
# 6 1951 2 20 -0.294 0.477
# 7 1952 2 20 -0.485 -0.675
# 8 1953 2 20 0.524 0.405
# 9 1954 2 20 0.0564 0.0418
#10 1955 2 20 0.294 0.161
# … with 40 more rows

Removing Values in DataFrame Based On Condition R

I have a data frame of 4 columns (magnified for this example). Most columns have outliers which are significantly larger than the other values in the data frame. For example: A column (with a maximum value of 99), has outliers with 96, 97, 98, 99. These outliers signify essentially "no response". This obviously heavily skews the data, thus they must be removed. I want to remove the outliers, but each variable has a different maximum value (and different set of outliers) and some have decimals.
96, 97, 98, 99 must be removed ONLY from the columns that have those as reserve values. So the function must know which columns have each specific classification of reserve values. More below.
The issue is that, I do not want to "remove from all columns" the reserve values as some values may mean something else in another column. For example removing 996 in one column could mean something of significance in another column, such as hourly wage/week.
It get tricky as some have decimals like hours worked/week. For example. 37.5 hours worked per week would have reserve values of 999.6, 999.7, 999.8, 999.9.
This length would be classified as 5.1.
I need to remove these reserve values from the data frame, but they must first match the corresponding reserve value length. Since each column has a different reserve value, the column names of the data frame should correspond to a specific reserve value.
df <- data.frame("children#" = c(1,5,0,2,10),
"annual income" = c(700000.00,50000.65,30000.45,1000000.59,9999999.96),
"hour wage"= c(25.65,9999999.99,50.23,1000.72,65.16),
"hours worked/week" = c(148.5,77.0,64.2,25.9,999.7))
Max length of children# is 2
Max length of annual income is 10.2 (10 total, 2 decimal)
Max length of hour wage is 10.2
Max length of hours worked/week is 5.1 (5 total, 1 decimal)
ALWAYS WILL BE 4 RESERVE VALUES
If max length = 2, remove reserve values: 96,97,98,99
If max length = 3, remove reserve values: 996, 997, 998, 999... and so forth with solid numbers
With decimals:
If max length = 5.1, remove reserve values: 999.6, 999.7, 999.8, 999.9.
If max length = 10.2, remove reserve values: 9999999.96, 9999999.97, 9999999.98, 9999999.99
Thus, I would like to figure out how to make a function that will
find max lengths
connect the corresponding max lengths with the correct reserve values
remove reserve values from data frame based on max lengths of each column
So far I have the max lengths of each column with the decimal points.
I just need some help with connecting it to the reserve values and getting those reserve values removed from the data frame.
If more info is required please comment as I will elaborate further if needed.
Code sample: For the reserve values I was thinking of creating a separate data frame and using that to remove the values. Other suggestions are welcome.
Find.Max.Length <- function(data){
# Check Max Length of each column
tmp <- data.frame(lapply(data, function(x) max(nchar(x, keepNA = F))))
tmp <- data.frame(t(tmp))
return(tmp)}
max.length <- Find.Max.Length(df)
Check.Decimal.Places <- function(x){
if((x %% 1) != 0){
nchar(strsplit(sub('0+$', '',as.character(x)), ".", fixed = TRUE)[[1]][[2]])
}else{
return(0)}
}
decimal <- data.frame(Check.Decimal.Places(df$random)) #<--- used to
initialize the variable before the loop
for(i in seq_along(df)){
decimal[i] <- data.frame(Check.Decimal.Places(df[[i]]))}
decimal<- data.frame(t(decimal))
rownames(decimal) <- names(df)
length.df <- cbind(max.length, decimal)
names(length.df) <- c("Max Length", "Decimal Place")
length.df$NewVariableLength <- paste0(length.df$`Max Length`, sep=
".",length.df$`Decimal Place`)
NOTE: Row names of length.df data frame match original data frame names. That can possibly be a way to link the two together?
There is probably a faster way to do this all, all suggestions are welcome.
edit: Now I understand what you mean with "reserve values" - answers from a survey that should not be counted (e.g. "I don't want to answer this question")
You have essentially three easy methods here without having to search of "integer length" or other overengineering:
Max values (i.e., remove the four highest values),
Manual thresholds (i.e., remove all values above X),
If-else logic (i.e., if answer == X, remove it).
Building the dataset
Your data did not correspond to your specifications ("always 4 outliers"), so I took the liberty to extend it.
df <- data.frame(
"children" = c(1, 0, 96, 2, 10, 99, 98, 99),
"annual_income" = c(700000.00, 50000.65, 30000.45, 1000000.59, 9999999.96, 9999999.97, 9999999.98, 9999999.99),
"hour_wage"= c(25.65, 9999999.99, 50.23, 9999999.98, 9999999.99, 9999999.98, 1000.72, 65.16),
"hours_worked_week" = c(148.5, 999.6, 77.0, 64.2, 999.9, 999.8, 25.9, 999.7)
)
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 96 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 99 9999999.97 9999999.98 999.8
7 98 9999999.98 1000.72 25.9
8 99 9999999.99 65.16 999.7
1. Maximum-Values-Approach (obsolete after clarification)
Load libraries
library(dplyr)
library(magrittr)
Get the four outliers
children_out <- tail(sort(df$children), 4)
Replace outliers with NA
df[df$children %in% children_out,]
%<>% mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
Caveat: This approach will work only if you always have four outliers for each column.
2. Manual thresholds
Load libraries
library(dplyr)
library(magrittr)
Exclude existing NA and replace anything that is 96 or above with NA
df[!is.na(df$children) & df$children >=96, ] %<>%
mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
3. If-else logic
Load libraries
library(dplyr)
library(magrittr)
Save "reserved answers"
children_res <- c(96, 97, 98, 99)
Replace anything that is a reserved answer with NA (excluding existing NA is not needed here)
df[df$children %in% children_res, ] %<>%
mutate(children = NA)
Check dataset
df
children annual_income hour_wage hours_worked_week
1 1 700000.00 25.65 148.5
2 0 50000.65 9999999.99 999.6
3 NA 30000.45 50.23 77.0
4 2 1000000.59 9999999.98 64.2
5 10 9999999.96 9999999.99 999.9
6 NA 9999999.97 9999999.98 999.8
7 NA 9999999.98 1000.72 25.9
8 NA 9999999.99 65.16 999.7
4. edit: Combined approach 1&3
Load libraries
library(dplyr)
library(magrittr)
Get "reserved answers"
children_res <- tail(sort(unique(df$children)), 4)
Replace anything that is a reserved answer with NA (excluding existing NA is not needed here)
df[df$children %in% children_res, ] %<>%
mutate(children = NA)
Caveat: This approach will work only if you always have ALL reserved answers (e.g., 96, 97, 98, and 99) present in each column. This will NOT WORK if, by accident, nobody would answer "97".

Loop that matches row to column names and computes an average of the 3 preceding columns

Im trying to make some computations in R. I have a dataset where in the columns i have id, startdate and then every day date from 2014 till 2017.
Now every id has a different start date. Accompanied for every date are concentrations of a chemical specific for an individual id.
A sample from my data looks like this:
id time 20140101 20140102 20140103 20140104 20140105 20140106 20140107
1 1 20141119 2.6 2.5 4.1 4.8 3.1 1.8 3.5
2 4 20150403 1.7 1.6 2.8 3.4 2.0 1.2 1.9
3 7 20140104 2.2 2.2 3.7 4.4 2.6 1.3 2.9
4 8 20141027 2.7 2.5 4.1 4.9 3.3 1.8 3.6
5 9 20141112 2.6 2.4 3.9 4.7 3.1 1.7 3.4
Now what i would like to do is to run a script that loops trough each row id and time combo eg "1 20141119" or "8 20141027", and matches the date numbers to the colnames and give me the corresponding concentration values.
so the combo "7 20140104" gives me the concentration 4.4
After this i would like to do the same but then take the date and make a 3 day average preceding the time date. So for the combo "7 20140104" make an average of the dates 20140102 20140103 20140104 concentrations for id 7
I made a small test data frame
id <- 12:18
date <- c("c","d","e","f","c","d","e")
a <- rnorm(7, 2, 1)
b <- rnorm(7, 2, 1)
c <- rnorm(7, 2, 1)
d <- rnorm(7, 2, 1)
e <- rnorm(7, 2, 1)
f <- rnorm(7, 2, 1)
df <- data.frame(id, date, a, b, c, d, e, f)
This was my solution for the first part of my question.
for(i in 1:nrow(df)){
conc <- df[i, df[i,"date"]==colnames(df)]
print(conc)
}
which works enough for the first part, but currently i don't know how to do the 3 day average. If you have tips on how to do the first part more nicely im all ears.
Hopefully you people can help me.
Thanks very much for your help.
If I've understood the question correctly, given a value, you want to get the next to values at that row and return the mean of the 3 values.
Assuming that these date columns are in order, I've adapted your loop to include what I think you are after. Not the most elegant code but I've tried to lay it out in a step-by-step manor:
for (i in 1:1) {
conc <- df[i, df[i,"date"]==colnames(df)]
conPos <- which(df[i,"date"]==colnames(df)) # Get the position
av <- df[i, (conPos:(conPos+2))] # Get the next to columns values
print(rowMeans(av)) # Get the average
}
Potentially a more efficient way to do this (depending on the size of your dataset) is to instead of a for loop, use an apply function. Something such as:
apply (df, MARGIN = 1, FUN = function(x, i){
position <- (which(x[['date']] == colnames(df)))
threeDayAverage <- as.numeric((x[(position:(position+2))]))
print(sum(threeDayAverage) / 3)
})

Transferring categorical means to a new table

I'm fairly new to R, but I've tackled much larger challenges than my current problem, which makes it particularly frustrating. I searched the forums and found some related topics, but none would do the trick for this situation.
I've got a dataset with 184 observations of 14 variables:
> head(diving)
tagID ddmmyy Hour.GMT. Hour.Local. X0 X3 X10 X20 X50 X100 X150 X200 X300 X400
1 122097 250912 0 9 0.0 0.0 0.3 12.0 15.3 59.6 12.8 0.0 0 0
2 122097 260912 0 9 0.0 2.4 6.9 5.5 13.7 66.5 5.0 0.0 0 0
3 122097 260912 6 15 0.0 1.9 3.6 4.1 12.7 39.3 34.6 3.8 0 0
4 122097 260912 12 21 0.0 0.2 5.5 8.0 18.1 61.4 6.7 0.0 0 0
5 122097 280912 6 15 2.4 9.3 6.0 3.4 7.6 21.1 50.3 0.0 0 0
6 122097 290912 18 3 0.0 0.2 1.6 6.4 41.4 50.4 0.0 0.0 0 0
This is tagging data, with each date having one or more 6-hour time bins (not a continuous dataset due to transmission interruptions). In each 6-hour bin, the depths to which the animal dived are broken down, by %, into 10 bins. So X0 = % of time spent between 0-3m, X3= % of time spent between 3-10m, and so on.
What I want to do for starters is take the mean % time spent in each depth bin and plot it. To start, I did the following:
avg0<-mean(diving$X0)
avg3<-mean(diving$X3)
avg10<-mean(diving$X10)
avg20<-mean(diving$X20)
avg50<-mean(diving$X50)
avg100<-mean(diving$X100)
avg150<-mean(diving$X150)
avg200<-mean(diving$X200)
avg300<-mean(diving$X300)
avg400<-mean(diving$X400)
At this point, I wasn't sure how to then plot the resulting means, so I made them a list:
divingmeans<-list(avg0, avg3, avg10, avg20, avg50, avg100, avg150, avg200, avg300, avg400)
boxplot(divingmeans) sort of works, providing 1:10 on the X axis and the % 0-30 on the y axis. However, I would prefer a histogram, as well as the x-axis providing categorical bin names (e.g. avg3 or X3), rather than just a rank 1:10.
hist() and plot() provide the following:
> plot(divingmeans)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
> hist(divingmeans)
Error in hist.default(divingmeans) : 'x' must be numeric
I've also tried:
> df<-as.data.frame(divingmeans)
> df
X3.33097826086957 X3.29945652173913 X8.85760869565217 X17.6461956521739 X30.2614130434783
1 3.330978 3.299457 8.857609 17.6462 30.26141
X29.3565217391304 X6.44510869565217 X0.664130434782609 X0.135869565217391 X0.0016304347826087
1 29.35652 6.445109 0.6641304 0.1358696 0.001630435
and
> df <- data.frame(matrix(unlist(divingmeans), nrow=10, byrow=T))
> df
matrix.unlist.divingmeans...nrow...10..byrow...T.
1 3.330978261
2 3.299456522
3 8.857608696
4 17.646195652
5 30.261413043
6 29.356521739
7 6.445108696
8 0.664130435
9 0.135869565
10 0.001630435
neither of which provide the sort of table I'm looking for.
I know there must be a really basic solution for converting this into an appropriate table, but I can't figure it out for the life of me. I'd like to be able to make a basic histogram showing the % of time spent in each diving bin, on average. It seems the best format for the data to be in for this purpose would be a table with two columns: col1=bin (category; e.g. avg50), and col2=% (numeric; mean % time spent in that category).
You'll also notice that the data is broken up into different timing bins; eventually I'd like to be able to separate out the data by time of day, to see if, for example, the average diving depths shift between day/night, and so on. I figure that once I have this initial bit of code worked out, I can then do the same by time-of-day by selecting, for example X0[which(Hour.GMT.=="6")]. Tips on this would also be very welcome.
I think you will find it far easier to deal with the data in long format.
You can reshape using reshape. I will use data.table to show how to easily calculate the means by group.
library(data.table)
DT <- data.table(diving)
DTlong <- reshape(DT, varying = list(5:14), direction = 'long',
times = c(0,3,10,20,50,100,150,200,300,400),
v.names = 'time.spent', timevar = 'hours')
timeByHours <- DTlong[,list(mean.time = mean(time.spent)),by=hours]
# you can then plot the two column data.table
plot(timeByHours, type = 'l')
You can now analyse by any combination of date / hour / time at depth
How would you like to plot them?
# grab the means of each column
diving.means <- colMeans(diving[, -(1:5)])
# plot it
plot(diving.means)
# boxplot
boxplot(diving.means)
If youd like to grab the lower bound to intervals from the column names, siply strip away the X
lowerIntervalBound <- gsub("X", "", names(diving)[-(1:5)])
# you can convert these to numeric and plot against them
lowInts <- as.numeric(lowerIntervalBound)
plot(x=lowInts, y=diving.means)
# ... or taking log
plot(x=log(lowInts), y=diving.means)
# ... or as factors (similar to basic plot)
plot(x=factor(lowInts), y=diving.means)
instead of putting the diving means in a list, try putting them in a vector (using c).
If you want to combine it into a data.frame:
data.frame(lowInts, diving.means)
# or adding a row id if needed.
data.frame(rowid=seq(along=diving.means), lowInts, diving.means)

summarize data from csv using R

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?
I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.
Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

Resources