Subtract data from lat/lon coordinates - r

I have 2 files of data that looks like this:
Model Data
long lat count
96.25 18.75 4
78.75 21.25 3
86.75 23.25 7
91.25 33.75 10
Observation Data
long lat count
96.75 25.75 10
86.75 23.25 7
78.75 21.25 11
95.25 30.25 5
I'm trying to subtract the counts of the lat/long combinations (model data-observation data) that match such that the first combination of 78.75 & 21.25 would give a difference count of -8. Any lat/long points without a match to subtract with would just be subtracted by or from 0.
I've tried an if statement as such to match points for subtraction:
if (modeldata$long == obsdata$long & modeldata$lat == obsdata$lat) {
obsdata$difference <- modeldata$count - obsdata$count
}
However, this just subtracts rows in order, not by matching points, unless matching points happen to fall within the same row.
I also get these warnings:
Warning messages:
1: In modeldata$long == obsdata$long :
longer object length is not a multiple of shorter object length
2: In modeldata$lat == obsdata$lat :
longer object length is not a multiple of shorter object length
3: In if (modeldata$long == obsdata$long & modeldata$lat == :
the condition has length > 1 and only the first element will be used
Any help would be greatly appreciated!

You can merge on coordinates, add 0 for NA and substract.
mdl <- read.table(text = "long lat count
96.25 18.75 4
78.75 21.25 3
86.75 23.25 7
91.25 33.75 10", header = TRUE)
obs <- read.table(text = "long lat count
96.75 25.75 10
86.75 23.25 7
78.75 21.25 11
95.25 30.25 5", header = TRUE)
xy <- merge(mdl, obs, by = c("long", "lat"), all.x = TRUE)
xy[is.na(xy)] <- 0
xy$diff <- xy$count.x - xy$count.y
xy
long lat count.x count.y diff
1 78.75 21.25 3 11 -8
2 86.75 23.25 7 7 0
3 91.25 33.75 10 0 10
4 96.25 18.75 4 0 4

You can do this using a data.table join & update
library(data.table)
## reading your supplied data
# dt_model <- fread(
# 'long lat count
# 96.25 18.75 4
# 78.75 21.25 3
# 86.75 23.25 7
# 91.25 33.75 10'
# )
#
#
# dt_obs <- fread(
# "long lat count
# 96.75 25.75 10
# 86.75 23.25 7
# 78.75 21.25 11
# 95.25 30.25 5"
# )
setDT(dt_model)
setDT(dt_obs)
## this join & update will update the `dt_model`.
dt_model[
dt_obs
, on = c("long", "lat")
, count := count - i.count
]
dt_model
# long lat count
# 1: 96.25 18.75 4
# 2: 78.75 21.25 -8
# 3: 86.75 23.25 0
# 4: 91.25 33.75 10
Noting the obvious caveat that joining on coordinates (floats/decimals) may not always give the right answer

Here is an option with dplyr
library(dplyr)
left_join(mdl, obs, by = c("long", "lat")) %>%
transmute(long, lat, count = count.x - replace(count.y, is.na(count.y), 0))
# long lat count
#1 96.25 18.75 4
#2 78.75 21.25 -8
#3 86.75 23.25 0
#4 91.25 33.75 10

Related

read.csv error due to no column names (R)

I'm trying to read a csv file in r.
The issue is that my file has no column names except for the first column
Using the read.csv() function gives me the 'Error in read.table : more columns than column names' error
So I used the read_csv() function from the readr library.
However this creates a df with just one column containing all the values.
(https://i.stack.imgur.com/Och8A.png)
What should I do to fix this issue?
First cut to read the data would be using skip=1 (to not read in the first line, it appears to be descriptive only) and header=FALSE:
quux <- read.csv("path/to/file.csv", skip = 1, header = FALSE)
I find this format to be a bit awkward, we may want to reshape it a bit.
quux <- setNames(data.frame(t(quux[,-1])), sub(":$", "", quux[[1]]))
quux
# LON LAT MMM 1984-Nov-01 1974-Nov-05
# V2 151.0 -24.5 27.11 22.28 22.92
# V3 151.5 -24.0 27.46 22.47 22.83
# V4 152.0 -24.0 27.19 22.27 22.64
Many tools prefer to have the "month" column names as a single column, which is converting this data from "wide" format to "long". This is easily done with either tidyr::pivot_longer or reshape2::melt:
dat <- reshape2::melt(quux, c("LON", "LAT", "MMM"), variable.name = "date")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-Nov-01 22.28
# 2 151.5 -24.0 27.46 1984-Nov-01 22.47
# 3 152.0 -24.0 27.19 1984-Nov-01 22.27
# 4 151.0 -24.5 27.11 1974-Nov-05 22.92
# 5 151.5 -24.0 27.46 1974-Nov-05 22.83
# 6 152.0 -24.0 27.19 1974-Nov-05 22.64
dat <- tidyr::pivot_longer(quux, -c(LON, LAT, MMM), names_to = "date")
From here, it might be nice to have the date column be a "proper" Date-object so that it "number-like" things can be done with it. For example, in its present form, sorting is incorrect since Apr will land before Jan; other number-like operations include finding ranges of dates (which can be done with strings, but not these strings) and adding/subtracting days (e.g., 7 days prior to a value).
dat$date <- as.Date(dat$date, format = "%Y-%b-%d")
dat
# LON LAT MMM date value
# 1 151.0 -24.5 27.11 1984-11-01 22.28
# 2 151.5 -24.0 27.46 1984-11-01 22.47
# 3 152.0 -24.0 27.19 1984-11-01 22.27
# 4 151.0 -24.5 27.11 1974-11-05 22.92
# 5 151.5 -24.0 27.46 1974-11-05 22.83
# 6 152.0 -24.0 27.19 1974-11-05 22.64
Sample data:
quux <- read.csv(skip = 1, header = FALSE, text = '
LON:,151.0,151.5,152.0
LAT:,-24.5,-24.0,-24.0
MMM:,27.11,27.46,27.19
1984-Nov-01,22.28,22.47,22.27
1974-Nov-05,22.92,22.83,22.64
')

Averaging every n columns and keep first two columns in the new data.frame in r

I have a data frame for daily time series with 4 observation for every day (every 6 hours) for each x and y (I have 202552 cells).
> head(tab,10)
x y X1990.05.01.01.00.00 X1990.05.01.07.00.00 X1990.05.01.13.00.00 X1990.05.01.19.00.00 X1990.05.02.01.00.00 X1990.05.02.07.00.00 X1990.05.02.13.00.00
1 5.000 60 276.9105 277.8516 278.9908 279.2422 279.6751 279.8078 280.4396
2 5.125 60 276.8863 277.8682 278.9966 279.2543 279.6863 279.7885 280.4033
3 5.250 60 276.8621 277.8830 279.0006 279.2659 279.6989 279.7688 280.3661
4 5.375 60 276.8379 277.8969 279.0029 279.2772 279.7123 279.7477 280.3289
5 5.500 60 276.8142 277.9094 279.0033 279.2879 279.7257 279.7244 280.2909
6 5.625 60 276.7913 277.9224 279.0033 279.2987 279.7396 279.6993 280.2523
7 5.750 60 276.7707 277.9363 279.0020 279.3094 279.7531 279.6715 280.2142
8 5.875 60 276.7537 277.9520 279.0002 279.3202 279.7656 279.6406 280.1770
9 6.000 60 276.7416 277.9713 278.9980 279.3314 279.7773 279.6070 280.1407
10 6.125 60 276.7357 277.9946 278.9953 279.3435 279.7871 279.5707 280.1071
X1990.05.02.19.00.00 X1990.05.03.01.00.00 X1990.05.03.07.00.00 X1990.05.03.13.00.00 X1990.05.03.19.00.00 X1990.05.04.01.00.00 X1990.05.04.07.00.00
1 280.5674 280.3316 280.3796 280.2308 280.6216 280.6216 280.1842
2 280.5414 280.3106 280.3697 280.2133 280.6220 280.6368 280.2053
3 280.5145 280.2886 280.3594 280.1927 280.6184 280.6503 280.2227
4 280.4858 280.2653 280.3482 280.1703 280.6113 280.6619 280.2380
5 280.4562 280.2420 280.3379 280.1466 280.6010 280.6722 280.2492
6 280.4262 280.2192 280.3280 280.1219 280.5880 280.6816 280.2572
7 280.3957 280.1981 280.3209 280.0973 280.5732 280.6910 280.2613
8 280.3661 280.1793 280.3159 280.0748 280.5571 280.7009 280.2626
9 280.3384 280.1640 280.3155 280.0542 280.5414 280.7112 280.2599
10 280.3128 280.1542 280.3195 280.0385 280.5270
I'd like to compute the daily average for every 4 columns (as each day has 4 measurements). I was able to use this function but I need to keep x and y for each row.
### daily mean
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
DM<- data.frame(byapply(tab[3:2800], 4, rowMeans))
> head(DM, 10)
X1 X2 X3 X4 X5
1 278.2488 280.1225 280.3909 279.4138 276.6809
2 278.2514 280.1049 280.3789 279.4395 276.7141
3 278.2529 280.0871 280.3648 279.4634 276.7437
4 278.2537 280.0687 280.3488 279.4858 276.7691
5 278.2537 280.0493 280.3319 279.5066 276.7909
6 278.2539 280.0294 280.3143 279.5264 276.8090
7 278.2546 280.0086 280.2974 279.5453 276.8244
8 278.2565 279.9873 280.2818 279.5639 276.8377
9 278.2605 279.9658 280.2688 279.5819 276.8495
10 278.2673 279.9444 280.2598 279.5998 276.8611
Then I can use cbind to link daily means with each x and y
lonlat<-tab[-(3:2800)]
DMxy<- data.frame(cbind(lonlat, DM))
But I am looking for a way that I can compute the daily average directly by keeping the first two columns (x and y) in the new data frame (without deleting x and y) to minimize any possible error in cobind
Instead of
DM<- data.frame(byapply(tab[3:2800], 4, rowMeans))
try
DM2 <- cbind(byapply(tab[-(1:2)], 4, rowMeans), tab[1:2])
That will get you the desired result in a single step. Also, you minimize the chance of a mistake because you don't need to know the length of your dataframe; tab[-(1:2)] means "Every column in tab except the first two".
Classic textbook case to not store data in wide format due to needed operations such as grouped aggregation, specifically averaging. Consider melting your data into long format, and aggregate by the day for each X and Y grouping:
DATA (OP's posted example but filled in missing row 10 last two values)
txt= ' x y X1990.05.01.01.00.00 X1990.05.01.07.00.00 X1990.05.01.13.00.00 X1990.05.01.19.00.00 X1990.05.02.01.00.00 X1990.05.02.07.00.00 X1990.05.02.13.00.00 X1990.05.02.19.00.00 X1990.05.03.01.00.00 X1990.05.03.07.00.00 X1990.05.03.13.00.00 X1990.05.03.19.00.00 X1990.05.04.01.00.00 X1990.05.04.07.00.00
1 5.000 60 276.9105 277.8516 278.9908 279.2422 279.6751 279.8078 280.4396 280.5674 280.3316 280.3796 280.2308 280.6216 280.6216 280.1842
2 5.125 60 276.8863 277.8682 278.9966 279.2543 279.6863 279.7885 280.4033 280.5414 280.3106 280.3697 280.2133 280.6220 280.6368 280.2053
3 5.250 60 276.8621 277.8830 279.0006 279.2659 279.6989 279.7688 280.3661 280.5145 280.2886 280.3594 280.1927 280.6184 280.6503 280.2227
4 5.375 60 276.8379 277.8969 279.0029 279.2772 279.7123 279.7477 280.3289 280.4858 280.2653 280.3482 280.1703 280.6113 280.6619 280.2380
5 5.500 60 276.8142 277.9094 279.0033 279.2879 279.7257 279.7244 280.2909 280.4562 280.2420 280.3379 280.1466 280.6010 280.6722 280.2492
6 5.625 60 276.7913 277.9224 279.0033 279.2987 279.7396 279.6993 280.2523 280.4262 280.2192 280.3280 280.1219 280.5880 280.6816 280.2572
7 5.750 60 276.7707 277.9363 279.0020 279.3094 279.7531 279.6715 280.2142 280.3957 280.1981 280.3209 280.0973 280.5732 280.6910 280.2613
8 5.875 60 276.7537 277.9520 279.0002 279.3202 279.7656 279.6406 280.1770 280.3661 280.1793 280.3159 280.0748 280.5571 280.7009 280.2626
9 6.000 60 276.7416 277.9713 278.9980 279.3314 279.7773 279.6070 280.1407 280.3384 280.1640 280.3155 280.0542 280.5414 280.7112 280.2599
10 6.125 60 276.7357 277.9946 278.9953 279.3435 279.7871 279.5707 280.1071 280.3128 280.1542 280.3195 280.0385 280.5270 280.6581 280.3139'
df <- read.table(text=txt, header=TRUE)
CODE
library(reshape2)
mdf <- melt(df, id.vars = c('x', 'y'), variable.name = "day")
mdf$day <- gsub("X", "", mdf$day)
mdf$datetime <- as.POSIXct(mdf$day, format="%Y.%m.%d.%H.%M.%S")
mdf$day <- format(mdf$datetime, "%Y-%m-%d")
head(mdf)
# x y day value datetime
# 1 5.000 60 1990-05-01 276.9105 1990-05-01 01:00:00
# 2 5.125 60 1990-05-01 276.8863 1990-05-01 01:00:00
# 3 5.250 60 1990-05-01 276.8621 1990-05-01 01:00:00
# 4 5.375 60 1990-05-01 276.8379 1990-05-01 01:00:00
# 5 5.500 60 1990-05-01 276.8142 1990-05-01 01:00:00
# 6 5.625 60 1990-05-01 276.7913 1990-05-01 01:00:00
aggdf <- aggregate(value ~ x + y + day, mdf, FUN=mean)
aggdf <- with(aggdf, aggdf[order(x,y),]) # RE-ORDER BY X
row.names(aggdf) <- NULL # RESET ROW NAMES
head(aggdf, 12)
# x y day value
# 1 5.000 60 1990-05-01 278.2488
# 2 5.000 60 1990-05-02 280.1225
# 3 5.000 60 1990-05-03 280.3909
# 4 5.000 60 1990-05-04 280.4029
# 5 5.125 60 1990-05-01 278.2514
# 6 5.125 60 1990-05-02 280.1049
# 7 5.125 60 1990-05-03 280.3789
# 8 5.125 60 1990-05-04 280.4211
# 9 5.250 60 1990-05-01 278.2529
# 10 5.250 60 1990-05-02 280.0871
# 11 5.250 60 1990-05-03 280.3648
# 12 5.250 60 1990-05-04 280.4365

How can I extract data from a raster stack based on a list of lat long?

I have a raster stack with 100+ files. And I want to extract the values from each file for particular lat-long locations. This gives me the list of values for one Lat-Long combination.
plist <- list.files(pattern = "\\.tif$", include.dirs = TRUE)
pstack <- stack(plist)
#levelplot(pstack)
for (i in 1:length(plist))
t[i]=extract(pstack[[i]], 35,-90)
How can I do it for thousands of locations when I have the lat-long locations in a separate file/dataframe. There is a location ID that I want to preserve too in the final list:
Lat Long LocID
35 -90 001
35 -95 221
30 -95.4 226
31.5 - 90 776
My final objective is to have a dataframe of this type:
Lat Long LocID value
35 -90 001 0.5
35 -95 221 1.4
30 -95.4 226 2.5
31.5 - 90 776 4.5
Though if it is not possible to preserve the LocID, that's fine too.
One of the files: https://www.dropbox.com/s/ank4uxjbjk3chaz/new_conus.tif?dl=0
Testing a solution from comments:
latlong<-structure(list(lon = c(-71.506667, -71.506667, -71.506667, -71.215278,
-71.215278, -71.215278, -71.215278, -71.215278, -71.215278, -71.215278
), lat = c(42.8575, 42.8575, 42.8575, 42.568056, 42.568056, 42.568056,
42.568056, 42.568056, 42.568056, 42.568056)), .Names = c("lon",
"lat"), row.names = c(NA, 10L), class = "data.frame")
ext<-extract(pstack,latlong)
gives
Error in UseMethod("extract_") :
no applicable method for 'extract_' applied to an object of class "c('RasterStack', 'Raster', 'RasterStackBrick', 'BasicRaster')"
Update #2:
The error was because it was conflicting with another package. This works:
raster::extract(pstack,latlong)
You can use extract function in raster library. First you read in your data frame and select the lon, lat columns. Let's say you have dataframe dat and the raster stack of pstack
loc <- dat[,c("long", "lat")]
ext <- extract(pstack, loc)
new_d <- cbind(dat, ext) # bind the extracted values back to the previous dataframe
I don't usually work with this type of data, but how about this:
library(sp)
library(raster)
library(rgdal)
# coordinate data
coords <- read.table(text = 'Lat Long LocID
35 -90 001
35 -95 221
30 -95.4 226
31.5 -90 776', header = T)
# list of all files
plist <- c('~/Downloads/new_conus.tif', '~/Downloads/new_conus copy.tif')
# image stack
data.images <- stack(plist)
# make a master data frame containing all necessary data
data.master <- data.frame(file = rep(plist, each = nrow(coords)), file.id = rep(1:length(plist), each = nrow(coords)), coords)
At this point, we have a master data frame that looks like this:
file file.id Lat Long LocID
1 ~/Downloads/new_conus.tif 1 35.0 -90.0 1
2 ~/Downloads/new_conus.tif 1 35.0 -95.0 221
3 ~/Downloads/new_conus.tif 1 30.0 -95.4 226
4 ~/Downloads/new_conus.tif 1 31.5 -90.0 776
5 ~/Downloads/new_conus copy.tif 2 35.0 -90.0 1
6 ~/Downloads/new_conus copy.tif 2 35.0 -95.0 221
7 ~/Downloads/new_conus copy.tif 2 30.0 -95.4 226
8 ~/Downloads/new_conus copy.tif 2 31.5 -90.0 776
Now we just extract the value corresponding to the data in each row of the data frame:
# extract values for each row in the master data frame
data.master$value <- NA
for (i in 1:nrow(data.master)) {
data.master$value[i] <- with(data.master, extract(data.images[[file.id[i]]], Lat[i], Long[i]))
}
file file.id Lat Long LocID value
1 ~/Downloads/new_conus.tif 1 35.0 -90.0 1 255
2 ~/Downloads/new_conus.tif 1 35.0 -95.0 221 255
3 ~/Downloads/new_conus.tif 1 30.0 -95.4 226 259
4 ~/Downloads/new_conus.tif 1 31.5 -90.0 776 249
5 ~/Downloads/new_conus copy.tif 2 35.0 -90.0 1 255
6 ~/Downloads/new_conus copy.tif 2 35.0 -95.0 221 255
7 ~/Downloads/new_conus copy.tif 2 30.0 -95.4 226 259
8 ~/Downloads/new_conus copy.tif 2 31.5 -90.0 776 249

R: sum vector by vector of conditions

I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?
I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526
I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067
Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.
Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920

mean and standard deviation by group for multiple variables [duplicate]

This question already has answers here:
plyr package writing the same function over multiple columns
(2 answers)
Closed 9 years ago.
I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.
This is a portion of my dataframe (g):
trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave
1 CTK 1 CT K 74 15 11 1.323 1.593 1.458
2 CTK 2 CT K 71 15 14 1.575 1.601 1.588
3 CTK 3 CT K 72 14 14 1.551 1.594 1.573
4 CTR 1 CT R 72 15 13 1.560 1.647 1.604
5 CTR 2 CT R 73 14 13 1.612 1.580 1.596
6 CTR 3 CT R 73 13 14 1.709 1.577 1.643
7 ZTK 1 ZT K 72 16 12 1.526 1.546 1.536
8 ZTK 2 ZT K 71 16 13 1.292 1.626 1.459
9 ZTK 3 ZT K 71 17 12 1.623 1.607 1.615
10 ZTR 1 ZT R 66 16 18 1.719 1.709 1.714
11 ZTR 2 ZT R 67 17 16 1.529 1.708 1.618
12 ZTR 3 ZT R 66 17 17 1.663 1.655 1.659
I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!
The function you will likely want to apply to your dataframe is aggregate() with either mean or sd as the function parameter.
assuming myDF is your original dataset:
library(data.table)
myDT <- data.table(myDF)
# Which variables to calculate All columns but the first five? :
variables <- tail( names(myDT), -5)
myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]
## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
--
> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR ZT 16.66667 17.00000 1.637000 1.690667 1.663667
> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984
aggregate(g[, c("sand", "silt", "clay")], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Using an anonymous function with aggregate.data.frame allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:
aggregate(g[, names(g)[-(1:4)], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )

Resources