How to do a "full" union with the R package sf - r

I try to do a union between three polygons using sf::st_union. In the figure below showing the result from ArcGIS "Overlay, Union, All" I wish to obtain a similar result as the five different polygons in 'OUTPUT' by using the sf package in R.
library(sf)
a1 <- st_polygon(list(rbind(c(0, 10), c(45, 10), c(45, 90), c(0, 90), c(0, 10))))
a2 <- st_polygon(list(rbind(c(45, 10), c(90,10), c(90, 90), c(45, 90), c(45, 10))))
b <- st_polygon(list(rbind(c(15, 5), c(75, 5), c(75, 50), c(15, 50), c(15, 5))))
a <- st_sf(c(st_sfc(a1), st_sfc(a2)))
b <- st_sf(st_sfc(b))
a$station <- c(1, 2)
b$type <- "A"
ab_union <- st_union(a, b)
In this simple example the resulting sf object 'ab_union' will only contain two polygons, not the expected five. Can I get the wanted result with five objects as in the figure above by using functions in the sf package?

I didn't find a function that make everything in one step, but this is a way to resolve your problem:
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(tidyverse)
a1 <- st_polygon(list(rbind(c(0, 10), c(45, 10), c(45, 90), c(0, 90), c(0, 10))))
a2 <- st_polygon(list(rbind(c(45, 10), c(90,10), c(90, 90), c(45, 90), c(45, 10))))
b1 <- st_polygon(list(rbind(c(15, 5), c(75, 5), c(75, 50), c(15, 50), c(15, 5))))
a <- st_sf(station=c(1, 2), geometry=st_sfc(a1, a2))
b <- st_sf(type="A", geometry=st_sfc(b1))
st_agr(a) = "constant" #to avoid warnings, but see https://github.com/r-spatial/sf/issues/406
st_agr(b) = "constant"
#Operations
plot(st_geometry(st_union(a,b)))
op1 <- st_difference(a,st_union(b)) #notice the use of st_union()
plot(st_geometry(op1), border="red", add=TRUE)
op2 <- st_difference(b, st_union(a)) #notice the order of b and a and st_union()
plot(st_geometry(op2), border="green", add=TRUE)
op3 <- st_intersection(b, a) #notice the order of b and a
plot(st_geometry(op3), border="blue", add=TRUE)
union <- rbind(op1, op2, op3) #Error because op1 (op2) doesn't have the column "type" ("station")
#> Error in match.names(clabs, names(xi)): names do not match previous names
op11 <- dplyr::mutate(op1, type=NA)
op22 <- dplyr::mutate(op2, station=NA)
union <- rbind(op11, op22, op3)
(as.data.frame(union)) #The row names must be ordered.
#> station type geometry
#> 1 1 <NA> POLYGON ((15 10, 0 10, 0 90...
#> 2 2 <NA> POLYGON ((45 50, 45 90, 90 ...
#> 3 NA A POLYGON ((75 10, 75 5, 15 5...
#> 11 1 A POLYGON ((15 10, 15 50, 45 ...
#> 1.1 2 A POLYGON ((45 50, 75 50, 75 ...
plot(union)
#Other approach for avoid create the new columns would be:
union2 <- dplyr::bind_rows(op1, op2, op3) #But see discusion here: https://github.com/r-spatial/sf/issues/49
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POLYGON' elements may not
#> preserve their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POLYGON' elements may not
#> preserve their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POLYGON' elements may not
#> preserve their attributes
Created on 2019-04-06 by the reprex package (v0.2.1)
The discussions I refer:
https://github.com/r-spatial/sf/issues/406
https://github.com/r-spatial/sf/issues/49

I also needed such a function, so here's my version for arbitrary sf objects, avoiding to explicitly add missing columns. It needs the 'plyr' library in addition (function rbind.fill):
my_union <- function(a,b) {
#
# function doing a real GIS union operation such as in QGIS or ArcGIS
#
# a - the first sf
# b - the second sf
#
st_agr(a) = "constant"
st_agr(b) = "constant"
op1 <- st_difference(a,st_union(b))
op2 <- st_difference(b, st_union(a))
op3 <- st_intersection(b, a)
union <- rbind.fill(op1, op2, op3)
return(st_as_sf(union))
}

Related

Why intersect function from terra R package not giving all the combinations?

I want to calculate the area under every possible combination of two classified rasters. I am using the following code
library(terra)
#First create two rasters
r1 <- r2 <- rast(nrow=100, ncol=100)
#Assign random cell values
set.seed(123)
values(r1) <- runif(ncell(r1), min=0, max=1)
values(r2) <- runif(ncell(r2), min=0, max=1)
# classify the values into two groups
m_r1 <- c(min(global(r1, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r1, "max", na.rm=TRUE)), 2)
m_r2 <- c(min(global(r2, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r2, "max", na.rm=TRUE)), 2)
#Reclassify the rasters
rclmat_r1 <- matrix(m_r1, ncol=3, byrow=TRUE)
rc_r1 <- classify(r1, rclmat_r1, include.lowest=TRUE)
rclmat_r2 <- matrix(m_r2, ncol=3, byrow=TRUE)
rc_r2 <- classify(r2, rclmat_r2, include.lowest=TRUE)
plot(rc_r1)
plot(rc_r2)
#Convert to polygons
r1_poly <- as.polygons(rc_r1, dissolve=TRUE)
r2_poly <- as.polygons(rc_r2, dissolve=TRUE)
plot(r1_poly)
plot(r2_poly)
#Perform intersections
x <- intersect(r1_poly, r2_poly)
x
#> class : SpatVector
#> geometry : polygons
#> dimensions : 2747, 2 (geometries, attributes)
#> extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84
#> names : lyr.1 lyr.1
#> type : <int> <int>
#> values : 1 1
#> 1 2
#> 2 1
As you can see from the output, one combination i.e. 2-2 is missing. Why is this happening?
When I am trying to calculate the area for each combination using expanse(x), it returns a long result. How can I get the area in km2 for the following combinations?
Combination Area (km2)
1-1
1-2
2-1
2-2
With this example it would be better to stay with raster data.
x = 10 * rc_r1 + rc_r2
a = cellSize(x, unit="km")
zonal(a, x, sum)
# lyr.1 area
#1 11 19886611
#2 12 81946082
#3 21 84763905
#4 22 323469024
By multiplying with 10, the values in the first layer become 10 (it they were 1) or 20 (if they were 2). If you then add the second layer, you get 10 + 1 or 2 and 20 + 1 or 2, so you end up with four classes: 11, 12, 21, and 22. These show the value in the first raster (first digit) and in the second raster (second digit).
When you show a SpatVector only the first three records are printed, and there is a 2-2 record. Nevertheless, intersect did not work properly and I have now fixed this.

Is it possible to write a single function to create raster files from a data.frame object?

I have loaded the data.frame object in R named "prec" with 1009549 rows and 8 variables. I want to create 60 raster layers of the cumulative "prec" variable values for each x-y coordinates pair at every 4-time step ("tstep" variable, from index 2 to 241) as summarized in the code below. I performed a single function to create each file in 3 steps to achieve it. However, is it possible to write a single function for each step or a single function for the entire code (steps 1 to 4)?
load required packages
library(data.table)
library(raster)
structure of the "prec" data.frame
> headTail(prec)
x y prec index tstep variable level date
1 -47.8 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.1 -47.6 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.2 -47.4 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
1.3 -47.2 -21.2 0 1 1 prec 1000 2015-01-01 00:00:00
... ... ... ... ... ... <NA> ... <NA>
241.4185 -36.8 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4186 -36.6 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4187 -36.4 -7.2 0 241 241 prec 1000 2015-01-01 00:00:59
241.4188 -36.2 -7.2 0 241 241 prec 1000 2015-01-01 00:01:00
step 1: subset by tstep
prec_1 <- prec[prec$tstep %in% c(2, 3, 4, 5),]
prec_2 <- prec[prec$tstep %in% c(6, 7, 8, 9),]
prec_3 <- prec[prec$tstep %in% c(10, 11, 12, 13),]
...
prec_60 <- prec[prec$tstep %in% c( 238 , 239 , 240 , 241),]
step 2: coerce to data.table
prec_1_sum <- setDT(prec_1)[, list(prec_sum_1 = sum(prec*1000)), list(x, y)]
prec_2_sum <- setDT(prec_2)[, list(prec_sum_2 = sum(prec*1000)), list(x, y)]
prec_3_sum <- setDT(prec_3)[, list(prec_sum_3 = sum(prec*1000)), list(x, y)]
...
prec_60_sum <- setDT(prec_60)[, list(prec_sum_60 = sum(prec*1000)), list(x, y)]
step 3: create n raster layers
layer_1 <- rasterFromXYZ(prec_1_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_2 <- rasterFromXYZ(prec_2_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
layer_3 <- rasterFromXYZ(prec_3_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
...
layer_60 <- rasterFromXYZ(prec_60_sum [,1:3], res = c(0.20, 0.20), crs = sp::CRS("+init=epsg:4326"))
step 4: stack raster layers
stack_prec <- stack(layer_1, layer_2, layer_3, layer_4, layer_5, layer_6, layer_7, layer_8, layer_9, layer_10,
layer_11, layer_12, layer_13, layer_14, layer_15, layer_16, layer_17, layer_18, layer_19, layer_20,
layer_21, layer_22, layer_23, layer_24, layer_25, layer_26, layer_27, layer_28, layer_29, layer_30,
layer_31, layer_32, layer_33, layer_34, layer_35, layer_36, layer_37, layer_38, layer_39, layer_40,
layer_41, layer_42, layer_43, layer_44, layer_45, layer_46, layer_47, layer_48, layer_49, layer_50,
layer_51, layer_52, layer_53, layer_54, layer_55, layer_56, layer_57, layer_58, layer_59, layer_60)
It’s always much easier to help when we have sample data we can use. In the future you can use dput(prec) and copy and paste that output for people to use. At the very least some sample data is useful, particularly when you’re using functions that have certain specifications for what the data should look like. Here we generate some data to work with.
library(raster)
#> Loading required package: sp
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:raster':
#>
#> shift
set.seed(1)
dat <-
data.frame(
x = rep(seq(-47.8, -47.2, by = 0.2), 241),
y = -21.2,
prec = runif(964),
tstep = rep(1:241, each = 4),
date = c(rep(as.Date("2015-01-01"), 4), rep(seq(as.Date("2015-01-01"), by = "day", length.out = 60), each = 16))
)
For your process, it seems a bit more straightforward to group the data rather than break it up. That way you only have to perform the operations on one data set rather than do it many times over. Steps 1 and 2 can be reduced to only a few lines that way. Without thinking too much about optimizing this, I’ve looped over the groups created in the first step to create the raster layers.
raster_layers <- function(dat){
## some flexibility if there is a differing number of tsteps
## it will by default exclude the first tstep as in your example
min_tstep <- min(dat$tstep)
max_tstep <- max(dat$tstep)
breaks <- seq(min_tstep, max_tstep, by = 4)
## Step 1
dat$group <- cut(dat$tstep, breaks)
dat <- dat[!is.na(dat$group), ]
## Step 2
prec <- setDT(dat)[ , list(prec_sum = sum(prec * 1000)), by = list(group, x, y)]
## Step 3
layer <- list()
group <- unique(prec$group)
j <- 1
for (i in group){
raster_dat <- prec[prec$group %in% i , c("x", "y", "prec_sum")]
## looks like your plot uses the names for changing labels??
colnames(raster_dat)[colnames(raster_dat) == "prec_sum"] <- paste0("prec_sum_", j)
layer[[j]] <-
rasterFromXYZ(raster_dat,
res = c(0.20, 0.20),
crs = sp::CRS("+init=epsg:4326"))
j <- j + 1
}
## Step 4
stack_prec <- stack(unlist(layer))
return(stack_prec)
}
Example
stack_prec <- raster_layers(dat = dat)
stack_prec
#> class : RasterStack
#> dimensions : 1, 4, 4, 60 (nrow, ncol, ncell, nlayers)
#> resolution : 0.2, 0.2 (x, y)
#> extent : -47.9, -47.1, -21.3, -21.1 (xmin, xmax, ymin, ymax)
#> crs : +init=epsg:4326
#> names : prec_sum_1, prec_sum_2, prec_sum_3, prec_sum_4, prec_sum_5, prec_sum_6, prec_sum_7, prec_sum_8, prec_sum_9, prec_sum_10, prec_sum_11, prec_sum_12, prec_sum_13, prec_sum_14, prec_sum_15, ...
#> min values : 2112.4990, 1124.8232, 2007.5945, 1315.0517, 1729.9294, 1582.8684, 1524.0147, 1098.1529, 2008.5390, 1248.1860, 1680.0199, 1855.4024, 815.4047, 1204.8576, 1416.3943, ...
#> max values : 2336.186, 2565.158, 2877.219, 2318.115, 3017.609, 2540.536, 2569.019, 2690.884, 2327.706, 2288.046, 3104.792, 2639.530, 2358.953, 2599.245, 2618.676, ...

Check if polygons intersect in R

Many users have asked how to check whether polygons intersect, however, current answers to those questions are not working for my case.
I have two objects. One is a "Large SpatialPolygons" named "farms". This object has several polygons (total of 2011), and each polygon indicates the limits of a different farm (see print screen).
My second object is a "Large SpatialPolygons DataFrame" named slope_RJ_100m. That object divides a large area into several square polygons with area of 10.000m^2 each (total of 310000 polygons).
For each of the squares (polygons) in "slope_RJ_100m", I would like to know whether they intersect any of the polygons in "farms". In other words, I want to know whether each particular square in "slope_RJ_100m" has a farm inside (even if just a piece of a farm).
I was expecting the outcome to be something with 310000 rows and two variables, one indicating the polygon in slope_RJ_100m, and the other with TRUE or FALSE for whether that polygon has a farm.
I have tried:
inters = gIntersection(slope_RJ_100m, farms)
This one produces an output of about 1500 polygons.
I am not sure how to use this to know which of my 310000 polygons has a farm in it.
inters = raster::intersect(slope_RJ_100m, farms)
The output has 29144 polygons. As in the previous case, not sure how I can use this to know whether the square has a farm.
and
inters = st_intersects(slope_RJ_100m, farms)
Error in UseMethod("st_intersects") :
no applicable method for 'st_intersects' applied to an object of class "c('SpatialPolygonsDataFrame', 'SpatialPolygons', 'Spatial', 'SpatialPolygonsNULL', 'SpatialVector', 'SPNULL')"
I am sure my question is trivial and I apologize beforehand.
Here is an example with terra (to run this you need terra 1.1-17)
library(terra)
# polygons
p1 <- vect("POLYGON ((0 0, 8 0, 8 9, 0 9, 0 0))")
p2 <- vect("POLYGON ((5 6, 15 6, 15 15, 5 15, 5 6))")
p3 <- vect("POLYGON ((8 2, 9 2, 9 3, 8 3, 8 2))")
p4 <- vect("POLYGON ((2 6, 3 6, 3 8, 2 8, 2 6))")
p5 <- vect("POLYGON ((2 12, 3 12, 3 13, 2 13, 2 12))")
p6 <- vect("POLYGON ((10 4, 12 4, 12 7, 11 7, 11 6, 10 6, 10 4))")
p <- rbind(p1, p2, p3, p4, p5, p6)
plot(p, col=rainbow(6, alpha=.5))
lines(p, lwd=2)
text(p)
relate(rbind(p1, p2), rbind(p3,p4,p5,p6), "intersects")
# [,1] [,2] [,3] [,4]
#[1,] TRUE TRUE FALSE FALSE
#[2,] FALSE FALSE FALSE TRUE
With your SpatialPolygons* you should be able to do the following:
s <- vect(slope_RJ_100m)
f <- vect(farms)
x <- relate(s, f, "intersects")
Likewise, to use st_intersects, you need to use sf objects, not Spatial* objects. Something like
library(sf)
ss <- st_as_sf(slope_RJ_100m)
ff <- st_as_sf(farms)
inters <- st_intersects(ss, ff)
With the example data from above
s1 < st_as_sf(rbind(p1, p2))
s2 <- st_as_sf(rbind(p3,p4,p5,p6))
st_intersects(s1, s2)
#Sparse geometry binary predicate list of length 2, where the predicate was `intersects'
# 1: 1, 2
# 2: 4

R raster calculate areal weighted mean when scaling to a larger resolution with offset

I have two raster grids in R with different resolutions which don't line up exactly. In actual fact I have hundreds of each so any answer must be easily run many times.
I want to scale the finer resolution grid up to the coarser resolution by taking an areal weighted mean of the grid cells.
I was hoping I could use projectRaster or resample but neither give the desired output and I cannot use aggregate as I need my new grids to align to the coarser resolution grid.
For my real data my finer grid is 0.005 deg intervals and coarser is at 0.02479172 deg intervals and extents/origins don't exactly match up.
I've made an extreme version as an example why neither resample or projectRaster work
library(raster)
#> Warning: package 'raster' was built under R version 3.5.3
#> Loading required package: sp
testproj <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +ellps=WGS84"
testmat <- matrix(1, nrow = 8, ncol = 8)
# testmat <- matrix(sample(1:10, 64, replace = T), nrow = 8, ncol = 8)
testmat[1,5] <- 400
testmat[8,4] <- -400
testsmallraster <- raster(testmat, xmn=0, xmx=8, ymn=0, ymx=8)
crs(testsmallraster) <- testproj
plot(testsmallraster)
testlarger <- raster(matrix(rep(NA,4), nrow = 2, ncol = 2), xmn=0.3, xmx=8.3, ymn=0, ymx=8)
crs(testlarger) <- testproj
tout_reproj <- projectRaster(testsmallraster, testlarger)
tout_resamp <- resample(testsmallraster, testlarger)
tout_resampngb <- resample(testsmallraster, testlarger, method = "ngb")
tout_agg <- aggregate(testsmallraster, fact = 4)
#reprojected values ignore all but 4 cells closest to new centre
values(tout_reproj)
#> [1] 1 1 1 1
#resample uses bilinear interpolation which weights the grids cells furthest from the new centre less than those closest
# I need all grid cells entirely contained in the new grid to have equal weighting
#bilinear interpolation also weights cells which do not fall within the new cell at all which I do not want
values(tout_resamp)
#> [1] 10.851852 15.777778 -7.911111 -12.366667
#aggregate gives close to the values I want but they are not in the new raster origin/resolution and therefore not splitting values that fall across grid boundaries
values(tout_agg)
#> [1] 1.0000 25.9375 -24.0625 1.0000
#using ngb was never really going to make any sense but thought I'd as it for completeness
values(tout_resampngb)
#> [1] 1 1 1 1
#desired output first cell only 0.3 of a grid cell covers the grid cell = 400 the rest equal 1
#desired output second cell 0.7 of a grid cell covers the grid cell = 400 the rest equal 1
#desired output third cell has exactly 1 grid cell of -400 and 15 of 1
#desired output fourth cell only overlap grid cells = 1
desiredoutput <- raster(matrix(c((15.7*1+0.3*400)/16,(15.3*1+0.7*400)/16,mean(c(-400, rep(1,15))),1),byrow = T, nrow = 2, ncol = 2), xmn=0.3, xmx=8.3, ymn=0, ymx=8)
values(desiredoutput)
#> [1] 8.48125 18.45625 -24.06250 1.00000
Created on 2020-07-02 by the reprex package (v0.3.0)
You can get closer to the desired result by using a similar spatial resolution for resample, and then aggregate the results
library(raster)
testproj <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"
testmat <- matrix(1, nrow = 8, ncol = 8)
testmat[1,5] <- 400
testmat[8,4] <- -400
testsmallraster <- raster(testmat, xmn=0, xmx=8, ymn=0, ymx=8, crs=testproj)
testlarger <- raster(matrix(rep(NA,4), nrow = 2, ncol = 2), xmn=0.3, xmx=8.3, ymn=0, ymx=8, crs = testproj)
y <- disaggregate(testlarger, 4)
z <- resample(testsmallraster, y)
za <- aggregate(z, 4)
values(za)
#[1] 8.48125 18.45625 -24.06250 1.00000
for much better speed, try terra
library(terra)
a <- rast(testsmallraster)
b <- rast(testlarger)
b <- disaggregate(b, 4)
d <- resample(a, b)
da <- aggregate(d, 4)
values(da)
# layer
#[1,] 8.48125
#[2,] 18.45625
#[3,] -24.06250
#[4,] 1.00000
This probably ought to be done automatically by resample and project(Raster). raster attempts to do some of this for resample, but in this case not very satisfactorily.
When I needed to do similar resampling, this worked for me. This example is a 4-cell destination grid at 1o x 1o spacing with centroids at half degrees (to match some satellite data), and an offset half-degree grid for source data (ECMWF weather). 'Resample' does the heavy lifting of interpolating on mismatched grids. The code below is basically a manual version of a 'weights=' option that doesn't exist for resample. We need relative, not absolute, areas to be correct for weighting, so the caveat on the precision of raster::area described in the help seems of low concern.
library(raster)
wgs84 <- "+init=epsg:4326"
polar.brick.source <- array(dim = c(5, 5, 2), rep(c(1, 2), each = 25))
dimnames(polar.brick.source)[[1]] <- seq(-1, 1, by = .5)
dimnames(polar.brick.source)[[2]] <- seq(80, 82, by = .5)
dimnames(polar.brick.source)[[3]] <- c("time.a", "time.b")
# Add some outliers to see their effects.
polar.brick.source[1, 2, ] <- c(25, 50)
polar.brick.source[3, 2, 2] <- -30
polar.brick <- brick(polar.brick.source, crs = CRS(wgs84),
xmn = min(as.numeric(dimnames(polar.brick.source)[[1]])) - .25,
xmx = max(as.numeric(dimnames(polar.brick.source)[[1]])) + .25,
ymn = min(as.numeric(dimnames(polar.brick.source)[[2]])) - .25,
ymx = max(as.numeric(dimnames(polar.brick.source)[[2]])) + .25)
fine.polar.area <- raster::area(polar.brick)
polar.one.degree.source <- data.frame(
lon = c(-.5, .5, -.5, .5),
lat = c(80.5, 80.5, 81.5, 81.5),
placeholder = rep(1, 4))
polar.one.degree.raster <- rasterFromXYZ(polar.one.degree.source, crs = CRS(wgs84))
polar.one.degree.area <- raster::area(polar.one.degree.raster)
as.data.frame(polar.one.degree.area, xy = T)
fine.clip.layer <- disaggregate(polar.one.degree.raster, 2)
clipped.fine.polar <-resample(polar.brick * fine.polar.area,
fine.clip.layer)
new.weighted.wx <- aggregate(clipped.fine.polar * 4, 2)
as.data.frame(new.weighted.wx, xy = T) # look at partial results.
new.weather <- new.weighted.wx / polar.one.degree.area
as.data.frame(new.weather, xy = T)

Calculate different sets of value in a variable into a dataframe

I am trying to figure out how to calculate the average,median and standard deviation for each value of each variable. Here is some of the data (thanks to #Barranka for providing the data in a easy-to-copy format):
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
To save it into csv later, i have to make it into data frame that supposed to be like this
Table for gama
Value Average Median Standard Deviation
10 (Avg of 10) (median of 10) (Stdev of 10)
1 (Avg of 1) (median of 1) (Stdev of 1)
0.1 (Avg of 0.1) (median of 0.1) (Stdev of 0.1)
Table for theta
Value Average Median Standard Deviation
1 (Avg of 10) (median of 10) (Stdev of 10)
0.65 (Avg of 0.65) (median of 0.65) (Stdev of 0.65)
Table for detectionsLimit
Value Average Median Standard Deviation
3 (Avg of 3) (median of 3) (Stdev of 3)
5 (Avg of 5) (median of 5) (Stdev of 5)
...
The columns to be used as ID's are:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
Summary statistics should be computed over the following columns:
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
I have tried using data.table package/function, but I cannot figuring out how to develop an approach using data.table without renaming values one by one; also, when pursuing this approach, my code gets very complicated.
Clever use of melt() and tapply() can help you. I made the following assumptions:
You have to get the mean, median and average of the last three columns
You need to group the data for each of the first ten columns (gama, theta, ..., negTrades)
For reproducibility, here's the input:
# Your example data
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
And here's my proposed solution:
library(reshape)
md <- melt(df, id=colnames(df)[1:10]) # This will create one row for each
# 'id' combination, and will store
# the rest of the column headers
# in the `variable` column, and
# each value corresponding to the
# variable. Like this:
head(md)
## gama theta detectl NSMOOTH NREF NOBS sma lma PosTrades NegTrades variable value
## 1 10 1.00 3 10 50 10 15 33 11 2 Acc 0.846154
## 2 1 1.00 5 5 80 40 15 33 7 2 ## Acc 0.777778
## 3 1 1.00 1 20 80 40 15 33 6 1 ## Acc 0.857143
## 4 1 1.00 1 20 50 20 15 33 3 0 ## Acc 1.000000
## 5 1 0.65 5 5 80 20 15 33 9 1 ## Acc 0.900000
## 6 1 1.00 3 20 50 20 15 33 3 0 ## Acc 1.000000
results <- list() # Prepare the results list
for(i in unique(md$variable)) { # For each variable you have...
results[[i]] <- list() # ... create a new list to hold the 'summary'
tmp_data <- subset(md, variable==i) # Filter the data you'll use
for(j in colnames(tmp_data)[1:10]) { # For each variable, use tapply()
# to get what you need, and
# store it into a data frame
# inside the results
results[[i]][[j]] <- as.data.frame(
t(
rbind(
tapply(tmp_data$value, tmp_data[,j], mean),
tapply(tmp_data$value, tmp_data[,j], median),
tapply(tmp_data$value, tmp_data[,j], sd))
)
)
colnames(results[[i]][[j]]) <- c('average', 'median', 'sd')
}
rm(tmp_data) # You'll no longer need this
}
Now what? Check out the summary for results:
summary(results)
## Length Class Mode
## Acc 10 -none- list
## AvgWin 10 -none- list
## AvgLoss 10 -none- list
## Return 10 -none- list
You have a list for each variable. Now, if you check out the summary for any results "sublist", you'll see this:
summary(results$Acc)
## Length Class Mode
## gama 3 data.frame list
## theta 3 data.frame list
## detectl 3 data.frame list
## NSMOOTH 3 data.frame list
## NREF 3 data.frame list
## NOBS 3 data.frame list
## sma 3 data.frame list
## lma 3 data.frame list
## PosTrades 3 data.frame list
## NegTrades 3 data.frame list
See what happens when you peek into the results$Acc$gama list:
results$Acc$gama
## average median sd
## 0.1 0.5454550 0.545455 NA
## 1 0.9069842 0.900000 0.09556548
## 10 0.8455433 0.846154 0.01191674
So, for each variable and each "id" column, you have the data summary you want.
Hope this helps.
I have an approach involving data.table.
EDIT: I tried to submit an edit to the question, but I took some liberties so it'll probably get rejected. I made assumptions about which columns were to be used as "id" columns (columns whose values subset data), and which should be "measure" columns (columns whose values are used to calculate the summary statistics). See here for these designations:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
Setup
# Convert to data.table
df <- data.table(df)
# Helper function to convert a string to a call
# useful in a data.table j
s2c <- function (x, type = "list"){
as.call(lapply(c(type, x), as.symbol))
}
# Function to computer the desired summary stats
smry <- function(x) list(Average=mean(x, na.rm=T), Median=median(x, na.rm=T), StandardDeviation=sd(x, na.rm=T))
# Define some names to use later
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
usenames <- paste(rep(c("Average","Median","StdDev"),each=length(vals)), vals,sep="_")
Calculations in data.table
# Compute the summary statistics
df2 <- df[,j={
for(i in 1:length(ids)){ # loop through each id
t.id <- ids[i]
t.out <- .SD[,j={
t.vals <- .SD[,eval(s2c(vals))] # this line returns a data.table with each vals as a column
sapply(t.vals, smry) # apply summary statistics
},by=t.id] # this by= loops through each value of the current id (t.id)
setnames(t.out, c("id.val", usenames)) # fix the names of the data.table to be returned for this i
t.out <- cbind(id=t.id, t.out) # add a column indicating the variable name (t.id)
if(i==1){big.out <- t.out}else{big.out <- rbind(big.out, t.out)} # accumulate the output data.table
}
big.out
}]
Formatting
df2 <- data.table:::melt.data.table(df2, id.vars=c("id","id.val")) # melt into "long" format
df2[,c("val","metric"):=list(gsub(".*_","",variable),gsub("_.*","",variable))] # splice names to create id's
df2[,variable:=NULL] # delete old column that had the names we just split up
df2 <- data.table:::dcast.data.table(df2, id+id.val+val~metric) # go a bit wider, so stats in diff columns
# reshape2:::acast(df2, id+id.val~metric~val) # maybe replace the above line with this
Result
id id.val val Average Median StdDev
1: NOBS 10 Acc 3.214550 0.01191674 0.006052701
2: NOBS 10 AvgLoss 1.000000 0.06300610 1.409930000
3: NOBS 10 AvgWin 1.333333 0.06100090 1.447786667
4: NOBS 10 NegTrades 6.000000 0.84615400 -0.019449800
5: NOBS 10 PosTrades 7.333333 0.84554333 -0.021165467
---
128: theta 1 AvgLoss 1.000000 0.06897450 1.447160000
129: theta 1 AvgWin 1.571429 0.08320849 1.455691429
130: theta 1 NegTrades 6.000000 0.84615400 -0.017465300
131: theta 1 PosTrades 5.857143 0.83712329 -0.017420860
132: theta 1 Return 1.718249 0.03285638 0.068957635

Resources