How to properly rasterize unregular data points on EASE2 grid? - r

I was given data with 3 columns (lon, lat, data) which I would like to turn into a SpatRaster.
library(terra)
#> terra 1.5.21
df <- read.csv("data.csv")
head(df)
#> lon lat mean_daily_par_mol
#> 1 -7.52993 61.91259 0.5463984
#> 2 -7.16531 61.77965 0.5635693
#> 3 -6.80426 61.64570 0.6404579
#> 4 -6.44676 61.51075 0.6602439
#> 5 -6.09277 61.37483 0.6801313
#> 6 -5.74228 61.23794 0.6358355
Unfortunately, the data is not on a regular grid, so I can create a raster directly. The data is on the MODIS ISIN grid.
rast(df)
#> Error: [raster,matrix(xyz)] x cell sizes are not regular
e <- ext(c(min(df$lon), max(df$lon), min(df$lat), max(df$lat)))
e
#> SpatExtent : -180, 179.69196, 60.00299, 89.41164 (xmin, xmax, ymin, ymax)
r <- rast(e, ncols = 100, nrows = 100)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
Obviously, changing ncols and nrows considerably influence how the raster is created.
r <- rast(e, ncols = 300, nrows = 300)
r <- rasterize(
as.matrix(df[c("lon", "lat")]),
r,
df$mean_daily_par_mol,
fun = mean,
na.rm = TRUE
)
plot(r)
My question is how can I find the optimal ncols and nrows so there is approximately 1 values per pixel. The ultimate goal being to reproject this data onto the EASE2 grid.
The data is available here
Created on 2022-05-29 by the reprex package (v2.0.1)

Related

stack geotiff with stars 'along' when 'band' dimension contains band + time information

I have a timeseries of geotiff files I'd like to stack in R using stars. Here's the first two:
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
stars::read_stars(urls, along="time")
Errors with:
Error in c.stars_proxy(`3` = list(gep01.t00z.pgrb2a.0p50.f003.tif = "/vsicurl/https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/noaa/gefs-v12/cogs/gefs.20221201/gep01.t00z.pgrb2a.0p50.f003.tif"), :
don't know how to merge arrays: please specify parameter along
Context: bands contain both time+band info
This fails because the dimensions do not match, which happens because the files have concatenated temporal information into the band names:
x<- lapply(urls, read_stars)
x
produces:
[[1]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f003.ti... 50026.01 98094.81 101138 98347.42 101845.2 104605.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:3 hour fcst,...,DLWRF:surface:0-3 hour ave fcst
[[2]]
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max.
gep01.t00z.pgrb2a.0p50.f006.ti... 50029.83 98101.83 101170.6 98337.52 101825 104588.2
dimension(s):
from to offset delta refsys point
x 1 720 -180.25 0.5 Coordinate System importe... FALSE
y 1 361 90.25 -0.5 Coordinate System importe... FALSE
band 1 8 NA NA NA NA
values x/y
x NULL [x]
y NULL [y]
band PRES:surface:6 hour fcst,...,DLWRF:surface:0-6 hour ave fcst
Note the band names would align except for the existence of the timestamp being tacked on, e.g. PRES:surface:3 hour fcst vs PRES:surface:6 hour fcst.
How can I best read in these files so that I have dimensions of x,y,band, and time in my stars object?
alternatives: terra?
How about terra? Note that terra is happy to read these files in directly, but treats this as 16 unique bands. Can I re-align that so that I have the original 8 bands along a new "time" dimension? (I recognize stars emphasizes 'spatio-temporal', maybe the such a cube is out of scope to terra?) Also note that terra for some reason mangles the timestamp in these band names:
x <- terra::rast(urls)
x
class : SpatRaster
dimensions : 361, 720, 16 (nrow, ncol, nlyr)
resolution : 0.5, 0.5 (x, y)
extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat Coordinate System imported from GRIB file
sources : gep01.t00z.pgrb2a.0p50.f003.tif (8 layers)
gep01.t00z.pgrb2a.0p50.f006.tif (8 layers)
names : PRES:~ fcst, TMP:2~ fcst, RH:2 ~ fcst, UGRD:~ fcst, VGRD:~ fcst, APCP:~ fcst, .
With terra it is pretty easy to make a time-series for each variable as I show below.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
r <- rast(urls)
Extract two variables of interest
nms <- names(r)
tmp <- r[[grep("TMP", nms)]]
rh <- r[[grep("RH", nms)]]
# set time
tm <- as.POSIXct("2022-12-01", tz="GMT") + c(3,6) * 3600
time(rh) <- tm
time(tmp) <- tm
And you could combine them into a SpatRasterDatset like this:
s <- sds(list(tmp=tmp, rh=rh))
An alternative path to get to the same point would be to start with a SpatRasterDataset and subset it.
sd <- sds(urls)
nl <- 1:length(sd)
nms <- names(sd[1])
tmp2 <- rast(sd[nl, grep("TMP", nms)])
time(tmp2) <- tm
rh2 <- rast(sd[nl, grep("RH", nms)])
time(rh2) <- tm
I made the subsetting work a little nicer in terra version 1.7-5
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/", c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(terra)
#terra 1.7.5
sd <- sds(urls)
tmp <- sd[,2]
tmp
#class : SpatRaster
#dimensions : 361, 720, 2 (nrow, ncol, nlyr)
#resolution : 0.5, 0.5 (x, y)
#extent : -180.25, 179.75, -90.25, 90.25 (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat Coordinate System imported from GRIB file
#sources : gep01.t00z.pgrb2a.0p50.f003.tif
# gep01.t00z.pgrb2a.0p50.f006.tif
#names : TMP:2 m above g~Temperature [C], TMP:2 m above g~Temperature [C]
#unit : C, C
#time : 2022-12-01 03:00:00 to 2022-12-01 06:00:00 UTC
As for the layer names containing the forecast time, that is just because that is what is in the tif metadata. It looks like that was a decision made when they were created from the original GRIB files.
The latitude extent going beyond the north and south poles is an interesting feature of this dataset.
Just wanted to share some additional possible solutions for comparison. With larger numbers of files some of these differences become more relevant. this expands a bit beyond my original question.
terra
Prof Hijmans gives a very nice solution in terra. He also asked about the original upstream sources, which I didn't explain properly -- these are originally GRIB files for NOAA GEFS forecast.
Notably, we can work directly from the GRIB files. GEFS is a 35-day forecast, so let's try going more than 6 hrs into the future:
library(terra)
# original GRIB sources, AWS mirror
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- terra::sds(gribs)
})
cube[1,63] |> plot()
very nice!
gdalcubes
gdalcubes is another package that can also leverage the gdal virtual filesystem when working with these large-ish remote files. It also lets us define an abstract cube at potentially a different resolution in space & time than the original sources (averaging or interpolating). lazy operations mean this may run a bit faster(?)
library(gdalcubes)
date <- as.Date("2023-01-26")
date_time = date + lubridate::hours(seq(3,240,by=3))
# USA box
v <- cube_view(srs = "EPSG:4326",
extent = list(left = -125, right = -66,top = 49, bottom = 25,
t0= as.character(min(date_time)), t1=as.character(max(date_time))),
dx = 0.5, dy = 0.5, dt = "PT3H")
gribs <- paste0("/vsicurl/https://noaa-gefs-pds.s3.amazonaws.com/gefs.20220314/00/atmos/pgrb2ap5/geavg.t00z.pgrb2a.0p50.f",
stringr::str_pad(seq(3,240,by=3), 3, pad="0"))
bench::bench_time({
cube <- gdalcubes::create_image_collection(gribs, date_time = date_time)
})
bench::bench_time({
raster_cube(cube, v) |>
select_bands("band63") |> # tempearture
animate(col = viridisLite::viridis, nbreaks=50, fps=10, save_as = "temp.gif")
})
stars
didn't translate a full stars example, but here at least is the band name correction; a bit more cumbersome than the examples above.
urls <- paste0("/vsicurl/",
"https://sdsc.osn.xsede.org/bio230014-bucket01/neon4cast-drivers/",
"noaa/gefs-v12/cogs/gefs.20221201/",
c("gep01.t00z.pgrb2a.0p50.f003.tif", "gep01.t00z.pgrb2a.0p50.f006.tif"))
library(stars)
#stars::read_stars(urls, along="time") # no luck!
## grab unstacked proxy object for each geotiff
x <- lapply(urls, read_stars)
# extract band-names-part
band_names <- st_get_dimension_values(x[[1]], "band") |>
stringr::str_extract("([A-Z]+):") |>
str_remove(":")
# apply corrected band-names
x1 <- lapply(x, st_set_dimensions, "band", band_names)
# at last, we can stack into a cube:
x1 <- do.call(c, c(x1, along="time"))
# and add correct date timestamps to the new time dimension
dates <- as.Date("2022-12-01") + lubridate::hours(c(3,6))
x1 <- st_set_dimensions(x1, "time", dates)
x1

Why intersect function from terra R package not giving all the combinations?

I want to calculate the area under every possible combination of two classified rasters. I am using the following code
library(terra)
#First create two rasters
r1 <- r2 <- rast(nrow=100, ncol=100)
#Assign random cell values
set.seed(123)
values(r1) <- runif(ncell(r1), min=0, max=1)
values(r2) <- runif(ncell(r2), min=0, max=1)
# classify the values into two groups
m_r1 <- c(min(global(r1, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r1, "max", na.rm=TRUE)), 2)
m_r2 <- c(min(global(r2, "min", na.rm=TRUE)), 0.2, 1,
0.2, max(global(r2, "max", na.rm=TRUE)), 2)
#Reclassify the rasters
rclmat_r1 <- matrix(m_r1, ncol=3, byrow=TRUE)
rc_r1 <- classify(r1, rclmat_r1, include.lowest=TRUE)
rclmat_r2 <- matrix(m_r2, ncol=3, byrow=TRUE)
rc_r2 <- classify(r2, rclmat_r2, include.lowest=TRUE)
plot(rc_r1)
plot(rc_r2)
#Convert to polygons
r1_poly <- as.polygons(rc_r1, dissolve=TRUE)
r2_poly <- as.polygons(rc_r2, dissolve=TRUE)
plot(r1_poly)
plot(r2_poly)
#Perform intersections
x <- intersect(r1_poly, r2_poly)
x
#> class : SpatVector
#> geometry : polygons
#> dimensions : 2747, 2 (geometries, attributes)
#> extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84
#> names : lyr.1 lyr.1
#> type : <int> <int>
#> values : 1 1
#> 1 2
#> 2 1
As you can see from the output, one combination i.e. 2-2 is missing. Why is this happening?
When I am trying to calculate the area for each combination using expanse(x), it returns a long result. How can I get the area in km2 for the following combinations?
Combination Area (km2)
1-1
1-2
2-1
2-2
With this example it would be better to stay with raster data.
x = 10 * rc_r1 + rc_r2
a = cellSize(x, unit="km")
zonal(a, x, sum)
# lyr.1 area
#1 11 19886611
#2 12 81946082
#3 21 84763905
#4 22 323469024
By multiplying with 10, the values in the first layer become 10 (it they were 1) or 20 (if they were 2). If you then add the second layer, you get 10 + 1 or 2 and 20 + 1 or 2, so you end up with four classes: 11, 12, 21, and 22. These show the value in the first raster (first digit) and in the second raster (second digit).
When you show a SpatVector only the first three records are printed, and there is a 2-2 record. Nevertheless, intersect did not work properly and I have now fixed this.

sf: Generate random points with maximal distance condition

I'd like to generate 100 random points but imposed a maximal distance around points using st_buffer() of size 1000 meters around each point, and eliminating any offending points. But, in my example:
library(sf)
# Data set creation
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m")
st_bbox(df.laea)
#
# Random simulation of 100 point inside df.laea extent
sim_study_area <- st_sample(st_as_sfc(st_bbox(df.laea)), 100) %>% # random points, as a list ...
st_sf()
border_area <- st_as_sfc(st_bbox(df.laea))%>% # random points, as a list ...
st_sf()
# I'd like to imposed a maximal distance of 1000 meters around points and for this:
i <- 1 # iterator start
buffer_size <- 1000 # minimal distance to be enforced (in meters)
repeat( {
# create buffer around i-th point
buffer <- st_buffer(sim_study_area[i,], buffer_size)
offending <- sim_study_area %>% # start with the intersection of master points...
st_intersects(buffer, sparse = F) # ... and the buffer, as a vector
# i-th point is not really offending
offending[i] <- TRUE
# if there are any offending points left - re-assign the master points
sim_study_area <- sim_study_area[offending,]
if ( i >= nrow(sim_study_area)) {
# the end was reached; no more points to process
break
} else {
# rinse & repeat
i <- i + 1
}
} )
# Visualizantion of points create with the offending condition:
simulation_area <- ggplot() +
geom_sf(data = border_area, col = 'gray40', fill = NA, lwd = 1) +
geom_sf(data = sim_study_area, pch = 3, col = 'red', alpha = 0.67) +
theme_bw()
plot(simulation_area)
It's not OK result because a don't have 100 points and I don't know how I can fix it.
Please any ideas?
Thanks in advance,
Alexandre
I think that the easiest solution is to adopt one of the sampling functions defined in the R package spatstat. For example:
# packages
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
# create data
set.seed(1)
df <- data.frame(
gr = c(rep("a",5),rep("b",5)),
x = rnorm(10),
y = rnorm(10)
)
df <- st_as_sf(df,coords = c("x","y"),remove = F, crs = 4326)
df.laea = st_transform(
df,
crs = "+proj=laea +x_0=4600000 +y_0=4600000 +lon_0=0.13 +lat_0=0.24 +datum=WGS84 +units=m"
)
Now we sample with a Simple Sequential Inhibition Process. Check ?spatstat.core::rSSI for more details.
sampled_points <- st_sample(
x = st_as_sfc(st_bbox(df.laea)),
type = "SSI",
r = 1000, # threshold distance (in metres)
n = 100 # number of points
)
# Check result
par(mar = rep(0, 4))
plot(st_as_sfc(st_bbox(df.laea)), reset = FALSE)
plot(sampled_points, add = TRUE, pch = 16)
# Estimate all distances
all_distances <- st_distance(sampled_points)
all_distances[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00 57735.67 183205.74 189381.50 81079.79
#> [2,] 57735.67 0.00 153892.93 143755.73 61475.85
#> [3,] 183205.74 153892.93 0.00 62696.68 213379.39
#> [4,] 189381.50 143755.73 62696.68 0.00 194237.12
#> [5,] 81079.79 61475.85 213379.39 194237.12 0.00
# Check they are all greater than 1000
sum(all_distances < 1000)
#> [1] 100 # since the diagonal is full of 100 zeros
Created on 2021-08-12 by the reprex package (v2.0.0)
Check here (in particular the answer from Prof. Baddeley), the references therein, and the help page of st_sample for more details.

How do I calculate the mean of two different variables taking into account the values of latitude and longitude in R?

I am currently trying to obtain some data in R from a table.
I have a dataset with two different variables, the annual range and the annual mean, of the worldwide sea surface temperature (SST). I have these values for each latitude (from 90 to -90) and longitude (from 180 to -180) level.
I would like to obtain the mean of the aforementioned variables (annual range and annual mean) for 5x5 grid cells of latitude/longitude. For example, I would need to know the "annual range" mean for for a longitude between -180 and -176 and a latitude between 90 and 86, and so on until getting the mean of this variable for all the possible 5x5 grid cells.
My data looks like:
lon lat ANNUAL_MEAN ANNUAL_RANGE
1 0.5 89.5 -1.8 0
2 1.5 89.5 -1.8 0
3 2.5 89.5 -1.8 0
4 3.5 89.5 -1.8 0
5 4.5 89.5 -1.8 0
6 5.5 89.5 -1.8 0
...
52001 354.5 -89.5 -1.8 0
52002 355.5 -89.5 -1.8 0
52003 356.5 -89.5 -1.8 0
52004 357.5 -89.5 -1.8 0
52005 358.5 -89.5 -1.8 0
52006 359.5 -89.5 -1.8 0
Thank you in advance
You can use raster package and its focal function for computations with a moving window.
First I will create a dummy data.frame which represents your data
# Prepare dummy data.frame
set.seed(2222)
lonlat <- expand.grid(1:10, 1:10)
df <- data.frame( lon = lonlat[, 1],
lat = lonlat[, 2],
ANNUAL_MEAN = rnorm(100),
ANNUAL_RANGE = runif(100, 1, 5)
)
Now we have to convert data frame into raster and to perform a moving window averaging.
library(raster)
# Convert data frame to raster object
rdf <- df
coordinates(rdf) <- ~ lon + lat
gridded(rdf) <- TRUE
rdf <- brick(rdf) # our raster brick
## Perform moving window averaging
# prepare weights matrix (5*5)
w <- matrix(1, ncol = 5, nrow = 5)
# perform moving window averaging
ANNUAL_MEAN_AVG <- focal(rdf[[1]], w, mean, pad = TRUE, na.rm = TRUE)
ANNUAL_RANGE_AVG <- focal(rdf[[2]], w, mean, pad = TRUE, na.rm = TRUE)
# Append new data to initial data.frame
df$ANNUAL_MEAN_AVG <- as.data.frame(ANNUAL_MEAN_AVG)
df$ANNUAL_RANGE_AVG <- as.data.frame(ANNUAL_RANGE_AVG)
Now each cell in df$ANNUAL_MEAN_AVG and df$ANNUAL_RANGE_AVG contains the mean value of the corresponding 5*5 square.
UPD 1. 5x5 downsampling
If you need a fixed 5x5 grid cells with mean values per cell you can use raster::agregate function.
Working with rdf raster brick from the previous example.
# perform an aggregation with given downsampling factor
rdf_d <- aggregate(rdf, fact=5, fun = mean)
# Now each pixel in the raster `rdf_d` contains a mean value of 5x5 pixels from initial `rdf`
# we need to get pixels coordinates and their values
coord <- coordinates(rdf_d)
vals <- as.data.frame(rdf_d)
colnames(coord) <- c("lon", "lat")
colnames(vals) <- c("ANNUAL_MEAN_AVG", "ANNUAL_RANGE_AVG")
res <- cbind(coord, vals)
This is a solution that uses the dplyr package, included in tidyverse. It should be easy to follow, step by step.
library(tidyverse)
# set.seed() assures reproducability of the example with identical random numbers
set.seed(42)
# build a simulated data set as described in the question
lats <- seq(from = -90, to = 90, by = 0.5)
lons <- seq(from = -180, to = 179.5, by = 0.5) # we must omit +180 or we would
# double count those points
# since they coincide with -180
# combining each latitude point with each longitude point
coord <- merge(lats, lons) %>%
rename(lat = x) %>%
rename(lon = y) %>%
# adding simulated values
mutate(annual_mean = runif(n = nrow(.), min = -2, max = 2)) %>%
mutate(annual_range = runif(n = nrow(.), min = 0, max = 3)) %>%
# defining bands of 5 latitude and 5 longitude points by using integer division
mutate(lat_band = lat%/%5) %>%
mutate(lon_band = lon%/%5) %>%
# creating a name label for each unique 5x5 gridcell
mutate(gridcell_5x5 = paste(lat_band, lon_band, sep = ",")) %>%
# group-by instruction, much like in SQL
group_by(lat_band, lon_band, gridcell_5x5) %>%
# sorting to get a nice order
arrange(lat_band, lon_band) %>%
# calculating minimum and maximum latitude and longitude for each gridcell
# calculating the mean values per gridcell
summarize(gridcell_min_lat = min(lat),
gridcell_max_lat = max(lat),
gridcell_min_lon = min(lon),
gridcell_max_lon = max(lon),
gridcell_mean_annual_mean = round(mean(annual_mean), 3),
gridcell_mean_annual_range = round(mean(annual_range), 3) )

Retrieving values from raster into another raster with different extent and resolution

I got a raster layer 'w_wgs' of climatic values. On the other side, I created an empty raster 'grid'.
I want to assign the values stored in 'w_wgs' into my empty raster grid. I tried merge(), overlay(), over() but these don't work.
How can I do this ? assuming these two rasters don't have the same extent nor resolution (nor equal number of cells, obviously)...but both have the same CRS (WGS84).
Here, a description of my 2 objects :
> extent(w_wgs)
class : Extent
xmin : -64.50344
xmax : 74.07016
ymin : 12.93039
ymax : 72.72534
> res(w_wgs)
[1] 0.01320 0.00895
> res(grid)
[1] 0.08 0.08
> extent(grid)
class : Extent
xmin : 5
xmax : 17.96
ymin : 40
ymax : 50
I think you need to use resample
grid <- resample(w_wgs, grid)
A small example that should be similar to your case:
x <- matrix(1:100, nr = 10, nc = 10)
a <- raster(x)
x2 <- matrix(NA, nr = 3, nc = 3)
b <- raster(x2)
# Manually changing the extent and resolution of b
b#extent#xmax <- 0.5
b#extent#ymin <- 0.3
b <- resample(a, b)
par(mfrow = c(2, 1))
plot(a)
plot(extent(b), add = T)
plot(b)

Resources