Following a previous question (Faster reading of time series from netCDF?) I have re-permuted my netCDF files to provide fast time-series reads (scripts on github to be cleaned up eventually ...).
In short, to make reads faster, I have rearranged the dimensions from lat, lon, time to time, lat, lon. Now, my existing scripts break because they assume that the dimensions will always be lat, lon, time, following the ncdf4 documentation of ncvar_get, for the 'start' argument:
Order is X-Y-Z-T (i.e., the time dimension is last)
However, this is not the case.
Furthermore, there is a related inconsistency in the order of variables listed via the commandline netCDF utility ncdump -h and the R function ncdf4::nc_open. The first says that the dimensions are in the expected (lat, lon, time) order while the latter sees dimensions with time first (time, lat, lon).
For a minimal example, download the file test.nc and run
bash-$ ncdump -h .nc
bash-$ R
R> library(ncdf4)
R> print(nc_open("test.nc")
What I want to do is get records 5-15 from the variable "lwdown"
my.nc <- nc_open("test.nc")
But this doesn't work, since R sees the time dimension first, so I must change my scripts to
ncvar_get(my.nc, "lwdown", start = c(5, 1, 1), count = c(10, 1, 1))
It wouldn't be so bad to update my scripts and functions, except that I want to be able to read data from files regardless of the dimension order.
Other than Is there a way to generalize this function so that it works independent of dimension order?
While asking the question, I figured out this solution, though there is still room for improvement:
The closest I can get is to open the file and find the order in this way:
my.nc$var$lwdown$dim[[1]]$name
[1] "time"
my.nc$var$lwdown$dim[[2]]$name
[1] "lon"
my.nc$var$lwdown$dim[[3]]$name
[1] "lat"
which is a bit unsatisfying, although it led me to this solution:
If I want to start at c(lat = 1, lon = 1, time = 5), but the ncvar_get expects an arbitrary order, I can say"
start <- c(lat = 1, lon = 1, time = 5)
count <- c(lat = 1, lon = 1, time = 10)
dim.order <- sapply(my.nc$var$lwdown$dim, function(x) x$name)
ncvar_get(my.nc, "lwdown", start = start[dim.order], count = count[dim.order])
I ran into this recently as well. I have a netcdf with data in this format
nc_in <- nc_open("my.nc")
nc_in$dim[[1]]$name == "time"
nc_in$dim[[2]]$name == "latitude"
nc_in$dim[[3]]$name == "longitude"
nc_in$dim[[1]]$len == 3653 # this is the number of timesteps in my netcdf
nc_in$dim[[2]]$len == 180 # this is the number of longitude cells
nc_in$dim[[3]]$len == 360 # this is the number of latitude cells
The obnoxious part here is that the DIM component of the netCDF is in the order of T,Y,X
If I try to to grab time series data for the pr var using the indices in the order they appear in nc_in$dim I get an error
ncvar_get(nc_in,"pr")[3653,180,360] # 'subscript out of bounds'
If I instead grab data in X,Y,T order, it works:
ncvar_get(nc_in,"pr")[360,180,3653] # gives me a value
What I don't understand is how the ncvar_get() package knows what variable represents X, Y and T, especially if you have generated your own netCDF.
Related
I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.
I'm trying to extract a subset of depth data from GEBCO's global ocean bathymetry dataset, which is a 10.9gb .nc file, netcdf4 (direct link).
I open a connection to the file, which doesn't load it into memory:
library(ncdf4)
GEBCO <- nc_open(filename = "GEBCO_2019.nc", verbose = T)
Find the lat & lon indices corresponding to my subset area:
LonIdx <- which(GEBCO$dim$lon$vals < -80 & GEBCO$dim$lon$vals > -81.7) #n=408 long
LatIdx <- which(GEBCO$dim$lat$vals < 26 & GEBCO$dim$lat$vals > 25) #n=240; 240*408=97920
Then get Z data for those extents:
z <- ncvar_get(GEBCO, GEBCO$var$elevation)[LonIdx, LatIdx]
Resulting in:
Error: cannot allocate vector of size 27.8gb
However it does this regardless of the size of the subset, even down to a 14*14 matrix. I presume therefore that ncvar_get() is pulling the whole database in order to extract the indices... even though I was under the impression that the entire point of netcdf files was that you could extract using matrix indexing without loading the whole thing to memory?
FWIW I'm on a 32gb linux machine, so it should work anyway? [edit, and the file is 10.9gb in the first place, so one would think a subset would be smaller]
Any ideas/intel/insights gratefully received. Thanks in advance.
Edit: other times it crashes RStudio rather than giving the error. R Session Aborted, fatal error, session terminated. RAM usage was:
Ok, solved. Turns out the answer I found online before using [LonIdx, LatIdx] indexes the object after the whole thing is read to memory. Notwithstanding I still don't know why this was a problem given its filesize is a third my memory, and failing expanded size is within my memory, this is still the wrong way to go.
Assuming ones rows and columns are contiguous (they should be in netcdf) the solution is:
z <- ncvar_get(nc = GEBCO,
varid = GEBCO$var$elevation,
start = c(LonIdx[1],
LatIdx[1]),
count = c(length(LonIdx),
length(LatIdx)),
verbose = T)
To convert to long format:
lon <- GEBCO$dim$lon$vals[LonIdx]
lat <- GEBCO$dim$lat$vals[LatIdx]
rownames(z) <- as.character(lon)
colnames(z) <- as.character(lat)
library(tidyr)
library(magrittr)
ztbl <- as_tibble(z, rownames = "lon")
ztbl %<>% pivot_longer(-lon, names_to = "lat", values_to = "depth")
The simplest description of what I am trying to do is that I have a column in a data.frame like 1,2,3,..., n, 1,2,3,...n,.... and I want group the first 1...n as 1 the second 1...n as 2 and so on.
The full context is; I am using the R spcosa package to do equal area stratification composite sampling on parcels of land. I start with a shape file from a GIS that contains a number of polygons (land parcels). The end result I want is a GIS file with each of the strata and sample locations in a GIS file format with each stratum and sample location labeled by land parcel, stratum and sample id. So far I can do all this except one bit which is identifying the stratum that the samples belongs too and including it in the sample label. The sample label needs to look like "parcel#-strata#-composite# (where # is the number). In practice I don't need this actual label but as separate attributes in GIS file.
The basic work flow is a follows
For each individual polygon using spcosa::stratify I divide it into a number of equal area strata like
strata.CSEA <- stratify(poly[i,], nStrata = n, nTry = 1, equalArea = TRUE, nGridCells = x)
Note spcosa::stratify generates a CompactStratificationEqualArea object. I cocerce this to a SpatialPixelData then use rasterToPolygon to be able to output it as a GIS file.
I then generate the sample locations as follows:
samples.SPRC <- spsample(strata.CSEA, n = n, type = "composite")
spcosa::spsample creates a SamplingPatternRandomComposite object. I coerce this to a SpatialPointsDataFrame
samples.SPDF <- as(samples.SPRC, "SpatialPointsDataFrame")
and add two columns to the #data slot
samples.SPDF#data$Strata <- "this is the bit I can't do yet"
samples.SPDF#data$CEA <- poly[i,]$name
I can then write samples.SPDF as a GIS file ( ie writeOGE) with all the wanted attributes.
As above the part I can't sort out is how the sample ids relate to the strata ids. The sample points are a vector like 1,2,3...n, 1,2,3...n,.... How do I extract which sample goes with which strata? As actual strata number are arbitrary, I can just group ( as per my simple question above) but ideally I would like to use the numbering of the actual strata so everything lines up.
To give any contributors access to a hands on example I copy below the code from the spcosa documentation slightly modified to generate the correct objects.
# Note: the example below requires the 'rgdal'-package You may consider the 'maptools'-package as an alternative
if (require(rgdal)) {
# read a vector representation of the `Farmsum' field
shpFarmsum <- readOGR(
dsn = system.file("maps", package = "spcosa"),
layer = "farmsum"
)
# stratify `Farmsum' into 50 strata
# NB: increase argument 'nTry' to get better results
set.seed(314)
myStratification <- stratify(shpFarmsum, nStrata = 50, nTry = 1, equalArea = TRUE)
# sample two sampling units per stratum
mySamplingPattern <- spsample(myStratification, n = 2 type = "composite")
# plot the resulting sampling pattern on
# top of the stratification
plot(myStratification, mySamplingPattern)
}
Maybe order() function can help you
n <- 10
dat <- data.frame(col1 = rep(1:n, 2), col2 = rnorm(2*n))
head(dat)
dat[order(dat$col1), ]
I did not get where the "ID" (1,2,3...n) is to be found; so let's assume you have your SpatialPolygonsDataFrame called shpFarmsum with a attribute data column "ID". You can access this column via shpFarmsum$ID. Therefore, if you want to create individual subsets for each ID this is one way to go:
for (i in unique(shpFarmsum$ID)) {
tempSubset shpFarmsum[shpFarmsum$ID == i,]
writeOGR(tempSubset, ".", paste0("subset_", i), driver = "ESRI Shapefile")
}
I added the line writeOGR(... so all subsets are written to your working direktory. However, you can change this line or add further analysis into the for-loop.
How it works
unique(shpFarmsum$ID) extracts all occuring IDs (compareable to your 1,2,3...n).
In each repetition of the for loop, another value of this IDs will be used to create a subset of the whole SpatialPolygonsDataFrame, which you can use for further analysis.
I am trying to create a NetCDF from a .csv file. I have read several tutorials here and other places and still have some doubts.
I have a table according to this:
lat,long,time,rh,temp
41,-109,6,1,1
40,-107,18,2,2
39,-105,6,3,3
41,-103,18,4,4
40,-109,6,5,2
39,-107,18,6,4
I create the NetCDF using the ncdf4 package in R.
xvals <- data$lon
yvals <- data$lat
nx <- length(xvals)
ny <- length(yvals)
lon1 <- ncdim_def("longitude", "degrees_east", xvals)
lat2 <- ncdim_def("latitude", "degrees_north", yvals)
time <- data$time
mv <- -999 #missing value to use
var_temp <- ncvar_def("temperatura", "celsius", list(lon1, lat2, time), longname="Temp. da superfĂcie", mv)
var_rh <- ncvar_def("humidade", "%", list(lon1, lat2, time), longname = "humidade relativa", mv )
ncnew <- nc_create(filename, list(var_temp, var_rh))
ncvar_put(ncnew, var_temp, dadostemp, start=c(1,1,1), count=c(nx,ny,nt))
When I follow the procedure it states that the NC expects 3 times the number of data that I have.
I understand why, one matrix for each dimension, since I stated that the variables are according to the Longitude, Latitude and Time.
So, how would I import this kind of data, where I already have one Lon, Lat, Time and other variables for each data acquisition?
Could someone shed some light?
PS: The data used here is not my real data, just some example I was using for the tutorials.
I think there is more than one problem in your code. Step by step:
Create dimensions
In a nc file dimensions don't work as key-values there just a vector of values defining what each position in a variable array means.
This means you should create your dimensions like this:
xvals <- unique(data$lon)
xvals <- xvals[order(xvals)]
yvals <- yvals[order(unique(data$lat))]
lon1 <- ncdim_def("longitude", "degrees_east", xvals)
lat2 <- ncdim_def("latitude", "degrees_north", yvals)
time <- data$time
time_d <- ncdim_def("time","h",unique(time))
Where I work we use unlimited dimensions as mere indexes while a 1d-variable with same name as the dimension holds the values. I'm not sure how unlimited dimensions work in R. Since you don't ask for it I leave this out :-)
define variables
mv <- -999 #missing value to use
var_temp <- ncvar_def("temperatura", "celsius",
list(lon1, lat2, time_d),
longname="Temp. da superfĂcie", mv)
var_rh <- ncvar_def("humidade", "%",
list(lon1, lat2, time_d),
longname = "humidade relativa", mv )
add data
Create an nc file: ncnew <- nc_create(f, list(var_temp, var_rh))
When adding values the object holding the data is molten to a 1d-array and a sequential write is started at the position specified by start. The dimension to write along is controlled by the values in count. If you have data like this:
long, lat, time, t
1, 1, 1, 1
2, 1, 1, 2
1, 2, 1, 3
2, 2, 1, 4
The command ncvar_put(ncnew, var_temp,data$t,count=c(2,2,1)) would give you what you (probably) expect.
For you're data the first step is to create the indexes for the dimensions:
data$idx_lon <- match(data$long,xvals)
data$idx_lat <- match(data$lat,yvals)
data$idx_time <- match(data$time,unique(time))
Then create an array with the dimensions appropriate for your data:
m <- array(mv,dim = c(length(yvals),length(xvals),length(unique(time))))
Then fill the array with you're values:
for(i in 1:NROW(data)){
m[data$idx_lat[i],data$idx_lon[i],data$idx_time[i]] <- data$temp[i]
}
if speed is a concern you could calculate the linear index vectorised and use this for value assignment.
Write the data
ncvar_put(ncnew, var_temp,m)
Note that you don't need start and count.
Finally close the nc file to write data to the disk nc_close(ncnew)
Optionally I would recommend you the ncdump console command to check your file.
Edit
Regarding your question whether to write a complete array or use start and count I believe both methods work reliable. Which one to prefer depends on your data and you're personal preferences.
I think the method of building an array, add the values and then write it as whole is easier to understand. However, when asking what is more efficient it depends on the data. If you're data is big and has many NA values I believe using multiple writes with start and count could be faster. If NA's are rare creating one matrix and do single write would be faster. If you're data is so big creating an extra array would exceed you're available memory you have to combine both methods.
Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.