How to read a stars object with over 32768 bands? - r

I have a very large dataset consisting of one attribute, simulated daily from 1970 to 2100, defined on a rather fine geographic grid. It has been given to me as a netCDF file, which I would like to read and analyze in an R script. The data is too big to fully fit in memory, so I wrote a script that does the analysis with stars proxy objects and the purrr package. It has worked for similar smaller datasets.
However, this dataset seems too big - there are 45956 bands, one for each time step., and it seems like the read_stars() command has an upper limit to how many bands an object can have. This is what my code looks like after loading the proper librairies, where data_path points to a single .nc file:
data_full <- read_stars(data_path, proxy = TRUE)
It returns the following:
Warning message:
In CPL_read_gdal(as.character(x), as.charater(options), as.characters(driver), :
GDAL Message 1 : Limiting number of bands to 32768 instead of 45956
Then the data is cropped and stops around 2050. I would like to have the full data in the data_full variable. Is is posible to increase the bands limits? Or are there alternative ways of doing this?

Try setting GDAL_MAX_BAND_COUNT to 65536
Python:
gdal.SetConfigOption('GDAL_MAX_BAND_COUNT',65536)
bash:
export GDAL_MAX_BAND_COUNT=65536

Related

Dimension problem while reading netcdf4 files in 'stars' R package

In essence I'm trying to do a relatively simple set of operations on a collection of netcdf4 files I've downloaded. They're sourced from ESA's Lakes Climate Change Initiative database of satellite-derived limnological data, and each netcdf4 file represents one day in a time series going back to the 2000s or earlier. Each netcdf4 file has a number of dimensions representing different variables of interest (surface temperature, chlorophyll-a concentration, etc).
Using the stars package I was hoping to geographically subset the dataset to only the lake I'm interested in and create a monthly aggregated time series for the lake for a number of those variables.
Unfortunately it seems as though there might be something wrong with the netcdf4 files ESA provided, as I'm running into odd errors while working with them in the 'stars' package in R.
For reproducibility's sake here's a link to the file in question - it should be the first file in the directory. If you download it to a directory of your choice and setwd() to it you should be able to repeat what I've managed here:
library(stars)
setwd()
lake_2015_01_01 <- read_stars('ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20150101-fv2.0.2.nc')
## Compare to using the read_ncdf4() function also from stars:
lake_2015_01_01_nc <- read_ncdf('ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-20150101-fv2.0.2.nc')
Running the file through read_stars() produces the error:
Error in attr(x, "dimensions")[[along]] : subscript out of bounds
In addition: There were 50 or more warnings (use warnings() to see the first 50)
While running it through read_ncdf() produces the following:
Warning messages:
1: In CPL_crs_from_input(x) :
GDAL Error 1: PROJ: proj_create: Error 1027 (Invalid value for an argument): longlat: Invalid value for units
2: In value[3L] : failed to create crs based on grid mapping
and coordinate variable units. Will return NULL crs.
Original error:
Error in st_crs.character(base_gm): invalid crs: +proj=longlat +a=6378137 +f=0.00335281066474748 +pm=0 +no_defs +units=degrees
But it does successfully complete, just with a broken coordinate system that can be fixed by approximating the coordinate system originally set by the creators:
lake_2015_01_01_nc <- st_set_crs(lake_2015_01_01_nc, 4979)
However the functions stars uses to manipulate data don't work on it, as it's a proxy file that points to the original netcdf4. The architecture of the stars package would suggest that if I want to manipulate the file I need to use read_stars().
I've tried to reverse-engineer the problem a little bit by opening the .nc file in QGIS, where it seems to perform as expected. I'm able to display the data and it seems to be georeferenced correctly, which makes me suspect the data is not being read into 'stars' correctly in the read_ncdf() function, and likely in the read_stars() function as well.
I'm not sure how to go about fixing this however, and I'd like to use both this dataset and stars in the analysis. Does anyone have any insight as to what might be going on here and if it's something I can fix?

Why does my raster file size increase after aggregating/cropping?

I am working with a NETcdf files on air pollution. Monthly data is stored in .nc files, which span over the entire North America, in a resolution of 0.01*0.01. However, I only need the data for US, at a resolution of 0.5.
So I do the following:
crop the data using USA boundaries: raster::mask(raster::crop(PM25, raster::extent(USA)),USA)
Aggregate the data: aggregate(PM25, fact = 50)
However, the file size of my raster increases tremendously after each of these two steps. I also tried to first aggregate and then crop the raster. But in this case, the file size first increases and then deceases slightly after cropping.
Any ideas why this is happening? Large file sizes are a big problem because I have over 100 files on which I need to perform the above operations in a loop.
Any help/advice is appreciated. Thanks.
If what you refer to as "file size" is actually the size of the R object, the explanation is probably that PM25 has its values on disk (in a file) while the RasterBrick with the aggregated values has its values in RAM memory. If you are concerned about running out of RAM you could instead write the values to file by using a filename argument like this
aggregate(PM25, fact = 50, filename="agg1.tif")
There is a (perhaps small) performance penalty for writing to disk.
Also, I would think that you ought to first use aggregate, and then use mask to avoid edge effects.

Extracting point data from a large shape file in R

I'm having trouble extracting point data from a large shape file (916.2 Mb, 4618197 elements - from here: https://earthdata.nasa.gov/data/near-real-time-data/firms/active-fire-data) in R. I'm using readShapeSpatial in maptools to read in the shape file which takes a while but eventually works:
worldmap <- readShapeSpatial("shp_file_name")
I then have a data.frame of coordinates that I want extract data for. However R is really struggling with this and either loses connection or freezes, even with just one set of coordinates!
pt <-data.frame(lat=-64,long=-13.5)
pt<-SpatialPoints(pt)
e<-over(pt,worldmap)
Could anyone advise me on a more efficient way of doing this?
Or is it the case that I need to run this script on something more powerful (currently using a mac mini with 2.3 GHz processor)?
Many thanks!
By 'point data' do you mean the longitude and latitude coordinates? If that's the case, you can obtain the data underlying the shapefile with:
worldmap#data
You can view this in the same way you would any other data frame, for example:
View(worldmap#data)
You can also access columns in this data frame in the same way you normally would, except you don't need the #data, e.g.:
worldmap$LATITUDE
Finally, it is recommended to use readOGR from the rgdal package rather than maptools::readShapeSpatial as the former reads in the CRS/projection information.

Metafor measure argument error

I have calculated effect size and pooled SE in the way that I wanted. Only thing is drawing a forest plot and let metafor calculate the summary effect size. I have over 30 .csv data files to plot separately. When I do that with the following data (below), it plots and calculates summary effect smoothly.
DeltaPI Spooled
-75.35224985 7.618629848
-51.85221078 7.513461236
-37.77455275 7.164279414
The line I use is:
meta1<-rma(yi=mydata$DeltaPI, sei=mydata$Spooled)
forest(meta1,slab=paste(mydata$Study,mydata$Genotype..Experimental.),showweight=TRUE,alim=c(-100,25),at=c(-100,-50,0,25),xlab="Percentage Change of PI Score",cex=0.7,cex.lab=1,col="red")
However, when I try to do same thing with some other .csv files I have, rma gives an error and asks for 'measure' argument to plot the output. And since the measure is already DeltaPI i calculated manually, I don't want to use.
Weirdly, even if I change the data in those don't working files with the one that working properly(3 data rows above), it still gives the same error. Although, the same data works properly in some other .csv file.
So I'm not clear why I am getting the error and what is the solution.
Any comment would be appreciated!
My guess is that this has nothing to do with the plotting, but occurs when the rma() command is run. And it sounds to me that there are issues with how variables are named in the data that you are reading in. Now you are reading in data from .csv files, but this is probably what is happening:
> library(metafor)
> dat <- data.frame(DeltaP1 = c(.2,.4), Spooled=c(.1,.1))
> rma(dat$DeltaPI, sei=dat$Spooled)
Error in rma(dat$DeltaPi, sei = dat$s) :
Specify the desired outcome measure via the 'measure' argument.
So, in essence, you should carefully check the variable names.

Extract certain values out of netCDF

I ve a netCDF file with 3 Dimensions. The first dimension is the longitude and reaches from 1-464. The second dimension is the latitude and reaches from 1-201. The third dimension is time and reaches from 1-5479.
Now I want to extract certain values out of the file. I think one can handle it with the start argument. I tried this command.
test = open.ncdf("rr_0.25deg_reg_1980-1994_v8.0.nc")
data = get.var.ncdf(test,start=c(1:464,1:201,1:365))
But somehow it doesnt work. Has anybody a solution?
Thanks in advance...
It looks like you are using the ncdf package in R. If you can, I recommend using the updated ncdf4 package, which is based on Unidata's netcdf version 4 library (link).
Back to your problem. I use the ncdf4 package, but I think the ncdf package works the same way. When you call the function get.var.ncdf, you also need to explicitly supply the name of the variable that you want to extract. I think you can get the names of the variables using names(test$var).
So you need to do something like this:
# Open the nc file
test = open.ncdf("rr_0.25deg_reg_1980-1994_v8.0.nc")
# Now get the names of the variables in the nc file
names(test$var)
# Get the data from the first variable listed above
# (May not fit in memory)
data = get.var.ncdf(test,varid=names(test$var)[1])
# If you only want a certain range of data.
# The following will probably not fit in memory either
# data = get.var.ncdf(test,varid=names(test$var)[1])[1:464,1:201,1:365]
For your problem, you would need to replace varid=names(test$var)[1] above with varid='VARIABLE_NAME', where VARIABLE_NAME is the variable you want to extract.
Hope that helps.
EDIT:
I installed the ncdf package on my system, and the above code works for me!
You could also do the extracting of timesteps/dates and locations outside of R before reading it into to R for plotting etc, by using CDO. This has the advantage that you can work directly in the coordinate space and specify timesteps or dates directly:
e.g.
cdo seldate,20100101,20121031 in.nc out.nc
cdo sellonlatbox,lon1,lon2,lat1,lat2 in.nc out.nc

Resources