Why does my raster file size increase after aggregating/cropping? - r

I am working with a NETcdf files on air pollution. Monthly data is stored in .nc files, which span over the entire North America, in a resolution of 0.01*0.01. However, I only need the data for US, at a resolution of 0.5.
So I do the following:
crop the data using USA boundaries: raster::mask(raster::crop(PM25, raster::extent(USA)),USA)
Aggregate the data: aggregate(PM25, fact = 50)
However, the file size of my raster increases tremendously after each of these two steps. I also tried to first aggregate and then crop the raster. But in this case, the file size first increases and then deceases slightly after cropping.
Any ideas why this is happening? Large file sizes are a big problem because I have over 100 files on which I need to perform the above operations in a loop.
Any help/advice is appreciated. Thanks.

If what you refer to as "file size" is actually the size of the R object, the explanation is probably that PM25 has its values on disk (in a file) while the RasterBrick with the aggregated values has its values in RAM memory. If you are concerned about running out of RAM you could instead write the values to file by using a filename argument like this
aggregate(PM25, fact = 50, filename="agg1.tif")
There is a (perhaps small) performance penalty for writing to disk.
Also, I would think that you ought to first use aggregate, and then use mask to avoid edge effects.

Related

How to read a stars object with over 32768 bands?

I have a very large dataset consisting of one attribute, simulated daily from 1970 to 2100, defined on a rather fine geographic grid. It has been given to me as a netCDF file, which I would like to read and analyze in an R script. The data is too big to fully fit in memory, so I wrote a script that does the analysis with stars proxy objects and the purrr package. It has worked for similar smaller datasets.
However, this dataset seems too big - there are 45956 bands, one for each time step., and it seems like the read_stars() command has an upper limit to how many bands an object can have. This is what my code looks like after loading the proper librairies, where data_path points to a single .nc file:
data_full <- read_stars(data_path, proxy = TRUE)
It returns the following:
Warning message:
In CPL_read_gdal(as.character(x), as.charater(options), as.characters(driver), :
GDAL Message 1 : Limiting number of bands to 32768 instead of 45956
Then the data is cropped and stops around 2050. I would like to have the full data in the data_full variable. Is is posible to increase the bands limits? Or are there alternative ways of doing this?
Try setting GDAL_MAX_BAND_COUNT to 65536
Python:
gdal.SetConfigOption('GDAL_MAX_BAND_COUNT',65536)
bash:
export GDAL_MAX_BAND_COUNT=65536

R save as NetCDF file after simple calculation

I want to do something (apparently) simple, but didn't yet find the right way to do it:
I read a netcdf file (wind speed from the ERA5 reanalysis) on a grid.
From this, I use the wind speed to calculate a wind capacity factor (using a given power curve).
I then want to write a new netcdf file, with exactly the same structure as the input file, but just replacing the input wind speed by the new variable (wind capacity factor).
Is there a simple/fast way to do this, avoiding to redefine all the dims, vars ... with ncvar_def and ncdim_def ?
Thanks in advance for your replies!
Writing a netcdf file in R is not overly complicated, there is a nice example online here:
http://geog.uoregon.edu/GeogR/topics/netCDF-write-ncdf4.html
You could copy the dimensions from the input file.
However if your wind power curve is a simple analytical expression then you could perform this task in one line from the command line in bash/linux using climate data operators (cdo).
For example, if you have two variables 10u and 10v in the file (I don't recalled the reanalysis names exactly) then you could make a new variable WCF=SQRT(U2+V2) in the following way
cdo expr,'wcf=sqrt(10u**2+10v**2)' input.nc output.nc
See an example here:
https://code.mpimet.mpg.de/boards/53/topics/1622
So if your window power function is an analytical expression you can define it this way without using R at all or worrying about dimensions etc, the new file will have an variable wcf added. You should then probably use NCO to alter the metadata (units etc) to ensure they are appropriate.

High-scale signal processing in R

I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.
This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.

Extracting point data from a large shape file in R

I'm having trouble extracting point data from a large shape file (916.2 Mb, 4618197 elements - from here: https://earthdata.nasa.gov/data/near-real-time-data/firms/active-fire-data) in R. I'm using readShapeSpatial in maptools to read in the shape file which takes a while but eventually works:
worldmap <- readShapeSpatial("shp_file_name")
I then have a data.frame of coordinates that I want extract data for. However R is really struggling with this and either loses connection or freezes, even with just one set of coordinates!
pt <-data.frame(lat=-64,long=-13.5)
pt<-SpatialPoints(pt)
e<-over(pt,worldmap)
Could anyone advise me on a more efficient way of doing this?
Or is it the case that I need to run this script on something more powerful (currently using a mac mini with 2.3 GHz processor)?
Many thanks!
By 'point data' do you mean the longitude and latitude coordinates? If that's the case, you can obtain the data underlying the shapefile with:
worldmap#data
You can view this in the same way you would any other data frame, for example:
View(worldmap#data)
You can also access columns in this data frame in the same way you normally would, except you don't need the #data, e.g.:
worldmap$LATITUDE
Finally, it is recommended to use readOGR from the rgdal package rather than maptools::readShapeSpatial as the former reads in the CRS/projection information.

Handling multiple raster files and executing unit conversions on them: R

I've dug around a lot for an answer to this and wasn't able to find anything, so here I am.
I have a whole bunch of ascii raster files corresponding to air temperature and dew point temperature of a certain area over 744 hourly time steps. (So I have 744 air temp and 744 dew point files corresponding to a 31-day month). The files are only about 45 kB each.
I want to stack them together so I can perform some analyses on them, and I also want to convert their units from K to deg F.
The file names air Tair1.txt, Tair2.txt, ... Tair744.txt and Eair1.txt, Eair2.txt, ... Eair744.txt.
Using the raster package, I can easily load all the files as rasters:
for (i in 1:744) {
assign(paste0("Tair",i),raster(paste0("Tair",i,".txt")))
assign(paste0("Eair",i),raster(paste0("Tair",i,".txt")))
}
I've tried to use ls() with pattern or glob2rx to define just the raster file names and
then do conversions on them, or to do something similar to join them in a stack, but to no avail. I also tried mget, values(mget(filename)) and things like that to get at the values in a loop.
I know R doesn't handle large datasets very well, but I'm thinking these aren't really that large so there should be something pretty simple?
I would appreciate any help and advice! Thank you.
The raster package's RasterStack is for this:
library(raster)
files <- paste0("Tair",1:744,".txt")
rs <- stack(files)
Why do you have these files in text format though? Who imposed this disaster on you? I suspect your individual layers have insufficient metadata, so try one and see if it's sensible. You can use extent(rs) <- and projection(rs) <- to fix:
r <- raster(files[1])
print(r)
Don't use assign() that's just creating a mess.

Resources