Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error - out-of-memory

Problem Summary
I am attempting to convert a.grib2 file representing a single day's worth of gridded radar rainfall data spanning the continental US, into a netcdf. When a .grib2 is missing timesteps, I am attempting to fill them in with NA values using xarray.Dataset.reindex before running xarray.Dataset.to_netcdf. However, after I've reindexed the dataset, the script fails due to a memory allocation error. It succeeds if I don't reindex. One clue could be in the fact that the dataset chunks are set to (70, 3500, 7000), but when ds.to_netcdf is called, the script fails because it's attempting to load a chunk with dimensions (210, 3500, 7000).
Accessing Full Reproducible Example
The code and data to reproduce my results can be downloaded from this Dropbox link. The code is also shown below followed by the outputs. Potentially relevant OS and environment information are shown below as well.
Code
#%% Import libraries
import time
start_time = time.time()
import xarray as xr
import cfgrib
from glob import glob
import pandas as pd
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False}) # to silence warnings of loading large slice into memory
dask.config.set(scheduler='synchronous') # this forces single threaded computations (netcdfs can only be written serially)
#%% parameters
chnk_sz = "7000MB"
fl_out_nc = "out_netcdfs/20010101.nc"
fldr_in_grib = "in_gribs/20010101.grib2"
#%% loading and exporting dataset
ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"time":chnk_sz},
backend_kwargs={'indexpath': ''})
# reindex
start_date = pd.to_datetime('2001-01-01')
tstep = pd.Timedelta('0 days 00:05:00')
new_index = pd.date_range(start=start_date, end=start_date + pd.Timedelta(1, "day"),\
freq=tstep, inclusive='left')
ds = ds.reindex(indexers={"time":new_index})
ds = ds.unify_chunks()
ds = ds.chunk(chunks={'time':chnk_sz})
print("######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########")
print(ds)
print(' ')
print("######## ERROR MESSAGE ########")
ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})
Outputs
######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########
<xarray.Dataset>
Dimensions: (time: 288, latitude: 3500, longitude: 7000)
Coordinates:
* time (time) datetime64[ns] 2001-01-01 ... 2001-01-01T23:55:00
* latitude (latitude) float64 54.99 54.98 54.98 54.97 ... 20.03 20.02 20.01
* longitude (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0
step timedelta64[ns] ...
surface float64 ...
valid_time (time) datetime64[ns] dask.array<chunksize=(288,), meta=np.ndarray>
Data variables:
unknown (time, latitude, longitude) float32 dask.array<chunksize=(70, 3500, 7000), meta=np.ndarray>
Attributes:
GRIB_edition: 2
GRIB_centre: 161
GRIB_centreDescription: 161
GRIB_subCentre: 0
Conventions: CF-1.7
institution: 161
history: 2022-09-10T14:50 GRIB to CDM+CF via cfgrib-0.9.1...
######## ERROR MESSAGE ########
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
d:\Dropbox\_Sharing\reprex\2022-9-9_writing_ncdf_fails\reprex\exporting_netcdfs_reduced.py in <cell line: 22>()
160 print(' ')
161 print("######## ERROR MESSAGE ########")
---> 162 ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})
File c:\Users\Daniel\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\core\dataset.py:1882, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
1879 encoding = {}
1880 from ..backends.api import to_netcdf
-> 1882 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
1883 self,
1884 path,
1885 mode=mode,
1886 format=format,
1887 group=group,
1888 engine=engine,
1889 encoding=encoding,
1890 unlimited_dims=unlimited_dims,
1891 compute=compute,
1892 multifile=False,
1893 invalid_netcdf=invalid_netcdf,
1894 )
File c:\Users\xxxxx\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\backends\api.py:1219, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
...
121 return arg
File <__array_function__ internals>:180, in where(*args, **kwargs)
MemoryError: Unable to allocate 19.2 GiB for an array with shape (210, 3500, 7000) and data type float32
Environment
windows 11 Home
xarray 2022.3.0
cfgrib 0.9.10.1
dask 2022.7.0

A functional workaround is to chunk by a dimension that is unchanged during reindexing. The following modification causes the script to run successfully:
ds = xr.open_dataset(
fldr_in_grib,
engine="cfgrib",
chunks={ "latitude": 875 },
backend_kwargs={ 'indexpath': '' }
)

Related

How would I go about web scraping from an interactive map?

This pertains to this interactive map, https://www.newworld-map.com/?filters=ores
An example is the ores here, how would I go about getting the coordinates of each node? It looks like the html element is a Canvas and I could not for the life of me figure out where it pulls the data from for this.
Any help would be greatly appreciated
Hoping that next OP's question will be more in line with Stackoverflow's guidelines (see https://stackoverflow.com/help/minimal-reproducible-example), one way to solve this would be to inspect what network calls are being made when page loads, and scrape an eventual API endpoint where the data is pulled from. Like below:
import requests
import pandas as pd
import time
time_stamp = int(time.time_ns() / 1000)
ore_list = []
url = f'https://www.newworld-map.com/markers.json?time={time_stamp}'
ores= requests.get(url).json()['ores']
for ore in ores:
for x in ores[ore]:
ore_list.append((ore, x, ores[ore][x]['x'], ores[ore][x]['y']))
df = pd.DataFrame(ore_list, columns = ['Ore', 'Code', 'X_Coord', 'Y_Coord'])
print(df)
Result in terminal:
Ore Code X_Coord Y_Coord
0 brimstone 02d1ba070438d53ce5fbb1955cd7d694 7473.096191 8715.674805
1 brimstone 0a50c499af034aeb6f38e011648a2ea8 7471.124512 8709.161133
2 brimstone 0b5b190c31eb3d314d993dd393aadfe8 5670.894043 7862.319336
3 brimstone 0f5c7427c75d80e10f71f9e92ddc4362 5883.601562 7703.445801
4 brimstone 20b0801bdb41c7dafbb1053b43c25bd8 6020.838379 8147.747070
... ... ... ... ...
4260 starmetal 86h 8766.964000 8431.438000
4261 starmetal 86i 8598.688000 8562.974000
4262 starmetal 86j 8586.000000 8211.000000
4263 starmetal 86k 8688.938000 8509.722000
4264 starmetal 86l 8685.827000 8505.694000
4265 rows × 4 columns

Line profiling with cython in jupyter notebook

I'm trying to use liner_profiler library in jupyter notebook with cython function. It is working only halfway. The result I get only consist of first row of the function and no profiling results.
%%cython -a
# cython: linetrace=True
# cython: binding=True
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
import numpy as np
cimport numpy as np
from datetime import datetime
import math
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
if month==2:
if (year%4==0 and year%100!=0) or (year%400==0):
return 29
return months[month-1]
For the profiling result int onlt shows one line of code
Timer unit: 1e-07 s
Total time: 0.0015096 s
File: .ipython\cython\_cython_magic_0154a9feed9bbd6e4f23e57d73acf50f.pyx
Function: get_days at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
This can be seen as a bug in the line_profiler (if it is supposed to support Cython). To get the code of the profiled function, line_profiler reads the pyx-file and tries to extract the code with help of inspect.getblock:
...
# read pyx-file
all_lines = linecache.getlines(filename)
# try to extract body of the function strarting at start_lineno:
sublines = inspect.getblock(all_lines[start_lineno-1:])
...
However, getblock knows nothing about cpdef-function, as python has only def-functions and thus yields wrong function-body (i.e. only the signature).
Workaround:
A simple work around would be to introduce a dummy def-function, which would be a sentinel for the cpdef-function in such a way, that inspect.getblock would yield the whole body of the cpdef-function + body of the the sentinel function, i.e.:
%%cython
...
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
...
def get_days_sentinel():
pass
and now the report %lprun -f get_days get_days(2019,3) looks as follows:
Timer unit: 1e-06 s
Total time: 1.7e-05 s
File: XXXX.pyx
Function: get_days at line 10
Line # Hits Time Per Hit % Time Line Contents
==============================================================
10 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
11 1 14.0 14.0 82.4 cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
12 1 1.0 1.0 5.9 if month==2:
13 if (year%4==0 and year%100!=0) or (year%400==0):
14 return 29
15 1 2.0 2.0 11.8 return months[month-1]
16
17 def get_days_sentinel():
18 pass
There are still somewhat ugly trailing lines from the sentinel, but it is probably better as not seeing anything at all.

Restore vgg16 network in tensorflow

This one has been giving me a headache for quite some time now, even though it seems to be very basic.
I have the vgg16 network downloaded as a .cpkt
(from https://github.com/tensorflow/models/blob/master/slim/README.md#Pretrained)
Now what I want to do is loading for example the tensor of the first convolution layer of this network as an array in R.
I tried
restorer = tf$train$Saver()
sess = tf$Session()
restorer$restore(sess, "/home/beheerder/R/vgg_16.ckpt")
But then I do not see any variables apearing in my enviroment.
I'm working in R, but an awnser in Python is OK as well, as I can probably translate it to R.
Saver takes the variables to restore in constructor. In other words, you have to create the variables before you can restore them. Here is the example from Saver's doc:
v1 = tf.Variable(..., name='v1')
v2 = tf.Variable(..., name='v2')
# Pass the variables as a dict:
saver = tf.train.Saver({'v1': v1, 'v2': v2})
# Or pass them as a list.
saver = tf.train.Saver([v1, v2])
If you were to run the first line of your code in python you would get:
In [1]: import tensorflow as tf
In [2]: saver = tf.train.Saver()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-18da33d742f9> in <module>()
----> 1 saver = tf.train.Saver()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in __init__(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number)
1054 self._pad_step_number = pad_step_number
1055 if not defer_build:
-> 1056 self.build()
1057 if self.saver_def:
1058 self._check_saver_def()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in build(self)
1075 return
1076 else:
-> 1077 raise ValueError("No variables to save")
1078 self._is_empty = False
1079 self.saver_def = self._builder.build(
ValueError: No variables to save
You can see how model variables are created before being restored in the 20 lines starting from https://github.com/tensorflow/models/blob/master/slim/train_image_classifier.py#L338
This code gets executed if you make a call to train_image_classifier.py similar to the flower example in https://github.com/tensorflow/models/blob/master/slim/README.md#fine-tuning-a-model-from-an-existing-checkpoint

piplinedRDD can't convert to dataframe using toDF

I have a pyspark dataframe contains rows of data seperated by comma. I want to split each row and apply LabeledPoints method to it. Then covnert it to dataframe.
Here is my code
import os.path
from pyspark.mllib.regression import LabeledPoint
import numpy as np
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
It gives the following error message after apply .DF().
---------------------------------------------------------------------------
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 44, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-65-dc4d86a8ee45> in <module>()
----> 1 rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
2 print(type(rdd))
3 #print(rdd.take(5))
/databricks/spark/python/pyspark/sql/context.py in toDF(self, schema, sampleRatio)
62 [Row(name=u'Alice', age=1)]
63 """
---> 64 return sqlContext.createDataFrame(self, schema, sampleRatio)
65
66 RDD.toDF = toDF
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
Answer found:
rdd = raw_data_df.map(lambda row: row['value'].split(',')).map(lambda seq:LabeledPoint(float(seq[0]),seq[1:])).toDF()
Here, I need to specifically reference each line of text using row['value'], even though there is only one feature in the row.

XLDateAmbiguous error even when using dayfirst argument

I'm trying to import data into a pandas dataframe object from an excel spreadsheet parsing dates. I'm using dayfirst however I still get an error XLDateAmbiguous (docs)
The dates are in a single column in the format 25/09/1990
Could somebody explain to me why this happening and how I can fix it? Thanks in advance.
Edit: It seems as though the problem is caused by xlrd attempting to parse a non-date column as a date even thought I've specified which column the dates are in. Unfortunately I don't know how to explicitly indicate that a column should not be parsed as dates. Does anybody have any ideas?
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import dateutil
path6 = 'C:\\Users\\Site2_Homepage_2013-06-04.xlsx'
df8 = pd.io.excel.read_excel(path6, 'Site2_Homepage_2012_06_13', header=1, parse_dates=True, dayfirst=True)
XLDateAmbiguous Traceback (most recent call last)
<ipython-input-17-4a83d104ab72> in <module>()
4 path7 = 'C:\\Users\\Site4_Homepage_2013-06-04.xlsx'
5 path8 = 'C:\\Users\\Site7_Homepage_2013-06-04.xlsx'
----> 6 df8 = pd.io.excel.read_excel(path6, 'Site2_Homepage_2012_06_13', header=1, parse_dates=True, dayfirst=True)
7 df9 = pd.io.excel.read_excel(path7, 'Site4_Homepage_2012_06_13', header=1, parse_dates=[3], dayfirst=True)
8 df10 = pd.io.excel.read_excel(path8, 'Site7_Homepage_2012_06_13', header=1, parse_dates=[3], dayfirst=True)
C:\Users\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\io\excel.pyc in read_excel(io, sheetname, **kwds)
101 engine = kwds.pop('engine', None)
102
--> 103 return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
104
105
C:\Users\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\io\excel.pyc in parse(self, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize, convert_float, has_index_names, **kwds)
206 skip_footer=skip_footer,
207 convert_float=convert_float,
--> 208 **kwds)
209
210 def _should_parse(self, i, parse_cols):
C:\Users\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\io\excel.pyc in _parse_excel(self, sheetname, header, skiprows, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize, convert_float, **kwds)
267 if parse_cols is None or should_parse[j]:
268 if typ == XL_CELL_DATE:
--> 269 dt = xldate_as_tuple(value, datemode)
270 # how to produce this first case?
271 if dt[0] < datetime.MINYEAR: # pragma: no cover
C:\Users\AppData\Local\Enthought\Canopy32\User\lib\site-packages\xlrd\xldate.pyc in xldate_as_tuple(xldate, datemode)
78
79 if xldays < 61 and datemode == 0:
---> 80 raise XLDateAmbiguous(xldate)
81
82 jdn = xldays + _JDN_delta[datemode]
XLDateAmbiguous: 15.3
I didn't manage to find a solution of this. In the end I had to use .csv versions of the files for the dates to parse correctly.

Resources