Line profiling with cython in jupyter notebook - jupyter-notebook

I'm trying to use liner_profiler library in jupyter notebook with cython function. It is working only halfway. The result I get only consist of first row of the function and no profiling results.
%%cython -a
# cython: linetrace=True
# cython: binding=True
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
import numpy as np
cimport numpy as np
from datetime import datetime
import math
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
if month==2:
if (year%4==0 and year%100!=0) or (year%400==0):
return 29
return months[month-1]
For the profiling result int onlt shows one line of code
Timer unit: 1e-07 s
Total time: 0.0015096 s
File: .ipython\cython\_cython_magic_0154a9feed9bbd6e4f23e57d73acf50f.pyx
Function: get_days at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):

This can be seen as a bug in the line_profiler (if it is supposed to support Cython). To get the code of the profiled function, line_profiler reads the pyx-file and tries to extract the code with help of inspect.getblock:
...
# read pyx-file
all_lines = linecache.getlines(filename)
# try to extract body of the function strarting at start_lineno:
sublines = inspect.getblock(all_lines[start_lineno-1:])
...
However, getblock knows nothing about cpdef-function, as python has only def-functions and thus yields wrong function-body (i.e. only the signature).
Workaround:
A simple work around would be to introduce a dummy def-function, which would be a sentinel for the cpdef-function in such a way, that inspect.getblock would yield the whole body of the cpdef-function + body of the the sentinel function, i.e.:
%%cython
...
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
...
def get_days_sentinel():
pass
and now the report %lprun -f get_days get_days(2019,3) looks as follows:
Timer unit: 1e-06 s
Total time: 1.7e-05 s
File: XXXX.pyx
Function: get_days at line 10
Line # Hits Time Per Hit % Time Line Contents
==============================================================
10 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
11 1 14.0 14.0 82.4 cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
12 1 1.0 1.0 5.9 if month==2:
13 if (year%4==0 and year%100!=0) or (year%400==0):
14 return 29
15 1 2.0 2.0 11.8 return months[month-1]
16
17 def get_days_sentinel():
18 pass
There are still somewhat ugly trailing lines from the sentinel, but it is probably better as not seeing anything at all.

Related

Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error

Problem Summary
I am attempting to convert a.grib2 file representing a single day's worth of gridded radar rainfall data spanning the continental US, into a netcdf. When a .grib2 is missing timesteps, I am attempting to fill them in with NA values using xarray.Dataset.reindex before running xarray.Dataset.to_netcdf. However, after I've reindexed the dataset, the script fails due to a memory allocation error. It succeeds if I don't reindex. One clue could be in the fact that the dataset chunks are set to (70, 3500, 7000), but when ds.to_netcdf is called, the script fails because it's attempting to load a chunk with dimensions (210, 3500, 7000).
Accessing Full Reproducible Example
The code and data to reproduce my results can be downloaded from this Dropbox link. The code is also shown below followed by the outputs. Potentially relevant OS and environment information are shown below as well.
Code
#%% Import libraries
import time
start_time = time.time()
import xarray as xr
import cfgrib
from glob import glob
import pandas as pd
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False}) # to silence warnings of loading large slice into memory
dask.config.set(scheduler='synchronous') # this forces single threaded computations (netcdfs can only be written serially)
#%% parameters
chnk_sz = "7000MB"
fl_out_nc = "out_netcdfs/20010101.nc"
fldr_in_grib = "in_gribs/20010101.grib2"
#%% loading and exporting dataset
ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"time":chnk_sz},
backend_kwargs={'indexpath': ''})
# reindex
start_date = pd.to_datetime('2001-01-01')
tstep = pd.Timedelta('0 days 00:05:00')
new_index = pd.date_range(start=start_date, end=start_date + pd.Timedelta(1, "day"),\
freq=tstep, inclusive='left')
ds = ds.reindex(indexers={"time":new_index})
ds = ds.unify_chunks()
ds = ds.chunk(chunks={'time':chnk_sz})
print("######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########")
print(ds)
print(' ')
print("######## ERROR MESSAGE ########")
ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})
Outputs
######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########
<xarray.Dataset>
Dimensions: (time: 288, latitude: 3500, longitude: 7000)
Coordinates:
* time (time) datetime64[ns] 2001-01-01 ... 2001-01-01T23:55:00
* latitude (latitude) float64 54.99 54.98 54.98 54.97 ... 20.03 20.02 20.01
* longitude (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0
step timedelta64[ns] ...
surface float64 ...
valid_time (time) datetime64[ns] dask.array<chunksize=(288,), meta=np.ndarray>
Data variables:
unknown (time, latitude, longitude) float32 dask.array<chunksize=(70, 3500, 7000), meta=np.ndarray>
Attributes:
GRIB_edition: 2
GRIB_centre: 161
GRIB_centreDescription: 161
GRIB_subCentre: 0
Conventions: CF-1.7
institution: 161
history: 2022-09-10T14:50 GRIB to CDM+CF via cfgrib-0.9.1...
######## ERROR MESSAGE ########
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
d:\Dropbox\_Sharing\reprex\2022-9-9_writing_ncdf_fails\reprex\exporting_netcdfs_reduced.py in <cell line: 22>()
160 print(' ')
161 print("######## ERROR MESSAGE ########")
---> 162 ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})
File c:\Users\Daniel\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\core\dataset.py:1882, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
1879 encoding = {}
1880 from ..backends.api import to_netcdf
-> 1882 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
1883 self,
1884 path,
1885 mode=mode,
1886 format=format,
1887 group=group,
1888 engine=engine,
1889 encoding=encoding,
1890 unlimited_dims=unlimited_dims,
1891 compute=compute,
1892 multifile=False,
1893 invalid_netcdf=invalid_netcdf,
1894 )
File c:\Users\xxxxx\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\backends\api.py:1219, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
...
121 return arg
File <__array_function__ internals>:180, in where(*args, **kwargs)
MemoryError: Unable to allocate 19.2 GiB for an array with shape (210, 3500, 7000) and data type float32
Environment
windows 11 Home
xarray 2022.3.0
cfgrib 0.9.10.1
dask 2022.7.0
A functional workaround is to chunk by a dimension that is unchanged during reindexing. The following modification causes the script to run successfully:
ds = xr.open_dataset(
fldr_in_grib,
engine="cfgrib",
chunks={ "latitude": 875 },
backend_kwargs={ 'indexpath': '' }
)

Groovy: Time in ISO 8601 format

How to get the current time and 15 min ago time in iso 8601 format (YYYY-MM-DDTHH:mm:ss) in groovy?
You can use java time's Instant and the toString() format
import java.time.*
def now = Instant.now()
def fifteenAgo = now.minus(Duration.ofMinutes(15))
println "Now is ${now} and 15 mins ago was ${fifteenAgo}"
Prints:
Now is 2020-06-30T19:53:17.445039Z and 15 mins ago was 2020-06-30T19:38:17.445039Z
You can formast the date in any way you want in Groovy, by doing e.g.
println new Date().format("yyyy-MM-dd HH.mm.ss.SSSSS Z")
Then, you can do calculations on the date, like this:
new Date(System.currentTimeMillis()-91*60*1000)
which will minus 91 minutes (91min * 60sec * 1000ms).
Then you can put the statements together, which is why Groovy is great:
def a = new Date(System.currentTimeMillis()-91*60*1000).format("YYYY-MM-DD")
And so you can get the half before the T. And the half after the T:
def b = new Date(System.currentTimeMillis()-91*60*1000).format("HH:mm:ss")
And then concatenate them with a T:
println "91 minutes ago in iso 8601 format is: ${a}T${b}"
There are other ways of doing it, like with TimeCategory.minus, but this is a good illustration. I used 91 minutes, but you can adapt it to your own requirtement.

Why is the difference of two datetime.now() objects not close to 0?

I am confused about the output of the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import datetime
import pytz
local_time = datetime.datetime.now(pytz.timezone('Europe/Berlin'))
utc_time = datetime.datetime.now(datetime.timezone.utc)
print(local_time - utc_time)
I thought a timezone is simply an offset (depending on factors such as the date and the location with DST and changing definitions over time... so not that simple, but still an offset). So a timezone-aware datetime I thought would simply be:
utc_time == '12:34'
+ timezone is Europe/Berlin in UTC 2018-01-01 at 12:34
=> local time = (utc + local timezone) = 2018-01-01 at 13:34
Then I thought, the difference between two datetime objects should be equal to the difference to the UTC at the same point in time (not considering the thought-construct of timezones).
Hence, if I execute "datetime.now" at (about) the same point in time and "assign" it to different time zones, the difference should be on the order of milliseconds that passed between the two commands.
But it actually is -1 day, 23:59:59.999982.
Found it; it's just a representation issue:
>>> print(local_time)
2018-03-13 14:01:14.973876+01:00
>>> print(utc_time)
2018-03-13 13:01:14.973899+00:00
>>> print(utc_time - local_time)
0:00:00.000023
>>> print(local_time - utc_time)
-1 day, 23:59:59.999977
>>> print((local_time - utc_time).total_seconds())
-2.3e-05
so it is -1 day + 23:59:59.999977

Restore vgg16 network in tensorflow

This one has been giving me a headache for quite some time now, even though it seems to be very basic.
I have the vgg16 network downloaded as a .cpkt
(from https://github.com/tensorflow/models/blob/master/slim/README.md#Pretrained)
Now what I want to do is loading for example the tensor of the first convolution layer of this network as an array in R.
I tried
restorer = tf$train$Saver()
sess = tf$Session()
restorer$restore(sess, "/home/beheerder/R/vgg_16.ckpt")
But then I do not see any variables apearing in my enviroment.
I'm working in R, but an awnser in Python is OK as well, as I can probably translate it to R.
Saver takes the variables to restore in constructor. In other words, you have to create the variables before you can restore them. Here is the example from Saver's doc:
v1 = tf.Variable(..., name='v1')
v2 = tf.Variable(..., name='v2')
# Pass the variables as a dict:
saver = tf.train.Saver({'v1': v1, 'v2': v2})
# Or pass them as a list.
saver = tf.train.Saver([v1, v2])
If you were to run the first line of your code in python you would get:
In [1]: import tensorflow as tf
In [2]: saver = tf.train.Saver()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-18da33d742f9> in <module>()
----> 1 saver = tf.train.Saver()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in __init__(self, var_list, reshape, sharded, max_to_keep, keep_checkpoint_every_n_hours, name, restore_sequentially, saver_def, builder, defer_build, allow_empty, write_version, pad_step_number)
1054 self._pad_step_number = pad_step_number
1055 if not defer_build:
-> 1056 self.build()
1057 if self.saver_def:
1058 self._check_saver_def()
/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.pyc in build(self)
1075 return
1076 else:
-> 1077 raise ValueError("No variables to save")
1078 self._is_empty = False
1079 self.saver_def = self._builder.build(
ValueError: No variables to save
You can see how model variables are created before being restored in the 20 lines starting from https://github.com/tensorflow/models/blob/master/slim/train_image_classifier.py#L338
This code gets executed if you make a call to train_image_classifier.py similar to the flower example in https://github.com/tensorflow/models/blob/master/slim/README.md#fine-tuning-a-model-from-an-existing-checkpoint

read csv into index of year, dayofyear, and hour/min into a pandas datetime object

I am trying to read in a csv in this form:
2014,92,1931,6.234,10.14
2014,92,1932,5.823,9.49
2014,92,1933,5.33,7.65
2014,92,1934,4.751,6.19
2014,92,1935,4.156,5.285
2014,92,1936,3.962,4.652
2014,92,1937,3.74,4.314
2014,92,1938,3.325,3.98
2014,92,1939,2.909,3.847
2014,92,1940,2.878,3.164
To be clear, this is (Year, Day of year, 2400hr time, and 2 columns of values).
I have had some thought on the matter in a previous question, but to no avail, and it's proving to be a matter of a few problems... (Create an indexed datetime from date/time info in 3 columns using pandas)
As noted in the above question, the following "read_csv" attempt
df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
date_parser=parser, header=None)
triggers a TypeError:
TypeError: parser() takes exactly 1 argument (3 given)
This is due to the "parse_dates" arg having 0,1,2 in it.
I have also tried putting them in double brackets [[0,1,2]] and get:
ValueError: [0, 1, 2] is not in list
I have gotten past this by setting parse_dates=True and thought I could just set_index after but get this:
TypeError: must be string, not numpy.int64
My parser gets hung up on the format too, and I have read conflicting stories about zero-padding the "day of year" value. Mine are not zero-padded, but even still, above errors aside I have had the format get hung up on the first value, the year! Here is the parser:
def parser(x):
return pd.datetime.strptime(x, '%Y %j %H%M')
So yea, I have had errors saying '2014' not recognized, and '92' (day of year) not recognized, but have been encouraged cause at least strptime has been able to make its way "through" to try out the format.
I am wondering if this has something to do with my data.
I am looking for a way to get this datetime info indexed as a datetime and I have had nothing but problems. I have gone ahead and padded some julians in case someone wants to test out the format being a problem of the padding, see below:
2014,092,1931,6.234,10.14
2014,092,1932,5.823,9.49
2014,092,1933,5.33,7.65
2014,092,1934,4.751,6.19
2014,092,1935,4.156,5.285
2014,092,1936,3.962,4.652
2014,092,1937,3.74,4.314
2014,092,1938,3.325,3.98
2014,092,1939,2.909,3.847
2014,092,1940,2.878,3.164
Thanks for your help guys, I am starting to really get frustrated here :S
After correcting your %m (month) to %M (minute), your code works for me:
>>> import pandas as pd
>>> print pd.version.version
0.15.2-10-gf7af818
>>>
>>> def parser(x):
... return pd.datetime.strptime(x, '%Y %j %H%M')
...
>>> df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
... date_parser=parser, header=None)
>>> df
dt 3 4
0 2014-04-02 19:31:00 6.234 10.140
1 2014-04-02 19:32:00 5.823 9.490
2 2014-04-02 19:33:00 5.330 7.650
3 2014-04-02 19:34:00 4.751 6.190
4 2014-04-02 19:35:00 4.156 5.285
5 2014-04-02 19:36:00 3.962 4.652
6 2014-04-02 19:37:00 3.740 4.314
7 2014-04-02 19:38:00 3.325 3.980
8 2014-04-02 19:39:00 2.909 3.847
9 2014-04-02 19:40:00 2.878 3.164
But after playing around with this for a little while, there are some very strange behaviours when an error happens, leading to some odd error messages, so I can see why it's very hard to debug this.
If for some reason the above isn't working, you could try doing the parsing yourself:
df = pd.read_csv("home_prepped.dat", header=None)
timestr = df.iloc[:,:3].astype(str).apply(' '.join,axis=1)
df = df.iloc[:,3:]
times = pd.to_datetime(timestr, format='%Y %j %H%M')
df["dt"] = times
As mentioned above, when something goes wrong (e.g. a parse error) the error messages are very confusing from within read_csv.
The following seems to work, i think. Keep in mind this is the first time I have ever brought anything into pandas to work with so not sure how to properly test it, but it recognizes the format and says:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-02 19:31:00, ..., 2014-12-21 23:59:00]
Length: 337917, Freq: None, Timezone: None
Which is sweet, as I believe this means I have finally indexed a datetime!
Here is what I did...
In [41]:
import numpy as np
import pandas as pd
from datetime import datetime
In [60]:
def parse(yr, yearday, hrmn):
date_string = ''.join([yr, yearday, hrmn])
return datetime.strptime(date_string,"%Y%j%H%M")
In [61]:
df = pd.read_csv('home_prepped.csv', parse_dates={'datetime':[0,1,2]}, date_parser=parse, index_col='datetime', header=None)
Now I tried to put a space in between the '' before the .join and it separated the %Y %j but only managed to see a "1" as part of the %H. So I got rid of the space and changed the format to be spaceless as well.
Thanks for your work on this DSM.

Resources