pandera - use decorator to specify multiple output schemas - decorator

I would like to know if it is possible to use pandera decorator to specify multiple output schemas.
Let's say for example you have a function that returns 2 dataframes and you want to check the schema of these dataframes using check_io() decorator:
import pandas as pd
import pandera as pa
from pandera import DataFrameSchema, Column, Check, check_input
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
})
in_schema = DataFrameSchema({
"column1": Column(int),
"column2": Column(float),
})
out_schema1 = DataFrameSchema({
"column1": Column(int),
"column2": Column(float),
"column3": Column(float),
})
out_schema2 = DataFrameSchema({
"column1": Column(int),
"column2": Column(float),
"column3": Column(int),
})
def preprocessor(df1, df2):
df_out1 = (df1 + df2).assign(column3=lambda x: x.column1 + x.column2)
df_out2 = (df1 + df2).assign(column3=lambda x: x.column1 ** 2)
return df_out1, df_out2
How would this be implemented for the above example?

just in case anyone else is looking for the solution:
#pa.check_io(df1=in_schema, df2=in_schema, out=[(0, out_schema1), (1, out_schema2)])
def preprocessor(df1, df2):
df_out1 = (df1 + df2).assign(column3=lambda x: x.column1 + x.column2)
df_out2 = (df1 + df2).assign(column3=lambda x: x.column1 ** 2)
return df_out1, df_out2
preprocessor(df, df)

Related

Plotting data in nested for loop and function and then combine all the plots in one plot

I am trying to plot a simple nested function and mwe goes like this: In the end I get just an empty plot
import matplotlib.pyplot as plt
k = 1.38e-23;
h = 6.6e-34;
T = 7;
Temp = np.array([7, 0.268, 0.02025])
Freq = np.arange(1, 10, 2)
for T in Temp:
for f in Freq:
def quanta(f,T):
return (f*T)
final = quanta(f,T)
plt.plot(f, final)

How do I calculate a scalar product in a sympy vector?

Pretty simple question:
from sympy.vector import Vector
v = Vector(0, 2, 1)
print(2*v)
Output expected:
Vector(0, 4, 2)
I'm unable to find how to do this, the docs do not talk about scalar product.
Matrix understands scalar multiplication but not Vector. So either convert back and forth between Matrix and Vector, file a feature request and wait for a possible enhancement, or write your own __mul__ and __rmul__ routine (part of the beauty of Python):
>>> v.func(*Matrix(v.args)*2)
Vector(0, 4, 2)
>>> Vector.__rmul__ = Vector.__mul__ = lambda s,o: s.func(*[i*o for i in s.args])
>>> 2*v
Vector(0, 4, 2)
>>> v*2
Vector(0, 4, 2)
I think the expected way to work with vectors is to create a coordinate system and use its i, j, k basis:
In [32]: from sympy.vector import CoordSys3D
In [33]: N = CoordSys3D('N')
In [34]: v = 2*N.j + N.k
In [35]: v
Out[35]: 2*N.j + N.k
In [36]: 2*v
Out[36]: 4*N.j + 2*N.k

xarray: replace array values corresponding to particular dates in datetime

I have an example array of zeros:
time = np.arange('2000', '2005', dtype='datetime64[D]')
test_array = xr.DataArray(np.zeros(len(time)), coords={'time': time}, dims=['time'])
Now if I have some data e.g. test_data = np.ones(365) that I want to put into the array corresponding to year 2001 (which has 365 days) how do I go about doing this?
I want to do something like: test_array[test_array.where(time='2001')] = test_data but .where() here doesn't work.
The following solution works, but if there is a more elegant way I'd love to know.
ind_start = (test_array.indexes['time'] == pd.Timestamp('2001-01-01')).argmax()
ind_end = (test_array.indexes['time'] == pd.Timestamp('2001-12-31')).argmax()
test_array[ind_start:ind_end + 1] = test_data
The three-argument xarray.where function could be a more elegant alternative:
import pandas as pd
import xarray as xr
times = pd.date_range('2000', '2002')
da = xr.DataArray(range(len(times)), [('time', times)])
result = xr.where(da.time.dt.year == 2001, 1, da)
It works with arrays of values too:
ones = xr.ones_like(da)
result = xr.where(da.time.dt.year == 2001, ones, da)
If you are starting from a pure NumPy array, you'll need to cast it to a DataArray and make sure that its time coordinate aligns exactly with the time coordinate of da; if the initial length of the NumPy array differs from that of da, you'll need to add a reindexing step. Here's one way to do that:
import numpy as np
year_2001_times = da.time.sel(time=da.time.dt.year == 2001)
arr = np.random.random(len(year_2001_times))
random_da = xr.DataArray(arr, [('time', year_2001_times)])
reindexed_random_da = random_da.reindex_like(da)
result = xr.where(da.time.dt.year == 2001, reindexed_random_da, da)

Use scipy.integrate.quad with Tensorflow

I am trying to use scipy.integrate.quad with Tensorflow as following.
time and Lambda are two Tensors with shape (None, 1).
def f_t(self, time, Lambda):
h = Lambda * self.shape * time ** (self.shape - 1)
S = tf.exp(-1 * Lambda * time ** self.shape)
return h * S
def left_censoring(self, time, Lambda):
return tf.map_fn(lambda x: integrate.quad(self.f_t,
0.0,
x[0], # it is not a float before evaluation
args=(x[1],)),
tf.concat([time, Lambda], 1))
However, I get an error as below:
File "J:\Workspace\Distributions.py", line 30, in <lambda>
args=(x[1],)),
File "I:\Anaconda3\envs\tensorflow\lib\site-packages\scipy\integrate\quadpack.py", line 323, in quad
points)
File "I:\Anaconda3\envs\tensorflow\lib\site-packages\scipy\integrate\quadpack.py", line 388, in _quad
return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
TypeError: a float is required
X[0] is a Tensor with shape=(). It is not a float value before evaluation. Is it possible to solve the problem? How should I calculate integration in Tensorflow?
If you have at least TensorFlow 1.8.0, you're probably best off using tf.contrib.integrate.odeint_fixed() like this code (tested):
from __future__ import print_function
import tensorflow as tf
assert tf.VERSION >= "1.8.0", "This code only works with TensorFlow 1.8.0 or later."
def f( y, a ):
return a * a
x = tf.constant( [ 0.0, 1.0, 2, 3, 4 ], dtype = tf.float32 )
i = tf.contrib.integrate.odeint_fixed( f, 0.0, x, method = "rk4" )
with tf.Session() as sess:
res = sess.run( i )
print( res )
will output:
[ 0. 0.33333334 2.6666667 9. 21.333334 ]
properly integrating x2 over the intervals of [ 0, 0 ], [ 0, 1 ], [ 0, 2 ], [ 0, 3 ], and [ 0, 4 ] as per x = [ 0, 1, 2, 3, 4 ] above. (The primitive function of x2 is ⅓ x3, so for example 43 / 3 = 64/3 = 21 ⅓.)
Otherwise, for earlier TensorFlow versions, here's how to fix your code.
So the main issue is that you have to use tf.py_func() to map a Python function (scipy.integrate.quad() in this case) on a tensor. tf.map_fn() will map other TensorFlow operations and passes and expects tensors as operands. Therefore x[ 0 ] will never be a simple float, it will be a scalar tensor and scipy.integrate.quad() will not know what to do with that.
You can't completely get rid of tf.map_fn() either, unless you want to manually loop over numpy arrays.
Furthermore, scipy.integrate.quad() returns a double (float64), whereas your tensors are float32.
I've simplified your code a lot, because I don't have access to the rest of it and it looks too complicated compared to the core of this question. The following code (tested):
from __future__ import print_function
import tensorflow as tf
from scipy import integrate
def f( a ):
return a * a
def integrated( f, x ):
return tf.map_fn( lambda y: tf.py_func(
lambda z: integrate.quad( f, 0.0, z )[ 0 ], [ y ], tf.float64 ),
x )
x = tf.constant( [ 1.0, 2, 3, 4 ], dtype = tf.float64 )
i = integrated( f, x )
with tf.Session() as sess:
res = sess.run( i )
print( res )
will also output:
[ 0.33333333 2.66666667 9. 21.33333333]

How to use pykalman filter_update for online regression

I want to use Kalman regression recursively on an incoming stream of price data using kf.filter_update() but I can't make it work. Here's the example code framing the problem:
The dataset (i.e. the stream):
DateTime CAT DOG
2015-01-02 09:01:00, 1471.24, 9868.76
2015-01-02 09:02:00, 1471.75, 9877.75
2015-01-02 09:03:00, 1471.81, 9867.70
2015-01-02 09:04:00, 1471.59, 9849.03
2015-01-02 09:05:00, 1471.45, 9840.15
2015-01-02 09:06:00, 1471.16, 9852.71
2015-01-02 09:07:00, 1471.30, 9860.24
2015-01-02 09:08:00, 1471.39, 9862.94
The data is read into a Pandas dataframe and the following code simulates the stream by iterating over the df:
df = pd.read_csv('data.txt')
df.dropna(inplace=True)
history = {}
history["spread"] = []
history["state_means"] = []
history["state_covs"] = []
for idx, row in df.iterrows():
if idx == 0: # Initialize the Kalman filter
delta = 1e-9
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = np.vstack([df.iloc[0].CAT, np.ones(df.iloc[0].CAT.shape)]).T[:, np.newaxis]
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov)
state_means, state_covs = kf.filter(np.asarray(df.iloc[0].DOG))
history["state_means"], history["state_covs"] = state_means, state_covs
slope=state_means[:, 0]
print "SLOPE", slope
else:
state_means, state_covs = kf.filter_update(history["state_means"][-1], history["state_covs"][-1], observation = np.asarray(df.iloc[idx].DOG))
history["state_means"].append(state_means)
history["state_covs"].append(state_covs)
slope=state_means[:, 0]
print "SLOPE", slope
The Kalman filter initializes properly and I get the first regression coefficient, but the subsequent updates throws an exception:
Traceback (most recent call last):
SLOPE [ 6.70319125]
File "C:/Users/.../KalmanUpdate_example.py", line 50, in <module>
KalmanOnline(df)
File "C:/Users/.../KalmanUpdate_example.py", line 43, in KalmanOnline
state_means, state_covs = kf.filter_update(history["state_means"][-1], history["state_covs"][-1], observation = np.asarray(df.iloc[idx].DOG))
File "C:\Python27\Lib\site-packages\pykalman\standard.py", line 1253, in filter_update
2, "observation_matrix"
File "C:\Python27\Lib\site-packages\pykalman\standard.py", line 38, in _arg_or_default
+ ' You must specify it manually.') % (name,)
ValueError: observation_matrix is not constant for all time. You must specify it manually.
Process finished with exit code 1
It seems intuitively clear that the observation matrix is required (it's provided in the initial step, but not in the updating steps), but I cannot figure out how to set it up properly. Any feedback would be highly appreciated.
Pykalman allows you to declare the observation matrix in two ways:
[n_timesteps, n_dim_obs, n_dim_obs] - once for the whole estimation
[n_dim_obs, n_dim_obs] - separately for each estimation step
In your code you used the first option (that's why "observation_matrix is not constant for all time"). But then you used filter_update in the loop and Pykalman could not understand what to use as the observation matrix in each iteration.
I would declare the observation matrix as a 2-element array:
from pykalman import KalmanFilter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.txt')
df.dropna(inplace=True)
n = df.shape[0]
n_dim_state = 2;
history_state_means = np.zeros((n, n_dim_state))
history_state_covs = np.zeros((n, n_dim_state, n_dim_state))
for idx, row in df.iterrows():
if idx == 0: # Initialize the Kalman filter
delta = 1e-9
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = [df.iloc[0].CAT, 1]
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov)
history_state_means[0], history_state_covs[0] = kf.filter(np.asarray(df.iloc[0].DOG))
slope=history_state_means[0, 0]
print "SLOPE", slope
else:
obs_mat = np.asarray([[df.iloc[idx].CAT, 1]])
history_state_means[idx], history_state_covs[idx] = kf.filter_update(history_state_means[idx-1],
history_state_covs[idx-1],
observation = df.iloc[idx].DOG,
observation_matrix=obs_mat)
slope=history_state_means[idx, 0]
print "SLOPE", slope
plt.figure(1)
plt.plot(history_state_means[:, 0], label="Slope")
plt.grid()
plt.show()
It results in the following output:
SLOPE 6.70322464199
SLOPE 6.70512037269
SLOPE 6.70337808649
SLOPE 6.69956406785
SLOPE 6.6961767953
SLOPE 6.69558438828
SLOPE 6.69581682668
SLOPE 6.69617670459
The Pykalman is not really good documented and there are mistakes on the official page. That's why I recomend to test the result using the offline estimation in one step. In this case the observation matrix has to be declared as you did it in your code.
from pykalman import KalmanFilter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.txt')
df.dropna(inplace=True)
delta = 1e-9
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = np.vstack([df.iloc[:].CAT, np.ones(df.iloc[:].CAT.shape)]).T[:, np.newaxis]
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov)
state_means, state_covs = kf.filter(df.iloc[:].DOG)
print "SLOPE", state_means[:, 0]
plt.figure(1)
plt.plot(state_means[:, 0], label="Slope")
plt.grid()
plt.show()
The result is the same.

Resources