I am working with the forecast data from GFS. I have written the following function to extract a timeseries from an archive of forecasts:
def time_series_from_ensemble_archive(ensemble_archive, lead_time: int=0, ensemble_member: int=0):
data = ensemble_archive
data['age'] = (data.validityDateTime - data.as_of_datetime).astype(np.float)
age_idx = data['age']==lead_time
return data[:, ensemble_member, :, :, :].values[age_idx.T, :, :]
It works as expected:
Here is the data:
Coordinates:
* validityDateTime (validityDateTime) datetime64[ns] 2017-10-01 ...
* perturbationNumber (perturbationNumber) int32 0 1 2 3 4 5 6 7 8 9 10 11 ...
* lon (lon) float64 -119.0 -118.5 -118.0 -117.5 -117.0 ...
* lat (lat) float64 45.5 45.0 44.5 44.0 43.5 43.0 42.5 ...
* as_of_datetime (as_of_datetime) datetime64[ns] 2017-10-01 ...
Attributes:
name: 2 metre temperature
And with my function:
temp_ts = time_series_from_ensemble_archive(data)
temp_ts.shape
(124, 10, 20)
type(temp_ts)
numpy.ndarray
However, I feel like it is not the most 'pythonic' or 'xarrayic' approach, and would be better to return another xarray object. Suggestions for improvement here? Could someone provide a solution using expand_dims or .sel methods?
xarray provides a variety of ways to index and select data. You might try indexing with dimension names, e.g.:
# select using positional & boolean indices
return data[{
'perturbationNumber': ensemble_member,
'validityDateTime': (data['age'] == lead_time)}]
or if lead_time is actually a positional index, just
# select using positional indices
return data[{
'perturbationNumber': ensemble_member,
'validityDateTime': lead_time}]
If you'd like to provide the index labels rather than their positions, you can just use the .sel or .loc methods:
# select using labels
return data.sel(
perturbationNumber=ensemble_member,
validityDateTime=lead_time)
or
# select using labels and boolean indices
return data.loc[{
'perturbationNumber': ensemble_member,
'validityDateTime': (data['age'] == lead_time)}]
Calling da.values is the step which returns the numpy array backend of the xarray data. There's no reason why your code shouldn't work with the indices you provided indexing the actual xarray DataArray (without .values).
Related
I create a new struct called HousingData, and also define function such as iterate and length. However, when I use the function collect for my HousingData object, I run into the following error.
TypeError: in typeassert, expected Integer, got a value of type Float64
import Base: length, size, iterate
struct HousingData
x
y
batchsize::Int
shuffle::Bool
num_instances::Int
function HousingData(
x, y; batchsize::Int=100, shuffle::Bool=false, dtype::Type=Array{Float64})
new(convert(dtype,x),convert(dtype,y),batchsize,shuffle,size(y)[end])
end
end
function length(d::HousingData)
return ceil(d.num_instances/d.batchsize)
end
function iterate(d::HousingData, state=ifelse(
d.shuffle, randperm(d.num_instances), collect(1:d.num_instances)))
if(length(state)==0)
return nothing
end
return ((d.x[:,state[1]],d.y[:,state[1]]),state[2:end])
end
x1 = randn(5, 100); y1 = rand(1, 100);
obj = HousingData(x1,y1; batchsize=20)
collect(obj)
There are multiple problems in your code. The first one is related to length not returning an integer, but rather a float. This is explained by the behavior of ceil:
julia> ceil(3.8)
4.0 # Notice: 4.0 (Float64) and not 4 (Int)
You can easily fix this:
function length(d::HousingData)
return Int(ceil(d.num_instances/d.batchsize))
end
Another problem lies in the logic of your iteration function, which is not consistent with the advertised length. To take a smaller example than yours:
julia> x1 = [i+j/10 for i in 1:2, j in 1:6]
2×6 Array{Float64,2}:
1.1 1.2 1.3 1.4 1.5 1.6
2.1 2.2 2.3 2.4 2.5 2.6
# As an aside, unless you really want to work with 1xN matrices
# it is more idiomatic in Julia to use 1D Vectors in such situations
julia> y1 = [Float64(j) for i in 1:1, j in 1:6]
1×6 Array{Float64,2}:
1.0 2.0 3.0 4.0 5.0 6.0
julia> obj = HousingData(x1,y1; batchsize=3)
HousingData([1.1 1.2 … 1.5 1.6; 2.1 2.2 … 2.5 2.6], [1.0 2.0 … 5.0 6.0], 3, false, 6)
julia> length(obj)
2
julia> for (i, e) in enumerate(obj)
println("$i -> $e")
end
1 -> ([1.1, 2.1], [1.0])
2 -> ([1.2, 2.2], [2.0])
3 -> ([1.3, 2.3], [3.0])
4 -> ([1.4, 2.4], [4.0])
5 -> ([1.5, 2.5], [5.0])
6 -> ([1.6, 2.6], [6.0])
The iterator produces 6 elements, whereas the length of this object is only 2. This explains why collect errors out:
julia> collect(obj)
ERROR: ArgumentError: destination has fewer elements than required
Knowing your code, you're probably the best person to fix its logic.
I'd like to be able to vectorize, for speed purposes, this piece of code. the purpose is to calculate a function, in this case a standard deviation, from a tuple of pair of dates that are cointained in two separate arrays.
import pandas as pd
import numpy as np
asd_1 = pd.Series(0.01 * np.random.randn(252), index=pd.date_range('2011-1-1', periods=252))
index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',])
index_tot = list(zip(index_1,index_2))
aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in index_tot], index=index_1)
the solution, that works, is performed through a loop but i'd rather be able to vectorize it through numpy/pandas, which is much faster. initially I though about using something like:
df_aux = pd.concat([asd_1 for _ in range(len(index_1))], axis=1)
results = df_aux.apply(lambda x: np.nanstd(x.loc[i,j]), axis = 0)
but here I fail to put together the vectors into one operation.
any and all advice is welcome.
p.s.: below there is an image for explanatory purposes
Vectorized standard deviation across ranges in an array
def get_ranges_arr(starts,ends):
# Taken from http://stackoverflow.com/a/37626057/3293881
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
def ranged_std(arr,starts,ends):
# Get all indices and the IDs corresponding to same groups
idx = get_ranges_arr(starts,ends)
id_arr = np.repeat(np.arange(starts.size),ends-starts)
# Extract relevant data
slice_arr = arr[idx]
# Simulate standard deviation implementation for a number of groups
# using id_arr as the basis to perform various mathematical operations
# within each group. Since, std. deviation performs sum/mean reduction,
# we can simply use np.bincount for an efficient implementation.
# Std. deviation formula used :
#https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/fromnumeric.py#L2939
grp_counts = np.bincount(id_arr)
mean_vals = np.bincount(id_arr,slice_arr)/grp_counts
abs_vals = np.abs(slice_arr - mean_vals[id_arr])**2
return np.sqrt(np.bincount(id_arr,abs_vals)/grp_counts)
Sample run (verify against a loopy version)
In [173]: arr = np.random.randint(0,9,(20))
In [174]: starts = np.array([2,6,11])
In [175]: ends = np.array([8,9,15])
In [176]: [np.std(arr[i:j]) for i,j in zip(starts,ends)]
Out[176]: [1.9720265943665387, 0.81649658092772603, 0.82915619758884995]
In [177]: ranged_std(arr,starts,ends)
Out[177]: array([ 1.97202659, 0.81649658, 0.8291562 ])
Runtime test
Case #1 : Very small number of ranges 3
In [21]: arr = np.random.randint(0,9,(20))
In [22]: starts = np.array([2,6,11])
In [23]: ends = np.array([8,9,15])
In [24]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
10000 loops, best of 3: 146 µs per loop
In [25]: %timeit ranged_std(arr,starts,ends)
10000 loops, best of 3: 45 µs per loop
Case #2 : Decent number of ranges 1000
In [32]: arr = np.random.randint(0,9,(1010))
In [33]: starts = np.random.randint(0,9,(1000))
In [34]: ends = starts + np.random.randint(0,9,(1000))
In [35]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
10 loops, best of 3: 47.5 ms per loop
In [36]: %timeit ranged_std(arr,starts,ends)
1000 loops, best of 3: 217 µs per loop
Case #3 : Large number of ranges 10000
In [60]: arr = np.random.randint(0,9,(1010))
In [61]: arr = np.random.randint(0,9,(10010))
In [62]: starts = np.random.randint(0,9,(10000))
In [63]: ends = starts + np.random.randint(0,9,(10000))
In [64]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
1 loops, best of 3: 474 ms per loop
In [65]: %timeit ranged_std(arr,starts,ends)
100 loops, best of 3: 2.17 ms per loop
Really amazing speedups of 200x+!
Using ranged_std to solve our case
# Get start, stop numeric indices as needed for getting ranges array later on
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
# Create final dataframe output using ranged_std func
df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
Sample run for verification -
In [17]: asd_1 = pd.Series(0.01 * np.random.randn(252), index=\
...: pd.date_range('2011-1-1', periods=252))
...:
...: index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',])
...: index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',])
...:
...: index_tot = list(zip(index_1,index_2))
...: aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in \
...: index_tot], index=index_1)
...:
In [18]: starts = asd_1.index.searchsorted(index_1)
...: ends = asd_1.index.searchsorted(index_2)
...: df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
...:
In [19]: aux_learning_std
Out[19]:
0
2011-02-02 0.007244
2011-04-03 0.012862
2011-05-01 0.010155
In [20]: df
Out[20]:
0
2011-02-02 0.007244
2011-04-03 0.012862
2011-05-01 0.010155
First time looking at Julia
julia> x=[1 2 3];
julia> x[2]=3+5im
ERROR: InexactError()
in convert at complex.jl:18
in setindex! at array.jl:346
I am sure this is because julia typing system is different.
How would one do this below in Julia?
x=[1 2 3];
x(2)=3+5*1i
x =
1.0000 + 0.0000i 3.0000 + 5.0000i 3.0000 + 0.0000i
You can make x a complex array:
x=[1 2 3];
x=complex(x);
Now you can perform this operation:
x[2]=3+5im;
This results in x containing:
println(x)
This outputs:
1+0im 3+5im 3+0im
As desired.
You probably want x to be complex. In which case, you can do this:
x = Complex{Float64}[1, 2, 3]
Which allows you to do what you want. You can also change Float64 to something else like Int or Int64.
Also, you should put commas after entries to get 1-dimensional arrays instead of 2-dimensional arrays, which is what yours are. To find the type do this
typeof(x)
which gives
1x3 Array{Complex{Float64},1}:
1.0+0.0im 2.0+0.0im 3.0+0.0im
The 1 at the end indicates that this is a 1-dimensional array.
I am trying to build a Pandas series by passing it a dictionary containing index and data pairs. While doing so I noticed an interesting quirk. If the index of the data pair is a very large integer the data will show up as NaN. This is fixed by reducing the size of the index values, or creating the Series using two lists instead of a single dict. I have large index values because I am using time-stamps in microseconds-since-1970 format. Am I doing something wrong or is this a bug?
Here's an example:
import pandas as pd
test_series_time = [1357230060000000, 1357230180000000, 1357230300000000]
test_series_value = [1, 2, 3]
series = pd.Series(test_series_value, test_series_time, name="this works")
test_series_dict = {1357230060000000: 1, 1357230180000000: 2, 1357230300000000: 3}
series2 = pd.Series(test_series_dict, name="this doesn't")
test_series_dict_smaller_index = {1357230060: 1, 1357230180: 2, 1357230300: 3}
series3 = pd.Series(test_series_dict_smaller_index, name="this does")
print series
print series2
print series3
and the output:
1357230060000000 1
1357230180000000 2
1357230300000000 3
Name: this works
1357230060000000 NaN
1357230180000000 NaN
1357230300000000 NaN
Name: this doesn't
1357230060 1
1357230180 2
1357230300 3
Name: this does
So what's up with this?
I bet you are on 32-bit; on 64-bit this works fine. In 0.10.1, the default of creation via dicts is to use the default numpy integer creation, which is system dependent (e.g. int32 on 32-bit, and int64 on 64-bit). You are overflowing the dtype, which results in unpredictable behavior.
In 0.11 (coming out this week!), this will work as it will default to creating int64s regardless of the system.
In [12]: np.iinfo(np.int32).max
Out[12]: 2147483647
In [13]: np.iinfo(np.int64).max
Out[13]: 9223372036854775807
Convert your microseconds to Timestamps (multiply by 1000 to put in nanoseconds which is what Timestamp accepts as an integer input, then you are good to go
In [5]: pd.Series(test_series_value,
[ pd.Timestamp(k*1000) for k in test_series_time ])
Out[5]:
2013-01-03 16:21:00 1
2013-01-03 16:23:00 2
2013-01-03 16:25:00 3
I have two p*n arrays, y and ymiss. y contains real numbers and NA's. ymiss contains 1's and 0's, so that if y(i,j)==NA, ymiss(i,j)==0, and 1 otherwise. I also have 1*n array ydim which tells how many real numbers there is at y(1:p,n), so ydim has values 0 to p
In R programming language, I can do following:
if(ydim!=p && ydim!=0)
y(1:ydim(t), t) = y(ymiss(,t), t)
That code arranges all real numbers of y(,t) in like this
first there's for example
y(,t) = (3,1,NA,6,2,NA)
after the code it's
y(,t) = (3,1,6,2,2,NA)
Now I will only need those first 1:ydim(t), so it doesn't matter what those rest are.
The question is, how can I do something like that in Fortran?
Thanks,
Jouni
The "where statement" and the "merge" intrinsic function are powerful, operating on selected positions in arrays, but they don't move items to the front of an array. With old-fashioned code with explicit indexing (could be packaged into a function) e.g.:
k=1
do i=1, n
if (ymiss (i) == 1) then
y(k) = y(i)
k = k + 1
end if
end do
What you want could be done with array intrinsics using the "pack" intrinsic. Convert ymiss into a logical array: 0 --> .false., 1 --> .true.. Then use code like (tested without the second index):
y(1:ydim(t), t) = pack (y (:,t), ymiss (:,t))
Edit to add example code, showing use of Fortran intrinsics "where", "count" and "pack". "where" alone can't solve the problem, but "pack" can. I used "< -90" as NaN for this example. The step "y (ydim+1:LEN) = -99.0" isn't required by the OP, who doesn't need to use these elements.
program test1
integer, parameter :: LEN = 6
real, dimension (1:LEN) :: y = [3.0, 1.0, -99.0, 6.0, 2.0, -99.0 ]
real, dimension (1:LEN) :: y2
logical, dimension (1:LEN) :: ymiss
integer :: ydim
y2 = y
write (*, '(/ "The input array:" / 6(F6.1) )' ) y
where (y < -90.0)
ymiss = .false.
elsewhere
ymiss = .true.
end where
ydim = count (ymiss)
where (ymiss) y2 = y
write (*, '(/ "Masking with where does not rearrange:" / 6(F6.1) )' ) y2
y (1:ydim) = pack (y, ymiss)
y (ydim+1:LEN) = -99.0
write (*, '(/ "After using pack, and ""erasing"" the end:" / 6(F6.1) )' ) y
stop
end program test1
Output is:
The input array:
3.0 1.0 -99.0 6.0 2.0 -99.0
Masking with where does not rearrange:
3.0 1.0 -99.0 6.0 2.0 -99.0
After using pack, and "erasing" the end:
3.0 1.0 6.0 2.0 -99.0 -99.0
In Fortran you can't store na in an array of real numbers, you can only store real numbers. So you'll probably want to replace na's with some value not likely to be present in your data: huge() might be suitable. 2D arrays are no problem at all for Fortan. You might want to use a 2D array of logicals to replace ymiss rather than a 2D array of 1s and 0s.
There is no simple, intrinsic to achieve what you want, you'd need to write a function. However, a more Fortran way of doing things would be to use the array of logicals as a mask for the operations you want to carry out.
So, here's some fragmentary Fortran code, not tested:
! Declarations
real(8), dimension(m,n) :: y, ynew
logical, dimension(m,n) :: ymiss
! Executable
where (ymiss) ynew = func(y) ! here func() is whatever your function is