The number of occurences of elements in a vector [JULIA] - vector

I have a vector of 2500 values composed of repeated values and NaN values. I want to remove all the NaN values and compute the number of occurrences of each other value.
y
2500-element Array{Int64,1}:
8
43
NaN
46
NaN
8
8
3
46
NaN
For example:
the number of occurences of 8 is 3
the number of occurences of 46 is 2
the number of occurences of 43 is 1.

To remove the NaN values you can use the filter function. From the Julia docs:
filter(function, collection)
Return a copy of collection, removing elements for which function is false.
x = filter(y->!isnan(y),y)
filter!(y->!isnan(y),y)
Thus, we create as our function the conditional !isnan(y) and use it to filter the array y (note, we could also have written filter(z->!isnan(z),y) using z or any other variable we chose, since the first argument of filter is just defining an inline function). Note, we can either then save this as a new object or use the modify in place version, signaled by the ! in order to simply modify the existing object y
Then, either before or after this, depending on whether we want to include the NaNs in our count, we can use the countmap() function from StatsBase. From the Julia docs:
countmap(x)
Return a dictionary mapping each unique value in x to its number of
occurrences.
using StatsBase
a = countmap(y)
you can then access specific elements of this dictionary, e.g. a[-1] will tell you how many occurrences there are of -1
Or, if you wanted to then convert that dictionary to an Array, you could use:
b = hcat([[key, val] for (key, val) in a]...)'
Note: Thanks to #JeffBezanon for comments on correct method for filtering NaN values.

y=rand(1:10,20)
u=unique(y)
d=Dict([(i,count(x->x==i,y)) for i in u])
println("count for 10 is $(d[10])")

countmap is the best solution I've seen so far, but here's a written out version, which is only slightly slower. It only passes over the array once, so if you have many unique values, it is very efficient:
function countmemb1(y)
d = Dict{Int, Int}()
for val in y
if isnan(val)
continue
end
if val in keys(d)
d[val] += 1
else
d[val] = 1
end
end
return d
end
The solution in the accepted answer can be a bit faster if there are a very small number of unique values, but otherwise scales poorly.
Edit: Because I just couldn't leave well enough alone, here's a version that is more generic and also faster (countmap doesn't accept strings, sets or tuples, for example):
function countmemb(itr)
d = Dict{eltype(itr), Int}()
for val in itr
if isa(val, Number) && isnan(val)
continue
end
d[val] = get(d, val, 0) + 1
end
return d
end

Related

how to return different things in function in Julia

Like the simplified function below, if b<c then how can I get the result "No"?
function o(b,c)
if b>=c
return b,c,b+c
else
return "No"
end
end
b = 3
c = 4
k,h,l = o(b,c)
The real problem is that you are returning two completely different things. In one case 3 different variables assigned to integers, in the other case 1 string.
The function is actually working here. The specific error you get is because you are trying to assign 3 variables to one string. When you assign multiple variables to a string, julia actually splits the string up into characters and assign a character to each variable, but your string is only 2 characters long and you are assigning 3 variables.
You should try and have your function return objects of the same type, or at the least the same number of variables. If you insist on getting this function to work in something resembling it's current form then you could do something like this
function o(b,c)
if b>=c
return [b,c,b+c]
else
return "No"
end
end
b = 5
c = 4
result = o(b,c)

How can I slice a shaped array in Perl 6?

I can make a shaped (fixed-size) array:
my #array[3;3] = (
< 1 2 3 >,
< 4 5 6 >,
< 7 8 9 >
);
say #array; # [[1 2 3] [4 5 6] [7 8 9]]
say #array[1;1]; # 5
How can I slice this to get any particular column or diagonal that I want (rows are easy)?
How do I turn a list of the indices in each dimension into the right thing to put in the square braces?
And, surely there's some fancy syntax that would keep me from doing something complicated:
my #diagonal = gather {
my #ends = #array.shape.map: { (0 ..^ $^a).List };
for [Z] #ends {
take #array[ $_ ] # how do I make that $_[0];$_[1];...
};
}
How can I slice this to get any particular column or diagonal that I want?
As far as I know, you can't currently use slice syntax with shaped arrays (notwithstanding your "(rows are easy)" comment which confuses me per my comment on your post).
The obvious solution is to drop the shape and use slice syntax:
my #array = ( < 1 2 3 >, < 4 5 6 >, < 7 8 9 > );
say #array[1]; # 4 5 6 (second row)
say #array[1;*]; # same
say #array[*;1]; # 2 5 8 (second column)
If you wanted to retain the bounds-checking safety of using a shaped array (and/or the C array compatibility of a shaped native array if I'm right that's a thing) then you'd presumably have to keep two copies of the array around, using one to retain the desired aspect of shaped arrays, the other to slice.
How do I turn a list of the indices in each dimension into the right thing to put in the square braces?
Each dimensional slice before the final leaf one must be separated from the next by a ;.
I'm not yet clear on whether that's because the ; is a statement separator (within the subscript) or a list-of-list indicator, nor how to programmatically turn a list of indices into that form. (Investigation continues.)
And, surely there's some fancy syntax that would keep me from doing something complicated [for a diagonal slice]:
say #array[*;{$++}]; # 1 5 9 (diagonal)
The first ; separated field in the [...] array subscript corresponds to the first dimension in the array, i.e. the rows in the array.
Specifying * means you want to include all rows rather than specify the specific row(s).
The last field corresponds to the leaves of the subscript, the actual elements to be accessed.
I first tried just $++ rather than {$++} but that gave me column zero for all elements presumably because the language/roast and/or Rakudo only evaluates a scalar index value once per call of the [...] subscript operator.
Then I reasoned that if an index is Callable it'll be called and it might be called once per row. And that worked.
I think that corresponds to this code in Rakudo.
At first glance this appears to mean you can't use a Callable to calculate a leaf slice and I note that the roast'd slicing for "calculated indices" doesn't include use of a Callable. Perhaps I'm just not looking at it right.
You probably have seen that that returns a not yet implemented error (which was inserted to solve this bug;
Partially dimensioned views of shaped arrays not yet implemented. Sorry.
In this case, it might be better to just unshape the array and use a more traditional approach:
use v6;
my #array = (
< 1 2 3 >,
< 4 5 6 >,
< 7 8 9 >
);
my #diagonal = gather {
my #ends = ((0,0),(1,1),(2,2));
for #ends -> #indices {
take #array[ #indices[0] ][#indices[1]];
};
}
say #diagonal;
By looking at the synopsis on the subject, I would say that approach is not really specified. So when all is said and done, you will probably have to use either EVAL or macros (when they are eventually implemented, of course... )

How to reshape Arrays quickly

In the following code I am using the Julia Optim package for finding an optimal matrix with respect to an objective function.
Unfortunately the provided optimize function only supports vectors, so I have to transform the matrix to a vector before passing it to the optimize function, and also transform it back when using it in the objective function.
function opt(A0,X)
I1(A) = sum(maximum(X*A,1))
function transform(A)
# reshape matrix to vector
return reshape(A,prod(size(A)))
end
function transformback(tA)
# reshape vector to matrix
return reshape(tA, size(A0))
end
obj(tA) = -I1(transformback(tA))
result = optimize(obj, transform(A0), method = :nelder_mead)
return transformback(result.minimum)
end
I think Julia is allocating new space for this every time and it feels slow, so what would be a more efficient way to tackle this problem?
So long as arrays contain elements that are considered immutable, which includes all primitives, then elements of an array are contained in 1 big contiguous blob of memory. So you can break dimension rules and simply treat a 2 dimensional array as a 1-dimensional array, which is what you want to do. So you don't need to reshape, but I don't think reshape is your problem
Arrays are column major and contiguous
Consider the following function
function enumerateArray(a)
for i = 1:*(size(a)...)
print(a[i])
end
end
This function multiplies all of the dimensions of a together and then loops from 1 to that number assuming a is one dimensional.
When you define a as the following
julia> a = [ 1 2; 3 4; 5 6]
3x2 Array{Int64,2}:
1 2
3 4
5 6
The result is
julia> enumerateArray(a)
135246
This illustrates a couple of things.
Yes it actually works
Matrices are stored in column-major format
reshape
So, the question is why doesn't reshape use that fact? Well it does. Here's the julia source for reshape in array.c
a = (jl_array_t*)allocobj((sizeof(jl_array_t) + sizeof(void*) + ndimwords*sizeof(size_t) + 15)&-16);
So yes a new array is created, but the only the new dimension information is created, it points back to the original data which is not copied. You can verify this simply like this:
b = reshape(a,6);
julia> size(b)
(6,)
julia> size(a)
(3,2)
julia> b[4]=100
100
julia> a
3x2 Array{Int64,2}:
1 100
3 4
5 6
So setting the 4th element of b sets the (1,2) element of a.
As for overall slowness
I1(A) = sum(maximum(X*A,1))
will create a new array.
You can use a couple of macros to track this down #profile and #time. Time will additionally record the amount of memory allocated and can be put in front of any expression.
For example
julia> A = rand(1000,1000);
julia> X = rand(1000,1000);
julia> #time sum(maximum(X*A,1))
elapsed time: 0.484229671 seconds (8008640 bytes allocated)
266274.8435928134
The statistics recorded by #profile are output using Profile.print()
Also, most methods in Optim actually allow you to supply Arrays, not just Vectors. You could generalize the nelder_mead function to do the same.

Multidimensional Array Comprehension in Julia

I'm mucking about with Julia and can't seem to get multidimensional array comprehensions to work. I'm using a nightly build of 0.20-pre for OSX; this could conceivably be a bug in the build. I suspect, however, it's a bug in the user.
Lets say I want to wind up with something like:
5x2 Array
1 6
2 7
3 8
4 9
5 10
And I don't want to just call reshape. From what I can tell, a multidimensional array should be generated something like: [(x, y) for x in 1:5, y in 6:10]. But this generates a 5x5 Array of tuples:
julia> [(x, y) for x in 1:5, y in 6:10]
5x5 Array{(Int64,Int64),2}:
(1,6) (1,7) (1,8) (1,9) (1,10)
(2,6) (2,7) (2,8) (2,9) (2,10)
(3,6) (3,7) (3,8) (3,9) (3,10)
(4,6) (4,7) (4,8) (4,9) (4,10)
(5,6) (5,7) (5,8) (5,9) (5,10)
Or, maybe I want to generate a set of values and a boolean code for each:
5x2 Array
1 false
2 false
3 false
4 false
5 false
Again, I can only seem to create an array of tuples with {(x, y) for x in 1:5, y=false}. If I remove the parens around x, y I get ERROR: syntax: missing separator in array expression. If I wrap x, y in something, I always get output of that kind -- Array, Array{Any}, or Tuple.
My guess: there's something I just don't get here. Anybody willing to help me understand what?
I don't think a comprehension is appropriate for what you're trying to do. The reason can be found in the Array Comprehension section of the Julia Manual:
A = [ F(x,y,...) for x=rx, y=ry, ... ]
The meaning of this form is that F(x,y,...) is evaluated with the variables x, y, etc. taking on each value in their given list of values. Values can be specified as any iterable object, but will commonly be ranges like 1:n or 2:(n-1), or explicit arrays of values like [1.2, 3.4, 5.7]. The result is an N-d dense array with dimensions that are the concatenation of the dimensions of the variable ranges rx, ry, etc. and each F(x,y,...) evaluation returns a scalar.
A caveat here is that if you set one of the variables to a >1 dimensional Array, it seems to get flattened first; so the statement that the "the result is... an array with dimensions that are the concatenation of the dimensions of the variable ranges rx, ry, etc" is not really accurate, since if rx is 2x2 and ry is 3, then you will not get a 2x2x3 result but rather a 4x3. But the result you're getting should make sense in light of the above: you are returning a tuple, so that's what goes in the Array cell. There is no automatic expansion of the returned tuple into the row of an Array.
If you want to get a 5x2 Array from a comprhension, you'll need to make sure x has a length of 5 and y has a length of 2. Then each cell would contain the result of the function evaluated with each possible pairing of elements from x and y as arguments. The thing is that the values in the cells of your example Arrays don't really require evaluating a function of two arguments. Rather what you're trying to do is just to stick two predetermined columns together into a 2D array. For that, use hcat or a literal:
hcat(1:5, 6:10)
[ 1:5 5:10 ]
hcat(1:5, falses(5))
[ 1:5 falses(5) ]
If you wanted to create a 2D Array where column 2 contained the result of a function evaluated on column 1, you could do this with a comprehension like so:
f(x) = x + 5
[ y ? f(x) : x for x=1:5, y=(false,true) ]
But this is a little confusing and it seems more intuitive to me to just do
x = 1:5
hcat( x, map(f,x) )
I think you are just reading the list comprehension wrong
julia> [x+5y for x in 1:5, y in 0:1]
5x2 Array{Int64,2}:
1 6
2 7
3 8
4 9
5 10
When you use them in multiple dimensions you get two variables and need a function for the cell values based on the coordinates
For your second question I think that you should reconsider your requirements. Julia uses typed arrays for performance and storing different types in different columns is possible. To get an untyped array you can use {} instead of [], but I think the better solution is to have an array of tuples (Int, Bool) or even better just use two arrays (one for the ints and one for the bool).
julia> [(i,false) for i in 1:5]
5-element Array{(Int64,Bool),1}:
(1,false)
(2,false)
(3,false)
(4,false)
(5,false)
I kind of like the answer #fawr gave for the efficiency of the datatypes while retaining mutability, but this quickly gets you what you asked for (working off of Shawn's answer):
hcat(1:5,6:10)
hcat({i for i=1:5},falses(5))
The cell-array comprehension in the second part forces the datatype to be Any instead of IntXX
This also works:
hcat(1:5,{i for i in falses(5)})
I haven't found another way to explicitly convert an array to type Any besides the comprehension.
Your intuition was to write [(x, y) for x in 1:5, y in 6:10], but what you need is to wrap the ranges in zip, like this:
[i for i in zip(1:5, 6:10)]
Which gives you something very close to what you need, namely:
5-element Array{(Int64,Int64),1}:
(1,6)
(2,7)
(3,8)
(4,9)
(5,10)
To get exactly what you're looking for, you'll need:
hcat([[i...] for i in zip(1:5, 6:10)]...)'
This gives you:
5x2 Array{Int64,2}:
1 6
2 7
3 8
4 9
5 10
This is another (albeit convoluted) way:
x1 = 1
x2 = 5
y1 = 6
y2 = 10
x = [x for x in x1:x2, y in y1:y2]
y = [y for x in x1:x2, y in y1:y2]
xy = cat(2,x[:],y[:])
As #ivarne noted
[{x,false} for x in 1:5]
would work and give you something mutable
I found a way to produce numerical multidimensional arrays via vcat and the splat operator:
R = [ [x y] for x in 1:3, y in 4:6 ] # make the list of rows
A = vcat(R...) # make n-dim. array from the row list
Then R will be a 3x3 Array{Array{Int64,2},2} while A is a 9x2 Array{Int64,2}, as you want.
For the second case (a set of values and a Boolean code for each), one can do something like
R = [[x y > 5] for x in 1:3, y in 4:6] # condition is y > 5
A = vcat(R...)
where A will be a 9x2 Array{Int64,2}, where true/false is denote by 1/0.
I have tested those in Julia 0.4.7.

calculating sums of unique values in a log in R

I have a data frame with three columns: timestamp, key, event which is ordered by time.
ts,key,event
3,12,1
8,49,1
12,42,1
46,12,-1
100,49,1
From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:
ts,prob
3,1
8,1
12,1
46,2/3
100,2/3
My initial step is to calculate the cumsum grouped by key:
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))
In the second (and final) step i iterate over sumByKey with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key). The prob is derived using the two vectors.
Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.
Is it possible to solve the second step in a better way, using e.g. ddply?
If my interpretation is right, then this should do it :
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
# numbers of keys that sum to zero, no ddply necessary
nzero <- cumsum(ave(items$event,items$key,FUN=cumsum)==0)
# number of unique keys at a given timepoint
nunique <- rep(F,length(items$key))
nunique[match(unique(items$key),items$key)] <- T
nunique <- cumsum(nunique)
# makes :
items$p <- (nunique-nzero)/nunique
items
ts key event p
1 3 12 1 1.0000000
2 8 49 1 1.0000000
3 12 42 1 1.0000000
4 46 12 -1 0.6666667
5 100 49 1 0.6666667
If your problem is only computational time, I bet the better idea will be to implement your algorithm as a C chunk; you may first use R to convert keys to a coherent interval of integers (as.numeric(factor(...))) and then use boolean array in C to obtain unique key number easily and very fast. Remember that neither plyr nor standard R *pplys are significantly faster than loops (providing both are used without embarrassing errors, of course).

Resources