Multiple summary statistics on grouped column in Julia

Multiple summary statistics on grouped column in Julia - julia

I am trying below code to work with Julia(1.5.3), Its just a representation of what I am trying to do.
using DataFrames
using DataFramesMeta
using RDatasets
## setup
iris = dataset("datasets", "iris")
gdf = groupby(iris, :Species)
## Applying the split combine
## This code works fine
combine(gdf, nrow, (valuecols(gdf) .=> mean))
But, when I try to do it for multiple summary it fails
combine(gdf, nrow, (valuecols(gdf) .=> [mean, sum]))
Error:
ERROR: DimensionMismatch("arrays could not be broadcast to a common
size; got a dimension with lengths 4 and 2")
Little debug on error suggests that If I change my code to this:
combine(gdf, nrow, ([:SepalLength, :PetalLength] .=> [mean,sum]))
## This code works but its still not correct as it doesn't tell me the mean and sum of both the columns , rather mean for SepalLength and sum for PetalLength, which was expected as per previous error
A little more research into it and I realized that, we can do something like this, this result is correct but the outcome is in long form of table not the wide form. I was expecting this would have given me the answer to my question, but it seems it doesn't work as expected.
combine(gdf, ([:SepalWidth, :PetalWidth] .=> x -> ([sum(x), mean(x)])))
## The code above works but output is 6x3 DataFrame, I was expecting 3x6 DataFrame
My question is:
Is there any way to use split combine in such a way that I get a wide table like below (I have used "do end" with "combine" to generate it). I am okay with this solution, but I need to type out all the column here, Is there any way such that I can get all the summary stats(sum, median, mean etc) as columns for all the column provided in combine. I hope my question is clear, Please point out in case its a duplicate or its not well communicated. Thanks
combine(gdf) do x
return(sw_sum = sum(x.SepalWidth),
sw_mean = mean(x.SepalWidth),
sp_mean = mean(x.PetalWidth),
sp_sum = sum(x.PetalWidth)
)
end
## My expected answer should be similar to this
#3×5 DataFrame
# Row │ Species sw_sum sw_mean sp_mean sp_sum
# │ Cat… Float64 Float64 Float64 Float64
#─────┼────────────────────────────────────────────────
# 1 │ setosa 171.4 3.428 0.246 12.3
# 2 │ versicolor 138.5 2.77 1.326 66.3
# 3 │ virginica 148.7 2.974 2.026 101.3
Also, this works:
combine(gdf, [:1] .=> [mean, sum, minimum, maximum,median])
But this doesn't and throws the dimension error like above, still scratching my head over this:
combine(gdf, [:1, :2] .=> [mean, sum, minimum, maximum,median])

Do:
combine(gdf, nrow, vec(valuecols(gdf) .=> [mean sum]))
or
combine(gdf, nrow, (valuecols(gdf) .=> [mean sum])...)
or
combine(gdf, nrow, [n => f for n in valuecols(gdf) for f in [mean sum]])
(note that there is no comma between mean and sum)
The reason is that you need to add an additional dimension to broadcasted .=> in order to get all combinations of inputs.
EDIT:
... just iterates a collection and passes its elements as consecutive positional arguments to the function, e.g.:
julia> f(x...) = x
f (generic function with 1 method)
julia> f(1, [2,3,4]...)
(1, 2, 3, 4)

Related

Julia: Vectorize function along a specific axis of a matrix

What is the best way in Julia to vectorize a function along a specific axis? For example sum up all the rows of a matrix. Is it possible with the dot notation?
sum.(ones(4,4))
Does not yield the desired result.

Try using the dims argument on a lot of functions that deal with sets of values.
sum([1 2; 3 4], dims=2)
2×1 Matrix{Int64}:
3
7
# or
using Statistics
mean([1 2; 3 4], dims=1)
1×2 Matrix{Float64}:
2.0 3.0

There is already a standard function called mapslices, looks like exactly what you need.
julia> mapslices(sum, ones(4, 4), dims = 2)
4-element Vector{Float64}:
4.0
4.0
4.0
4.0
You can find the documentation here or by typing ? followed by mapslices in REPL.
If in your example you want to use the dot notation you should pass an array of rows, not the array itself. Otherwise, sum is applied to each element resulting in the same matrix. It can be done with eachrow and eachcol for rows and columns respectively.
julia> sum.(eachrow(ones(4, 4)))
4-element Vector{Float64}:
4.0
4.0
4.0
4.0
EDIT: I tried to suggest a more general solution, but if you have this option I would recommend using Andre's answer.

Calculate mean value of two objects

Let's say we have two objects at the beginning:
a <- c(2,4,6)
b <- 8
If we apply the mean() function in each of them we get this:
> mean(a)
[1] 4
> mean(b)
[1] 8
... which is absolutely normal.
If I create a new object merging a and b...
c <- c(2,4,6,8)
and calculate its mean...
> mean(c)
[1] 5
... we get 5, which is the expected value.
However, I would like to calculate the mean value of both objects at the same time. I tried this way:
> mean(a,b)
[1] 4
As we can see, its value differs from the expected correct value (5). What am I missing?

As mentioned, the correct solution is to concatenate the vectors before passing them to mean:
mean(c(a, b))
The reason that your original code gives a wrong result is due to what mean’s second argument is:
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
So when calling mean with two numeric arguments, the second one is passed as the trim argument, which, in turn, controls how much trimming is to be done. In your case, 8 causes the function to simply return the median (meaningful values for trim would be fractions between 0 and 0.5).

If you print the argument a,b that you are feeding into the mean function, you will see that only a prints:
print(a,b)
[1] 2 4 6
So mean(a,b) only provides the mean of a.
mean(c(a,b)) will produce the expected answer of 5.

Preallocate a data frame of known size in Julia

When I'm running simulations, I like to initialize a big, empty array and fill it up as the simulation iterates through to the end. I do this with something like res = Array(Real,(n_iterations,n_parameters)). However, it would be nice to have named columns, which I think means using a DataFrame. Yet when I try to do something like res_df = convert(DataFrame,res) it throws an error. I would like a more concise approach than doing something like res_df = DataFrame(a=Array(Real,N),b=Array(Real,N),c=Array(Real,N),....) as suggested by the answers to: julia create an empty dataframe and append rows to it

To preallocate a data frame, you must pre-allocate its columns. You can create three columns full of missing values by simply doing [fill(missing, 10000) for _ in 1:3], but that doesn't actually allocate anything at all because those vectors can only hold one value — missing — and thus they can't be changed to hold other values later. One way to do this is by using to Vector constructors that can hold either Missing or Float64:
julia> DataFrame([Vector{Union{Missing, Float64}}(missing, 10000) for _ in 1:3], [:a, :b, :c])
10000×3 DataFrame
Row │ a b c
│ Float64? Float64? Float64?
───────┼──────────────────────────────
1 │ missing missing missing
2 │ missing missing missing
⋮ │ ⋮ ⋮ ⋮
10000 │ missing missing missing
9997 rows omitted
Note that rather than Real, this is using the concrete Float64 — this will have significantly better performance.
(this answer was edited to reflect DataFrames v1.0 syntax)

Multidimensional Array Comprehension in Julia

I'm mucking about with Julia and can't seem to get multidimensional array comprehensions to work. I'm using a nightly build of 0.20-pre for OSX; this could conceivably be a bug in the build. I suspect, however, it's a bug in the user.
Lets say I want to wind up with something like:
5x2 Array
1 6
2 7
3 8
4 9
5 10
And I don't want to just call reshape. From what I can tell, a multidimensional array should be generated something like: [(x, y) for x in 1:5, y in 6:10]. But this generates a 5x5 Array of tuples:
julia> [(x, y) for x in 1:5, y in 6:10]
5x5 Array{(Int64,Int64),2}:
(1,6) (1,7) (1,8) (1,9) (1,10)
(2,6) (2,7) (2,8) (2,9) (2,10)
(3,6) (3,7) (3,8) (3,9) (3,10)
(4,6) (4,7) (4,8) (4,9) (4,10)
(5,6) (5,7) (5,8) (5,9) (5,10)
Or, maybe I want to generate a set of values and a boolean code for each:
5x2 Array
1 false
2 false
3 false
4 false
5 false
Again, I can only seem to create an array of tuples with {(x, y) for x in 1:5, y=false}. If I remove the parens around x, y I get ERROR: syntax: missing separator in array expression. If I wrap x, y in something, I always get output of that kind -- Array, Array{Any}, or Tuple.
My guess: there's something I just don't get here. Anybody willing to help me understand what?

I don't think a comprehension is appropriate for what you're trying to do. The reason can be found in the Array Comprehension section of the Julia Manual:
A = [ F(x,y,...) for x=rx, y=ry, ... ]
The meaning of this form is that F(x,y,...) is evaluated with the variables x, y, etc. taking on each value in their given list of values. Values can be specified as any iterable object, but will commonly be ranges like 1:n or 2:(n-1), or explicit arrays of values like [1.2, 3.4, 5.7]. The result is an N-d dense array with dimensions that are the concatenation of the dimensions of the variable ranges rx, ry, etc. and each F(x,y,...) evaluation returns a scalar.
A caveat here is that if you set one of the variables to a >1 dimensional Array, it seems to get flattened first; so the statement that the "the result is... an array with dimensions that are the concatenation of the dimensions of the variable ranges rx, ry, etc" is not really accurate, since if rx is 2x2 and ry is 3, then you will not get a 2x2x3 result but rather a 4x3. But the result you're getting should make sense in light of the above: you are returning a tuple, so that's what goes in the Array cell. There is no automatic expansion of the returned tuple into the row of an Array.
If you want to get a 5x2 Array from a comprhension, you'll need to make sure x has a length of 5 and y has a length of 2. Then each cell would contain the result of the function evaluated with each possible pairing of elements from x and y as arguments. The thing is that the values in the cells of your example Arrays don't really require evaluating a function of two arguments. Rather what you're trying to do is just to stick two predetermined columns together into a 2D array. For that, use hcat or a literal:
hcat(1:5, 6:10)
[ 1:5 5:10 ]
hcat(1:5, falses(5))
[ 1:5 falses(5) ]
If you wanted to create a 2D Array where column 2 contained the result of a function evaluated on column 1, you could do this with a comprehension like so:
f(x) = x + 5
[ y ? f(x) : x for x=1:5, y=(false,true) ]
But this is a little confusing and it seems more intuitive to me to just do
x = 1:5
hcat( x, map(f,x) )

I think you are just reading the list comprehension wrong
julia> [x+5y for x in 1:5, y in 0:1]
5x2 Array{Int64,2}:
1 6
2 7
3 8
4 9
5 10
When you use them in multiple dimensions you get two variables and need a function for the cell values based on the coordinates
For your second question I think that you should reconsider your requirements. Julia uses typed arrays for performance and storing different types in different columns is possible. To get an untyped array you can use {} instead of [], but I think the better solution is to have an array of tuples (Int, Bool) or even better just use two arrays (one for the ints and one for the bool).
julia> [(i,false) for i in 1:5]
5-element Array{(Int64,Bool),1}:
(1,false)
(2,false)
(3,false)
(4,false)
(5,false)

I kind of like the answer #fawr gave for the efficiency of the datatypes while retaining mutability, but this quickly gets you what you asked for (working off of Shawn's answer):
hcat(1:5,6:10)
hcat({i for i=1:5},falses(5))
The cell-array comprehension in the second part forces the datatype to be Any instead of IntXX
This also works:
hcat(1:5,{i for i in falses(5)})
I haven't found another way to explicitly convert an array to type Any besides the comprehension.

Your intuition was to write [(x, y) for x in 1:5, y in 6:10], but what you need is to wrap the ranges in zip, like this:
[i for i in zip(1:5, 6:10)]
Which gives you something very close to what you need, namely:
5-element Array{(Int64,Int64),1}:
(1,6)
(2,7)
(3,8)
(4,9)
(5,10)
To get exactly what you're looking for, you'll need:
hcat([[i...] for i in zip(1:5, 6:10)]...)'
This gives you:
5x2 Array{Int64,2}:
1 6
2 7
3 8
4 9
5 10

This is another (albeit convoluted) way:
x1 = 1
x2 = 5
y1 = 6
y2 = 10
x = [x for x in x1:x2, y in y1:y2]
y = [y for x in x1:x2, y in y1:y2]
xy = cat(2,x[:],y[:])

As #ivarne noted
[{x,false} for x in 1:5]
would work and give you something mutable

I found a way to produce numerical multidimensional arrays via vcat and the splat operator:
R = [ [x y] for x in 1:3, y in 4:6 ] # make the list of rows
A = vcat(R...) # make n-dim. array from the row list
Then R will be a 3x3 Array{Array{Int64,2},2} while A is a 9x2 Array{Int64,2}, as you want.
For the second case (a set of values and a Boolean code for each), one can do something like
R = [[x y > 5] for x in 1:3, y in 4:6] # condition is y > 5
A = vcat(R...)
where A will be a 9x2 Array{Int64,2}, where true/false is denote by 1/0.
I have tested those in Julia 0.4.7.

Mean Row of Matrix

I'm trying to use mean(A,1) to get the mean row of a matrix A, but am getting an error.
For example, try running the command mean(eye(3), 1).
This gives the error no method mean(Array{Float64,2},Int32).
The only documentation I can find for the mean function is here:
http://docs.julialang.org/en/release-0.1/stdlib/base/#statistics
mean(v[, region])
Compute the mean of whole array v, or optionally along the dimensions in region.
What is the region parameter?
EDIT: for Julia 0.7 and higher, write this as mean(v, dims=1).

julia> using Statistics
julia> A = [[1 2 3];[ 4 5 6]]
2×3 Array{Int64,2}:
1 2 3
4 5 6
# Column means
julia> mean(A, dims=1)
1×3 Array{Float64,2}:
2.5 3.5 4.5
# Row means
julia> mean(A, dims=2)
2×1 Array{Float64,2}:
2.0
5.0

It must be something with your installation, mean(eye(3),1) works just fine here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multiple summary statistics on grouped column in Julia - julia

Related

Julia: Vectorize function along a specific axis of a matrix

Calculate mean value of two objects

Preallocate a data frame of known size in Julia

Multidimensional Array Comprehension in Julia

Mean Row of Matrix

Categories

Resources