How to add a Column to an empty DataFrame in Julia - julia

I want to append a vector as a column to an empty DataFrame. Suppose I have defined an empty DataFrame like this:
import DataFrames
dataframe = DataFrames.DataFrame()
Then I want to append this vector as a column to the dataframe:
vec = [1,2,3]
I tried push!(dataframe , vec), but I got this error:
DimensionMismatch("Length of `row` does not match `DataFrame` column count.")
Stacktrace:
[1] push!(df::DataFrames.DataFrame, row::Vector{Int64}; promote::Bool)
# DataFrames C:\Users\Shayan\.julia\packages\DataFrames\BM4OQ\src\dataframe\dataframe.jl:1691
[2] push!(df::DataFrames.DataFrame, row::Vector{Int64})
# DataFrames C:\Users\Shayan\.julia\packages\DataFrames\BM4OQ\src\dataframe\dataframe.jl:1680
[3] top-level scope
# c:\Users\Shayan\Documents\PyJul Scripts\Jul-test.ipynb:2
[4] eval
# .\boot.jl:373 [inlined]
[5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
# Base .\loading.jl:1196
[6] #invokelatest#2
# .\essentials.jl:716 [inlined]
[7] invokelatest
# .\essentials.jl:714 [inlined]
[8] (::VSCodeServer.var"#164#165"{VSCodeServer.NotebookRunCellArguments, String})()
# VSCodeServer c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\packages\VSCodeServer\src\serve_notebook.jl:19
[9] withpath(f::VSCodeServer.var"#164#165"{VSCodeServer.NotebookRunCellArguments, String}, path::String)
# VSCodeServer c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\packages\VSCodeServer\src\repl.jl:184
[10] notebook_runcell_request(conn::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, params::VSCodeServer.NotebookRunCellArguments)
# VSCodeServer c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\packages\VSCodeServer\src\serve_notebook.jl:13
[11] dispatch_msg(x::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint, Base.PipeEndpoint}, dispatcher::VSCodeServer.JSONRPC.MsgDispatcher, msg::Dict{String, Any})
# VSCodeServer.JSONRPC c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\packages\JSONRPC\src\typed.jl:67
[12] serve_notebook(pipename::String, outputchannel_logger::Base.CoreLogging.SimpleLogger; crashreporting_pipename::String)
# VSCodeServer c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\packages\VSCodeServer\src\serve_notebook.jl:136
[13] top-level scope
# c:\Users\Shayan\.vscode\extensions\julialang.language-julia-1.6.17\scripts\notebook\notebook.jl:32
[14] include(mod::Module, _path::String)
# Base .\Base.jl:418
[15] exec_options(opts::Base.JLOptions)
# Base .\client.jl:292
[16] _start()
# Base .\client.jl:495
Also, I tried insert!(dataframe , vec), but I got this:
MethodError: no method matching insert!(::DataFrames.DataFrame, ::Vector{Int64})
Closest candidates are:
insert!(!Matched::DataStructures.AVLTree{K}, ::K) where K at C:\Users\Shayan\.julia\packages\DataStructures\vSp4s\src\avl_tree.jl:128
insert!(!Matched::DataStructures.SortedSet, ::Any) at C:\Users\Shayan\.julia\packages\DataStructures\vSp4s\src\sorted_set.jl:114
insert!(!Matched::DataStructures.SortedDict{K, D, Ord}, ::Any, !Matched::Any) where {K, D, Ord<:Base.Order.Ordering} at C:\Users\Shayan\.julia\packages\DataStructures\vSp4s\src\sorted_dict.jl:268
How can I do this? Any help would be appreciated.
Additional Note: vec is not defined before dataframe and is intended! I mean, I have to create an empty DataFrame first!

There are the following options depending what you need.
Add a vector without copying
julia> x = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> df = DataFrame()
0×0 DataFrame
julia> df.x = x
3-element Vector{Int64}:
1
2
3
julia> df.x === x
true
or
julia> x = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> df = DataFrame()
0×0 DataFrame
julia> df[!, :x] = x
3-element Vector{Int64}:
1
2
3
julia> df.x === x
true
Add a vector with copying
julia> x = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = x
3-element Vector{Int64}:
1
2
3
julia> df.x == x
true
julia> df.x === x
false
If you have a scalar you can do (also works with vectors)
julia> df = DataFrame()
0×0 DataFrame
julia> insertcols!(df, :x => 1)
1×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1

You can do as follows:
julia> r=DataFrame(:a=>rand(5),:b=>rand(5))
5×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼────────────────────
1 │ 0.8613 0.207534
2 │ 0.994096 0.561571
3 │ 0.220975 0.429286
4 │ 0.884805 0.835078
5 │ 0.964035 0.653509
julia> r[:,:c]=rand(5)
5-element Vector{Float64}:
0.5722614445699863
0.1582911302051686
0.14114436033460553
0.20981872218154363
0.07636493031324465
julia> r
5×3 DataFrame
Row │ a b c
│ Float64 Float64 Float64
─────┼───────────────────────────────
1 │ 0.8613 0.207534 0.572261
2 │ 0.994096 0.561571 0.158291
3 │ 0.220975 0.429286 0.141144
4 │ 0.884805 0.835078 0.209819
5 │ 0.964035 0.653509 0.0763649
nb: also works starting from an empty dataframe:
julia> r=DataFrame()
0×0 DataFrame
julia> r[:,:c]=rand(5)
5-element Vector{Float64}:
0.6792303081607677
0.08094072339097869
0.5171831771259873
0.35343166177619845
0.44751700973394026
julia> r
5×1 DataFrame
Row │ c
│ Float64
─────┼───────────
1 │ 0.67923
2 │ 0.0809407
3 │ 0.517183
4 │ 0.353432
5 │ 0.447517
Update & summary (completed using Bogumił Kamiński answer)
You can do:
d[:,:colname] = x_vector # copy of x
d[!,:colname] = x_vector # no copy of x (shared)
if x is a scalar, see Bogumił Kamiński answer.

Related

How to apply a statistical test to each group of a dataframe in Julia? (tapply equivalent)

In Julia, I want to test the normality of a variable for each group defined in another column in a dataframe.
Lets say we have:
df = DataFrame(x = rand(Normal(),30), group = repeat(["A", "B"],15))
I know I can test the normality of x with :
using HypothesisTests
using Distributions
OneSampleADTest(x, Normal())
So the question is how do I test the normality of x for each group ? In R, I would use tapply() but I couldn't find the equivalent in Julia...
It depends what output you expect. I recommend that you store the result in a data frame (this is not what tapply does):
julia> gdf = groupby(df, :group, sort=true) # group by :group and keep groups sorted
GroupedDataFrame with 2 groups based on key: group
First Group (15 rows): group = "A"
Row │ x group
│ Float64 String
─────┼───────────────────
1 │ -0.869008 A
2 │ 0.190041 A
3 │ 0.369881 A
4 │ 0.445092 A
⋮ │ ⋮ ⋮
13 │ -0.599266 A
14 │ 0.696132 A
15 │ 0.788465 A
8 rows omitted
⋮
Last Group (15 rows): group = "B"
Row │ x group
│ Float64 String
─────┼───────────────────
1 │ -1.19973 B
2 │ 0.557241 B
3 │ -0.425667 B
4 │ 0.787917 B
⋮ │ ⋮ ⋮
13 │ 1.96912 B
14 │ 0.567594 B
15 │ 1.39739 B
8 rows omitted
julia> res = combine(gdf, :x => (x -> OneSampleADTest(x, Normal())) => :ADTest)
2×2 DataFrame
Row │ group ADTest
│ String OneSampl…
─────┼───────────────────────────────────────────
1 │ A One sample Anderson-Darling test…
2 │ B One sample Anderson-Darling test…
Now in res you have both group name and the result of the test (a full test-result object that you can work with later).
If you are interested only in p-value do:
julia> res = combine(gdf, :x => (x -> pvalue(OneSampleADTest(x, Normal()))) => :ADTest_pvalue)
2×2 DataFrame
Row │ group ADTest_pvalue
│ String Float64
─────┼───────────────────────
1 │ A 0.469626
2 │ B 0.750134
If you are used to dplyr style use DataFramesMeta.jl:
julia> using DataFramesMeta
julia> #combine(gdf, :ADTest = OneSampleADTest(:x, Normal()))
2×2 DataFrame
Row │ group ADTest
│ String OneSampl…
─────┼───────────────────────────────────────────
1 │ A One sample Anderson-Darling test…
2 │ B One sample Anderson-Darling test…
julia> #combine(gdf, :ADTest_pvalue = pvalue(OneSampleADTest(:x, Normal())))
2×2 DataFrame
Row │ group ADTest_pvalue
│ String Float64
─────┼───────────────────────
1 │ A 0.469626
2 │ B 0.750134
If you want to just get the pvalue for each group in the data frame,
julia> combine(groupby(df, :group), :x => (x -> pvalue(OneSampleADTest(x, Normal()))) => :onesampleAD_pvalue)
2×2 DataFrame
Row │ group onesampleAD_pvalue
│ String Float64
─────┼────────────────────────────
1 │ A 0.275653
2 │ B 0.544317
If you want to print the test details (or do more complex manipulations) per group, you can instead loop over the groups too:
julia> for (key, sdf) in pairs(groupby(df, :group))
println("Group $(key.group)")
display(OneSampleADTest(sdf.x, Normal()))
end
Group A
One sample Anderson-Darling test
--------------------------------
...
Group B
One sample Anderson-Darling test
--------------------------------
...

Storing all nonzero elements of a 2-D array via slicing

for extracting all nonzero elemnts of an one dimentional array, we do the following:
One_D = [1,4,5,0,0,4,7,0,2,6]
One_D[One_D .> 0]
How to do a similar thing for a two or more than 2 dimentinal vectore array?
two_D = [[1,0,2,3,0],[4,0,5,0,6]]
this two_D[two_D .> 0] obviously is incorrect. So, what esle we can try?
Your two_D is not 2 dimensional, but it is a vector of vectors. You can use then broadcasted filter:
julia> filter.(>(0), two_D)
2-element Vector{Vector{Int64}}:
[1, 2, 3]
[4, 5, 6]
If instead your two_D were a matrix like this:
julia> mat = [[1,0,2,3,0] [4,0,5,0,6]]
5×2 Matrix{Int64}:
1 4
0 0
2 5
3 0
0 6
You can still use filter but without broadcasting. In this case you will get a flat vector of found elements:
julia> filter(>(0), mat)
6-element Vector{Int64}:
1
2
3
4
5
6

Conditional selection of elements of a list in Base R

I'm trying to find the unique elements in the variables listed as x.
The only constraint is that I want to first find the variable (here either a, b, or c) in the list whose max element is smallest, and keep that variable untouched at the top of the output?
I have tried something but can't implement the constraint above:
P.S. My goal is to achieve a function/looping structure to handle larger lists.
x = list(a = 1:5, b = 3:7, c = 6:9) ## a list of 3 variables; variable `a` has the smallest
## max among all variables in the list, so keep `a`
## untouched at the top of the output.
x[-1] <- Map(setdiff, x[-1], x[-length(x)]) ## Now, take the values of `b` not shared
## with `a`, AND values of `c` not shared
## with `b`.
x
# Output: # This output is OK now, but if we change order of `a`, `b`,
# and `c` in the initial list the output will change.
# This is why the constraint above is necessary?
$a
[1] 1 2 3 4 5
$b
[1] 6 7
$c
[1] 8 9
#Find which element in the list has smallest max.
smallest_max <- which.min(sapply(x, max))
#Rearrange the list by keeping the smallest max in first place
#followed by remaining ones
new_x <- c(x[smallest_max], x[-smallest_max])
#Apply the Map function
new_x[-1] <- Map(setdiff, new_x[-1], new_x[-length(new_x)])
new_x
#$a
#[1] 1 2 3 4 5
#$b
#[1] 6 7
#$c
#[1] 8 9
We can wrap this up in a function and then use it
keep_smallest_max <- function(x) {
smallest_max <- which.min(sapply(x, max))
new_x <- c(x[smallest_max], x[-smallest_max])
new_x[-1] <- Map(setdiff, new_x[-1], new_x[-length(new_x)])
new_x
}
keep_smallest_max(x)
#$a
#[1] 1 2 3 4 5
#$b
#[1] 6 7
#$c
#[1] 8 9

How to split a list of vectors into sublist based on a values of another vector in r

Suppose I have a list of vectors x as follows:
> x <- list(x1=c(1,2,3), x2=c(1,4,3), x3=c(3,4,6), x4=c(4,8,4), x5=c(4,33,4), x6=c(9,6,7))
Suppose I have another two vectors y, y1 such that:
y <- c(3,3)
y1 <- c(2,4)
I would like to split x based on the values of y and y1. For example,
for y, I would like to split x into two sub-lists with the same number of vectors (3 vectors in each sub-list)
For y1, I would like to split x into two sub-lists with different number of vectors, where the first sub-list contains 2 vectors and the second sub-list contains 4 vectors.
I tried this:
> z <- split(x, y[1]))
but it is not what I expected.
The output should be as follows:
based on y:
sublist_1 = list(x1, x2, x3),
sublist_2= list(x4,x5,x6)
based on y1:
sublist_1 = list(x1, x2).
sublist_2= list(x1, x2, x3, x4).
Any help, please?
We can use split to split the list into elements by creating groups to split.
split(x, rep(c(1, 2), y))
#$`1`
#$`1`$x1
#[1] 1 2 3
#$`1`$x2
#[1] 1 4 3
#$`1`$x3
#[1] 3 4 6
#$`2`
#$`2`$x4
#[1] 4 8 4
#$`2`$x5
#[1] 4 33 4
#$`2`$x6
#[1] 9 6 7
We can also write a function to do this
split_list <- function(x, split_var) {
split(x, rep(1:length(split_var), split_var))
}
split_list(x, y1)
#$`1`
#$`1`$x1
#[1] 1 2 3
#$`1`$x2
#[1] 1 4 3
#$`2`
#$`2`$x3
#[1] 3 4 6
#$`2`$x4
#[1] 4 8 4
#$`2`$x5
#[1] 4 33 4
#$`2`$x6
#[1] 9 6 7

How to use map in Julia to mimic a nested list comprehension?

I would like to use Julia's map function to mimic a nested list comprehension. This ability would be particularly useful for parallel mapping (pmap).
For example, this nested list comprehension
[x+y for x in [0,10] for y in [1,2,3]]
produces the nice result
6-element Array{Int64,1}:
1
2
3
11
12
13
and this
[x+y for x in [0,10], y in [1,2,3]]
produces the equally nice
2×3 Array{Int64,2}:
1 2 3
11 12 13
Either of these outcomes are satisfactory for my purposes.
Now here is my best effort at replicating these outcomes with map
map([0,10]) do x
map([1,2,3]) do y
x + y
end
end
which yields the correct results, just not in a form I admire:
2-element Array{Array{Int64,1},1}:
[1, 2, 3]
[11, 12, 13]
Now I know there are brute-force ways get the outcome I want, such as hcat/vcat'ing the output or manipulating the input into a matrix, but I'd like to know if there exists a solution as elegant as the nested list comprehension.
The simplest way I can think of is to use comprehensions and combine them with map (low benefit) or pmap (here you get the real value).
On Julia 0.7 (use the fact that in this release you have destructuring in function arguments functionality):
julia> map(((x,y) for x in [0,10] for y in [1,2,3])) do (x,y)
x+y
end
6-element Array{Int64,1}:
1
2
3
11
12
13
julia> map(((x,y) for x in [0,10], y in [1,2,3])) do (x,y)
x+y
end
2×3 Array{Int64,2}:
1 2 3
11 12 13
On Julia 0.6.2 (less nice):
julia> map(((x,y) for x in [0,10] for y in [1,2,3])) do v
v[1]+v[2]
end
6-element Array{Int64,1}:
1
2
3
11
12
13
julia> map(((x,y) for x in [0,10], y in [1,2,3])) do v
v[1]+v[2]
end
2×3 Array{Int64,2}:
1 2 3
11 12 13
You could use Iterators.product:
julia> map(t -> t[1]+t[2], Iterators.product([0,10], [1,2,3]))
2×3 Array{Int64,2}:
1 2 3
11 12 13
Iterators.product returns an iterator whose elements are tuples.
(It's a shame the anonymous function above couldn't be written (x,y) -> x+y)

Resources