I have a dataframe with a column of datetimes, a column of floats, and a column of integers like this:
┌─────────────────────────┬───────────┬─────────────┐
│ time ┆ NAV_DEPTH ┆ coarse_ints │
│ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ f64 ┆ i64 │
╞═════════════════════════╪═══════════╪═════════════╡
│ 2019-07-21 23:25:02.737 ┆ 3.424 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:25:32.745 ┆ 2.514 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:02.753 ┆ 2.514 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-21 23:26:32.668 ┆ 2.323 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:16.383 ┆ 3.17 ┆ 689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:24:46.390 ┆ 3.213 ┆ 689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:16.396 ┆ 3.361 ┆ 689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-07-23 21:25:46.402 ┆ 3.403 ┆ 690 │
The integer column serves to split the dataset up into sequential groups of 8 samples for averaging. I would like to perform a groupby on the integer column and get the mean depth and datetime for each integer. It works with median
df.groupby('coarse_ints').median()
┌─────────────┬─────────────────────────┬───────────┐
│ coarse_ints ┆ time ┆ NAV_DEPTH │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ms] ┆ f64 │
╞═════════════╪═════════════════════════╪═══════════╡
│ 128 ┆ 2019-07-22 07:58:55.498 ┆ 207.8305 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 672 ┆ 2019-07-23 20:15:29.461 ┆ 3.086 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 328 ┆ 2019-07-22 21:19:08.667 ┆ 694.677 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
But with mean, the datetimes all go to null
df.groupby('coarse_ints').mean()
┌─────────────┬──────────────┬────────────┐
│ coarse_ints ┆ time ┆ NAV_DEPTH │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ms] ┆ f64 │
╞═════════════╪══════════════╪════════════╡
│ 232 ┆ null ┆ 96.967125 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 216 ┆ null ┆ 156.889 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
groupby_dynamic looked promising, but it needs a regular time interval. I need to average every 8 samples, regardless of the irregular time interval.
If you operate on the underlying integer representation of the datetime, then cast back when done, you can get the mean via a regular groupby (I admit this is slightly non-intuitive ;)
df.with_column(
pl.col('time').to_physical()
).groupby(
by = pl.col('coarse_ints'),
maintain_order = True # or not :)
).mean().with_column(
pl.col('time').cast( pl.Datetime('ms') )
)
Note that casting back from the physical/integer representation should include the original timeunit (eg: 'ms','us','ns') so as to avoid potentially incorrect scaling.
Related
I understand theory that is behind tagged pointers and how it is used to save additional data in pointer.
But i dont understand this part (from wikipedia article about tagged pointers).
Most architectures are byte-addressable (the smallest addressable unit is a byte), but certain types of data will often be aligned to the size of the data, often a word or multiple thereof. This discrepancy leaves a few of the least significant bits of the pointer unused
Why is this happening ?
Does pointer have only 30 bites (on 32 bit architectures) and that 2 bites are result of aligning?
Why there are 2 bites left unused in first place ?
And does this decrease size of addresable space (from 2^32 bytes to 2^30 bytes)?
Consider an architecture that uses 16-bit alignment and also 16-bit pointers (just to avoid having too many binary digits!). A pointer will only ever be referring to memory locations that are multiples of 16, but the pointer value is still precise down to the byte. So a pointer that, in binary, is:
0000000000000100
refers to the memory location 4 (the fifth byte in memory):
┌────────────────────┬───────────────────┬─────────────┬──────────────┐
│ Address in Decimal │ Address in Binary │ 8─bit bytes │ 16─bit words │
├────────────────────┼───────────────────┼─────────────┼──────────────┤
│ 0 │ 0000000000000000 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 1 │ 0000000000000001 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ 2 │ 0000000000000010 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 3 │ 0000000000000011 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ 4 - this one │ 0000000000000100 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 5 │ 0000000000000101 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ 6 │ 0000000000000110 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 7 │ 0000000000000111 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ ... │ │ │ │
└────────────────────┴───────────────────┴─────────────┴──────────────┘
With 16-bit alignment, there will never be a pointer referring to memory location 5 because it wouldn't be aligned, the next useful value is 6:
0000000000000110
Note that the least significant bit (the one on the far right) is still 0. In fact, for all valid pointer values on that architecture, that bit will be 0. That's what they mean by leaving "...a few of the least significant bits of the pointer unused." In my example it's just one bit, but if you had 32-bit alignment, it would be two bits at the end of pointer value that would always be zero:
┌────────────────────┬───────────────────┬─────────────┬──────────────┐
│ Address in Decimal │ Address in Binary │ 8─bit bytes │ 32─bit words │
├────────────────────┼───────────────────┼─────────────┼──────────────┤
│ 0 │ 0000000000000000 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 1 │ 0000000000000001 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 2 │ 0000000000000010 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 3 │ 0000000000000011 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ 4 │ 0000000000000100 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 5 │ 0000000000000101 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 6 │ 0000000000000110 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 7 │ 0000000000000111 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ 8 │ 0000000000001000 │ ┌─────────┐ │ ┌──────────┐ │
│ │ │ └─────────┘ │ │ │ │
│ 9 │ 0000000000001001 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 10 │ 0000000000001010 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ │ │ │
│ 11 │ 0000000000001011 │ ┌─────────┐ │ │ │ │
│ │ │ └─────────┘ │ └──────────┘ │
│ ... │ │ │ │
└────────────────────┴───────────────────┴─────────────┴──────────────┘
I want to figure out where is the duplicate data which cause this error, but how?
using DataFrames, TimeSeries, CSV
s = "2019-12-25,3
2020-01-01,6
2019-12-25,9
2020-01-02,10
2020-01-03,11
2020-01-04,12
2020-01-02,13
2020-01-02,14"
df=CSV.read(IOBuffer(s), types=[Date,Int], header=["timestamp","V")
ta = TimeArray(df, timestamp=:timestamp)
error message
ERROR: ArgumentError: timestamps must be strictly monotonic
Stacktrace:
[1] (::TimeSeries.var"#_#1#2")(::Bool, ::Type{TimeArray{Int64,1,Date,Array{Int64,1}}}, ::Array{Date,1}, ::Array{Int64,1}, ::Array{Symbol,1}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:81
[2] TimeArray at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:65 [inlined]
[3] #TimeArray#3 at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:89 [inlined]
[4] TimeArray(::Array{Date,1}, ::Array{Int64,1}, ::Array{Symbol,1}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:89
[5] #TimeArray#3(::Symbol, ::Type{TimeArray}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/tables.jl:70
[6] (::Core.var"#kw#Type")(::NamedTuple{(:timestamp,),Tuple{Symbol}}, ::Type{TimeArray}, ::DataFrame) at ./none:0
[7] top-level scope at REPL[239]:1
I want to find out which index caused the error, may similar to
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 3 │ 2019-12-25 │ 9 │
Or even better find out all non unique value rows
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 3 │ 2019-12-25 │ 9 │
│ 4 │ 2020-01-02 │ 10 │
│ 7 │ 2020-01-02 │ 13 │
│ 8 │ 2020-01-02 │ 14 │
Remove duplicates and than pass DataFrame to TimeArray:
julia> TimeArray(aggregate(df, :timestamp, minimum, sort=true), timestamp=:timestamp)
2×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2019-12-25 to 2020-01-01
│ │ V_minimum │
├────────────┼───────────┤
│ 2019-12-25 │ 3 │
│ 2020-01-01 │ 6 │
If you have a DataFrame and just want to identify duplicate date values use the nonunique function.
julia> nonunique(df,:timestamp)
3-element Array{Bool,1}:
0
0
1
If you want just the rows unique to the date:
julia> unique(df,:timestamp)
2×2 DataFrame
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 2 │ 2020-01-01 │ 6 │
By #Przemyslaw Szufel's answer, I figure out the way to find the content, but it is still not perfect, it can't show the original row index and only show the first non unique content.
julia> v=nonunique(df,1)
8-element Array{Bool,1}:
0
0
1
0
0
0
1
1
julia> f=findfirst(v)
3
julia> df[df.Column1 .== df.Column1[f],:]
2×2 DataFrame
│ Row │ Column1 │ Column2 │
│ │ Date │ Int64 │
├─────┼────────────┼─────────┤
│ 1 │ 2019-12-25 │ 3 │
│ 2 │ 2019-12-25 │ 9 │
BTW, I found the "ArgumentError: timestamps must be strictly monotonic" message is not only monotonic, but also "sorted" after check the source code of timearray.jl.
Suppose I have the following DataFrame in Julia, named A:
│ Row │ x1 │ x2 │
├──────┼─────┼─────────┤
│ 1 │ 1.0 │ 5.78341 │
│ 2 │ 2.0 │ 5.05401 │
│ 3 │ 3.0 │ 4.79754 │
│ 4 │ 4.0 │ 4.4126 │
│ 5 │ 5.0 │ 4.29433 │
│ 6 │ 6.0 │ 4.14306 │
│ 7 │ 1.0 │ 5.94811 │
│ 8 │ 2.0 │ 5.0432 │
│ 9 │ 3.0 │ 4.78697 │
│ 10 │ 4.0 │ 4.40384 │
│ 11 │ 5.0 │ 4.29901 │
?
│ 3933 │ 2.0 │ 4.90528 │
│ 3934 │ 3.0 │ 4.57429 │
│ 3935 │ 4.0 │ 4.3988 │
│ 3936 │ 5.0 │ 4.19076 │
│ 3937 │ 6.0 │ 4.09517 │
│ 3938 │ 7.0 │ 3.96192 │
│ 3939 │ 1.0 │ 5.88878 │
│ 3940 │ 2.0 │ 5.87492 │
│ 3941 │ 3.0 │ 4.9453 │
│ 3942 │ 4.0 │ 4.39047 │
│ 3943 │ 5.0 │ 4.28096 │
│ 3944 │ 6.0 │ 4.13686 │
I want to calculate the mean of x2 values by x1 values only if the number of repetitions of x1 values in less or equal than 500, for example. I tried the following code, but it didn't work:
aggregate(A,length(:x1).<=500,mean)
If for example, only the values 1,2 and 3 meet the condition, the result should be:
│ Row │ x1 │ x2 │
├──────┼─────┼─────────┤
│ 1 │ 1.0 │ 5.85264 │
│ 2 │ 2.0 │ 5.15852 │
│ 3 │ 3.0 │ 4.92586 │
where the x2 values are the corresponding mean values.
Any suggestions?
I would use DataFramesMeta.jl here as it will be cleaner than using DataFrames.jl only functionality (I give two ways to obtain the desired result as examples):
using DataFramesMeta
# I generate a smaller DataFrame with cutoff of 15 for the example
df = DataFrame(x1=repeat([1,1,2,2,3], inner=10), x2=rand(50))
# first way to do it
#linq df |>
groupby(:x1) |>
where(length(:x1)>15) |>
#based_on(x2=mean(:x2))
# other way to do the same
#linq df |>
by(:x1, x2=mean(:x2), n=length(:x2)) |>
where(:n.>15) |>
select(:x1, :x2)
I have loaded data into a .csv within Julia.
I wish to convert my string Date to Date format:
julia> head(df)
6×7 DataFrames.DataFrame
│ Row │ Date │ Open │ High │ Low │ Close │ Adj_Close │ Volume │
├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┤
│ 1 │ 1993-01-29 │ 43.9687 │ 43.9687 │ 43.75 │ 43.9375 │ 27.6073 │ 1003200 │
│ 2 │ 1993-02-01 │ 43.9687 │ 44.25 │ 43.9687 │ 44.25 │ 27.8036 │ 480500 │
│ 3 │ 1993-02-02 │ 44.2187 │ 44.375 │ 44.125 │ 44.3437 │ 27.8625 │ 201300 │
│ 4 │ 1993-02-03 │ 44.4062 │ 44.8437 │ 44.375 │ 44.8125 │ 28.1571 │ 529400 │
│ 5 │ 1993-02-04 │ 44.9687 │ 45.0937 │ 44.4687 │ 45.0 │ 28.2749 │ 531500 │
│ 6 │ 1993-02-05 │ 44.9687 │ 45.0625 │ 44.7187 │ 44.9687 │ 28.2552 │ 492100 │
The type is:
julia> showcols(df)
6258×7 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values │
├───────┼───────────┼──────────────────────────────────┼─────────┼───────────────────────────┤
│ 1 │ Date │ Union{Missings.Missing, String} │ 0 │ 1993-01-29 … 2017-12-01 │
│ 2 │ Open │ Union{Float64, Missings.Missing} │ 0 │ 43.9687 … 264.76 │
│ 3 │ High │ Union{Float64, Missings.Missing} │ 0 │ 43.9687 … 265.31 │
│ 4 │ Low │ Union{Float64, Missings.Missing} │ 0 │ 43.75 … 260.76 │
│ 5 │ Close │ Union{Float64, Missings.Missing} │ 0 │ 43.9375 … 264.46 │
│ 6 │ Adj_Close │ Union{Float64, Missings.Missing} │ 0 │ 27.6073 … 264.46 │
│ 7 │ Volume │ Union{Int64, Missings.Missing} │ 0 │ 1003200 … 159947700 │
Right now the Date is a string.
So wish to convert the column a Date format.
trying:
df[:Date, DateFormat("yyyy-mm-dd")]
and
df[df[:Date] = DateFormat("yyyy-mm-dd")]
with error:
MethodError: Cannot convert an object of type DateFormat{Symbol("yyyy-mm-dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}} to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.
in setindex! at DataFrames\src\dataframe\dataframe.jl:376
in fill! at base\multidimensional.jl:841
REPL
encase my syntax is wrong I make a vector x from the date column:
x = df[:Date]
Date(x, "yyyy-mm-dd")
MethodError: Cannot convert an object of type Array{Union{Missings.Missing, String},1} to an object of type Int64
REPL
This is easy with R but Julia can not find that great of information, any assistance appreciated.
I am also following this link:
https://docs.julialang.org/en/release-0.4/manual/dates/
Here is an example:
julia> df = Dates.DateFormat("y-m-d");
julia> dt = Date("2015-01-01",df)
2015-01-01
julia> dt2 = Date("2015-01-02",df)
2015-01-02
Why cant I pass a vector or data frame column through this??
Update:
This works when I pass one element from the vector:
julia> Date(x[1], Dates.DateFormat("yyyy-mm-dd"))
1993-01-29
I just want to convert every element to this format and store in the data frame
Simply write Date.(x, Dates.DateFormat("yyyy-mm-dd")) to get what you want.
Notice the . after Date - it tells Julia to apply Date function to all elements of x and Dates.DateFormat("yyyy-mm-dd") will be reused in every call as it is a scalar.
The details are explained here https://docs.julialang.org/en/latest/base/arrays/#Broadcast-and-vectorization-1.
As a side note if you use latest version of CSV.jl package then it should detect Date type automatically:
julia> data="""Date,Open,High,Low,Close,Adj_Close,Volume
1993-01-29,43.9687,43.9687,43.75,43.9375,27.6073,1003200
1993-02-01,43.9687,44.25,43.9687,44.25,27.8036,480500
1993-02-02,44.2187,44.375,44.125 ,44.3437,27.8625,201300"""
"Date,Open,High,Low,Close,Adj_Close,Volume\n1993-01-29,43.9687,43.9687,43.75,43.9375,27.6073,1003200\n1993-02-01,43.9687,44.25,43.9687,44.25,27.8036,480500\n1993-02-02,44.2187,44.375,44.125 ,44.3437,27.8625,201300"
julia> showcols(CSV.read(IOBuffer(data)))
3×7 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values │
├───────┼───────────┼──────────────────────────────────┼─────────┼───────────────────────────┤
│ 1 │ Date │ Union{Date, Missings.Missing} │ 0 │ 1993-01-29 … 1993-02-02 │
│ 2 │ Open │ Union{Float64, Missings.Missing} │ 0 │ 43.9687 … 44.2187 │
│ 3 │ High │ Union{Float64, Missings.Missing} │ 0 │ 43.9687 … 44.375 │
│ 4 │ Low │ Union{Float64, Missings.Missing} │ 0 │ 43.75 … 44.125 │
│ 5 │ Close │ Union{Float64, Missings.Missing} │ 0 │ 43.9375 … 44.3437 │
│ 6 │ Adj_Close │ Union{Float64, Missings.Missing} │ 0 │ 27.6073 … 27.8625 │
│ 7 │ Volume │ Union{Int64, Missings.Missing} │ 0 │ 1003200 … 201300 │
And even if it would not you can pass types argument (in example below it avoids an union with Missing if you would not want this for some reason):
julia> showcols(CSV.read(IOBuffer(data), types=[String; fill(Float64, 5); Int]))
3×7 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values │
├───────┼───────────┼─────────┼─────────┼───────────────────────────┤
│ 1 │ Date │ String │ 0 │ 1993-01-29 … 1993-02-02 │
│ 2 │ Open │ Float64 │ 0 │ 43.9687 … 44.2187 │
│ 3 │ High │ Float64 │ 0 │ 43.9687 … 44.375 │
│ 4 │ Low │ Float64 │ 0 │ 43.75 … 44.125 │
│ 5 │ Close │ Float64 │ 0 │ 43.9375 … 44.3437 │
│ 6 │ Adj_Close │ Float64 │ 0 │ 27.6073 … 27.8625 │
│ 7 │ Volume │ Int64 │ 0 │ 1003200 … 201300 │
EDIT: Under DataFrames.jl version 0.14 or later use describe instead of showcols.
Here is what I came up with:
# Pull date column and store in vector
x = df[:Date]
# loop to iterate through each element in vector, converting to Date format
v = []
for i in 1:length(x)
z = Date(x[i], Dates.DateFormat("yyyy-mm-dd"))
push!(v,z)
end
# Check format
julia> v[1] - v[3]
-4 days
# cbind() R equivalent hcat() to existing data frame
df = hcat(df,v)
With the output:
julia> head(df)
6×8 DataFrames.DataFrame
│ Row │ Date │ Open │ High │ Low │ Close │ Adj_Close │ Volume │ x1 │
├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼────────────┤
│ 1 │ 1993-01-29 │ 43.9687 │ 43.9687 │ 43.75 │ 43.9375 │ 27.6073 │ 1003200 │ 1993-01-29 │
│ 2 │ 1993-02-01 │ 43.9687 │ 44.25 │ 43.9687 │ 44.25 │ 27.8036 │ 480500 │ 1993-02-01 │
│ 3 │ 1993-02-02 │ 44.2187 │ 44.375 │ 44.125 │ 44.3437 │ 27.8625 │ 201300 │ 1993-02-02 │
│ 4 │ 1993-02-03 │ 44.4062 │ 44.8437 │ 44.375 │ 44.8125 │ 28.1571 │ 529400 │ 1993-02-03 │
│ 5 │ 1993-02-04 │ 44.9687 │ 45.0937 │ 44.4687 │ 45.0 │ 28.2749 │ 531500 │ 1993-02-04 │
│ 6 │ 1993-02-05 │ 44.9687 │ 45.0625 │ 44.7187 │ 44.9687 │ 28.2552 │ 492100 │ 1993-02-05 │
I have imported a DataFrame as below:
julia> df
100×3 DataFrames.DataFrame
│ Row │ ex1 │ ex2 │ admit │
├─────┼─────────┼─────────┼───────┤
│ 1 │ 34.6237 │ 78.0247 │ 0 │
│ 2 │ 30.2867 │ 43.895 │ 0 │
│ 3 │ 35.8474 │ 72.9022 │ 0 │
│ 4 │ 60.1826 │ 86.3086 │ 1 │
│ 5 │ 79.0327 │ 75.3444 │ 1 │
│ 6 │ 45.0833 │ 56.3164 │ 0 │
│ 7 │ 61.1067 │ 96.5114 │ 1 │
│ 8 │ 75.0247 │ 46.554 │ 1 │
⋮
│ 92 │ 90.4486 │ 87.5088 │ 1 │
│ 93 │ 55.4822 │ 35.5707 │ 0 │
│ 94 │ 74.4927 │ 84.8451 │ 1 │
│ 95 │ 89.8458 │ 45.3583 │ 1 │
│ 96 │ 83.4892 │ 48.3803 │ 1 │
│ 97 │ 42.2617 │ 87.1039 │ 1 │
│ 98 │ 99.315 │ 68.7754 │ 1 │
│ 99 │ 55.34 │ 64.9319 │ 1 │
│ 100 │ 74.7759 │ 89.5298 │ 1 │
I want to plot this DataFrame using ex1 as x-axis, ex2 as y-axis. In addition, the data is categorized by the third column :admit, so I want to give dots different colors based on the :admit value.
I used Scale.color_discrete_manual to set up colors, and I tried to use Guide.manual_color_key to change the color key legend. However it turns out Gadfly made two color keys.
p = plot(df, x = :ex1, y = :ex2, color=:admit,
Scale.color_discrete_manual(colorant"deep sky blue",
colorant"light pink"),
Guide.manual_color_key("Legend", ["Failure", "Success"],
["deep sky blue", "light pink"]))
My question is how to change the color key legend when using Scale.color_discrete_manual?
One related question is Remove automatically generated color key in Gadfly plot, where the best answer suggests to use two layers plus Guide.manual_color_key. Is there a better solution for using DataFrame and Scale.color_discrete_manual?
Currently, it looks like users cannot customize the color legend generated by color or Scale.color_discrete_manual based on the discussion.
From the same source, Mattriks suggested to use an extra column as "label". Although it is not "natural" for changing color key, it works pretty well.
Therefore, for the same dataset in the problem. We add one more column:
df[:admission] = map(df[:admit])do x
if x == 1
return "Success"
else
return "Failure"
end
end
julia> df
100×4 DataFrames.DataFrame
│ Row │ exam1 │ exam2 │ admit │ admission │
├─────┼─────────┼─────────┼───────┼───────────┤
│ 1 │ 34.6237 │ 78.0247 │ 0 │ "Failure" │
│ 2 │ 30.2867 │ 43.895 │ 0 │ "Failure" │
│ 3 │ 35.8474 │ 72.9022 │ 0 │ "Failure" │
│ 4 │ 60.1826 │ 86.3086 │ 1 │ "Success" │
│ 5 │ 79.0327 │ 75.3444 │ 1 │ "Success" │
│ 6 │ 45.0833 │ 56.3164 │ 0 │ "Failure" │
│ 7 │ 61.1067 │ 96.5114 │ 1 │ "Success" │
│ 8 │ 75.0247 │ 46.554 │ 1 │ "Success" │
⋮
│ 92 │ 90.4486 │ 87.5088 │ 1 │ "Success" │
│ 93 │ 55.4822 │ 35.5707 │ 0 │ "Failure" │
│ 94 │ 74.4927 │ 84.8451 │ 1 │ "Success" │
│ 95 │ 89.8458 │ 45.3583 │ 1 │ "Success" │
│ 96 │ 83.4892 │ 48.3803 │ 1 │ "Success" │
│ 97 │ 42.2617 │ 87.1039 │ 1 │ "Success" │
│ 98 │ 99.315 │ 68.7754 │ 1 │ "Success" │
│ 99 │ 55.34 │ 64.9319 │ 1 │ "Success" │
│ 100 │ 74.7759 │ 89.5298 │ 1 │ "Success" │
Then color the data using this new column Scale.color_discrete_manual:
plot(df, x = :exam1, y = :exam2, color = :admission,
Scale.color_discrete_manual(colorant"deep sky blue",
colorant"light pink"))