I'm trying to see how the size of variables I'm working with increase over each iteration of a loop. I'm not sure which variables are increasing in size, so I would like to capture all of them. To do so I would like to use varinfo(), but as this outputs a Markdown table I'm not able to access it. Is there a way to either convert the Markdown table to a more useable format, or else save the size of variables in the environment in general?
Ideally, I would like as output a dataframe with as many rows or columns as the number of loops, and the size of each variable in each corresponding loop saved.
I really like your idea for debugging purposes :-)
This code is inspired by: https://github.com/JuliaLang/julia/blob/master/stdlib/InteractiveUtils/src/InteractiveUtils.jl
using DataFrames
function debug_list_vals(m::Module=Main)
res = DataFrame()
vs = [v for v in sort!(names(m)) if isdefined(m, v)]
for v in vs
value = getfield(m, v)
if !(value===Base || value===Main || value===Core ||
value===InteractiveUtils || value===debug_list_vals)
append!(res, DataFrame(v=v,size=Base.summarysize(value),
summary=summary(value)))
end
end
res
end
Now let's give it a spin:
julia> for i in 1:3
push!(some_array, i)
println("i=$i ::", debug_list_vals())
end
i=1 ::2×3 DataFrame
│ Row │ v │ size │ summary │
│ │ Symbol │ Int64 │ String │
├─────┼────────────┼───────┼──────────────────────────┤
│ 1 │ ans │ 48 │ 1-element Array{Int64,1} │
│ 2 │ some_array │ 48 │ 1-element Array{Int64,1} │
i=2 ::2×3 DataFrame
│ Row │ v │ size │ summary │
│ │ Symbol │ Int64 │ String │
├─────┼────────────┼───────┼──────────────────────────┤
│ 1 │ ans │ 56 │ 2-element Array{Int64,1} │
│ 2 │ some_array │ 56 │ 2-element Array{Int64,1} │
i=3 ::2×3 DataFrame
│ Row │ v │ size │ summary │
│ │ Symbol │ Int64 │ String │
├─────┼────────────┼───────┼──────────────────────────┤
│ 1 │ ans │ 64 │ 3-element Array{Int64,1} │
│ 2 │ some_array │ 64 │ 3-element Array{Int64,1} │
Related
I am able to read a json file and convert into dataframe using below code.
df = open(jsontable, "normal.json") |> DataFrame
normal.json looks like below,
{"col1":["thasin", "hello", "world"],"col2":[1,2,3],"col3":["abc", "def", "ghi"]}
So final df has,
3×3 DataFrame
│ Row │ col1 │ col2 │ col3 │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ thasin │ 1 │ abc │
│ 2 │ hello │ 2 │ def │
│ 3 │ world │ 3 │ ghi │
But, the same code is not working for record formatted json file.
the format is list like {column -> value}, … , {column -> value}
My sample json
{"billing_account_id":"0139A","credits":[],"invoice":{"month":"202003"},"cost_type":"regular"}
{"billing_account_id":"0139A","credits":[1.45],"invoice":{"month":"202003"},"cost_type":"regular"}
{"billing_account_id":"0139A","credits":[2.00, 3.56],"invoice":{"month":"202003"},"cost_type":"regular"}
Expected output:
billing_account_id cost_type credits invoice
0 0139A regular [] {'month': '202003'}
1 0139A regular [1.45] {'month': '202003'}
2 0139A regular [2.0, 3.56] {'month': '202003'}
This can be done in python like below,
data = []
for line in open("sample.json", 'r'):
data.append(json.loads(line))
print(data)
df=pd.DataFrame(data)
How to do this in Julia?
Note that your file is not a valid JSON (its lines are valid JSON, not the whole file).
You can do this like this:
julia> using DataFrames, JSON3
julia> df = JSON3.read.(eachline("sample.json")) |> DataFrame;
julia> df.credits = Vector{Float64}.(df.credits);
julia> df.invoice = Dict{Symbol,String}.(df.invoice);
julia> df
3×4 DataFrame
│ Row │ billing_account_id │ credits │ invoice │ cost_type │
│ │ String │ Array{Float64,1} │ Dict{Symbol,String} │ String │
├─────┼────────────────────┼────────────────────────────┼────────────────────────┼───────────┤
│ 1 │ 0139A │ 0-element Array{Float64,1} │ Dict(:month=>"202003") │ regular │
│ 2 │ 0139A │ [1.45] │ Dict(:month=>"202003") │ regular │
│ 3 │ 0139A │ [2.0, 3.56] │ Dict(:month=>"202003") │ regular │
The transformations on :credits and :invoice columns are to make them of type that is easy to work with (otherwise they use types that are defined internally by JSON3.jl).
A more advanced option is to do it in one shot by specifying the row schema using a NamedTuple type e.g.:
julia> df = JSON3.read.(eachline("sample.json"),
NamedTuple{(:billing_account_id, :credits, :invoice, :cost_type),Tuple{String,Vector{Float64},Dict{String,String},String}}) |>
DataFrame
3×4 DataFrame
│ Row │ billing_account_id │ credits │ invoice │ cost_type │
│ │ String │ Array{Float64,1} │ Dict{String,String} │ String │
├─────┼────────────────────┼────────────────────────────┼─────────────────────────┼───────────┤
│ 1 │ 0139A │ 0-element Array{Float64,1} │ Dict("month"=>"202003") │ regular │
│ 2 │ 0139A │ [1.45] │ Dict("month"=>"202003") │ regular │
│ 3 │ 0139A │ [2.0, 3.56] │ Dict("month"=>"202003") │ regular │
Unrelated to the julia answer, but in python you can do
pd.read_json("sample.json", lines=True)
I have a csv file which looks like below,
20×2 DataFrame
│ Row │ Id │ Date │
│ │ Int64 │ String │
├─────┼───────┼────────────┤
│ 1 │ 1 │ 01-01-2010 │
│ 2 │ 2 │ 02-01-2010 │
│ 3 │ 3 │ 03-01-2010 │
│ 4 │ 4 │ 04-01-2010 │
│ 5 │ 5 │ 05-01-2010 │
│ 6 │ 6 │ 06-01-2010 │
│ 7 │ 7 │ 07-01-2010 │
│ 8 │ 8 │ 08-01-2010 │
│ 9 │ 9 │ 09-01-2010 │
│ 10 │ 10 │ 10-01-2010 │
│ 11 │ 11 │ 11-01-2010 │
│ 12 │ 12 │ 12-01-2010 │
│ 13 │ 13 │ 13-01-2010 │
│ 14 │ 14 │ 14-01-2010 │
│ 15 │ 15 │ 15-01-2010 │
│ 16 │ 16 │ 16-01-2010 │
│ 17 │ 17 │ 17-01-2010 │
│ 18 │ 18 │ 18-01-2010 │
│ 19 │ 19 │ 19-01-2010 │
│ 20 │ 20 │ 20-01-2010 │
after reading the csv file date columns is in String type. How to externally convert a string series into Datetime series. In Julia Data Frame docs doesn't talk Anything about TimeSeries.
How to externally convert a series or vector into Datetime format?
Is there anyway I can mention timeseries columns while reading a CSV File?
When reading-in a CSV file you can specify dateformat kwarg in CSV.jl:
CSV.File("your_file_name.csv", dateformat="dd-mm-yyyy") |> DataFrame
On the other hand if your data frame is called df then to convert String to Date in your case use:
using Dates
df.Date = Date.(df.Date, "dd-mm-yyyy")
Here is how I have done it:
First a helper function to convert different string formats.
parse_date(d::AbstractString) = DateTime(d, dateformat"yyyy-mm-dd HH:MM:SS")
parse_date(v::Vector{AbstractString}) = parse_date.(v)
parse_date(v::Vector{String}) = parse_date.(v)
parse_date(v::Vector{String31}) = parse_date(String.(v))
using Pipe, TimeSeries
prices = #pipe CSV.File(filename; header = 1, delim = ",") |>
TimeArray(_; timestamp = :Date, timeparser = parse_date)
I want to figure out where is the duplicate data which cause this error, but how?
using DataFrames, TimeSeries, CSV
s = "2019-12-25,3
2020-01-01,6
2019-12-25,9
2020-01-02,10
2020-01-03,11
2020-01-04,12
2020-01-02,13
2020-01-02,14"
df=CSV.read(IOBuffer(s), types=[Date,Int], header=["timestamp","V")
ta = TimeArray(df, timestamp=:timestamp)
error message
ERROR: ArgumentError: timestamps must be strictly monotonic
Stacktrace:
[1] (::TimeSeries.var"#_#1#2")(::Bool, ::Type{TimeArray{Int64,1,Date,Array{Int64,1}}}, ::Array{Date,1}, ::Array{Int64,1}, ::Array{Symbol,1}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:81
[2] TimeArray at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:65 [inlined]
[3] #TimeArray#3 at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:89 [inlined]
[4] TimeArray(::Array{Date,1}, ::Array{Int64,1}, ::Array{Symbol,1}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/timearray.jl:89
[5] #TimeArray#3(::Symbol, ::Type{TimeArray}, ::DataFrame) at /home/dlin/.julia/packages/TimeSeries/8Z5Is/src/tables.jl:70
[6] (::Core.var"#kw#Type")(::NamedTuple{(:timestamp,),Tuple{Symbol}}, ::Type{TimeArray}, ::DataFrame) at ./none:0
[7] top-level scope at REPL[239]:1
I want to find out which index caused the error, may similar to
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 3 │ 2019-12-25 │ 9 │
Or even better find out all non unique value rows
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 3 │ 2019-12-25 │ 9 │
│ 4 │ 2020-01-02 │ 10 │
│ 7 │ 2020-01-02 │ 13 │
│ 8 │ 2020-01-02 │ 14 │
Remove duplicates and than pass DataFrame to TimeArray:
julia> TimeArray(aggregate(df, :timestamp, minimum, sort=true), timestamp=:timestamp)
2×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2019-12-25 to 2020-01-01
│ │ V_minimum │
├────────────┼───────────┤
│ 2019-12-25 │ 3 │
│ 2020-01-01 │ 6 │
If you have a DataFrame and just want to identify duplicate date values use the nonunique function.
julia> nonunique(df,:timestamp)
3-element Array{Bool,1}:
0
0
1
If you want just the rows unique to the date:
julia> unique(df,:timestamp)
2×2 DataFrame
│ Row │ timestamp │ V │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2019-12-25 │ 3 │
│ 2 │ 2020-01-01 │ 6 │
By #Przemyslaw Szufel's answer, I figure out the way to find the content, but it is still not perfect, it can't show the original row index and only show the first non unique content.
julia> v=nonunique(df,1)
8-element Array{Bool,1}:
0
0
1
0
0
0
1
1
julia> f=findfirst(v)
3
julia> df[df.Column1 .== df.Column1[f],:]
2×2 DataFrame
│ Row │ Column1 │ Column2 │
│ │ Date │ Int64 │
├─────┼────────────┼─────────┤
│ 1 │ 2019-12-25 │ 3 │
│ 2 │ 2019-12-25 │ 9 │
BTW, I found the "ArgumentError: timestamps must be strictly monotonic" message is not only monotonic, but also "sorted" after check the source code of timearray.jl.
I have imported a DataFrame as below:
julia> df
100×3 DataFrames.DataFrame
│ Row │ ex1 │ ex2 │ admit │
├─────┼─────────┼─────────┼───────┤
│ 1 │ 34.6237 │ 78.0247 │ 0 │
│ 2 │ 30.2867 │ 43.895 │ 0 │
│ 3 │ 35.8474 │ 72.9022 │ 0 │
│ 4 │ 60.1826 │ 86.3086 │ 1 │
│ 5 │ 79.0327 │ 75.3444 │ 1 │
│ 6 │ 45.0833 │ 56.3164 │ 0 │
│ 7 │ 61.1067 │ 96.5114 │ 1 │
│ 8 │ 75.0247 │ 46.554 │ 1 │
⋮
│ 92 │ 90.4486 │ 87.5088 │ 1 │
│ 93 │ 55.4822 │ 35.5707 │ 0 │
│ 94 │ 74.4927 │ 84.8451 │ 1 │
│ 95 │ 89.8458 │ 45.3583 │ 1 │
│ 96 │ 83.4892 │ 48.3803 │ 1 │
│ 97 │ 42.2617 │ 87.1039 │ 1 │
│ 98 │ 99.315 │ 68.7754 │ 1 │
│ 99 │ 55.34 │ 64.9319 │ 1 │
│ 100 │ 74.7759 │ 89.5298 │ 1 │
I want to plot this DataFrame using ex1 as x-axis, ex2 as y-axis. In addition, the data is categorized by the third column :admit, so I want to give dots different colors based on the :admit value.
I used Scale.color_discrete_manual to set up colors, and I tried to use Guide.manual_color_key to change the color key legend. However it turns out Gadfly made two color keys.
p = plot(df, x = :ex1, y = :ex2, color=:admit,
Scale.color_discrete_manual(colorant"deep sky blue",
colorant"light pink"),
Guide.manual_color_key("Legend", ["Failure", "Success"],
["deep sky blue", "light pink"]))
My question is how to change the color key legend when using Scale.color_discrete_manual?
One related question is Remove automatically generated color key in Gadfly plot, where the best answer suggests to use two layers plus Guide.manual_color_key. Is there a better solution for using DataFrame and Scale.color_discrete_manual?
Currently, it looks like users cannot customize the color legend generated by color or Scale.color_discrete_manual based on the discussion.
From the same source, Mattriks suggested to use an extra column as "label". Although it is not "natural" for changing color key, it works pretty well.
Therefore, for the same dataset in the problem. We add one more column:
df[:admission] = map(df[:admit])do x
if x == 1
return "Success"
else
return "Failure"
end
end
julia> df
100×4 DataFrames.DataFrame
│ Row │ exam1 │ exam2 │ admit │ admission │
├─────┼─────────┼─────────┼───────┼───────────┤
│ 1 │ 34.6237 │ 78.0247 │ 0 │ "Failure" │
│ 2 │ 30.2867 │ 43.895 │ 0 │ "Failure" │
│ 3 │ 35.8474 │ 72.9022 │ 0 │ "Failure" │
│ 4 │ 60.1826 │ 86.3086 │ 1 │ "Success" │
│ 5 │ 79.0327 │ 75.3444 │ 1 │ "Success" │
│ 6 │ 45.0833 │ 56.3164 │ 0 │ "Failure" │
│ 7 │ 61.1067 │ 96.5114 │ 1 │ "Success" │
│ 8 │ 75.0247 │ 46.554 │ 1 │ "Success" │
⋮
│ 92 │ 90.4486 │ 87.5088 │ 1 │ "Success" │
│ 93 │ 55.4822 │ 35.5707 │ 0 │ "Failure" │
│ 94 │ 74.4927 │ 84.8451 │ 1 │ "Success" │
│ 95 │ 89.8458 │ 45.3583 │ 1 │ "Success" │
│ 96 │ 83.4892 │ 48.3803 │ 1 │ "Success" │
│ 97 │ 42.2617 │ 87.1039 │ 1 │ "Success" │
│ 98 │ 99.315 │ 68.7754 │ 1 │ "Success" │
│ 99 │ 55.34 │ 64.9319 │ 1 │ "Success" │
│ 100 │ 74.7759 │ 89.5298 │ 1 │ "Success" │
Then color the data using this new column Scale.color_discrete_manual:
plot(df, x = :exam1, y = :exam2, color = :admission,
Scale.color_discrete_manual(colorant"deep sky blue",
colorant"light pink"))
What Julia functions can output a DataFrame so it is converted into text other than the ones shown below?
using DataFrames
A = DataFrame(randn(10, 7));
print("\n\n\n", "A = DataFrame(randn(10, 7))")
print("\n\n\n","print(A)\n")
print(A)
print("\n\n\n","show(A)\n")
show(A)
print("\n\n\n","show(A, true)\n")
show(A, true)
print("\n\n\n","show(A, false)\n")
show(A, false)
print("\n\n\n","showall(A)\n")
showall(A)
print("\n\n\n","showall(A, true)\n")
showall(A, true)
print("\n\n\n","showall(A, false)\n")
showall(A, false)
print("\n\n\n","display(A)\n")
display(A)
Most of these output something similar to the following:
10×7 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │
├─────┼────────────┼───────────┼───────────┼───────────┼────────────┼───────────┼───────────┤
│ 1 │ 0.377968 │ -2.23532 │ 0.560632 │ 1.00294 │ 1.32404 │ 1.30788 │ -2.09068 │
│ 2 │ -0.694824 │ -0.765572 │ -1.11163 │ 0.038083 │ -0.52553 │ -0.571156 │ 0.977219 │
│ 3 │ 0.343035 │ -1.47047 │ 0.228148 │ -1.29784 │ -1.00742 │ 0.127103 │ -0.399041 │
│ 4 │ -0.0979587 │ -0.445756 │ -0.483188 │ 0.816921 │ -1.12535 │ 0.603824 │ 0.293274 │
│ 5 │ 1.12755 │ -1.62993 │ 0.178764 │ -0.201441 │ -0.730923 │ 0.230186 │ -0.679262 │
│ 6 │ 0.481705 │ -0.716072 │ 0.747341 │ -0.310009 │ 1.4159 │ -0.175918 │ -0.079051 │
│ 7 │ 0.732061 │ -1.08842 │ -1.18988 │ 0.577758 │ -1.474 │ -1.43082 │ -0.584148 │
│ 8 │ -1.077 │ -1.41973 │ -0.330143 │ -1.12357 │ 1.01005 │ 1.06746 │ 2.09197 │
│ 9 │ -1.60122 │ -1.44661 │ 0.299586 │ 1.46604 │ -0.0200695 │ 2.62421 │ 0.396777 │
│ 10 │ -1.74101 │ -0.541589 │ 0.425117 │ 0.14669 │ 0.95779 │ 1.73954 │ -1.7994 │
This is ok and looks decent in a notebook, and is output correctly to latex/pdf with nbconvert as a plain ascii text table. However, I want more options to get text output similar to the following which looks much better in the latex/pdf generated by nbconvert.
| Column 1 | Column 2 | Column 3 | Column 4 |
|-------|-------------|-------------|--------------|
| a | b | c | d |
| A | B | C | D |
Are there any functions that output a Julia DataFrame with this formatting? What about other parameters like digits = or caption =?
Looking at the source code, there isn't an out-of-the-box way. You'll have to write your own (which you can then contribute to the github repo).
Easiest way to do this is to make use of the output writetable() to an in-memory file (using | as a separator) and then handle the column headers