I must be doing something wrong. I have a Julia script (below) that uses both vcat and plot. When I run the script, vcat returns an empty DataFrame. Another function calls plot and no plot is generated.
When I manually type the commands in the terminal window the commands behave normally.
Any help would be appreciated.
f_l = file_list[start_row_num:end_row_num] # Build a dataframe containing the data
len = length(f_l)
tmp_stock_df = DataFrame(CSV.File(f_l[1]))
vcat(s_d_df, tmp_stock_df)
println(s_d_df)
for i = 2:len
tmp_stock_df = DataFrame(CSV.File(f_l[i]))
tmp_stock_df.quote_datetime = map((x) -> DateTime(x, "yyyy-mm-dd HH:MM:SS"), tmp_stock_df.quote_datetime)
DataFrames.vcat(s_d_df, tmp_stock_df)
end
It's hard to say what you're doing differently when manually typing in the commands, but it seems to me that this code would ever produce the results you're looking for. Apart from the fact that s_d_df is not defined, vcat does not mutate its arguments, and therefore you're never actually adding to your DataFrame:
julia> using DataFrames
julia> df1 = DataFrame(a = rand(2), b = rand(2)); df2 = DataFrame(a = rand(2), b = rand(2));
julia> vcat(df1, df2)
4×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼────────────────────
1 │ 0.918298 0.343344
2 │ 0.538763 0.188229
3 │ 0.347177 0.385166
4 │ 0.18795 0.98408
julia> df1
2×2 DataFrame
Row │ a b
│ Float64 Float64
─────┼────────────────────
1 │ 0.918298 0.343344
2 │ 0.538763 0.188229
You probably want s_d_df = vcat(s_d_df, tmp_stock_df) to assign the result of the concatenation.
On a related note, it looks like you just have a list of files f_l with different csv files stored on your system which you want to read into a single DataFrame, in which case you can just replace the whole loop with:
s_d_df = vcat(CSV.read.(f_l, DataFrame)...)
(potentially also use the dateformat = "yyyy-mm-dd HH:MM:SS" kwarg in CSV.read to directly parse the dates when reading in the file).
Related
I am new to Julia, when i am trying to import csv file
using CSV
CSV.read("C:\\Users\\...\\loan_predicton.csv")
I am getting below error
Error : ArgumentError: provide a valid sink argument, like `using DataFrames; CSV.read(source, DataFrame)`
Use:
using CSV
using DataFrames
df = CSV.read("C:\\Users\\...\\loan_predicton.csv", DataFrame)
After you will get some more experience with Julia you will find out that you can read a CSV file into different tabular data formats. That is why CSV.read asks you to provide the type of the output you want to read your data into. Here is a small example:
julia> write("test.csv",
"""
a,b,c
1,2,3
4,5,6
""")
18
julia> using CSV, DataFrames
julia> CSV.read("test.csv", DataFrame)
2×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
2 │ 4 5 6
julia> CSV.read("test.csv", NamedTuple)
(a = [1, 4], b = [2, 5], c = [3, 6])
and you can see that in the first case you stored the result in a DataFrame, and in the second a NamedTuple.
The following code creates a segfault for me - is this a bug? And if so, in which component?
using DataFrames
function test()
Threads.#threads for i in 1:50
df = DataFrame()
df.foo = 1
end
end
test()
(need to start Julia with multithreading support for this to work, eg JULIA_NUM_THREADS=50; julia)
It only generates a segfault if the number of iterations / threads is sufficiently high, eg 50. For lower numbers it only sporadically / never does so.
My environment:
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) Gold 6254 CPU # 3.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 50
It is most likely caused by the fact that you are using deprecated syntax so probably something with deprecation handling messes up things (I do not have enough cores to test it).
In general your code uses deprecated syntax (and produces something different than you probably expect):
~$ julia --depwarn=yes --banner=no
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df.foo=1
┌ Warning: `setproperty!(df::DataFrame, col_ind::Symbol, v)` is deprecated, use `df[!, col_ind] .= v` instead.
│ caller = top-level scope at REPL[3]:1
└ # Core REPL[3]:1
1
julia> df # note that the resulting deprecated syntax has added the column but it has 0 rows
0×1 DataFrame
julia> df2 = DataFrame()
0×0 DataFrame
julia> df2.foo = [1] # this is a correct syntax - assign a vector
1-element Array{Int64,1}:
1
julia> df2[:, :foo2] .= 1 # or use broadcasting
1-element Array{Int64,1}:
1
julia> insertcols!(df2, :foo3 => 1) # or use insertcols! which does broadcasting automatically, see the docstring for details
1×3 DataFrame
│ Row │ foo │ foo2 │ foo3 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
The reason why df.foo = 1 is disallowed and df.foo = [1] is required follows the fact that, as opposed to e.g. R, Julia distinguishes scalars and vectors (in R everything is a vector).
Going back to the original question something e.g. like this should work:
using DataFrames
function test()
Threads.#threads for i in 1:50
df = DataFrame()
df.foo = [1]
end
end
test()
please let me know if it causes problems or not. Thank you!
I have a number of very large CSV files which I would like to parse into custom data structures for subsequent processing. My current approach involves CSV.File and then converting each CSV.Row into the custom data structure. It works well for small test cases but gets really inefficient for the large files (GC very high). The problem is in the second step and I suspect is due to type instability. I'm providing a mock example below.
(I'm new to Julia, so apologies if I misunderstood something)
Define data structure and conversion logic:
using CSV
struct Foo
a::Int32
b::Float32
end
Foo(csv_row::CSV.Row) = Foo(csv_row.a, csv_row.b)
Using the default constructor causes 0 allocations:
julia> #allocated foo1 = Foo(1, 2.5)
0
However, when creating the object from CSV.Row all of a sudden 80 bytes are allocated:
julia> data = CSV.File(Vector{UInt8}("a,b\n1,2.5"); threaded = false)
1-element CSV.File{false}:
CSV.Row: (a = 1, b = 2.5f0)
julia> #allocated foo2 = Foo(data[1])
80
In the first case all types are stable:
julia> #code_warntype Foo(1, 2)
Variables
#self#::Core.Compiler.Const(Foo, false)
a::Int64
b::Int64
Body::Foo
1 ─ %1 = Main.Foo::Core.Compiler.Const(Foo, false)
│ %2 = Core.fieldtype(%1, 1)::Core.Compiler.Const(Int32, false)
│ %3 = Base.convert(%2, a)::Int32
│ %4 = Core.fieldtype(%1, 2)::Core.Compiler.Const(Float32, false)
│ %5 = Base.convert(%4, b)::Float32
│ %6 = %new(%1, %3, %5)::Foo
└── return %6
Whereas in the second case they are not:
julia> #code_warntype Foo(data[1])
Variables
#self#::Core.Compiler.Const(Foo, false)
csv_row::CSV.Row
Body::Foo
1 ─ %1 = Base.getproperty(csv_row, :a)::Any
│ %2 = Base.getproperty(csv_row, :b)::Any
│ %3 = Main.Foo(%1, %2)::Foo
└── return %3
So I guess my question is: How can I make the second case type-stable and avoid the allocations?
Providing the types explicitly in CSV.File does not make a difference by the way.
While this does not focus on type stability, I would expect the highest performance combined with flexibility from the following code:
d = DataFrame!(CSV.File(Vector{UInt8}("a,b\n1,2.5\n3,4.0"); threaded = false))
The above efficiently transforms a CSV.File into a type stable structure, additionally avoiding data copying in this process. This should work for your case of huge CSV files.
And now:
julia> Foo.(d.a, d.b)
2-element Array{Foo,1}:
Foo(1, 2.5f0)
Foo(3, 4.0f0)
I'm trying to subset a DataFrame in Julia as follows:
df = DataFrame(a=[1,2,3], b=["x", "y", "z"])
df2 = df[df.a == 2, :]
I'd expect to get back just the second row, but instead I get an error:
ERROR: BoundsError: attempt to access "attempt to access a data frame
with 3 rows at index false"
What does this error mean and how do I subset the DataFrame?
Just to mention other options note that you can use the filter function here:
julia> filter(row -> row.a == 2, df)
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 2 │ y │
or
julia> df[filter(==(2), df.a), :]
1×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 2 │ y │
Fortunately, you only need to add one character: .. The . character enables broadcasting on any Julia function, even ones like ==. Therefore, your code would be as follows:
df = DataFrame(a=[1,2,3], b=["x", "y", "z"])
df2 = df[df.a .== 2, :]
Without the broadcast, the clause df.a == 2 returns false because it's literally comparing the Array [1,2,3], as a whole unit, to the scalar value of 2. An Array of shape (3,) will never be equal to a scalar value of 2, without broadcasting, because the sizes are different. Therefore, that clause just returns a single false.
The error you're getting tells you that you're trying to access the DataFrame at index false, which is not a valid index for a DataFrame with 3 rows. By broadcasting with ., you're now creating a Bool Array of shape (3,), which is a valid way to index a DataFrame with 3 rows.
For more on broadcasting, see the official Julia documentation here.
Is there a way to convert an object in Julia to a code representation generating the same object?
I am basically looking for an equivalent to R's dput function.
So if I have an object like:
A = rand(2,2)
# Which outputs
>2×2 Array{Float64,2}:
0.0462887 0.365109
0.698356 0.302478
I can do something like dput(A) which prints something like the following to the console that can be copy-pasted to be able to replicate the object:
[0.0462887 0.365109; 0.698356 0.302478]
I think you are looking for repr:
julia> A = rand(2, 2);
julia> repr(A)
"[0.427705 0.0971806; 0.395074 0.168961]"
Just use Base.dump.
julia> dump(rand(2,2))
Array{Float64}((2, 2)) [0.162861 0.434463; 0.0823066 0.519742]
You can copy the second part.
(This is a modified crosspost of https://stackoverflow.com/a/73337342/18431399)
repr might not work as expected for DataFrames.
Here is one way to mimic the behaviour of R's dput for DataFrames in Julia:
julia> using DataFrames
julia> using Random; Random.seed!(0);
julia> df = DataFrame(a = rand(3), b = rand(1:10, 3))
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼──────────────────
1 │ 0.405699 1
2 │ 0.0685458 7
3 │ 0.862141 2
julia> repr(df) # Attempting with repr()
"3×2 DataFrame\n Row │ a b\n │ Float64 Int64\n─────┼──────────────────\n 1 │ 0.405699 1\n 2 │ 0.0685458 7\n 3 │ 0.862141 2"
julia> julian_dput(x) = invoke(show, Tuple{typeof(stdout), Any}, stdout, df);
julia> julian_dput(df)
DataFrame(AbstractVector[[0.4056994708920292, 0.06854582438651502, 0.8621408571954849], [1, 7, 2]], DataFrames.Index(Dict(:a => 1, :b => 2), [:a, :b]))
That is, julian_dput() takes a DataFrame as input and returns a string that can generate the input.
Source: https://discourse.julialang.org/t/given-an-object-return-julia-code-that-defines-the-object/80579/12