I have a very large (~120 GB) CSV file with ~100 columns. I want to iterate through the file line by line using CSV.File and aggregate certain ranges of columns. However, it appears that there is no getindex method for the CSV.Row type. Here is a simplified example:
using CSV
using DataFrames
df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)
file = CSV.File("test_data.csv")
row1 = first(file)
row1.x3 # Works fine
# Both of these throw method errors:
row1[4]
row1[4:7]
Suppose that for each row I want to sum columns [1:3; 8:10] in a variable a and sum columns 4:7 in a variable b. The final output should be a data frame with columns a and b. Is there an easy way to do this when iterating through CSV.Rows?
Here's a version that lets you avoid having to think about translating to the "drop"/"take" logic:
using CSV, Tables, DataFrames
df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)
function aggregate_file(path, a_inds, b_inds)
file = CSV.File(path)
a, b = Int64[], Int64[]
a_cols = propertynames(file)[a_inds]
b_cols = propertynames(file)[b_inds]
for row in file
push!(a, sum(getproperty.(Ref(row), a_cols)))
push!(b, sum(getproperty.(Ref(row), b_cols)))
end
DataFrame(a = a, b = b)
end
julia> aggregate_file("test_data.csv", [1:3; 8:10], 4:7)
6×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 168 │ 112 │
│ 2 │ 174 │ 116 │
│ 3 │ 180 │ 120 │
│ 4 │ 186 │ 124 │
│ 5 │ 192 │ 128 │
│ 6 │ 198 │ 132 │
You can use Tables.eachcolumn from the Tables.jl, since CSV.File supports the Tables interface. Tables.eachcolumn will create an iterator over the columns. Then you can use a combination of Iterators.take and Iterators.drop to access the desired ranges of columns:
using CSV
using Tables
using DataFrames
using Base.Iterators: take
using Base.Iterators: drop
function aggregate_file(path)
file = CSV.File(path)
a, b = Int64[], Int64[]
for row in file
cols = Tables.eachcolumn(row)
sum1to3 = sum(take(cols, 3))
sum8to10 = sum(drop(cols, 7))
push!(a, sum1to3 + sum8to10)
sum4to7 = sum(drop(take(cols, 7), 3))
push!(b, sum4to7)
end
DataFrame(a = a, b = b)
end
julia> aggregate_file("test_data.csv")
6×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 168 │ 112 │
│ 2 │ 174 │ 116 │
│ 3 │ 180 │ 120 │
│ 4 │ 186 │ 124 │
│ 5 │ 192 │ 128 │
│ 6 │ 198 │ 132 │
If you need to aggregate over an arbitrary set of column indices, you could use enumerate on the column iterator:
inds = [2, 4, 7]
sum(j for (i, j) in enumerate(cols) if i in inds)
EDIT:
I did a performance comparison of mine and #IanFiske's answer. His version appears to be faster and use less memory:
julia> using BenchmarkTools
julia> #btime aggregate_file("test_data.csv");
118.687 μs (550 allocations: 24.42 KiB)
julia> #btime aggregate_file("test_data.csv", [1:3; 8:10], 4:7);
62.416 μs (236 allocations: 14.48 KiB)
Related
I've been using the following code to generate histograms from binning one column of a Dataframe and using that bin to calculate a median from another column.
using Plots, Statistics, DataFrames
df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
res = combine(groupby(df, :x, sort=true), :y => median)
bar(res.x, res.y_median, legend=false)
The code point selects values for the bins and I would like to apply a bin range of values manually, if possible?
Row │ A B_median
│ Any Float64
─────┼───────────────────
1 │ 1515.74 0.09
2 │ 1517.7 0.81
3 │ 1527.22 10.23
4 │ 1529.88 2.95
5 │ 1530.72 17.32
6 │ 1530.86 15.22
7 │ 1532.26 1.45
8 │ 1532.68 18.51
9 │ 1541.08 1.32
10 │ 1541.22 15.78
11 │ 1541.36 0.12
12 │ 1541.5 13.55
13 │ 1541.92 11.99
14 │ 1542.06 21.14
15 │ 1542.34 10.645
16 │ 1542.62 19.95
17 │ 1542.76 21.0
18 │ 1543.32 20.91
For example, instead of calculating a median for rows 9->17 individually. Could these rows be bunched together automatically i.e. 1542.7+/-0.7 and a total median value be calculated for this range?
Many thanks!
I assume you want something like this:
julia> using DataFrames, CategoricalArrays, Random, Statistics
julia> Random.seed!(1234);
julia> df = DataFrame(A=rand(20), B=rand(20));
julia> df.A_group = cut(df.A, 4);
julia> res = combine(groupby(df, :A_group, sort=true), :B => median)
4×2 DataFrame
Row │ A_group B_median
│ Cat… Float64
─────┼─────────────────────────────────────────────
1 │ Q1: [0.014908849285099945, 0.532… 0.134685
2 │ Q2: [0.5323651749779272, 0.65860… 0.347995
3 │ Q3: [0.6586057536399257, 0.81493… 0.501756
4 │ Q4: [0.8149335702852593, 0.97213… 0.531899
I am new to Julia, I want to see the first 5 rows of data frame, but when i am writing below code
head(df,5)
I am getting
UndefVarError: head not defined
head is available in e.g. R but not in Julia. First - note that Julia has a nice data frame printing system out of the box that crops things to fit in the terminal window, so you do not need to subset your data frame to see its head and tail. Here is an example:
julia> df = DataFrame(rand(100, 100), :auto)
100×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float6 ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.4475 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.7908
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.3433
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.9463
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.4912 ⋯
6 │ 0.139018 0.182928 0.00129572 0.0439561 0.0929167 0.264609 0.1555
7 │ 0.16076 0.404707 0.0300284 0.665413 0.681704 0.431746 0.3460
8 │ 0.149331 0.132869 0.237446 0.599701 0.149257 0.70753 0.7687
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
93 │ 0.912703 0.98395 0.133307 0.493799 0.76125 0.295725 0.9249 ⋯
94 │ 0.153175 0.339036 0.685642 0.355421 0.365252 0.434604 0.1515
95 │ 0.780877 0.225312 0.511122 0.0506186 0.108054 0.729219 0.5275
96 │ 0.132961 0.348176 0.619712 0.791334 0.052787 0.577896 0.6696
97 │ 0.904386 0.938876 0.988184 0.831708 0.699214 0.627366 0.4320 ⋯
98 │ 0.0295777 0.704879 0.905364 0.142231 0.586725 0.584692 0.9546
99 │ 0.848715 0.177192 0.544509 0.771653 0.472267 0.584306 0.0089
100 │ 0.81299 0.00540772 0.107315 0.323288 0.592159 0.1297 0.3383
94 columns and 84 rows omitted
Now if you need to fetch first 5 rows of your data frame and create a new data frame then use the first function that is defined in Julia Base:
julia> first(df, 5)
5×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 F ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.447533 0 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.790866 0
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.343316 0
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.946342 0
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.491206 0 ⋯
93 columns omitted
In general the design of DataFrames.jl is that we limit the number of new function names as much as possible and reuse what is defined in Julia Base if possible. This is one example of such a situation. This way users have less things to learn.
In julia, the equivalent command is first rather than head.
first is used instead of head. The last is used instead of tail.
I am generating TeX files using a template and rendering that template using Mustache.
Firstly I have data in a DataFrame:
Row │ label │ score │ max │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 4 │
│ 2 │ 2 │ 3 │ 5 │
│ 3 │ 3 │ 4 │ 6 │
│ 4 │ 4 │ 5 │ 7 │
and a dictionary:
student = Dict( "name" => "John", "surname" => "Smith");
I want to render a template such that both dictionary variables and DataFrame variables are replaced in the template. It is OK to use either a dictionary or a DataFrame but not both at the same time.
For example, the render works on a DataFrame only with the template 'tmpl' shown below:
tmpl = """
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
"""
rendered_marks = render(tmpl, D=df );
However, when I add variables such as :name or :surname from the 'student' dictionary, I get error messages:
marks_tmpl = """
Hello \\textbf{ {{:name}}, {{:surname}} }
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
\\end{itemize}
\\end{document}
"""
rendered_marks = render(tmpl, student, D=df );
What is the right way to do it?
You are not allowed to mix Dict and keyword arguments. The easiest thing is to add the DataFrame to the dictionary.
First, create your DataFrame:
df = DataFrame(label=1:4, score=2:5, max=4:7)
4×3 DataFrame
│ Row │ label │ score │ max │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 4 │
│ 2 │ 2 │ 3 │ 5 │
│ 3 │ 3 │ 4 │ 6 │
│ 4 │ 4 │ 5 │ 7 │
Next, reference your DataFrame in the dictionary for Mustache.jl rendering:
student = Dict( "name" => "John", "surname" => "Smith", "df" => df);
marks_tmpl = """
Hello \\textbf{ {{name}}, {{surname}} }
Your marks are:
\\begin{itemize}
{{#df}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/df}}
\\end{itemize}
"""
In this way, both dictionary and DataFrame variables are rendered:
julia> println(render(marks_tmpl, student))
Hello \textbf{ John, Smith }
Your marks are:
\begin{itemize}
\item Mark for question 1 is 2 out of 4
\item Mark for question 2 is 3 out of 5
\item Mark for question 3 is 4 out of 6
\item Mark for question 4 is 5 out of 7
\end{itemize}
I guess this is what you wanted?
To add to the answer, you could also have used an iterable to access the keys in the dictionary or alternatively named tuple:
tmpl = """
Hello {{#:E}}\\textbf{ {{:name}}, {{:surname}} }{{/:E}}
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
\\end{itemize}
\\end{document}
"""
using Mustache
using DataFrames
student = Dict( "name" => "John", "surname" => "Smith");
D = DataFrame(label=[1,2], score=[80,90])
Mustache.render(tmpl, E=(name="John",surname="Doe"),D=D, max=100)
How can I read DateTime from .CSV file data with Julia (julia version 1.0.1)? If you notice here, when it reads in my data, its marked as "String" values but I was hoping the call to head() would show DateTime values as the data type.
I'm reading like this:
using Dates, CSV, DataFrames
dfmt = dateformat"yyyy-mm-dd hh:MM:ss"
column_types = Dict(:pickup_datetime=>DateTime, :dropoff_datetime=>DateTime)
df = convert(DataFrame, CSV.read("$(Base.source_dir())/small_taxi.csv",
types=column_types, dateformat=dfmt))
function reduce_dataframe(data_frame)
return data_frame[[:vendor_id, :pickup_datetime, :dropoff_datetime,
:passenger_count, :trip_distance]]
end
df = reduce_dataframe(df)
head(df)
Here is my program output (from taxi data):
julia> include("hello.jl")
Started ...
elapsed CPU time: 0.09325 seconds
0.094642 seconds (548.85 k allocations: 10.445 MiB)
6×4 DataFrame
│ Row │ vendor_id │ pickup_datetime │ dropoff_datetime │ passenger_count │
│ │ Int64⍰ │ String⍰ │ String⍰ │ Int64⍰ │
├─────┼───────────┼─────────────────────┼─────────────────────┼─────────────────┤
│ 1 │ 1 │ 2017-01-01 01:21:25 │ 2017-01-01 01:51:56 │ 2 │
│ 2 │ 1 │ 2017-01-01 02:17:49 │ 2017-01-01 02:17:49 │ 3 │
│ 3 │ 1 │ 2017-01-01 02:30:02 │ 2017-01-01 02:52:56 │ 1 │
│ 4 │ 1 │ 2017-01-01 04:17:32 │ 2017-01-01 04:17:36 │ 1 │
│ 5 │ 1 │ 2017-01-01 04:41:54 │ 2017-01-01 05:24:22 │ 1 │
│ 6 │ 1 │ 2017-01-01 10:41:18 │ 2017-01-01 10:56:59 │ 2 │
What is the trick here? Here is some sample data if you want to try yourself: https://gist.github.com/djangofan/09c6304b55f2a73cb05d0d2afc7902b1
When faced with such conversion issues, it best to go a bit low level to understand what is going on.
So, we start by looking your date-time string from your tables
dt_str="2017-01-01 01:21:25"
Can it be formatted with our format string?
dfmt = dateformat"yyyy-MM-dd hh:mm:ss"
Date(dt_str,dfmt)
Running that we get
ERROR: ArgumentError: Unable to parse date time. Expected directive Delim( hh:) at char 11
Something is not quite right here. Lets consult the manual. The manual points to Dates.DateFormat and tonnes of examples at stdlib/Dates/test/io.jl.
We notice that we have been using the wrong letters for months, hours and seconds. We test now
dfmt = dateformat"yyyy-mm-dd HH:MM:SS"
Date(dt_str,dfmt)
No errors this time! We try it on our table
t_data=CSV.read("$(Base.source_dir())/small_taxi.csv", dateformat=dfmt)
t_data[:vendor_id, :pickup_datetime, :dropoff_datetime,
:passenger_count, :trip_distance]
We get
julia> t_data[[:vendor_id, :pickup_datetime, :dropoff_datetime,
:passenger_count]]
5×4 DataFrame
│ Row │ vendor_id │ pickup_datetime │ dropoff_datetime │ passenger_count │
│ │ Int64⍰ │ DateTime⍰ │ DateTime⍰ │ Int64⍰ │
├─────┼───────────┼─────────────────────┼─────────────────────┼─────────────────┤
│ 1 │ 2 │ 2017-09-23T05:08:42 │ 2017-09-23T05:27:39 │ 6 │
│ 2 │ 1 │ 2017-07-14T19:07:38 │ 2017-07-14T19:54:17 │ 1 │
│ 3 │ 2 │ 2017-10-29T00:42:06 │ 2017-10-29T00:43:12 │ 2 │
│ 4 │ 2 │ 2017-10-02T20:38:17 │ 2017-10-02T21:13:09 │ 1 │
│ 5 │ 1 │ 2017-05-11T22:53:11 │ 2017-05-11T23:27:53 │ 2 │
Libraries you need (often forgotten, which frustrates the learners).
# import Pkg; Pkg.add("CSV")
using CSV
# import Pkg; Pkg.add("Dates")
using Dates
# import Pkg; Pkg.add("DataFrames")
using DataFrames
The date format depends on the original data in CSV file.
Note below: 'u' stands for 3-letter English month, e.g. "Aug. 3, 2020"
date_format="yyyy.mm.dd" # or "yyyy-mm-dd" or "u. dd, yyyy"
Read DataFrame with formated date format outputting standard Date "yyyy-mm-dd"
df = CSV.read( # returns DataFrame
file_path, # URL
dateformat="$date_format"
)
Example output:
82 rows × 4 columns
Date ActualValue ForecastValue PreviousValue
Date Float64 Float64? Float64?
1 2020-08-03 44.3 34.4 42.1
I think that they changed the macro in Julia 1.0, so the dateformat statement form is
dfmt = #dateformat_str("yyyy-mm-dd HH:MM:SS")
or
dfmt = dateformat"yyyy-mm-dd HH:MM:SS"
though I don't have your dated CSV file to verify this works.
(added when you edited question to give file) In addition, your provided file is tab separated with repeated tabs, so you need:
using Dates, CSV, DataFrames
dfmt = dateformat"yyyy-mm-dd hh:MM:ss"
df = convert(DataFrame, CSV.read("$(Base.source_dir())/small_taxi.csv",
dateformat=dfmt, delim="\t", ignorerepeated=true))
function reduce_dataframe(data_frame)
return data_frame[[:vendor_id, :pickup_datetime, :dropoff_datetime,
:passenger_count, :trip_distance]]
end
df = reduce_dataframe(df)
head(df)
I am attempting to create a lag +1 forward for a particular column in my data frame.
My data is like this:
julia> head(df)
6×9 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ Date │ Open │ High │ Low │ Close │ Adj Close │ Volume │ Close_200sma │
├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼──────────────┤
│ 1 │ 1993-02-02 │ 43.9687 │ 43.9687 │ 43.75 │ 43.9375 │ 27.6073 │ 1003200 │ NaN │
│ 2 │ 1993-02-03 │ 43.9687 │ 44.25 │ 43.9687 │ 44.25 │ 27.8036 │ 480500 │ NaN │
│ 3 │ 1993-02-04 │ 44.2187 │ 44.375 │ 44.125 │ 44.3437 │ 27.8625 │ 201300 │ NaN │
│ 4 │ 1993-02-05 │ 44.4062 │ 44.8437 │ 44.375 │ 44.8125 │ 28.1571 │ 529400 │ NaN │
│ 5 │ 1993-02-08 │ 44.9687 │ 45.0937 │ 44.4687 │ 45.0 │ 28.2749 │ 531500 │ NaN │
│ 6 │ 1993-02-09 │ 44.9687 │ 45.0625 │ 44.7187 │ 44.9687 │ 28.2552 │ 492100 │ NaN
│
So this is my attempt at lagging forward, in R I may rep NA, 1 and then append this to the front of the subsetted data. Here is my Julia:
# Lag data +1 forward
lag = df[1:nrow(df)-1,[:Long]] # shorten vector by 1 (remove last element)
v = Float64[]
v = vec(convert(Array, lag)) # convert df column to vector
z = fill(NaN, 1) # rep NaN, 1 time (add this to front) to push all forward +1
lags = Float64[]
lags= vec[z; [v]] # join both arrays z=NA first , make vector same nrow(df)
When I join the NaN and my array I have a length(lags) of 2.
The data is split in two:
julia> length(lags[2])
6255
I see the longer length when access the second portion.
If I join the other way, NaN at end, numbers first. I obtain correct length:
# try joining other way
lags_flip = [v; [z]]
julia> length(lags_flip)
6256
I can also add this back to my data frame: (Nan at bottom, i want at front)
# add back to data frame
df[:add] = lags_flip
1
1
1
1
1
1
1
1
[NaN]
My question is when joining the Nan and my data like this:
lags_flip = [v; [z]]
I obtain correct length, when I do it the other way:
Nan first:
lags= [z; [v]]
Then it doesnt appear correct.
How can I offset by data +1 forward, placing a Nan in front and adding back to my df? I feel im close but missing something
EDIT:
A a second thought - probably messing with length of column in a DataFrame is not the best thing to do and I assume you want a new column anyway. In this case this could be a basic approach:
df[:LagLong] = [missing; df[1:end-1,:Long]]
or if you want NaN (but probably you want missing as explained below):
df[:LagLong] = [NaN; df[1:end-1,:Long]]
PREVIOUS REPLY:
You can do it in place:
julia> x = [1.0,2.0,3.0]
3-element Array{Float64,1}:
1.0
2.0
3.0
julia> pop!(unshift!(x, NaN))
3.0
julia> x
3-element Array{Float64,1}:
NaN
1.0
2.0
Replace x in pop!(unshift!(x, NaN)) by an appropriate column selector like df[:Long].
Note, however, that NaN is not NA in R. In Julia NA is missing. And now there is a branch:
if your column allows missing values (it will show Union{Missing, [Something]} in showcols) then you do the same as above pop!(unshift!(df[:Long], missing)).
if it does not allow missings you have two options. First is to first call allowmissing!(df, :Long) to allow missings and go forward as described above. The other is similar to the approach you have proposed: df[:Long] = [missing; df[1:end-1, :Long]].