I have this for loop in Julia:
begin
countries_data_labels = ["Canada", "Italy", "China", "United States", "Spain"]
y_axis = DataFrame()
for country in countries_data_labels
new_dataframe = get_country(df, country)
new_dataframe = DataFrame(new_dataframe)
df_rows, df_columns = size(new_dataframe)
new_dataframe_long = stack(new_dataframe, begin:end-4)
y_axis[!, Symbol("$country")] = new_dataframe_long[!, :value]
end
end
and I'm getting this error:
syntax: extra token ")" after end of expression
I decided to comment all of the body of the for loop except the 1st one and ran the cell each time after uncommenting to see which line was throwing this error and it was the 4th line in the body:
new_dataframe_long = stack(new_dataframe, begin:end-4)
There is no reason for this error to exist. There are no extra bracket pieces in this line.
My guess is that you mean here:
stack(new_dataframe[begin:end-4, :])
See the MWE example below:
julia> df = DataFrame(a=11:16,b=2.5:7.5)
6×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼────────────────
1 │ 11 2.5
2 │ 12 3.5
3 │ 13 4.5
4 │ 14 5.5
5 │ 15 6.5
6 │ 16 7.5
julia> stack(df[begin:end-3, :])
3×3 DataFrame
Row │ a variable value
│ Int64 String Float64
─────┼──────────────────────────
1 │ 11 b 2.5
2 │ 12 b 3.5
3 │ 13 b 4.5
Related
I've been using the following code to generate histograms from binning one column of a Dataframe and using that bin to calculate a median from another column.
using Plots, Statistics, DataFrames
df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
res = combine(groupby(df, :x, sort=true), :y => median)
bar(res.x, res.y_median, legend=false)
The code point selects values for the bins and I would like to apply a bin range of values manually, if possible?
Row │ A B_median
│ Any Float64
─────┼───────────────────
1 │ 1515.74 0.09
2 │ 1517.7 0.81
3 │ 1527.22 10.23
4 │ 1529.88 2.95
5 │ 1530.72 17.32
6 │ 1530.86 15.22
7 │ 1532.26 1.45
8 │ 1532.68 18.51
9 │ 1541.08 1.32
10 │ 1541.22 15.78
11 │ 1541.36 0.12
12 │ 1541.5 13.55
13 │ 1541.92 11.99
14 │ 1542.06 21.14
15 │ 1542.34 10.645
16 │ 1542.62 19.95
17 │ 1542.76 21.0
18 │ 1543.32 20.91
For example, instead of calculating a median for rows 9->17 individually. Could these rows be bunched together automatically i.e. 1542.7+/-0.7 and a total median value be calculated for this range?
Many thanks!
I assume you want something like this:
julia> using DataFrames, CategoricalArrays, Random, Statistics
julia> Random.seed!(1234);
julia> df = DataFrame(A=rand(20), B=rand(20));
julia> df.A_group = cut(df.A, 4);
julia> res = combine(groupby(df, :A_group, sort=true), :B => median)
4×2 DataFrame
Row │ A_group B_median
│ Cat… Float64
─────┼─────────────────────────────────────────────
1 │ Q1: [0.014908849285099945, 0.532… 0.134685
2 │ Q2: [0.5323651749779272, 0.65860… 0.347995
3 │ Q3: [0.6586057536399257, 0.81493… 0.501756
4 │ Q4: [0.8149335702852593, 0.97213… 0.531899
I am new to Julia, I want to see the first 5 rows of data frame, but when i am writing below code
head(df,5)
I am getting
UndefVarError: head not defined
head is available in e.g. R but not in Julia. First - note that Julia has a nice data frame printing system out of the box that crops things to fit in the terminal window, so you do not need to subset your data frame to see its head and tail. Here is an example:
julia> df = DataFrame(rand(100, 100), :auto)
100×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float6 ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.4475 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.7908
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.3433
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.9463
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.4912 ⋯
6 │ 0.139018 0.182928 0.00129572 0.0439561 0.0929167 0.264609 0.1555
7 │ 0.16076 0.404707 0.0300284 0.665413 0.681704 0.431746 0.3460
8 │ 0.149331 0.132869 0.237446 0.599701 0.149257 0.70753 0.7687
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
93 │ 0.912703 0.98395 0.133307 0.493799 0.76125 0.295725 0.9249 ⋯
94 │ 0.153175 0.339036 0.685642 0.355421 0.365252 0.434604 0.1515
95 │ 0.780877 0.225312 0.511122 0.0506186 0.108054 0.729219 0.5275
96 │ 0.132961 0.348176 0.619712 0.791334 0.052787 0.577896 0.6696
97 │ 0.904386 0.938876 0.988184 0.831708 0.699214 0.627366 0.4320 ⋯
98 │ 0.0295777 0.704879 0.905364 0.142231 0.586725 0.584692 0.9546
99 │ 0.848715 0.177192 0.544509 0.771653 0.472267 0.584306 0.0089
100 │ 0.81299 0.00540772 0.107315 0.323288 0.592159 0.1297 0.3383
94 columns and 84 rows omitted
Now if you need to fetch first 5 rows of your data frame and create a new data frame then use the first function that is defined in Julia Base:
julia> first(df, 5)
5×100 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 F ⋯
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 0.915485 0.176254 0.381047 0.710266 0.597914 0.177617 0.447533 0 ⋯
2 │ 0.58495 0.551726 0.464703 0.630956 0.476727 0.804854 0.790866 0
3 │ 0.123723 0.183817 0.986624 0.306091 0.202054 0.148579 0.343316 0
4 │ 0.558321 0.117478 0.187091 0.482795 0.0718985 0.807018 0.946342 0
5 │ 0.771561 0.515823 0.830598 0.0742368 0.0831569 0.818487 0.491206 0 ⋯
93 columns omitted
In general the design of DataFrames.jl is that we limit the number of new function names as much as possible and reuse what is defined in Julia Base if possible. This is one example of such a situation. This way users have less things to learn.
In julia, the equivalent command is first rather than head.
first is used instead of head. The last is used instead of tail.
I have a very large (~120 GB) CSV file with ~100 columns. I want to iterate through the file line by line using CSV.File and aggregate certain ranges of columns. However, it appears that there is no getindex method for the CSV.Row type. Here is a simplified example:
using CSV
using DataFrames
df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)
file = CSV.File("test_data.csv")
row1 = first(file)
row1.x3 # Works fine
# Both of these throw method errors:
row1[4]
row1[4:7]
Suppose that for each row I want to sum columns [1:3; 8:10] in a variable a and sum columns 4:7 in a variable b. The final output should be a data frame with columns a and b. Is there an easy way to do this when iterating through CSV.Rows?
Here's a version that lets you avoid having to think about translating to the "drop"/"take" logic:
using CSV, Tables, DataFrames
df = DataFrame(reshape(1:60, 6, 10)) # column names are x1 through x10
CSV.write("test_data.csv", df)
function aggregate_file(path, a_inds, b_inds)
file = CSV.File(path)
a, b = Int64[], Int64[]
a_cols = propertynames(file)[a_inds]
b_cols = propertynames(file)[b_inds]
for row in file
push!(a, sum(getproperty.(Ref(row), a_cols)))
push!(b, sum(getproperty.(Ref(row), b_cols)))
end
DataFrame(a = a, b = b)
end
julia> aggregate_file("test_data.csv", [1:3; 8:10], 4:7)
6×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 168 │ 112 │
│ 2 │ 174 │ 116 │
│ 3 │ 180 │ 120 │
│ 4 │ 186 │ 124 │
│ 5 │ 192 │ 128 │
│ 6 │ 198 │ 132 │
You can use Tables.eachcolumn from the Tables.jl, since CSV.File supports the Tables interface. Tables.eachcolumn will create an iterator over the columns. Then you can use a combination of Iterators.take and Iterators.drop to access the desired ranges of columns:
using CSV
using Tables
using DataFrames
using Base.Iterators: take
using Base.Iterators: drop
function aggregate_file(path)
file = CSV.File(path)
a, b = Int64[], Int64[]
for row in file
cols = Tables.eachcolumn(row)
sum1to3 = sum(take(cols, 3))
sum8to10 = sum(drop(cols, 7))
push!(a, sum1to3 + sum8to10)
sum4to7 = sum(drop(take(cols, 7), 3))
push!(b, sum4to7)
end
DataFrame(a = a, b = b)
end
julia> aggregate_file("test_data.csv")
6×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 168 │ 112 │
│ 2 │ 174 │ 116 │
│ 3 │ 180 │ 120 │
│ 4 │ 186 │ 124 │
│ 5 │ 192 │ 128 │
│ 6 │ 198 │ 132 │
If you need to aggregate over an arbitrary set of column indices, you could use enumerate on the column iterator:
inds = [2, 4, 7]
sum(j for (i, j) in enumerate(cols) if i in inds)
EDIT:
I did a performance comparison of mine and #IanFiske's answer. His version appears to be faster and use less memory:
julia> using BenchmarkTools
julia> #btime aggregate_file("test_data.csv");
118.687 μs (550 allocations: 24.42 KiB)
julia> #btime aggregate_file("test_data.csv", [1:3; 8:10], 4:7);
62.416 μs (236 allocations: 14.48 KiB)
I am generating TeX files using a template and rendering that template using Mustache.
Firstly I have data in a DataFrame:
Row │ label │ score │ max │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 4 │
│ 2 │ 2 │ 3 │ 5 │
│ 3 │ 3 │ 4 │ 6 │
│ 4 │ 4 │ 5 │ 7 │
and a dictionary:
student = Dict( "name" => "John", "surname" => "Smith");
I want to render a template such that both dictionary variables and DataFrame variables are replaced in the template. It is OK to use either a dictionary or a DataFrame but not both at the same time.
For example, the render works on a DataFrame only with the template 'tmpl' shown below:
tmpl = """
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
"""
rendered_marks = render(tmpl, D=df );
However, when I add variables such as :name or :surname from the 'student' dictionary, I get error messages:
marks_tmpl = """
Hello \\textbf{ {{:name}}, {{:surname}} }
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
\\end{itemize}
\\end{document}
"""
rendered_marks = render(tmpl, student, D=df );
What is the right way to do it?
You are not allowed to mix Dict and keyword arguments. The easiest thing is to add the DataFrame to the dictionary.
First, create your DataFrame:
df = DataFrame(label=1:4, score=2:5, max=4:7)
4×3 DataFrame
│ Row │ label │ score │ max │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 4 │
│ 2 │ 2 │ 3 │ 5 │
│ 3 │ 3 │ 4 │ 6 │
│ 4 │ 4 │ 5 │ 7 │
Next, reference your DataFrame in the dictionary for Mustache.jl rendering:
student = Dict( "name" => "John", "surname" => "Smith", "df" => df);
marks_tmpl = """
Hello \\textbf{ {{name}}, {{surname}} }
Your marks are:
\\begin{itemize}
{{#df}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/df}}
\\end{itemize}
"""
In this way, both dictionary and DataFrame variables are rendered:
julia> println(render(marks_tmpl, student))
Hello \textbf{ John, Smith }
Your marks are:
\begin{itemize}
\item Mark for question 1 is 2 out of 4
\item Mark for question 2 is 3 out of 5
\item Mark for question 3 is 4 out of 6
\item Mark for question 4 is 5 out of 7
\end{itemize}
I guess this is what you wanted?
To add to the answer, you could also have used an iterable to access the keys in the dictionary or alternatively named tuple:
tmpl = """
Hello {{#:E}}\\textbf{ {{:name}}, {{:surname}} }{{/:E}}
Your marks are:
\\begin{itemize}
{{#:D}}
\\item Mark for question {{:label}} is {{:score}} out of {{:max}}
{{/:D}}
\\end{itemize}
\\end{document}
"""
using Mustache
using DataFrames
student = Dict( "name" => "John", "surname" => "Smith");
D = DataFrame(label=[1,2], score=[80,90])
Mustache.render(tmpl, E=(name="John",surname="Doe"),D=D, max=100)
I am attempting to create a lag +1 forward for a particular column in my data frame.
My data is like this:
julia> head(df)
6×9 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ Date │ Open │ High │ Low │ Close │ Adj Close │ Volume │ Close_200sma │
├─────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────┼─────────┼──────────────┤
│ 1 │ 1993-02-02 │ 43.9687 │ 43.9687 │ 43.75 │ 43.9375 │ 27.6073 │ 1003200 │ NaN │
│ 2 │ 1993-02-03 │ 43.9687 │ 44.25 │ 43.9687 │ 44.25 │ 27.8036 │ 480500 │ NaN │
│ 3 │ 1993-02-04 │ 44.2187 │ 44.375 │ 44.125 │ 44.3437 │ 27.8625 │ 201300 │ NaN │
│ 4 │ 1993-02-05 │ 44.4062 │ 44.8437 │ 44.375 │ 44.8125 │ 28.1571 │ 529400 │ NaN │
│ 5 │ 1993-02-08 │ 44.9687 │ 45.0937 │ 44.4687 │ 45.0 │ 28.2749 │ 531500 │ NaN │
│ 6 │ 1993-02-09 │ 44.9687 │ 45.0625 │ 44.7187 │ 44.9687 │ 28.2552 │ 492100 │ NaN
│
So this is my attempt at lagging forward, in R I may rep NA, 1 and then append this to the front of the subsetted data. Here is my Julia:
# Lag data +1 forward
lag = df[1:nrow(df)-1,[:Long]] # shorten vector by 1 (remove last element)
v = Float64[]
v = vec(convert(Array, lag)) # convert df column to vector
z = fill(NaN, 1) # rep NaN, 1 time (add this to front) to push all forward +1
lags = Float64[]
lags= vec[z; [v]] # join both arrays z=NA first , make vector same nrow(df)
When I join the NaN and my array I have a length(lags) of 2.
The data is split in two:
julia> length(lags[2])
6255
I see the longer length when access the second portion.
If I join the other way, NaN at end, numbers first. I obtain correct length:
# try joining other way
lags_flip = [v; [z]]
julia> length(lags_flip)
6256
I can also add this back to my data frame: (Nan at bottom, i want at front)
# add back to data frame
df[:add] = lags_flip
1
1
1
1
1
1
1
1
[NaN]
My question is when joining the Nan and my data like this:
lags_flip = [v; [z]]
I obtain correct length, when I do it the other way:
Nan first:
lags= [z; [v]]
Then it doesnt appear correct.
How can I offset by data +1 forward, placing a Nan in front and adding back to my df? I feel im close but missing something
EDIT:
A a second thought - probably messing with length of column in a DataFrame is not the best thing to do and I assume you want a new column anyway. In this case this could be a basic approach:
df[:LagLong] = [missing; df[1:end-1,:Long]]
or if you want NaN (but probably you want missing as explained below):
df[:LagLong] = [NaN; df[1:end-1,:Long]]
PREVIOUS REPLY:
You can do it in place:
julia> x = [1.0,2.0,3.0]
3-element Array{Float64,1}:
1.0
2.0
3.0
julia> pop!(unshift!(x, NaN))
3.0
julia> x
3-element Array{Float64,1}:
NaN
1.0
2.0
Replace x in pop!(unshift!(x, NaN)) by an appropriate column selector like df[:Long].
Note, however, that NaN is not NA in R. In Julia NA is missing. And now there is a branch:
if your column allows missing values (it will show Union{Missing, [Something]} in showcols) then you do the same as above pop!(unshift!(df[:Long], missing)).
if it does not allow missings you have two options. First is to first call allowmissing!(df, :Long) to allow missings and go forward as described above. The other is similar to the approach you have proposed: df[:Long] = [missing; df[1:end-1, :Long]].