I am trying to generate lagged variables using panel data (ID and year). By each ID, it might have different years of panel and sometimes years are not continuous within ID group. For example, we have the data set below:
ID
Year
x
1
2001
3
1
2002
1
1
2006
2
1
2007
2
2
2002
1
2
2003
5
3
2006
2
3
2007
2
3
2008
4
And the lagged variable for x that I want to generate is:
ID
Year
x
x_lag
1
2001
3
.
1
2002
1
3
1
2006
2
.
1
2007
2
2
2
2002
1
.
2
2003
5
1
3
2006
2
.
3
2007
2
2
3
2008
4
2
I found some other answers to how to create lagged variables by groups but it does not work for me because some of IDs in my data set have discontinuous years (ex. row 2-3 in the example above).
So, I am using the function that I have written down below:
function lagged(data,x)
for c in x
data[:,c*"_lag"] .= 0.0
end
allowmissing!(data)
for row in eachrow(data)
for c in x
if filter(y -> y.id == row.id && y.year == row.year - 1, data)[:,c] == []
row[c*"_lag"] = missing
else
row[c*"_lag"] = filter(y -> y.id == row.id && y.year == row.year - 1, data)[:,c][1]
end
end
end
return data
end
But it is extremely slow... Is there any faster way to create lagged variables in panel data with discontinuous years? Thanks!
Is this what you need?
julia> df = DataFrame(ID=[1,1,1,1,2,2,3,3,3],
Year=[2001, 2002, 2006, 2007, 2002, 2003, 2006, 2007, 2008],
x=[3,1,2,2,1,5,2,2,4])
9×3 DataFrame
Row │ ID Year x
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2001 3
2 │ 1 2002 1
3 │ 1 2006 2
4 │ 1 2007 2
5 │ 2 2002 1
6 │ 2 2003 5
7 │ 3 2006 2
8 │ 3 2007 2
9 │ 3 2008 4
julia> function lag_v(year, x)
result = missings(eltype(x), length(x))
length(x) < 2 && return result
last = first(year)
for i in 2:length(x)
current = year[i]
current-last == 1 && (result[i] = x[i-1])
last = current
end
return result
end
lag_v (generic function with 1 method)
julia> transform(groupby(df, :ID), [:Year, :x] => lag_v => :x_lag)
9×4 DataFrame
Row │ ID Year x x_lag
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 1 2001 3 missing
2 │ 1 2002 1 3
3 │ 1 2006 2 missing
4 │ 1 2007 2 2
5 │ 2 2002 1 missing
6 │ 2 2003 5 1
7 │ 3 2006 2 missing
8 │ 3 2007 2 2
9 │ 3 2008 4 2
and is it fast enough?
An approach using ShiftedArrays
Data
julia> using ShiftedArrays
julia> using DataFrames
julia> df = DataFrame(ID = [1, 1, 1, 1, 2, 2, 3, 3, 3],
Year = [2001, 2002, 2006, 2007, 2002, 2003, 2006, 2007, 2008],
x = [3, 1, 2, 2, 1, 5, 2, 2, 4])
Add helper column is to check if lagged Year == Year, then
remove missings, create and fill x_lag column and finally remove the helper column.
julia> function do_lag!(frame)
for grp in groupby(frame, :ID)
grp.is = (ShiftedArrays.lag(grp.Year, 1) .+ 1) .== grp.Year
end
frame.is[ismissing.(frame.is)] .= false
frame.x_lag = Vector{Union{Int, String}}(repeat(["."], size(frame)[1]))
frame.x_lag[findall(frame.is)] =
ShiftedArrays.lag(frame.x)[findall(frame.is)]
select!(frame, Not([:is]))
end
julia> do_lag!(df)
9×4 DataFrame
Row │ ID Year x x_lag
│ Int64 Int64 Int64 Union…
─────┼─────────────────────────────
1 │ 1 2001 3 .
2 │ 1 2002 1 3
3 │ 1 2006 2 .
4 │ 1 2007 2 2
5 │ 2 2002 1 .
6 │ 2 2003 5 1
7 │ 3 2006 2 .
8 │ 3 2007 2 2
9 │ 3 2008 4 2
And another option, without defining a function, but using a leftjoin with next-year version of df:
julia> df = DataFrame(ID=[1,1,1,1,2,2,3,3,3],
Year=[2001, 2002, 2006, 2007, 2002, 2003, 2006, 2007, 2008],
x=[3,1,2,2,1,5,2,2,4]);
julia> leftjoin(df, select(df, :ID,
:Year => ByRow(Base.Fix2(+,1)) => :Year,
:x => :lag_x); on=[:ID, :Year])
9×4 DataFrame
Row │ ID Year x lag_x
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 1 2002 1 3
2 │ 1 2007 2 2
3 │ 2 2003 5 1
4 │ 3 2007 2 2
5 │ 3 2008 4 2
6 │ 1 2001 3 missing
7 │ 1 2006 2 missing
8 │ 2 2002 1 missing
9 │ 3 2006 2 missing
It's easier to see what this join does after sorting (though not necessary):
julia> sort!(leftjoin(df,
select(df, :ID, :Year => ByRow(Base.Fix2(+,1)) => :Year,
:x => :lag_x); on=[:ID, :Year]), [:ID, :Year])
9×4 DataFrame
Row │ ID Year x lag_x
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 1 2001 3 missing
2 │ 1 2002 1 3
3 │ 1 2006 2 missing
4 │ 1 2007 2 2
5 │ 2 2002 1 missing
6 │ 2 2003 5 1
7 │ 3 2006 2 missing
8 │ 3 2007 2 2
9 │ 3 2008 4 2
This solution has the benefit of being easier to use when eventually the underlying DataFrame storage becomes a database (like SQLite, or Postgres), once the OP's operation scales to 100M users :-)
Why don't you just do
transform!(groupby(df, :id), :x => lag, :year => lag)
And then only keep the x_lag values where year_lag is 1?
EDIT
Now that I'm not on my phone here's what I meant in DataFrames minilanguage:
julia> using DataFrames; import ShiftedArrays: lag
julia> df = DataFrame(ID=[1,1,1,1,2,2,3,3,3],
Year=[2001, 2002, 2006, 2007, 2002, 2003, 2006, 2007, 2008],
x=[3,1,2,2,1,5,2,2,4]);
julia> transform!(groupby(df, :ID),
[:Year, :x] => ((y, x) -> ifelse.(coalesce.(y .- lag(y) .== 1, false), lag(x), missing)) => :x_lag)
9×4 DataFrame
Row │ ID Year x x_lag
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 1 2001 3 missing
2 │ 1 2002 1 3
3 │ 1 2006 2 missing
4 │ 1 2007 2 2
5 │ 2 2002 1 missing
6 │ 2 2003 5 1
7 │ 3 2006 2 missing
8 │ 3 2007 2 2
9 │ 3 2008 4 2
I haven't benchmarked all the solutions in this thread, but testing this on a 10-million row DataFrame this runs in about 1 second on my laptop, which seems acceptable?
julia> df_test = DataFrame(id = repeat(1:1_000_000, inner = 10), year = reduce(vcat, [sort(rand(2000:2015, 10)) for _ ∈ 1:1_000_000]), x = rand(10_000_000));
julia> #time transform!(groupby(df_test, :id),
[:year, :x] => ((y, x) -> ifelse.(coalesce.(y .- lag(y) .== 1, false), lag(x), missing)) => :x_lag);
1.139166 seconds (9.33 M allocations: 940.402 MiB, 7.77% gc time, 16.18% compilation time)
Related
Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.
input dataframe:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
hdfhhd
dfsdfsdf
10
10
80
There are 2 q2 with same id_q, check_id but with different values: 90,80.
I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:
So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn
Desired Output:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
For this case this code seems to be shorter that the ones referenced in other answers:
Suppose you have
julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
Row │ a b c notimportant
│ String Int64 Int64 Float64
─────┼────────────────────────────────────
1 │ a 1 11 0.93785
2 │ a 1 12 0.877777
3 │ b 2 13 0.845306
4 │ b 2 14 0.477606
5 │ b 3 15 0.722569
6 │ b 3 16 0.122807
Than you can just do:
julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
Row │ a b c
│ String Int64 Int64
─────┼──────────────────────
1 │ a 1 12
2 │ b 2 14
3 │ b 3 16
Consider
import TypedTables as TT
TT.Table(this=[1,2,3])
Fine. Now instead I have
a = "this"
b = [1,2,3]
How do I create the same table from a and b? Going via a NamedTyple is a bit round about but seems to work:
TT.Table((; Symbol(a) =>b))
Is a less round about approach available?
You can skip NamedTuple construction and just pass this as kwargs:
julia> Table(;Symbol(a) =>b)
Table with 1 column and 3 rows:
this
┌─────
1 │ 1
2 │ 2
3 │ 3
Regarding the multi-column comments:
julia> as = ["this", "that"];
julia> bs = [[1,2,3],[4,5,6]];
julia> Table(; (Symbol.(as) .=> bs)...)
Table with 2 columns and 3 rows:
this that
┌───────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
I have column in a dataframe like this:
df = DataFrame(:num=>rand(0:10,20))
From df I want to make 2 others dataframe:
df1 = counter(df[!,:num)
To have the frequencies of each integer from 0 to 10. But I need the values sorted from 0 to 10:
0=>2
1=>3
2=>7
so on..
Then I want a new dataframe df2 where:
column_p = sum of occurrences of 9 and 10
column_n = sum of occurrences of 7 and 8
column_d = sum of occurrences of 0 to 6
I managed to get the first part, even though the result is not sorted but this last dataframe has been a challenge to my julia skills (still learning)
UPDATE 1
I managed to do this fucntion:
function f(dff)
#eachrow dff begin
if :num >=9
:class = "Positive"
elseif :num >=7
:class = "Neutral"
elseif :num <7
:class = "Negative"
end
end
end
This function do half of what I want and fails if there's no :class column in the dataframe.
Now I want to count how many positive, neutral and negatives to do this operation:
(posivite - negative) / (negatives+neutral+positives)
The first part is:
julia> using DataFrames, Random
julia> Random.seed!(1234);
julia> df = DataFrame(:num=>rand(0:10,20));
julia> df1 = combine(groupby(df, :num, sort=true), nrow)
10×2 DataFrame
Row │ num nrow
│ Int64 Int64
─────┼──────────────
1 │ 0 1
2 │ 2 2
3 │ 3 2
4 │ 4 2
5 │ 5 1
6 │ 6 2
7 │ 7 2
8 │ 8 4
9 │ 9 1
10 │ 10 3
I was not sure what you wanted in the second step, but here are two ways to achieve the third step using either df1 or df:
julia> (sum(df1.nrow[df1.num .>= 9]) - sum(df1.nrow[df1.num .<= 6])) / sum(df1.nrow)
-0.3
julia> (count(>=(9), df.num) - count(<=(6), df.num)) / nrow(df)
-0.3
Suppose I have the following data.
dt = DataFrame(
id = [1,1,1,1,1,2,2,2,2,2,],
t = [1,2,3,4,5, 1,2,3,4,5],
val = randn(10)
)
Row │ id t val
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673
2 │ 1 2 -0.817519
3 │ 1 3 0.201231
4 │ 1 4 0.856569
5 │ 1 5 1.8941
6 │ 2 1 0.240532
7 │ 2 2 -0.431824
8 │ 2 3 0.165137
9 │ 2 4 1.22958
10 │ 2 5 -0.424504
I want to make a dummy variable from t to t+2 whether the val>0.5.
For instance, I want to make val_gr_0.5 a new variable.
Could someone help me with how to do this?
Row │ id t val val_gr_0.5
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673 0 (search t:1 to 3)
2 │ 1 2 -0.817519 1 (search t:2 to 4)
3 │ 1 3 0.201231 1 (search t:3 to 5)
4 │ 1 4 0.856569 missing
5 │ 1 5 1.8941 missing
6 │ 2 1 0.240532 0 (search t:1 to 3)
7 │ 2 2 -0.431824 1 (search t:2 to 4)
8 │ 2 3 0.165137 1 (search t:3 to 5)
9 │ 2 4 1.22958 missing
10 │ 2 5 -0.424504 missing
julia> using DataFramesMeta
julia> function checkvals(subsetdf)
vals = subsetdf[!, :val]
length(vals) < 3 && return missing
any(vals .> 0.5)
end
checkvals (generic function with 1 method)
julia> for sdf in groupby(dt, :id)
transform!(sdf, :t => ByRow(t -> checkvals(#subset(sdf, #byrow t <= :t <= t+2))) => :val_gr)
end
julia> dt
10×4 DataFrame
Row │ id t val val_gr
│ Int64 Int64 Float64 Bool?
─────┼──────────────────────────────────
1 │ 1 1 0.0619327 false
2 │ 1 2 0.278406 false
3 │ 1 3 -0.595824 true
4 │ 1 4 0.0466594 missing
5 │ 1 5 1.08579 missing
6 │ 2 1 -1.57656 true
7 │ 2 2 0.17594 true
8 │ 2 3 0.865381 true
9 │ 2 4 0.972024 missing
10 │ 2 5 1.54641 missing
first define a function
function run_max(x, window)
window -= 1
res = missings(eltype(x), length(x))
for i in 1:length(x)-window
res[i] = maximum(view(x, i:i+window))
end
res
end
then use it in DataFrames.jl
dt.new = dt.val .> 0.5
transform!(groupby(dt,1), :new => x->run_max(x, 3))
Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257