Looking for a function that works like by but doesn't collapse my DataFrame. In R I would use dplyr's groupby(b) %>% mutate(x1 = sum(a)). I don't want to lose information from the table such as that in variable :c.
mydf = DataFrame(a = 1:4, b = repeat(1:2,2), c=4:-1:1)
bypreserve(mydf, :b, x -> sum(x.a))
│ Row │ a │ b │ c │ x1
│ │ Int64 │ Int64 │ Int64 │Int64
├─────┼───────┼───────┼───────┤───────
│ 1 │ 1 │ 1 │ 4 │ 4
│ 2 │ 2 │ 2 │ 3 │ 6
│ 3 │ 3 │ 1 │ 2 │ 4
│ 4 │ 4 │ 2 │ 1 │ 6
Adding this functionality is discussed, but I would say that it will take several months to be shipped (the general idea is to allow select to have groupby keyword argument + also add transform function that will work like select but preserve columns of the source data frame).
For now the solution is to use join after by:
join(mydf, by(mydf, :b, x1 = :a => sum), on=:b)
Related
Is there a way to add a row to an existing dataframe at a specific index?
E.g. you have a dataframe with 3 rows and 1 columns
df = DataFrame(x = [2,3,4])
X
2
3
4
any way to do the following:
insert!(df, 1, [1])
in order to get
X
1
2
3
4
I know that i could probably concat two dataframes df = [df1; df2] but i was hoping to avoid garbaging a large DF whenever i want to insert a row.
In DataFrames 0.21.4 just write (I give two options: one, with broadcasting, is short but creates a temporary object; the other, with foreach is longer to write but allocates a bit less):
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(eachcol(df), 2, [4, "d"]) # creates an temporary object but is terse
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
["a", "d", "b", "c"]
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
julia> foreach((c, v) -> insert!(c, 2, v), eachcol(df), [4, "d"]) # does not create a temporary object
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 4 │ d │
│ 4 │ 2 │ b │
│ 5 │ 3 │ c │
note that the above operation is not atomic (it may corrupt your data frame if the type of the element you want to add does not match the element type allowed in the column).
If you want a safe operation that will provide automatic promotion use this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> [view(df, 1:1, :); DataFrame(names(df) .=> ['a', 'b']); view(df, 3:3, :)]
3×2 DataFrame
│ Row │ x │ y │
│ │ Any │ Any │
├─────┼─────┼─────┤
│ 1 │ 1 │ a │
│ 2 │ 'a' │ 'b' │
│ 3 │ 3 │ c │
(it is a bit slower though and creates a new data frame)
Deprecated
The original answer is here. It was valid for Julia before 1.0 release (and DataFrames.jl version that was compatible with it).
I guess you want to do it in place. Then you can use insert! function like this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
Of course you have to make sure that you have the right number of columns in the added collection.
If you accept using unexported internal structure of a DataFrame you can do it even simpler:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(df.columns, 2, [4, "d"])
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
String["a", "d", "b", "c"]
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
I want to plot a data, but it's x-axis is time, for the missing value in each half-hour, I wish to fill zero.
using CSV, DataFrames, Dates
s="ts, v
2020-01-01T01:00:00, 3
2020-01-01T04:00:00, 6
2020-01-01T05:00:00, 1"
d=CSV.read(IOBuffer(s))
I expect to expand the d like d2
s2="ts, v
2020-01-01T01:00:00, 3
2020-01-01T01:30:00, 0
2020-01-01T02:00:00, 0
2020-01-01T02:30:00, 0
2020-01-01T03:00:00, 0
2020-01-01T03:30:00, 0
2020-01-01T04:00:00, 6
2020-01-01T04:30:00, 0
2020-01-01T05:00:00, 1"
d2=CSV.read(IOBuffer(s2))
I would probably do the following:
# Create half-hourly data frame with zeros from first to last observation
julia> df = DataFrame(ts = minimum(d.ts):Minute(30):maximum(d.ts), v_filled = 0);
# Join the existing observations onto this dataframe
julia> df = join(df, d, on = :ts, kind = :left);
# Replace zeros with observations where available
julia> df[.!ismissing.(df.v), :v_filled] = df[.!ismissing.(df.v), :v];
julia> df
9×3 DataFrame
│ Row │ ts │ v_filled │ v │
│ │ DateTime │ Int64 │ Int64⍰ │
├─────┼─────────────────────┼──────────┼─────────┤
│ 1 │ 2020-01-01T01:00:00 │ 3 │ 3 │
│ 2 │ 2020-01-01T01:30:00 │ 0 │ missing │
│ 3 │ 2020-01-01T02:00:00 │ 0 │ missing │
│ 4 │ 2020-01-01T02:30:00 │ 0 │ missing │
│ 5 │ 2020-01-01T03:00:00 │ 0 │ missing │
│ 6 │ 2020-01-01T03:30:00 │ 0 │ missing │
│ 7 │ 2020-01-01T04:00:00 │ 6 │ 6 │
│ 8 │ 2020-01-01T04:30:00 │ 0 │ missing │
│ 9 │ 2020-01-01T05:00:00 │ 1 │ 1 │
Ok lets say I have a series of arrays:
data_one = ["dog","cat"]
data_two = [1,2]
data_three = ["1/1/2018","1/2/2018"]
I build them into a matrix
m = hcat(data_one,data_two,data_three)
and convert to a df
df = DataFrame(m)
showcols(df)
for output:
julia> showcols(df)
3×5 DataFrames.DataFrame
│ Row │ variable │ eltype │ nmissing │ first │ last │
├─────┼──────────┼────────┼──────────┼──────────┼──────────┤
│ 1 │ x1 │ Any │ 0 │ dog │ cat │
│ 2 │ x2 │ Any │ 0 │ 1 │ 2 │
│ 3 │ x3 │ Any │ 0 │ 1/1/2018 │ 1/2/2018 │
When I build this data frame - how may I specify the types of each column??
col1 should be String
col2 = Int
col3 = String
You can do it only indirectly through the following DataFrame constructor (of course you could pass [String, Int, String] as a variable here):
DataFrame([([String, Int, String][i]).(m[:,i]) for i in 1:size(m, 2)])
and if you want to use automatic detection of column type you can use:
DataFrame([[v for v in m[:,i]] for i in 1:size(m, 2)])
In R we can convert NA to 0 with:
df[is.na(df)] <- 0
This works for single columns:
df[ismissing.(df[:col]), :col] = 0
There a way for the full df?
I don't think there's such a function in DataFrames.jl yet.
But you can hack your way around it by combining colwise and recode. I'm also providing a reproducible example here, in case someone wants to iterate on this answer:
julia> using DataFrames
julia> df = DataFrame(a = [missing, 5, 5],
b = [1, missing, missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1 │
│ 2 │ 5 │ missing │
│ 3 │ 5 │ missing │
julia> DataFrame(colwise(col -> recode(col, missing=>0), df), names(df))
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 0 │ 1 │
│ 2 │ 5 │ 0 │
│ 3 │ 5 │ 0 │
This is a bit ugly as you have to reassign the dataframe column names.
Maybe a simpler way to convert all missing values in a DataFrame is to just use list comprehension:
[df[ismissing.(df[i]), i] = 0 for i in names(df)]
Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).