Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).
Related
Is there a way to add a row to an existing dataframe at a specific index?
E.g. you have a dataframe with 3 rows and 1 columns
df = DataFrame(x = [2,3,4])
X
2
3
4
any way to do the following:
insert!(df, 1, [1])
in order to get
X
1
2
3
4
I know that i could probably concat two dataframes df = [df1; df2] but i was hoping to avoid garbaging a large DF whenever i want to insert a row.
In DataFrames 0.21.4 just write (I give two options: one, with broadcasting, is short but creates a temporary object; the other, with foreach is longer to write but allocates a bit less):
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(eachcol(df), 2, [4, "d"]) # creates an temporary object but is terse
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
["a", "d", "b", "c"]
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
julia> foreach((c, v) -> insert!(c, 2, v), eachcol(df), [4, "d"]) # does not create a temporary object
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 4 │ d │
│ 4 │ 2 │ b │
│ 5 │ 3 │ c │
note that the above operation is not atomic (it may corrupt your data frame if the type of the element you want to add does not match the element type allowed in the column).
If you want a safe operation that will provide automatic promotion use this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> [view(df, 1:1, :); DataFrame(names(df) .=> ['a', 'b']); view(df, 3:3, :)]
3×2 DataFrame
│ Row │ x │ y │
│ │ Any │ Any │
├─────┼─────┼─────┤
│ 1 │ 1 │ a │
│ 2 │ 'a' │ 'b' │
│ 3 │ 3 │ c │
(it is a bit slower though and creates a new data frame)
Deprecated
The original answer is here. It was valid for Julia before 1.0 release (and DataFrames.jl version that was compatible with it).
I guess you want to do it in place. Then you can use insert! function like this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
Of course you have to make sure that you have the right number of columns in the added collection.
If you accept using unexported internal structure of a DataFrame you can do it even simpler:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(df.columns, 2, [4, "d"])
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
String["a", "d", "b", "c"]
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
I had two tab delimited files; one with the data, and the second one with the names of the columns that I am interested in. I want to subset the data frame so that it only has my columns of interest. Here is my code:
dat1 = DataFrame(CSV.File("data.txt"))
hdr = Symbol(readdlm("header.txt",'\t'))
which gives
julia> dat1
4×5 DataFrame
│ Row │ chr │ pos │ alt │ ref │ cadd │
│ │ String │ Int64 │ String │ String │ Float64 │
├─────┼────────┼───────┼────────┼────────┼─────────┤
│ 1 │ chr1 │ 1234 │ A │ T │ 23.4 │
│ 2 │ chr2 │ 1234 │ C │ G │ 5.4 │
│ 3 │ chr2 │ 1234 │ G │ C │ 11.0 │
│ 4 │ chr5 │ 3216 │ A │ T │ 3.0 │
julia> hdr
Symbol("Any[\"pos\" \"alt\"]")
However, I get an error if I try to subset with:
julia> dat2 = dat1[ :, :hdr]
What would be the correct way to subset? Thanks!
Just do:
hdr = vec(readdlm("header.txt",'\t'))
dat2 = dat1[:, hdr]
or for the second step
dat2 = select(df1, hdr)
What is important here is hat hdr should be a vector of strings.
You could also have written:
dat2 = select(df1, readdlm("header.txt",'\t')...)
splatting the contents of the matrix (strings holding column names) as positional arguments.
I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?
! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false
Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.
Ok lets say I have a series of arrays:
data_one = ["dog","cat"]
data_two = [1,2]
data_three = ["1/1/2018","1/2/2018"]
I build them into a matrix
m = hcat(data_one,data_two,data_three)
and convert to a df
df = DataFrame(m)
showcols(df)
for output:
julia> showcols(df)
3×5 DataFrames.DataFrame
│ Row │ variable │ eltype │ nmissing │ first │ last │
├─────┼──────────┼────────┼──────────┼──────────┼──────────┤
│ 1 │ x1 │ Any │ 0 │ dog │ cat │
│ 2 │ x2 │ Any │ 0 │ 1 │ 2 │
│ 3 │ x3 │ Any │ 0 │ 1/1/2018 │ 1/2/2018 │
When I build this data frame - how may I specify the types of each column??
col1 should be String
col2 = Int
col3 = String
You can do it only indirectly through the following DataFrame constructor (of course you could pass [String, Int, String] as a variable here):
DataFrame([([String, Int, String][i]).(m[:,i]) for i in 1:size(m, 2)])
and if you want to use automatic detection of column type you can use:
DataFrame([[v for v in m[:,i]] for i in 1:size(m, 2)])
I am transferring a script written in R to Julia, and one of the R functions is the names() function. Is there a synonymous function in Julia?
DataFrames
In Julia there is names function for a DataFrame which returns column names, e.g.:
julia> using DataFrames
julia> x = DataFrame(rand(3,4))
3×4 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
julia> names(x)
4-element Array{Symbol,1}:
:x1
:x2
:x3
:x4
Then in order to set column names of a DataFrame you can use names! function (example continued):
julia> names!(x, [:a,:b,:c,:d])
3×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ d │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
Arrays
Standard arrays do not support naming their dimensions. But you can use NamedArrays.jl package which adds this functionality. You can get and set names of dimensions as well as names of indices along each dimension. You can find the details here https://github.com/davidavdav/NamedArrays.jl#general-functions.
I'm not an R expert but I think you want fieldnames
type Foo
bar::Int
end
#show fieldnames(Foo)
baz = Foo(2)
#show fieldnames(baz)