I had two tab delimited files; one with the data, and the second one with the names of the columns that I am interested in. I want to subset the data frame so that it only has my columns of interest. Here is my code:
dat1 = DataFrame(CSV.File("data.txt"))
hdr = Symbol(readdlm("header.txt",'\t'))
which gives
julia> dat1
4×5 DataFrame
│ Row │ chr │ pos │ alt │ ref │ cadd │
│ │ String │ Int64 │ String │ String │ Float64 │
├─────┼────────┼───────┼────────┼────────┼─────────┤
│ 1 │ chr1 │ 1234 │ A │ T │ 23.4 │
│ 2 │ chr2 │ 1234 │ C │ G │ 5.4 │
│ 3 │ chr2 │ 1234 │ G │ C │ 11.0 │
│ 4 │ chr5 │ 3216 │ A │ T │ 3.0 │
julia> hdr
Symbol("Any[\"pos\" \"alt\"]")
However, I get an error if I try to subset with:
julia> dat2 = dat1[ :, :hdr]
What would be the correct way to subset? Thanks!
Just do:
hdr = vec(readdlm("header.txt",'\t'))
dat2 = dat1[:, hdr]
or for the second step
dat2 = select(df1, hdr)
What is important here is hat hdr should be a vector of strings.
You could also have written:
dat2 = select(df1, readdlm("header.txt",'\t')...)
splatting the contents of the matrix (strings holding column names) as positional arguments.
Related
How to get dtypes of all columns and particular column(s) in julia. To be specific what is the pandas equivalent of df.dtypes in julia?
For example,
I have a df like below,
│ Row │ Id │ name │ item location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
│ 2 │ 2 │ B │ abc │
│ 3 │ 3 │ C │ def │
│ 4 │ 4 │ D │ ghi │
│ 5 │ 5 │ E │ xyz │
│ 6 │ 6 │ F │ abc │
│ 7 │ 7 │ G │ def │
│ 8 │ 8 │ H │ ghi │
│ 9 │ 9 │ I │ xyz │
│ 10 │ 10 │ J │ abc │
Expected output:
{'id': Int64, 'name': String, 'item location': String}
How to get dtypes, i.e., Int64 │ String │ String in Julia?
You have specified two different expected outputs so I show here how to get both:
julia> df = DataFrame("Id" => 1, "name" => "A", "item_location" => "xyz")
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
Int64
String
String
julia> Dict(names(df) .=> eltype.(eachcol(df)))
Dict{String,DataType} with 3 entries:
"Id" => Int64
"name" => String
"item_location" => String
additionally, if you wanted to store the result in a DataFrame instead of a Dict you can write (see mapcols documentation here):
julia> mapcols(eltype, df)
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ DataType │ DataType │ DataType │
├─────┼──────────┼──────────┼───────────────┤
│ 1 │ Int64 │ String │ String │
And if you would want to have a NamedTuple storing this information write (the documentation of Tables.columntable is here):
julia> map(eltype, Tables.columntable(df))
(Id = Int64, name = String, item_location = String)
(in this case note that for very wide tables this might have some extra compilation cost as each time you call it you potentially get a new type of NamedTuple)
You can also use describe(df) which is a catchall for learning about the columns in your data frame.
I want to use Faker data for many rows. My current code only repeats whatever was generated by the Faker library at that moment:
Current output:
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ 40D593 │
│ 2 │ 40D593 │
│ 3 │ 40D593 │
Desired outputs:
│ Row │ Digits │
│ │ String │
├─────┼────────┤
│ 1 │ 23K125 │
│ 2 │ 13K125 │
│ 3 │ 45K125 │
df2 = DataFrame(Identifier = repeat([Faker.bothify("##?###")], outer=[3]))
I thought I could do something like Faker.bothify("##?###") * 3. I suppose there may also be a way to apply it to a dataframe column that was already made, but I can't find a way just looking through the docs quickly.
The simple way is to use a comprehension:
df2 = DataFrame(Identifier=[Faker.bothify("##?###") for _ in 1:3])
an alternative is to use broadcasting (but for me a comprehension in this case is more natural to use):
df2 = DataFrame(Identifier=Faker.bothify.(Iterators.repeated("##?###", 3)))
(I assume this is what you want)
and this is the way to apply it to an existing column with String eltype. This operation is in-place:
julia> df = DataFrame(Identifier=Vector{String}(undef, 3))
3×1 DataFrame
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ #undef │
│ 2 │ #undef │
│ 3 │ #undef │
julia> df.Identifier .= Faker.bothify.("##?###")
3-element Array{String,1}:
"12H314"
"56G992"
"23X588"
julia> df
3×1 DataFrame
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ 12H314 │
│ 2 │ 56G992 │
│ 3 │ 23X588 │
Ok lets say I have a series of arrays:
data_one = ["dog","cat"]
data_two = [1,2]
data_three = ["1/1/2018","1/2/2018"]
I build them into a matrix
m = hcat(data_one,data_two,data_three)
and convert to a df
df = DataFrame(m)
showcols(df)
for output:
julia> showcols(df)
3×5 DataFrames.DataFrame
│ Row │ variable │ eltype │ nmissing │ first │ last │
├─────┼──────────┼────────┼──────────┼──────────┼──────────┤
│ 1 │ x1 │ Any │ 0 │ dog │ cat │
│ 2 │ x2 │ Any │ 0 │ 1 │ 2 │
│ 3 │ x3 │ Any │ 0 │ 1/1/2018 │ 1/2/2018 │
When I build this data frame - how may I specify the types of each column??
col1 should be String
col2 = Int
col3 = String
You can do it only indirectly through the following DataFrame constructor (of course you could pass [String, Int, String] as a variable here):
DataFrame([([String, Int, String][i]).(m[:,i]) for i in 1:size(m, 2)])
and if you want to use automatic detection of column type you can use:
DataFrame([[v for v in m[:,i]] for i in 1:size(m, 2)])
So I tried this:
df[:new_col] = (df[:col_one ] .* [df[:col_two]])
It produces a wild result.
I then though to iterate row wise by access the data frame index:
v = Float64[]
for i in 1:nrow(df)
z = df[[i],[:col_one]] * df[[i],[:col_two]]
append!(v,z)
end
This however does not work. Any ideas?
What are my options from here? Pull the data from a data frame and make a vector?
** Update **
df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
df[:new_d] = df[:b] .* df[:c]
For output:
julia> head(df)
6×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ new_d │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 6.67916 │ 8.38096 │ 55.9778 │
│ 2 │ 2 │ 7.50056 │ 5.26593 │ 39.4974 │
│ 3 │ 3 │ 7.76419 │ 3.54361 │ 27.5133 │
│ 4 │ 4 │ 2.86521 │ 8.41335 │ 24.1061 │
│ 5 │ 5 │ 3.7417 │ 8.10884 │ 30.3409 │
│ 6 │ 6 │ 7.52014 │ 2.61603 │ 19.6729 │
Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).