Suppose I have a DataFrame with numeric elements. I want to check that all the elements are non-negative. I can do something like:
df .> 0
which results in a DataFrame of ones and zeros. How do I reduce it to a one true/false value?
The almost non-allocating and efficient way to do it is:
all(all.(>(0), eachcol(df)))
or
all(all.(x -> isless(0, x), eachcol(df)))
depending on how you want to handle missing values.
Here is an example of the difference:
julia> df = DataFrame(a=[1, missing], b=1:2)
2×2 DataFrame
Row │ a b
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
julia> all(all.(>(0), eachcol(df)))
missing
julia> all(all.(x -> isless(0, x), eachcol(df)))
true
as with isless missing value is treated as greater than any other value.
Related
I have a dataframe with some columns,sometimes it can be : [Type_House, Name, Location].
And sometimes it can be: [Type_Build, Name, Location]
There is a way to acess this dataframe column Type dynamically, like?
colName = "House"
dataframe.Type_colName
Thanks.
if you have
colName = "House"
you can access the column with
df[!, colName]
and from there you can use typeof() or eltype() to get the type or element type of that column
As indicated by #jling but specific to your question it would be:
> colName = "House"
> df[!, "Type_"*colName]
or
> getproperty(df, "Type_"*colName)
then you can just change colName="Build" to select the other column.
If you want to access the column that starts with Type_, you can use the names function this way:
julia> df = DataFrame( Type_Build = ["foo", "bar"], Name = ["A", "B"])
2×2 DataFrame
Row │ Type_Build Name
│ String String
─────┼────────────────────
1 │ foo A
2 │ bar B
julia> names(df, startswith("Type_"))
1-element Vector{String}:
"Type_Build"
To access the values in the column, you can use that to index into the dataframe:
julia> df[!, names(df, startswith("Type_"))]
2×1 DataFrame
Row │ Type_Build
│ String
─────┼────────────
1 │ foo
2 │ bar
I have a file that in principal looks like this
1 2 2,3,4 5
6 3 7,8 9
10 5 11,12,13,14,15 16
4 columns where column 1 and 4 is integer/float columns but column 2 indicates how many elements that are in the vector in column 3, so column 3 is in reality a column that has different length vectors
How to best read in this and to best represent this structure in a "variable" / dict.
I have tried to use readdlm with the delim = "," and I can read in the file but needs to easily address the structure to get the integers/floats/arrays for plotting etc. In reality there are much more columns and rows and which columns that are variable sized vectors varies from case to case
So I'm not 100% sure what exactly it is you want to do. But if you want to parse the file into some datastructure, here's a jab at a solution:
function parseline(line::AbstractString)
fields = split(line, ' ')
v = [parse(Int, i) for i in split(fields[3], ',')]
lenv = parse(Int, fields[2])
lenv == length(v) || throw(DimensionMismatch("Incorrect length for $v"))
result = Any[parse(Int, fields[1]), v, parse(Int, fields[4])]
return result
end
function parsefile(path::AbstractString)
return open(path) do file
[parseline(line) for line in eachline(file)]
end
end
The first function parses a line and checks if the length of column 3 matches the value in column 2. It returns a vector of [column 1, column 3, column 4]. The second function just does that for each line in a file.
I have a list of numbers and I want to find numbers which their second string is 9. the grep() code find any number that has 9 but I am looking for a code that find number that second string is 9. so the below returns:
p <- c(34405, 09098424, 6908347, 8900333, 453434)
grep(9, p)
[1] 1 2 3 4
I am looking for something that return:
[1] 2 3 4
Thanks
Majran
We can use substr to extract the 2nd digit and check whether (==) that is equal to 9, get the numeric index by wrapping with which.
which(substr(p,2,2)=="9")
#[1] 2 3 4
Or another option is grep where we match the pattern ^.9 (where ^ suggests the start of the string, . can be any character followed by 9 i.e. the second character)
grep("^.9", p)
#[1] 2 3 4
NOTE: Here I am assuming that the OP's vector is character class because numeric elements don't have 0 padded on the left.
data
p <- c("34405", "09098424", "6908347", "8900333", "453434")
I have a Dataframe of several columns say column1, column2...column100. How do I select only a subset of the columns eg (not column1) should return all columns column2...column100.
data[[colnames(data) .!= "column1"]])
doesn't seem to work.
I don't want to mutate the dataframe. I just want to select all the columns that don't have a particular column name like in my example
EDIT 2/7/2021: as people seem to still find this on Google, I'll edit this to say write at the top that current DataFrames (1.0+) allows both Not() selection supported by InvertedIndices.jl and also string types as column names, including regex selection with the r"" string macro. Examples:
julia> df = DataFrame(a1 = rand(2), a2 = rand(2), x1 = rand(2), x2 = rand(2), y = rand(["a", "b"], 2))
2×5 DataFrame
Row │ a1 a2 x1 x2 y
│ Float64 Float64 Float64 Float64 String
─────┼────────────────────────────────────────────────
1 │ 0.784704 0.963761 0.124937 0.37532 a
2 │ 0.814647 0.986194 0.236149 0.468216 a
julia> df[!, r"2"]
2×2 DataFrame
Row │ a2 x2
│ Float64 Float64
─────┼────────────────────
1 │ 0.963761 0.37532
2 │ 0.986194 0.468216
julia> df[!, Not(r"2")]
2×3 DataFrame
Row │ a1 x1 y
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 0.784704 0.124937 a
2 │ 0.814647 0.236149 a
Finally, the names function has a method which takes a type as its second argument, which is handy for subsetting DataFrames by the element type of each column:
julia> df[!, names(df, String)]
2×1 DataFrame
Row │ y
│ String
─────┼────────
1 │ a
2 │ a
In addition to indexing with square brackets, there's also the select function (and its mutating equivalent select!), which basically takes the same input as the column index in []-indexing as its second argument:
julia> select(df, Not(r"a"))
2×3 DataFrame
Row │ x1 x2 y
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 0.124937 0.37532 a
2 │ 0.236149 0.468216 a
Original answer below
As #Reza Afzalan said, what you're trying to do returns an array of strings, while column names in DataFrames are symbols.
Given that Julia doesn't have conditional list comprehension, the nicest thing you could do I guess would be
data[:, filter(x -> x != :column1, names(df))]
This will give you the data set with column 1 removed (without mutating it). You could extend this to checking against lists of names as well:
data[:, filter(x -> !(x in [:column1,:column2]), names(df))]
UPDATE: As Ian says below, for this use case the Not syntax is now the best way to go.
More generally, conditional list comprehensions are also available by now, so you could do:
data[:, [x for x in names(data) if x != :column1]]
As of DataFrames 0.19, seems that you can now do
select(data, Not(:column1))
to select all but the column column1. To select all except for multiple columns, use an array in the inverted index:
select(data, Not([:column1, :column2]))
To select several columns by name:
df[[:col1, :col2]
or, for other versions of the DataFrames library, I use:
select(df, [:col1, :col2])
colnames(data) .!= "column1" # => returns an array of bool
I think the right way is to use a filter function that returns desired column names
filter(x->x != "column1", colnames(data)) # => returns an array of string
DataFrame column names are of Symbol datatype
map(symbol ,str_array_of_filterd_column_names) # => returns array of identical symbols
One way is selecting a range of columns using the index
idx = length(data)
data[2:idx]
Other ways to do conditional selection are in the DataFrames docs
For example say you create a Julia DataFrame like so with 20 columns:
y=convert(DataFrame, randn(10,20))
How do you convert the column names (:x1 ... :x20) to something else, like (:col1, ..., :col20) for example, all at once?
You might find the names! function more concise:
julia> using DataFrames
julia> df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
2x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
julia> names!(df, [symbol("col$i") for i in 1:3])
Index([:col2=>2,:col1=>1,:col3=>3],[:col1,:col2,:col3])
julia> df
2x3 DataFrame
|-------|------|------|------|
| Row # | col1 | col2 | col3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
One way to do this is with the rename! function. The method of the rename function takes a DataFrame as input though only allows you to change a single column name at a time (as of the development version 0.3 branch on 1/4/2014). Looking into the code of Index.jl in the DataFrames repository lead me to this solution which works for me:
rename!(y.colindex, [(symbol("x$i")=>symbol("col$i")) for i in 1:20])
y.colindex returns the index for the dataframe y, and the next argument creates a dictionary mapping the old column symbols to the new column symbols. I imagine that by the time someone else needs this, there will be a nicer way to do this, but I just spent a few hours figuring this out in the development version 0.3 of Julia, so I thought i would share!
As an update to the answer of #JohnMylesWhite, the names! function has been deprecated in DataFrames v 0.20.2. The latest way of going about this is by using the rename! function:
import DataFrames
DF = DataFrames
df = DF.DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
println(df)
DF.rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
println(df)
v1.1.0
One can directly change the column names by
names!(df, colNames_as_Symbols)
To rename the columns with a vector of strings, this can be done via
names!(df, Symbol.(colNames_as_strings) )
# import Pkg; Pkg.add("DataFrames")
using DataFrames
The question has been answered, but for the additional clarity, sometimes you just want to specify the names without using loops (i.e. over-engineering):
rename!(df, [:Date, :feature_1, :feature_2 ], makeunique=true)
Example output:
141 rows × 3 columns
Date feature_1 feature_2
Date Float64? Float64?
1 2020-08-03 44.3 missing
Update:
For Julia 0.4, as described by John Myles White, all the names can be changed with:
names!(df::AbstractDataFrame, vals)
where vals is a Vector{Symbol} the same length as
the number of columns in df.
Specific names can be changed with:
rename!(df::AbstractDataFrame, from::Symbol, to::Symbol)
rename!(df::AbstractDataFrame, d::Associative)
rename!(f::Function, df::AbstractDataFrame)
where d is an Associative type that maps the original name to a new name
and f is a function that has the old column name (a symbol) as input
and new column name (a symbol) as output.
This is documented in the code at https://github.com/JuliaStats/DataFrames.jl/blob/7e2f48ad9f31185d279fdd81d6413a79b7e42e87/src/abstractdataframe/abstractdataframe.jl
This is the short and simple answer for Julia 1.1.1:
names!(df, [Symbol("Col$i") for i in 1:size(df,2)])
Use the rename function with an array containing the new names:
Vector_with_names = ["col1","col2","col3"]
rename!(df,Vector_with_names)
Using John's dataframe, i had to use colnames! instead of names!
df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
colnames!(df, ["col$i" for i in 1:3])
My version of Julia is 0.2.1