How to select only a subset of dataframe columns in julia - julia

I have a Dataframe of several columns say column1, column2...column100. How do I select only a subset of the columns eg (not column1) should return all columns column2...column100.
data[[colnames(data) .!= "column1"]])
doesn't seem to work.
I don't want to mutate the dataframe. I just want to select all the columns that don't have a particular column name like in my example

EDIT 2/7/2021: as people seem to still find this on Google, I'll edit this to say write at the top that current DataFrames (1.0+) allows both Not() selection supported by InvertedIndices.jl and also string types as column names, including regex selection with the r"" string macro. Examples:
julia> df = DataFrame(a1 = rand(2), a2 = rand(2), x1 = rand(2), x2 = rand(2), y = rand(["a", "b"], 2))
2×5 DataFrame
Row │ a1 a2 x1 x2 y
│ Float64 Float64 Float64 Float64 String
─────┼────────────────────────────────────────────────
1 │ 0.784704 0.963761 0.124937 0.37532 a
2 │ 0.814647 0.986194 0.236149 0.468216 a
julia> df[!, r"2"]
2×2 DataFrame
Row │ a2 x2
│ Float64 Float64
─────┼────────────────────
1 │ 0.963761 0.37532
2 │ 0.986194 0.468216
julia> df[!, Not(r"2")]
2×3 DataFrame
Row │ a1 x1 y
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 0.784704 0.124937 a
2 │ 0.814647 0.236149 a
Finally, the names function has a method which takes a type as its second argument, which is handy for subsetting DataFrames by the element type of each column:
julia> df[!, names(df, String)]
2×1 DataFrame
Row │ y
│ String
─────┼────────
1 │ a
2 │ a
In addition to indexing with square brackets, there's also the select function (and its mutating equivalent select!), which basically takes the same input as the column index in []-indexing as its second argument:
julia> select(df, Not(r"a"))
2×3 DataFrame
Row │ x1 x2 y
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 0.124937 0.37532 a
2 │ 0.236149 0.468216 a
Original answer below
As #Reza Afzalan said, what you're trying to do returns an array of strings, while column names in DataFrames are symbols.
Given that Julia doesn't have conditional list comprehension, the nicest thing you could do I guess would be
data[:, filter(x -> x != :column1, names(df))]
This will give you the data set with column 1 removed (without mutating it). You could extend this to checking against lists of names as well:
data[:, filter(x -> !(x in [:column1,:column2]), names(df))]
UPDATE: As Ian says below, for this use case the Not syntax is now the best way to go.
More generally, conditional list comprehensions are also available by now, so you could do:
data[:, [x for x in names(data) if x != :column1]]

As of DataFrames 0.19, seems that you can now do
select(data, Not(:column1))
to select all but the column column1. To select all except for multiple columns, use an array in the inverted index:
select(data, Not([:column1, :column2]))

To select several columns by name:
df[[:col1, :col2]
or, for other versions of the DataFrames library, I use:
select(df, [:col1, :col2])

colnames(data) .!= "column1" # => returns an array of bool
I think the right way is to use a filter function that returns desired column names
filter(x->x != "column1", colnames(data)) # => returns an array of string
DataFrame column names are of Symbol datatype
map(symbol ,str_array_of_filterd_column_names) # => returns array of identical symbols

One way is selecting a range of columns using the index
idx = length(data)
data[2:idx]
Other ways to do conditional selection are in the DataFrames docs

Related

Change variable name dynamically

I have a dataframe with some columns,sometimes it can be : [Type_House, Name, Location].
And sometimes it can be: [Type_Build, Name, Location]
There is a way to acess this dataframe column Type dynamically, like?
colName = "House"
dataframe.Type_colName
Thanks.
if you have
colName = "House"
you can access the column with
df[!, colName]
and from there you can use typeof() or eltype() to get the type or element type of that column
As indicated by #jling but specific to your question it would be:
> colName = "House"
> df[!, "Type_"*colName]
or
> getproperty(df, "Type_"*colName)
then you can just change colName="Build" to select the other column.
If you want to access the column that starts with Type_, you can use the names function this way:
julia> df = DataFrame( Type_Build = ["foo", "bar"], Name = ["A", "B"])
2×2 DataFrame
Row │ Type_Build Name
│ String String
─────┼────────────────────
1 │ foo A
2 │ bar B
julia> names(df, startswith("Type_"))
1-element Vector{String}:
"Type_Build"
To access the values in the column, you can use that to index into the dataframe:
julia> df[!, names(df, startswith("Type_"))]
2×1 DataFrame
Row │ Type_Build
│ String
─────┼────────────
1 │ foo
2 │ bar

Add table inside plot Julia

I have the following bar plot:
using Plots, DataFrames
Plots.bar(["A", "B", "c"],[6,5,3],fillcolor=[:red,:green,:blue], legend = :none)
Output:
I would like to add a simple small table inside to plot in the top right corner. The table should have the following values of the dataframe:
df = DataFrame(x = ["A", "B", "c"], y = [6,5,3])
3×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ A 6
2 │ B 5
3 │ c 3
So I was wondering if anyone knows how to add a simple table to a Plots graph in Julia?
You can use the following:
using Plots, DataFrames
df = DataFrame(; x=["A", "B", "c"], y=[6, 5, 3])
plt = Plots.bar(
df[!, :x], df[!, :y]; fillcolor=[:red, :green, :blue], legend=:none, dpi=300
)
Plots.annotate!(
1,
4,
Plots.text(
replace(string(df), ' ' => '\u00A0'); family="Courier", halign=:left, color="black"
),
)
Plots.savefig("plot.svg")
Which gets you the following plot (after conversion to PNG for uploading to StackOverflow):
Note that we have to replace space characters with '\u00a0', non-breaking space, to prevent multiple consecutive spaces from being collapsed by the SVG (SVG‘s collapse consecutive spaces by default).

How do I check if all elements of DataFrame are non-negative?

Suppose I have a DataFrame with numeric elements. I want to check that all the elements are non-negative. I can do something like:
df .> 0
which results in a DataFrame of ones and zeros. How do I reduce it to a one true/false value?
The almost non-allocating and efficient way to do it is:
all(all.(>(0), eachcol(df)))
or
all(all.(x -> isless(0, x), eachcol(df)))
depending on how you want to handle missing values.
Here is an example of the difference:
julia> df = DataFrame(a=[1, missing], b=1:2)
2×2 DataFrame
Row │ a b
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
julia> all(all.(>(0), eachcol(df)))
missing
julia> all(all.(x -> isless(0, x), eachcol(df)))
true
as with isless missing value is treated as greater than any other value.

How do you change multiple column names in a Julia (version 0.3) DataFrame?

For example say you create a Julia DataFrame like so with 20 columns:
y=convert(DataFrame, randn(10,20))
How do you convert the column names (:x1 ... :x20) to something else, like (:col1, ..., :col20) for example, all at once?
You might find the names! function more concise:
julia> using DataFrames
julia> df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
2x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
julia> names!(df, [symbol("col$i") for i in 1:3])
Index([:col2=>2,:col1=>1,:col3=>3],[:col1,:col2,:col3])
julia> df
2x3 DataFrame
|-------|------|------|------|
| Row # | col1 | col2 | col3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
One way to do this is with the rename! function. The method of the rename function takes a DataFrame as input though only allows you to change a single column name at a time (as of the development version 0.3 branch on 1/4/2014). Looking into the code of Index.jl in the DataFrames repository lead me to this solution which works for me:
rename!(y.colindex, [(symbol("x$i")=>symbol("col$i")) for i in 1:20])
y.colindex returns the index for the dataframe y, and the next argument creates a dictionary mapping the old column symbols to the new column symbols. I imagine that by the time someone else needs this, there will be a nicer way to do this, but I just spent a few hours figuring this out in the development version 0.3 of Julia, so I thought i would share!
As an update to the answer of #JohnMylesWhite, the names! function has been deprecated in DataFrames v 0.20.2. The latest way of going about this is by using the rename! function:
import DataFrames
DF = DataFrames
df = DF.DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
println(df)
DF.rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
println(df)
v1.1.0
One can directly change the column names by
names!(df, colNames_as_Symbols)
To rename the columns with a vector of strings, this can be done via
names!(df, Symbol.(colNames_as_strings) )
# import Pkg; Pkg.add("DataFrames")
using DataFrames
The question has been answered, but for the additional clarity, sometimes you just want to specify the names without using loops (i.e. over-engineering):
rename!(df, [:Date, :feature_1, :feature_2 ], makeunique=true)
Example output:
141 rows × 3 columns
Date feature_1 feature_2
Date Float64? Float64?
1 2020-08-03 44.3 missing
Update:
For Julia 0.4, as described by John Myles White, all the names can be changed with:
names!(df::AbstractDataFrame, vals)
where vals is a Vector{Symbol} the same length as
the number of columns in df.
Specific names can be changed with:
rename!(df::AbstractDataFrame, from::Symbol, to::Symbol)
rename!(df::AbstractDataFrame, d::Associative)
rename!(f::Function, df::AbstractDataFrame)
where d is an Associative type that maps the original name to a new name
and f is a function that has the old column name (a symbol) as input
and new column name (a symbol) as output.
This is documented in the code at https://github.com/JuliaStats/DataFrames.jl/blob/7e2f48ad9f31185d279fdd81d6413a79b7e42e87/src/abstractdataframe/abstractdataframe.jl
This is the short and simple answer for Julia 1.1.1:
names!(df, [Symbol("Col$i") for i in 1:size(df,2)])
Use the rename function with an array containing the new names:
Vector_with_names = ["col1","col2","col3"]
rename!(df,Vector_with_names)
Using John's dataframe, i had to use colnames! instead of names!
df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
colnames!(df, ["col$i" for i in 1:3])
My version of Julia is 0.2.1

Replace string within a variable in R using fuzzy matching

I have data with 3 relevant variables: The first is an activity (x1), the second is the respondents rating of that activity (x2), and the third is the proper name of the activity in x1 (x3). The x1 variables are respondent written, and very close matches to the reference variable of the activity x3, but all a little bit different. I would like to match and replace all x1's with the reference x3 - I was thinking of using a loop referring to each reference activity x3 and replacing the x1 respondent written activity using a program like agrep. However, agrep seems only to tell me what the matches are. How can I replace the x1 variables with the "correct" string title in x3?
in R, the function agrep returns the indices where it found a match, not the number of matches
agrep('chrg', c('charge', 'trapper', 'friend', 'charger'))
# [1] 1 4
If you would like to have the value instead of the index, you can pass value=TRUE.
agrep('chrg', c('charge', 'trapper', 'friend', 'charger'), value=TRUE)
# [1] "charge" "charger"
EDIT after your update:
If x1 and x3 are in phase (for each index you have the names of the same activity) here is a snippet that will do the trick.
subs <- function(x, old, new) {
# Replace 'old' by 'new' in 'x'.
matchv <- match(x, old, nomatch=0)
replace(x, matchv > 0, new[matchv])
}
# y is any vector that contains short names.
subs(y, x1, x3)
If they are not in phase you can create the old and new vectors as follows with agrep.
oldnew <- sapply(x1, function(x) { agrep(x, x3, value=TRUE)[1] })
subs(y, names(oldnew), oldnew)

Resources