I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"
Related
I'm having difficulty developing a function/algorithm that that updates a dataframe based on certain conditions. I've looked at some answers related to "updating" a dataframe via for loops, but I'm still stuck.
Say I have a dataframe:
df <- data.frame("data_low" = .2143, "data_high" = .7149)
where data_low and data_high are the max and min of some column in a dataframe
I also have two functions:
checker(b[1,])
Takes the value of data_high and data_low, and returns a scalar. If the scalar is less than 1, I'd like to store this in another dataframe, say "d". Else, I want to split "b" with the following function:
splitter()
splits "b" by the median of data_high and data_low.
I've considered trying to develop this with a loop:
storage <- data.frame(data_low = double(), data_high = double()
for( i in 1:nrow(b)){
if(checker(b[i,]) <1){
storage <- splitter(b[i,])
} else {
temp <- splitter(b[i,])
b <- rbind(b,temp)
}
}
My desired output after two iterations (where check >1 for each row:
** Obviously these numbers are picked at random, I'm just hoping to gain some intuition related to looping/updating dataframes based on cases..
starting at i = 0:
| .2143 | .7149 |,
i = 2
| .2143 | .4442 | ** Note at splitter() should break this into 2 rows after i = 2 is complete.
| .4442 | .7149 | ** And again here
i = 3
| .2143 | .3002 |
| .3002 | .4442 |
| .4442 | .5630 |
| .5630 | .7149 |
Can anyone give me some tips on how to organize this loop? I'm thinking my issue here is related to rbind and/or the actual updating of b.
I recognize that much of this code isn't reproducible, but am more interested in the though process here.
Any help would be greatly appreciated!
You can do this with a nested loop (one for the number of iterations and one for the number of rows in b), or using nested Reduce calls, as shown here.
Reduce(function(x, y) {
List=apply(x, 1, function(z) {
med=median(c(z[1], z[2]))
dat=data.frame(data_low=c(z[1], med), data_high=c(med, z[2]))
rownames(dat)=NULL
return(dat)
})
Reduce(function(w, z) rbind(w, z), List)
}, rep(NA, 2), init=df)
One rep:
data_low data_high
1 0.2143 0.4646
2 0.4646 0.7149
Two reps:
data_low data_high
1 0.21430 0.33945
2 0.33945 0.46460
3 0.46460 0.58975
4 0.58975 0.71490
Three reps:
data_low data_high
1 0.214300 0.276875
2 0.276875 0.339450
3 0.339450 0.402025
4 0.402025 0.464600
5 0.464600 0.527175
6 0.527175 0.589750
7 0.589750 0.652325
8 0.652325 0.714900
I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)
Scaled down my dataframe looks like this:
+---+------------+-------------+
| | Label1 | Label2 |
+---+------------+-------------+
| 1 | T | F |
| 2 | F | F |
| 3 | T | T |
+---+------------+-------------+
I need to create a list of lists that map the column names to all the row numbers that have a false boolean as their value. For the above example it would look something like this:
{"Label1" : (2), "Label2" : (1,2)}
I am currently doing it as so:
myList = with(data.frame(which(!myDataFrame, arr.ind = TRUE)), list("colNames" = names(myDataFrame)[col], "rows" = row))
l = list()
count = 1;
for (i in myList[["colNames"]]) {
tmpRowNum = myList[["rows"]][[count]];
tmpList = l[[i]];
if (is.null(tmpList)) {
tmpList = list();
}
l[[i]] = c(tmpList, list(tmpRowNum))
count = count + 1;
}
This does work, but as I am new to R I can only assume there is a more efficient method of doing this. The with function creates two separate lists that I essentially have to combine to get the result that I am looking for.
You could try:
df <- data.frame(Label1=c("T","F","T"),Label2=c("F","F","T"))
lapply(df,function(x) which(x=="F"))
$Label1
[1] 2
$Label2
[1] 1 2
EDIT To get the same by row, use apply with margin=1:
apply(df,1,function(x) which(x=="F"))
To get a vector of the "F"s in row 2:
res <- apply(df,1,function(x) which(x=="F"))
res[[2]]
1 2
One useful way to get the row/column index is with which and arr.ind
i1 <- which(df=="F", arr.ind=TRUE)
I've got a data frame with this structure:
> df
modifications
13-MOD:0057
13-MOD:0046
13-MOD:0051,13-MOD:0076
13-MOD:0036,13-MOD:0076,13-MOD:0016
13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125
13-MOD:0014 13-MOD:0156, 13-MOD:0956,13-MOD:0125...n
13-MOD:0012 ... n
To split the data I used this code:
df2 <- data.frame(str_split_fixed(df$modifications, ",", 20))
Basically, I get this data.
> df2
x1 | x2 | x3 | empty |
13-MOD:0057 | empty | empty | empty |
13-MOD:0046 | emply | empty | empty |
13-MOD:0051 | 13-MOD:0076 | empty | empty |
13-MOD:0036 | 13-MOD:0076 | 13-MOD:0016 | empty |
13-MOD:0256 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125
13-MOD:0014 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125 | ... n
13-MOD:0012 | ... | ...n
What I want is remove the empty values and stack the data from columns X2,X3, X4 ... n to the first one X1.
To do that I was using this:
df3 <- melt(setDT(df2), # set df to a data.table
measure.vars = list(c(1:20)), # set column groupings
value.name = 'V')[ # set output name scheme
, -1, with = F]
To remove the empty values:
df3[df3==""] <- NA
histo3 = subset(df3, V1 != 'NA')
But I don't know why I get an error about the length of the column in melt function. Do you know any way to make this easier?.
Reproducible example:
df <- data.frame(modifications=c("UNIMOD:108,UNIMOD:108","UNIMOD:108","UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108"))
could it be something like this?
library(stringr)
# input dataset
s <- c('13-MOD:0057', '13-MOD:0046', '13-MOD:0051,13-MOD:0076', '13-MOD:0036,13-MOD:0076,13-MOD:0016', '13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125')
s
[1] "13-MOD:0057"
[2] "13-MOD:0046"
[3] "13-MOD:0051,13-MOD:0076"
[4] "13-MOD:0036,13-MOD:0076,13-MOD:0016"
[5] "13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125"
# get the individual lengths
lengths <- sapply(str_split(s,','), function(x){ length(x) })
# create the dataframe splitting in N columns
as.data.frame(str_split_fixed(s, ',', max(lengths)))
V1 V2 V3 V4
1 13-MOD:0057
2 13-MOD:0046
3 13-MOD:0051 13-MOD:0076
4 13-MOD:0036 13-MOD:0076 13-MOD:0016
5 13-MOD:0256 13-MOD:0156 13-MOD:0956 13-MOD:0125
UPDATE 1
To stack all the non-empty cells into a single column
# create the dataframe splitting in N columns
first.matrix <- str_split_fixed(s, ',', max(lengths))
# select only the cells != ""
first.matrix[which(first.matrix!="")]
[1] "13-MOD:0057" "13-MOD:0046" "13-MOD:0051" "13-MOD:0036" "13-MOD:0256" "13-MOD:0076"
[7] "13-MOD:0076" "13-MOD:0156" "13-MOD:0016" "13-MOD:0956" "13-MOD:0125"
For example say you create a Julia DataFrame like so with 20 columns:
y=convert(DataFrame, randn(10,20))
How do you convert the column names (:x1 ... :x20) to something else, like (:col1, ..., :col20) for example, all at once?
You might find the names! function more concise:
julia> using DataFrames
julia> df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
2x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
julia> names!(df, [symbol("col$i") for i in 1:3])
Index([:col2=>2,:col1=>1,:col3=>3],[:col1,:col2,:col3])
julia> df
2x3 DataFrame
|-------|------|------|------|
| Row # | col1 | col2 | col3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
One way to do this is with the rename! function. The method of the rename function takes a DataFrame as input though only allows you to change a single column name at a time (as of the development version 0.3 branch on 1/4/2014). Looking into the code of Index.jl in the DataFrames repository lead me to this solution which works for me:
rename!(y.colindex, [(symbol("x$i")=>symbol("col$i")) for i in 1:20])
y.colindex returns the index for the dataframe y, and the next argument creates a dictionary mapping the old column symbols to the new column symbols. I imagine that by the time someone else needs this, there will be a nicer way to do this, but I just spent a few hours figuring this out in the development version 0.3 of Julia, so I thought i would share!
As an update to the answer of #JohnMylesWhite, the names! function has been deprecated in DataFrames v 0.20.2. The latest way of going about this is by using the rename! function:
import DataFrames
DF = DataFrames
df = DF.DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
println(df)
DF.rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
println(df)
v1.1.0
One can directly change the column names by
names!(df, colNames_as_Symbols)
To rename the columns with a vector of strings, this can be done via
names!(df, Symbol.(colNames_as_strings) )
# import Pkg; Pkg.add("DataFrames")
using DataFrames
The question has been answered, but for the additional clarity, sometimes you just want to specify the names without using loops (i.e. over-engineering):
rename!(df, [:Date, :feature_1, :feature_2 ], makeunique=true)
Example output:
141 rows × 3 columns
Date feature_1 feature_2
Date Float64? Float64?
1 2020-08-03 44.3 missing
Update:
For Julia 0.4, as described by John Myles White, all the names can be changed with:
names!(df::AbstractDataFrame, vals)
where vals is a Vector{Symbol} the same length as
the number of columns in df.
Specific names can be changed with:
rename!(df::AbstractDataFrame, from::Symbol, to::Symbol)
rename!(df::AbstractDataFrame, d::Associative)
rename!(f::Function, df::AbstractDataFrame)
where d is an Associative type that maps the original name to a new name
and f is a function that has the old column name (a symbol) as input
and new column name (a symbol) as output.
This is documented in the code at https://github.com/JuliaStats/DataFrames.jl/blob/7e2f48ad9f31185d279fdd81d6413a79b7e42e87/src/abstractdataframe/abstractdataframe.jl
This is the short and simple answer for Julia 1.1.1:
names!(df, [Symbol("Col$i") for i in 1:size(df,2)])
Use the rename function with an array containing the new names:
Vector_with_names = ["col1","col2","col3"]
rename!(df,Vector_with_names)
Using John's dataframe, i had to use colnames! instead of names!
df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
colnames!(df, ["col$i" for i in 1:3])
My version of Julia is 0.2.1