I'm having difficulty developing a function/algorithm that that updates a dataframe based on certain conditions. I've looked at some answers related to "updating" a dataframe via for loops, but I'm still stuck.
Say I have a dataframe:
df <- data.frame("data_low" = .2143, "data_high" = .7149)
where data_low and data_high are the max and min of some column in a dataframe
I also have two functions:
checker(b[1,])
Takes the value of data_high and data_low, and returns a scalar. If the scalar is less than 1, I'd like to store this in another dataframe, say "d". Else, I want to split "b" with the following function:
splitter()
splits "b" by the median of data_high and data_low.
I've considered trying to develop this with a loop:
storage <- data.frame(data_low = double(), data_high = double()
for( i in 1:nrow(b)){
if(checker(b[i,]) <1){
storage <- splitter(b[i,])
} else {
temp <- splitter(b[i,])
b <- rbind(b,temp)
}
}
My desired output after two iterations (where check >1 for each row:
** Obviously these numbers are picked at random, I'm just hoping to gain some intuition related to looping/updating dataframes based on cases..
starting at i = 0:
| .2143 | .7149 |,
i = 2
| .2143 | .4442 | ** Note at splitter() should break this into 2 rows after i = 2 is complete.
| .4442 | .7149 | ** And again here
i = 3
| .2143 | .3002 |
| .3002 | .4442 |
| .4442 | .5630 |
| .5630 | .7149 |
Can anyone give me some tips on how to organize this loop? I'm thinking my issue here is related to rbind and/or the actual updating of b.
I recognize that much of this code isn't reproducible, but am more interested in the though process here.
Any help would be greatly appreciated!
You can do this with a nested loop (one for the number of iterations and one for the number of rows in b), or using nested Reduce calls, as shown here.
Reduce(function(x, y) {
List=apply(x, 1, function(z) {
med=median(c(z[1], z[2]))
dat=data.frame(data_low=c(z[1], med), data_high=c(med, z[2]))
rownames(dat)=NULL
return(dat)
})
Reduce(function(w, z) rbind(w, z), List)
}, rep(NA, 2), init=df)
One rep:
data_low data_high
1 0.2143 0.4646
2 0.4646 0.7149
Two reps:
data_low data_high
1 0.21430 0.33945
2 0.33945 0.46460
3 0.46460 0.58975
4 0.58975 0.71490
Three reps:
data_low data_high
1 0.214300 0.276875
2 0.276875 0.339450
3 0.339450 0.402025
4 0.402025 0.464600
5 0.464600 0.527175
6 0.527175 0.589750
7 0.589750 0.652325
8 0.652325 0.714900
Related
I am working with a database of three-dimensional vectors and am trying to calculate the surface area of the triangles between all possible combinations of three vectors. The goal is to get a list or dataframe containing the area for all possible combinations, each named based on the column names of the respective coordinates (e. g. c1:c2:c3).
For the moment, I get "invalid subscript type 'list'" as an error when running my function for the triangle calculation but I don't know how else to iterate through my list.
I am generating a list of all possible combinations of coordinates using combn
tridf <- combn(newdata, 3, simplify=FALSE) #newdata contains the coordinates, each column consists of a three-dimensional vector with x, y and z
Example for structure of newdata:
| c1 | c2 | c3 | c4 | c5 |
x| -8.99 | -8.71 | -10.52 | -8.38 | -55.76 |
y| -267.54 | -266.50 | -266.26 | -279.47 | -243.53 |
z| -117.85 | -122.87 | -200.95 | -146.96 | -130.40 |
dput(newdata):
structure(list(g = c("-8.993426322937012", "-267.54718017578125",
"-117.85099792480469"), n = c("-8.717547416687012", "-266.50799560546875",
"-122.87059020996094"), ale = c("-10.52885627746582", "-266.2621154785156",
"-200.95721435546875"), rhi = c("-8.382125854492188", "-279.47918701171875",
"-146.96658325195312"), fmo.r = c("-55.76047897338867", "-243.5348663330078",
"-130.4052734375")), row.names = c("V2", "V3", "V4"), class = "data.frame")
which gives me a list of n dataframes through which I now would like to iterate using the following function:
triarea <- function(i){
newtridf <- as.data.frame(tridf[[i]])
ab <- as.numeric(newtridf[,2])-as.numeric(newtridf[,1])
ac <- as.numeric(newtridf[,3])-as.numeric(newtridf[,1])
c <- as.data.frame(cross(ab,ac)) #cross is a function of library(pracma)
area <- 0.5*sqrt(c[1,]^2+c[2,]^2+c[3,]^2)
}
When running this code manually outside the function there is no problem and I always end up with the correct result for area, but when running this as a function, called using combn
newcombn <- combn(tridf, 1, triarea, simplify=FALSE)
it throws the following error:
Error in tridf[[i]] : invalid subscript type 'list'
I've been searching the web and trying around for hours now but I am completely lost, especially as I am relatively new to R and programming in general. I get that there seems to be a problem with the data being stored in a list, but I do not know how to approach solving this or how to directly refer iteratively to the respective column of the dataframe inside of the list of dataframes, without the need for auxiliary elements like newtridf ...
Thank you very much in advance for your time and help!
I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)
I've got a coding problem, I'm not able to solve on my own, so I'd appreciate any help from you. To summarise, I'd like to create a new colum attached to my dataframe listing the column names of those cells which match a specific condition (row by row). I have searched for solutions for a very long time, but I still haven't found the right one for me.
Let's say I got a dataframe like this:
a <- c(90, NA,20)
b <- c(NA, 89, 20)
d <- as.data.frame(cbind(a,b))
names(d) <- c("house", "cat")
| house | cat
--|-------|----
1 | 90 | NA
--|-------|----
2 | NA | 89
--|-------|----
3 | 20 | 20
I'd like to get a final data frame with a new colum, which lists all the column names of those cell values that are not NA. So, ideally it would look like this:
| house | cat | newcol
---|-------|-----|--------
1 | 90 | NA | house
---|-------|-----|--------
2 | NA | 89 | cat
---|-------|-----|--------
3 | 20 | 20 | house, cat
I must admit that - even though I've been seaching this for about a week now - I have trouble indexing the cells and the columnames. I've tried a for loop and I've tried using apply. I've tried every one-bracket-, two-bracket-version I could think of. I tried to include which() into apply, i tried.. a lot.
Most of the time I adressed the rows within the apply function cause as I understand it, I want the function looping over the rows and finally write a new vector at the end of each row. But it didn't get me nowhere, one of the many versions was this one:
col <- colnames(d)[apply(d, 1, function(x) which(!is.na(x),arr.ind=T))]
But it prints an error: "Error in colnames(d)[apply(d, 1, function(x) which(!is.na(x), arr.ind = T))] : invalid subscript type 'list'"
So I tried addressing the columns, which didn't do it either...:
col <- colnames(d)[apply(d, 2, function(x) which(!is.na(x),arr.ind=T))]
col
[1] "house" NA "cat" NA
I also had the colname reference within apply, trying to row by row build vectors. (I've tried this also with print() or paste() around the colnamesindex):
similar <- c(similar, apply(d, 1, function(x) colnames(x)[x[!is.na(x)]]))
The last thing i tried was without a loop:
e <- which(!is.na(d),arr.ind=T)
list <- names(d[e[,2]])
list
[1] "house" "house.1" "cat" "cat.1"
But this code is running down the columns and the output doesn't allow me to match elements of the output with its corresponding row.
I'd very much appreciate your help. I feel like I'm not asking for a complicated thing to be done but still it's too complicated for me. (I'd like to add that I just started using R so my current workflow is still mostly google-trial and error.)
I'd be very happy to learn from you.
Thank you very much.
LK
This should do it...
df$newcol <- apply(df,1,function(x) paste(names(df)[!is.na(x)],collapse=", "))
df
house cat newcol
1 90 NA house
2 NA 89 cat
3 20 20 house, cat
I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"
For example say you create a Julia DataFrame like so with 20 columns:
y=convert(DataFrame, randn(10,20))
How do you convert the column names (:x1 ... :x20) to something else, like (:col1, ..., :col20) for example, all at once?
You might find the names! function more concise:
julia> using DataFrames
julia> df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
2x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
julia> names!(df, [symbol("col$i") for i in 1:3])
Index([:col2=>2,:col1=>1,:col3=>3],[:col1,:col2,:col3])
julia> df
2x3 DataFrame
|-------|------|------|------|
| Row # | col1 | col2 | col3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
One way to do this is with the rename! function. The method of the rename function takes a DataFrame as input though only allows you to change a single column name at a time (as of the development version 0.3 branch on 1/4/2014). Looking into the code of Index.jl in the DataFrames repository lead me to this solution which works for me:
rename!(y.colindex, [(symbol("x$i")=>symbol("col$i")) for i in 1:20])
y.colindex returns the index for the dataframe y, and the next argument creates a dictionary mapping the old column symbols to the new column symbols. I imagine that by the time someone else needs this, there will be a nicer way to do this, but I just spent a few hours figuring this out in the development version 0.3 of Julia, so I thought i would share!
As an update to the answer of #JohnMylesWhite, the names! function has been deprecated in DataFrames v 0.20.2. The latest way of going about this is by using the rename! function:
import DataFrames
DF = DataFrames
df = DF.DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
println(df)
DF.rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
println(df)
v1.1.0
One can directly change the column names by
names!(df, colNames_as_Symbols)
To rename the columns with a vector of strings, this can be done via
names!(df, Symbol.(colNames_as_strings) )
# import Pkg; Pkg.add("DataFrames")
using DataFrames
The question has been answered, but for the additional clarity, sometimes you just want to specify the names without using loops (i.e. over-engineering):
rename!(df, [:Date, :feature_1, :feature_2 ], makeunique=true)
Example output:
141 rows × 3 columns
Date feature_1 feature_2
Date Float64? Float64?
1 2020-08-03 44.3 missing
Update:
For Julia 0.4, as described by John Myles White, all the names can be changed with:
names!(df::AbstractDataFrame, vals)
where vals is a Vector{Symbol} the same length as
the number of columns in df.
Specific names can be changed with:
rename!(df::AbstractDataFrame, from::Symbol, to::Symbol)
rename!(df::AbstractDataFrame, d::Associative)
rename!(f::Function, df::AbstractDataFrame)
where d is an Associative type that maps the original name to a new name
and f is a function that has the old column name (a symbol) as input
and new column name (a symbol) as output.
This is documented in the code at https://github.com/JuliaStats/DataFrames.jl/blob/7e2f48ad9f31185d279fdd81d6413a79b7e42e87/src/abstractdataframe/abstractdataframe.jl
This is the short and simple answer for Julia 1.1.1:
names!(df, [Symbol("Col$i") for i in 1:size(df,2)])
Use the rename function with an array containing the new names:
Vector_with_names = ["col1","col2","col3"]
rename!(df,Vector_with_names)
Using John's dataframe, i had to use colnames! instead of names!
df = DataFrame(x1 = 1:2, x2 = 2:3, x3 = 3:4)
colnames!(df, ["col$i" for i in 1:3])
My version of Julia is 0.2.1