Combining Grep and a For-Loop to Construct A Matrix (R) - r

I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )

First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)

Related

Problem iterating through list of dataframes

I am working with a database of three-dimensional vectors and am trying to calculate the surface area of the triangles between all possible combinations of three vectors. The goal is to get a list or dataframe containing the area for all possible combinations, each named based on the column names of the respective coordinates (e. g. c1:c2:c3).
For the moment, I get "invalid subscript type 'list'" as an error when running my function for the triangle calculation but I don't know how else to iterate through my list.
I am generating a list of all possible combinations of coordinates using combn
tridf <- combn(newdata, 3, simplify=FALSE) #newdata contains the coordinates, each column consists of a three-dimensional vector with x, y and z
Example for structure of newdata:
| c1 | c2 | c3 | c4 | c5 |
x| -8.99 | -8.71 | -10.52 | -8.38 | -55.76 |
y| -267.54 | -266.50 | -266.26 | -279.47 | -243.53 |
z| -117.85 | -122.87 | -200.95 | -146.96 | -130.40 |
dput(newdata):
structure(list(g = c("-8.993426322937012", "-267.54718017578125",
"-117.85099792480469"), n = c("-8.717547416687012", "-266.50799560546875",
"-122.87059020996094"), ale = c("-10.52885627746582", "-266.2621154785156",
"-200.95721435546875"), rhi = c("-8.382125854492188", "-279.47918701171875",
"-146.96658325195312"), fmo.r = c("-55.76047897338867", "-243.5348663330078",
"-130.4052734375")), row.names = c("V2", "V3", "V4"), class = "data.frame")
which gives me a list of n dataframes through which I now would like to iterate using the following function:
triarea <- function(i){
newtridf <- as.data.frame(tridf[[i]])
ab <- as.numeric(newtridf[,2])-as.numeric(newtridf[,1])
ac <- as.numeric(newtridf[,3])-as.numeric(newtridf[,1])
c <- as.data.frame(cross(ab,ac)) #cross is a function of library(pracma)
area <- 0.5*sqrt(c[1,]^2+c[2,]^2+c[3,]^2)
}
When running this code manually outside the function there is no problem and I always end up with the correct result for area, but when running this as a function, called using combn
newcombn <- combn(tridf, 1, triarea, simplify=FALSE)
it throws the following error:
Error in tridf[[i]] : invalid subscript type 'list'
I've been searching the web and trying around for hours now but I am completely lost, especially as I am relatively new to R and programming in general. I get that there seems to be a problem with the data being stored in a list, but I do not know how to approach solving this or how to directly refer iteratively to the respective column of the dataframe inside of the list of dataframes, without the need for auxiliary elements like newtridf ...
Thank you very much in advance for your time and help!

Updating dataframe based on conditions (over loop) R

I'm having difficulty developing a function/algorithm that that updates a dataframe based on certain conditions. I've looked at some answers related to "updating" a dataframe via for loops, but I'm still stuck.
Say I have a dataframe:
df <- data.frame("data_low" = .2143, "data_high" = .7149)
where data_low and data_high are the max and min of some column in a dataframe
I also have two functions:
checker(b[1,])
Takes the value of data_high and data_low, and returns a scalar. If the scalar is less than 1, I'd like to store this in another dataframe, say "d". Else, I want to split "b" with the following function:
splitter()
splits "b" by the median of data_high and data_low.
I've considered trying to develop this with a loop:
storage <- data.frame(data_low = double(), data_high = double()
for( i in 1:nrow(b)){
if(checker(b[i,]) <1){
storage <- splitter(b[i,])
} else {
temp <- splitter(b[i,])
b <- rbind(b,temp)
}
}
My desired output after two iterations (where check >1 for each row:
** Obviously these numbers are picked at random, I'm just hoping to gain some intuition related to looping/updating dataframes based on cases..
starting at i = 0:
| .2143 | .7149 |,
i = 2
| .2143 | .4442 | ** Note at splitter() should break this into 2 rows after i = 2 is complete.
| .4442 | .7149 | ** And again here
i = 3
| .2143 | .3002 |
| .3002 | .4442 |
| .4442 | .5630 |
| .5630 | .7149 |
Can anyone give me some tips on how to organize this loop? I'm thinking my issue here is related to rbind and/or the actual updating of b.
I recognize that much of this code isn't reproducible, but am more interested in the though process here.
Any help would be greatly appreciated!
You can do this with a nested loop (one for the number of iterations and one for the number of rows in b), or using nested Reduce calls, as shown here.
Reduce(function(x, y) {
List=apply(x, 1, function(z) {
med=median(c(z[1], z[2]))
dat=data.frame(data_low=c(z[1], med), data_high=c(med, z[2]))
rownames(dat)=NULL
return(dat)
})
Reduce(function(w, z) rbind(w, z), List)
}, rep(NA, 2), init=df)
One rep:
data_low data_high
1 0.2143 0.4646
2 0.4646 0.7149
Two reps:
data_low data_high
1 0.21430 0.33945
2 0.33945 0.46460
3 0.46460 0.58975
4 0.58975 0.71490
Three reps:
data_low data_high
1 0.214300 0.276875
2 0.276875 0.339450
3 0.339450 0.402025
4 0.402025 0.464600
5 0.464600 0.527175
6 0.527175 0.589750
7 0.589750 0.652325
8 0.652325 0.714900

Selecting row by min value and merging

This is row selection problem in R.
I want to get only one row in each dataset based on the minimum value of the variable fifteen. I have tried two approaches below, none of them returning the desired output.
For list 1> SPX[[1]] the date.frame is set up like this:
SPX1[[1]]
+----------------------------------+---------+------------------------+----------------+
| | stkPx | expirDate | fifteen |
+----------------------------------+---------+------------------------+----------------+
| 1 | 1461.62 | 2013-01-19 | 2 |
| 2 | 1461.25 | 2013-01-25 | 8 |
| 3 | 1461.35 | 2013-02-01 | 3 |
| . | . | . | . |
| . | . | . | . |
+----------------------------------+---------+------------------------+----------------+
The first approach is aggregating and the merge. As this has to be done for a list the code is in a loop:
df.agg<- list() # creates a list
for (l in 1:length(SPX1)){
df.agg[[l]]<- SPX1[[l]] %>%
aggregate(fifteen ~ ticker, data=SPX1[[l]], min) #Finding minimum value of fifteen for ticker
df.minSPX1 <- merge(df.agg[[l]], SPX1[[l]]) #Merge dataset so we only get row with min fifteen value
}
I get :
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'SPX1' of mode 'function' was not found
Another approach, however it just changes all the values of the first column to one, not deleting any rows when merging:
TESTER<- which.min(SPX1[[1]]$fifteen) # Finds which row has minimum value of fifteen
df.minSPX1 <- merge(TESTER, SPX1[[1]],by.x=by) #Try to merge so I only get the row with min. fifteen
I have tried reading other answers on SO but maybe because of the way the lists are set up this won't work?
I hope you can tell me where I get this wrong.
Try This
df<- lapply(SPX, function(x) x[x$fifteen==min(x$fifteen),])
df<- as.data.frame(df)
Edit:
As suggested by #Gregor-Thomas this will work when there is a tie.
df<- lapply(SPX, function(x) x[which.min(x$fifteen), ])
df<- as.data.frame(df)

R: Aggregating a list of column names mapped to row numbers based off of a condition in a data frame

Scaled down my dataframe looks like this:
+---+------------+-------------+
| | Label1 | Label2 |
+---+------------+-------------+
| 1 | T | F |
| 2 | F | F |
| 3 | T | T |
+---+------------+-------------+
I need to create a list of lists that map the column names to all the row numbers that have a false boolean as their value. For the above example it would look something like this:
{"Label1" : (2), "Label2" : (1,2)}
I am currently doing it as so:
myList = with(data.frame(which(!myDataFrame, arr.ind = TRUE)), list("colNames" = names(myDataFrame)[col], "rows" = row))
l = list()
count = 1;
for (i in myList[["colNames"]]) {
tmpRowNum = myList[["rows"]][[count]];
tmpList = l[[i]];
if (is.null(tmpList)) {
tmpList = list();
}
l[[i]] = c(tmpList, list(tmpRowNum))
count = count + 1;
}
This does work, but as I am new to R I can only assume there is a more efficient method of doing this. The with function creates two separate lists that I essentially have to combine to get the result that I am looking for.
You could try:
df <- data.frame(Label1=c("T","F","T"),Label2=c("F","F","T"))
lapply(df,function(x) which(x=="F"))
$Label1
[1] 2
$Label2
[1] 1 2
EDIT To get the same by row, use apply with margin=1:
apply(df,1,function(x) which(x=="F"))
To get a vector of the "F"s in row 2:
res <- apply(df,1,function(x) which(x=="F"))
res[[2]]
1 2
One useful way to get the row/column index is with which and arr.ind
i1 <- which(df=="F", arr.ind=TRUE)

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

Resources