Selecting row by min value and merging - r

This is row selection problem in R.
I want to get only one row in each dataset based on the minimum value of the variable fifteen. I have tried two approaches below, none of them returning the desired output.
For list 1> SPX[[1]] the date.frame is set up like this:
SPX1[[1]]
+----------------------------------+---------+------------------------+----------------+
| | stkPx | expirDate | fifteen |
+----------------------------------+---------+------------------------+----------------+
| 1 | 1461.62 | 2013-01-19 | 2 |
| 2 | 1461.25 | 2013-01-25 | 8 |
| 3 | 1461.35 | 2013-02-01 | 3 |
| . | . | . | . |
| . | . | . | . |
+----------------------------------+---------+------------------------+----------------+
The first approach is aggregating and the merge. As this has to be done for a list the code is in a loop:
df.agg<- list() # creates a list
for (l in 1:length(SPX1)){
df.agg[[l]]<- SPX1[[l]] %>%
aggregate(fifteen ~ ticker, data=SPX1[[l]], min) #Finding minimum value of fifteen for ticker
df.minSPX1 <- merge(df.agg[[l]], SPX1[[l]]) #Merge dataset so we only get row with min fifteen value
}
I get :
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'SPX1' of mode 'function' was not found
Another approach, however it just changes all the values of the first column to one, not deleting any rows when merging:
TESTER<- which.min(SPX1[[1]]$fifteen) # Finds which row has minimum value of fifteen
df.minSPX1 <- merge(TESTER, SPX1[[1]],by.x=by) #Try to merge so I only get the row with min. fifteen
I have tried reading other answers on SO but maybe because of the way the lists are set up this won't work?
I hope you can tell me where I get this wrong.

Try This
df<- lapply(SPX, function(x) x[x$fifteen==min(x$fifteen),])
df<- as.data.frame(df)
Edit:
As suggested by #Gregor-Thomas this will work when there is a tie.
df<- lapply(SPX, function(x) x[which.min(x$fifteen), ])
df<- as.data.frame(df)

Related

Combining Grep and a For-Loop to Construct A Matrix (R)

I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)

Splitting strings and stacking them in one column

I've got a data frame with this structure:
> df
modifications
13-MOD:0057
13-MOD:0046
13-MOD:0051,13-MOD:0076
13-MOD:0036,13-MOD:0076,13-MOD:0016
13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125
13-MOD:0014 13-MOD:0156, 13-MOD:0956,13-MOD:0125...n
13-MOD:0012 ... n
To split the data I used this code:
df2 <- data.frame(str_split_fixed(df$modifications, ",", 20))
Basically, I get this data.
> df2
x1 | x2 | x3 | empty |
13-MOD:0057 | empty | empty | empty |
13-MOD:0046 | emply | empty | empty |
13-MOD:0051 | 13-MOD:0076 | empty | empty |
13-MOD:0036 | 13-MOD:0076 | 13-MOD:0016 | empty |
13-MOD:0256 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125
13-MOD:0014 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125 | ... n
13-MOD:0012 | ... | ...n
What I want is remove the empty values and stack the data from columns X2,X3, X4 ... n to the first one X1.
To do that I was using this:
df3 <- melt(setDT(df2), # set df to a data.table
measure.vars = list(c(1:20)), # set column groupings
value.name = 'V')[ # set output name scheme
, -1, with = F]
To remove the empty values:
df3[df3==""] <- NA
histo3 = subset(df3, V1 != 'NA')
But I don't know why I get an error about the length of the column in melt function. Do you know any way to make this easier?.
Reproducible example:
df <- data.frame(modifications=c("UNIMOD:108,UNIMOD:108","UNIMOD:108","UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108"))
could it be something like this?
library(stringr)
# input dataset
s <- c('13-MOD:0057', '13-MOD:0046', '13-MOD:0051,13-MOD:0076', '13-MOD:0036,13-MOD:0076,13-MOD:0016', '13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125')
s
[1] "13-MOD:0057"
[2] "13-MOD:0046"
[3] "13-MOD:0051,13-MOD:0076"
[4] "13-MOD:0036,13-MOD:0076,13-MOD:0016"
[5] "13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125"
# get the individual lengths
lengths <- sapply(str_split(s,','), function(x){ length(x) })
# create the dataframe splitting in N columns
as.data.frame(str_split_fixed(s, ',', max(lengths)))
V1 V2 V3 V4
1 13-MOD:0057
2 13-MOD:0046
3 13-MOD:0051 13-MOD:0076
4 13-MOD:0036 13-MOD:0076 13-MOD:0016
5 13-MOD:0256 13-MOD:0156 13-MOD:0956 13-MOD:0125
UPDATE 1
To stack all the non-empty cells into a single column
# create the dataframe splitting in N columns
first.matrix <- str_split_fixed(s, ',', max(lengths))
# select only the cells != ""
first.matrix[which(first.matrix!="")]
[1] "13-MOD:0057" "13-MOD:0046" "13-MOD:0051" "13-MOD:0036" "13-MOD:0256" "13-MOD:0076"
[7] "13-MOD:0076" "13-MOD:0156" "13-MOD:0016" "13-MOD:0956" "13-MOD:0125"

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

How to return the row from a data frame based on the maximum value of the data frame in R?

Suppose I have got a data frame as displayed below. Most of the suggestions I found on Stackoverflow aim at getting the max from one column and then returning the row index.
I was wondering whether there is a way to return the row index of the data frame by scanning two or more columns for the maximum.
To summarize, from the example below, I want to get the row:
11 building_footprint_sum 0.003 0.470
which holds the maximum of the data frame
+----+-------------------------+--------------------+-------------------+
| id | plot_name | rsquare_allotments | rsquare_block_dev |
+----+-------------------------+--------------------+-------------------+
| 6 | building_footprint_max | 0.002 | 0.421 |
| 7 | building_footprint_mean | 0.002 | 0.354 |
| 8 | building_footprint_med | 0.002 | 0.350 |
| 9 | building_footprint_min | 0.002 | 0.278 |
| 10 | building_footprint_sd | 0.003 | 0.052 |
| 11 | building_footprint_sum | 0.003 | 0.470 |
+----+-------------------------+--------------------+-------------------+
Is there a rather simple way to achieve this?
You are looking for the row index in which a matrix attains its maximum. You can do this by using which() with the arr.ind=TRUE option:
> set.seed(1)
> foo <- matrix(rnorm(6),3,2)
> which(foo==max(foo),arr.ind=TRUE)
row col
[1,] 1 2
So in this case, you would need row 1. (And you can discard the col output.)
If you go this route, be wary of floating point arithmetic and == (see FAQ 7.31). Better to do this:
> which(foo>max(foo)-0.01,arr.ind=TRUE)
row col
[1,] 1 2
where you use an appropriate small value in place of 0.01.
Try using pmax
?pmax
pmax and pmin take one or more vectors (or matrices) as arguments and
return a single vector giving the ‘parallel’ maxima (or minima) of the vectors.
I would suggest making this in two steps
# make a new column that compares column 3 and column 4 and returns the larger value
> df$new <- pmax(df$rsquare_allotments, df$rsquare_block_dev)
# look for the row, where the new variable has the largest value
> df[(df$new == max(df$new)), ][3:4]
Consider that if the max value occurs more than once, your result will have more than one row

Limit frequency table to one entry per unique ID

I have a data frame that tracks error codes generated from a script. Every time the script is executed, it adds records to a massive CSV file. The event_id field is a unique to each time the script is run. Each run may add multiple combinations of CRITICAL, ERROR, WARNING, DIAGNOSTIC, or INFORMATION messages with accompanying values and additional information (not represented here for simplicity).
I need to summarize the number of each class of error in the CSV file, but multiple errors from the same event id should only count as one error. Here's an example of how the data is structured:
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64
And this is what I'm looking to get as output. Even though there were three ERROR class events, two of them were associated with the same event, so it only counts as one.
class | count
ERROR | 2
WARNING | 2
DIAGNOSTIC | 1
I did try searching for this, but don't even know what keywords to search for. So even if you aren't able to answer the question, I'd appreciate any help with search queries.
df = read.table(header = T, sep = "|", text = "
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64")
df = as.data.table(df)
setkey(df, event_id, class)
unique(df)[, .N, by = class]
# class N
#1: ERROR 2
#2: WARNING 2
#3: DIAGNOSTIC 1
You could split class by event id then create a data frame.
> s <- sapply(split(dat$event_id, dat$class), function(x) length(unique(x)))
> data.frame(count = s)
## count
## DIAGNOSTIC 1
## ERROR 2
## WARNING 2
You could build a 2-d table using the class and event_id variables, use pmin to limit values to 1 in that table, and then use rowSums to get it back to a 1-d table:
rowSums(pmin(table(dat$class, dat$event_id), 1))
# DIAGNOSITIC ERROR WARNING
# 1 2 2

Resources