rename_with but predicate based on value in other variable

rename_with but predicate based on value in other variable - r

Is there a way to rename_with but instead of the predicate function be on the column name, the predicate function would be based on a value in another variable?
Say I have a dataset as follows:
data <- tibble(home_team = c("SF", "KC", "JAX", "WAS", "BUF"),
away_team = c("GB", "CAR", "HOU", "NYG", "SEA"),
home_total = c(21, 25, 30, 22, 23.5),
home_plays = c(65, 64, 63, 57, 60),
away_total = c(30, 22, 25, 22, 25),
away_plays = c(56, 62, 66, 59, 62))
And I am trying to get it to look something like:
finalized_data <- tibble(team = c("SF", "KC", "JAX", "WAS", "BUF", "GB", "CAR", "HOU", "NYG", "SEA"),
total = c(21, 25, 30, 22, 23.5, 30, 22, 25, 22, 25),
plays = c(65, 64, 63, 57, 60, 56, 62, 66, 59, 62))
Currently the best way I know how is with a mutate function that gets long when theres a lot of variables, and there's got to be a cleaner way to do it since its essentially a rename I'm doing based on a variable in the data.
current_way <- data %>%
pivot_longer(c(home_team, away_team), names_to = "team_type", values_to = "team") %>%
mutate(total = ifelse(str_detect(team_type, "home_team"), home_total, away_total),
plays = ifelse(str_detect(team_type, "home_team"), home_plays, away_plays)) %>%
select(team, total, plays)
Any thoughts, or is there even a way to do it in the pivot function that I am missing?

Here is an option with pivot_longer by making use of the column names pattern to split into columns
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = everything(), names_to = c("grp", ".value"),
names_sep = "_") %>%
arrange(desc(grp)) %>%
select(-grp)
-output
# A tibble: 10 x 3
# team total plays
# <chr> <dbl> <dbl>
# 1 SF 21 65
# 2 KC 25 64
# 3 JAX 30 63
# 4 WAS 22 57
# 5 BUF 23.5 60
# 6 GB 30 56
# 7 CAR 22 62
# 8 HOU 25 66
# 9 NYG 22 59
#10 SEA 25 62

Related

R: Sum Two Lists Embedded in Rows of a Dataframe by Groups

I have data on groups of bonds and their cash flows. Each IssueID contains multiple BondIDs. Each BondID has its own cash flow, rendered as a list within the data frame via dplyr mutate. The cash flow lists do not have equal numbers of elements. The data structure is:
IssueID
BondID
cashflow
AA
AA1
c(-1000, 50, 50, 1050)
AA
AA2
c(-1000, 25, 25, 25, 25, 1025)
AB
AB1
c(-2000, 100, 100, 2100)
AB
AB1
c(-1000, 75, 75, 75, 75, 1075)
I need to sum each BondID's cashflows by IssueID while maintaining each element's position in the list. The output needs to look like:
IssueID
sumcashflow
AA
c(-2000, 75, 75, 1075, 25, 1025)
AB
c(-3000, 175, 175, 2175, 75, 1075)
Appreciate any assistance. Thanks.
Am not able to sum the lists.

Here is one option - grouped by 'IssueID', convert the cashflow list to a named list with sequence (row_number()), then convert to a tibble using enframe, unnest the list column, create a grouping by the rowid of the 'name' (in case there are unequal lengths), and get the sum of 'value', pull the column as a list
library(dplyr)
library(tibble)
library(tidyr)
library(data.table)
out <- df1 %>%
group_by(IssueID) %>%
summarise(sumcashflow = setNames(cashflow, row_number()) %>%
enframe %>%
unnest(value) %>%
group_by(grp = rowid(name)) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
pull(value) %>%
list(.))
-output
> out$sumcashflow
[[1]]
[1] -2000 75 75 1075 25 1025
[[2]]
[1] -3000 175 175 2175 75 1075
> out
# A tibble: 2 × 2
IssueID sumcashflow
<chr> <list>
1 AA <dbl [6]>
2 AB <dbl [6]>
or using base R with split
lst1 <- lapply(split(df1$cashflow, df1$IssueID), \(x) {
mx <- max(lengths(x))
rowSums(sapply(x, `length<-`, mx), na.rm = TRUE)
})
> lst1
$AA
[1] -2000 75 75 1075 25 1025
$AB
[1] -3000 175 175 2175 75 1075
data
df1 <- structure(list(IssueID = c("AA", "AA", "AB", "AB"), BondID = c("AA1",
"AA2", "AB1", "AB1"), cashflow = list(c(-1000, 50, 50, 1050),
c(-1000, 25, 25, 25, 25, 1025), c(-2000, 100, 100, 2100),
c(-1000, 75, 75, 75, 75, 1075))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))

df1 %>%
group_by(IssueID) %>%
summarise(val = list(colSums(do.call(qpcR:::rbind.na, cashflow), na.rm = TRUE)))
# A tibble: 2 x 2
IssueID val
<chr> <list>
1 AA <dbl [6]>
2 AB <dbl [6]>
val:
[[1]]
[1] -2000 75 75 1075 25 1025
[[2]]
[1] -3000 175 175 2175 75 1075

You can write a small function, and apply it to each IssueID
f <- function(cf) {
ml = lengths(cf)
for(i in seq_along(ml)) {
if(length(cf[[i]])<max(ml)) cf[[i]]=c(cf[[i]],rep(0,max(ml)-length(cf[[i]])))
}
list(rowSums(matrix(unlist(cf),ncol=length(cf))))
}
Using dplyr:
d %>% group_by(IssueID) %>% summarize(sumcashflow = f(cashflow))
Using data.table:
setDT(d)[,.(f(cashflow)), by=IssueID]
Output:
IssueID sumcashflow
1: AA -2000, 75, 75, 1075, 25, 1025
2: AB -3000, 175, 175, 2175, 75, 1075
Input:
d = data.table(
IssueID = c("AA","AA","AB","AB"),
BondID = c("AA1", "AA2","AB1", "AB2"),
cashflow = list(c(-1000,50,50,1050),
c(-1000,25,25,25,25,1025),
c(-2000,100,100,2100),
c(-1000,75,75,75,75,1075))

sum cells across similar data frames within a list in R, by data frames names

I have a list of data frames that look like this:
df1_BC <- data.frame(name=c("name1", "name2", "name3"),
year1=c(23, 45, 54),
year2=c(54, 23, 79),
year3=c(67, 29, 76))
df2_BC <- data.frame(name=c("name1", "name2", "name3"),
year1=c(93, 32, 56),
year2=c(82, 96, 72),
year3=c(54, 76, 19))
df3_BC <- data.frame(name=c("name1", "name2", "name3"),
year1=c(83, 41, 92),
year2=c(76, 73, 65),
year3=c(63, 62, 95))
df1_BA <- data.frame(name=c("name1", "name2", "name3", "name4"),
year1=c(23, 35, 54, 41),
year2=c(84, 23, 79, 69),
year3=c(97, 29, 76, 0))
df2_BA <- data.frame(name=c("name1", "name2", "name3", "name4"),
year1=c(93, 32, 56, 64),
year2=c(82, 96, 53, 0),
year3=c(54, 76, 19, 3))
df3_BA <- data.frame(name=c("name1", "name2", "name3", "name4"),
year1=c(83, 41, 92, 5),
year2=c(76, 3, 65, 82),
year3=c(3, 62, 95, 6))
list_dfs <- list(df1_BC, df2_BC, df3_BC, df1_BA, df2_BA, df3_BA)
As you can see, dataframes with the same sufix ('BA' or 'BC') have the same columns and number of rows.
What I want to do is to sum across the cells of the two groups of dataframes (the ones with the 'AB' suffix and the ones with the 'BC' suffix).
If I do it on the dataframes alone, without listing them, I get the expected result:
result_BA <- df1_BA[,-1] + df2_BA[,-1] + df3_BA[,-1]
result_BC <- df1_BC[,-1] + df2_BC[,-1] + df3_BC[,-1]
print(result_BA)
year1 year2 year3
1 199 242 154
2 108 122 167
3 202 197 190
4 110 151 9
As you can also see, is necessary to keep the name column away to do the sum. EDIT: Then I would like to put it back. Something like this:
result_BA <- cbind(df1_BA[,-1], result_BA)
To have column of names added back to each corresponding dataframe in the list.
This is a simplified example from much larger lists, so doing it as a list and matching the dataframes to add up by suffix really simplifies the task.
Thanks!

The list didn't have any names. We need to construct with names one option is to create a named list, split the list by the substring of the names, and use Reduce to + the inner list elements
list_dfs <- list(df1_BC = df1_BC, df2_BC = df2_BC, df3_BC = df3_BC,
df1_BA = df1_BA, df2_BA = df2_BA, df3_BA = df3_BA)
lapply(split(list_dfs, sub(".*_", "", names(list_dfs))),
\(x) Reduce(`+`, lapply(x, `[`, -1)))
-output
$BA
year1 year2 year3
1 199 242 154
2 108 122 167
3 202 197 190
4 110 151 9
$BC
year1 year2 year3
1 199 212 184
2 118 192 167
3 202 216 190
Or this may be done with tidyverse using a group by approach
library(dplyr)
library(tidyr)
library(data.table)
list_dfs <- lst(df1_BC, df2_BC, df3_BC, df1_BA, df2_BA, df3_BA)
bind_rows(list_dfs, .id = 'name') %>%
separate(name, into = c("name1", "name2")) %>%
mutate(grp = rowid(name1, name2)) %>%
group_by(name2, grp) %>%
summarise(across(where(is.numeric), sum), .groups = "drop") %>%
select(-grp)
-output
# A tibble: 7 × 4
name2 year1 year2 year3
<chr> <dbl> <dbl> <dbl>
1 BA 199 242 154
2 BA 108 122 167
3 BA 202 197 190
4 BA 110 151 9
5 BC 199 212 184
6 BC 118 192 167
7 BC 202 216 190

Reshaping dataset in a way that values for variables become variable names and their values are picked from another column

-------------------NEW POST:
I've posted incorrect example of my data in past (leaving it below). In reality my data has repetitive "Modules" under same column and previous solution doesn't work for my problem.
My example data (current dataset):
Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
Module1 <- c("English", "English", "Science", "English")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("History", "History", "History", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("Art", "Geography", "Math", "Geography")
Results3 <- c(89, 84, 45, 67)
Module14 <- c("Math", "Math", "Geography", "Art")
Results14 <- c(89, 24, 95, 67)
Module15 <-c("Science", "Art", "Art", "Science")
Results15 <-c(87, 24, 25, 67)
daf <- data.frame(Id, Year, Grade, Code, Module1, Results1, Module2, Results2, Module3, Results3, Module14, Results14, Module15, Results15)
My target - dataset I need to achieve:
Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
English <- c(45, 58,NA,54)
Math <- c(89, 24,45, NA)
Science <- c(87, NA, 34, 67)
Geography <- c(NA, 84, 95,67)
Art <- c(89,24,25,45)
wished_df <- data.frame(Id, Year, Grade, Code, English, Math, Science,Geography, Art)
Thanks again for any help!
-------------------------------- OLD POST:
I am trying to reshape my current data to new format.
Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)
daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)
What I need is module names set as ‘variable names’, and module results set as ‘values for variable names’, looking like:
English1 <- c(45, 98, 45)
Math1 <- c(58, 12, NA)
Science1 <- c(34, NA, NA)
Geography1 <- c(54,NA, 67)
Art1 <- c(NA, 45, 84)
wished_df <- data.frame(English1, Math1, Science1,Geography1, Art1)
Thank you for any ideas.

1) reshape Using the data in the Note at the end, split the input column names into two groups (Module columns and Results columns) giving varying. Using that reshape to long form where varying= defines which columns in the input correspond to a single column in the long form. v.names= specifies the names to use for each of the two columns produced from the varying columns. reshape will give a data frame with columns time, Module, Result and id columns. We don't need the id column so drop it using [-4].
Then reshape that back to the new wide form. idvar= specifies the source of the output rows and timevar= specifies the source of the output columns. Everything else is the body of the result. reshape will generate a time column which we don't need so remove it using [-1]. At the end we remove the junk part of each column name.
No packages are used.
varying <- split(names(daf), sub("\\d+$", "", names(daf)))
long <- reshape(daf, dir = "long", varying = varying, v.names = names(varying))[-4]
wide <- reshape(long, dir = "wide", idvar = "time", timevar = "Module")[-1]
names(wide) <- sub(".*[.]", "", names(wide))
giving:
> wide
English Math Science Geography History Art
1.1 45 58 34 54 NA NA
1.2 98 12 NA NA 67 45
1.3 45 NA NA 67 89 84
2) pivot_ Using the data in the Note at the end, specify that all columns are to be used and using .names specify that the column names in long form are taken from the first portion of the column names of the input where the names of the input are split according to the names_pattern= regular expression. Then pivot to a new wide form where the column names are taken from the Module column and the values in the body of the result are taken from the Results column. The index column will define the rows and can be omitted afterwards.
library(dplyr)
library(tidyr)
daf %>%
pivot_longer(everything(), names_to = c(".value", "index"),
names_pattern = "(\\D+)(\\d+)") %>%
pivot_wider(names_from = Module, values_from = Results) %>%
select(-index)
giving:
# A tibble: 3 x 6
English Math History Art Science Geography
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45 58 NA NA 34 54
2 98 12 67 45 NA NA
3 45 NA 89 84 NA 67
3) unlist/tapply UUsing the data in the Note at the end, another base solution can be fashioned by separately unlisting the Module and Results columns to get the long form and using tapply to convert to wide form. No packages are used
is_mod <- grepl("Module", names(daf))
long <- data.frame(Module = unlist(daf[is_mod]), Results = unlist(daf[!is_mod]))
tab <- tapply(long$Results, list(sub("\\d+$", "", rownames(long)), long$Module), sum)
as.data.frame.matrix(tab)
giving:
Art English Geography History Math Science
Module1 NA 45 54 NA 58 34
Module2 45 98 NA 67 12 NA
Module3 84 45 67 89 NA NA
Note
Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)
daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)

A data.table version:
library(data.table)
library(magrittr)
dt <- as.data.table(daf)
dt %>%
melt.data.table(measure.vars = patterns("^Module", "^Result")) %>%
dcast.data.table(variable ~ ..., value.var = "value2")
giving:
Key: <variable>
variable Art English Geography History Math Science
<fctr> <num> <num> <num> <num> <num> <num>
1: 1 NA 45 54 NA 58 34
2: 2 45 98 NA 67 12 NA
3: 3 84 45 67 89 NA NA

How to find maximum value across several lists stored in a dataframe?

Another question about lists. Say I have a dataframe containing several lists. Assume they're the results of an American election. They include the vote shares for Democrats, Republicans, and Third Party candidates across three simulations for three states:
list1 <- list(c(40, 44, 52))
list2 <- list(c(22, 36, 18))
list3 <- list(c(45, 37, 42))
list4 <- list(c(60, 56, 48))
list5 <- list(c(34, 52, 26))
list6 <- list(c(55, 63, 58))
list7 <- list(c(0, 0, 0))
list8 <- list(c(44, 12, 56))
list9 <- list(c(0, 0, 0))
dat <- data.frame(State = c("Iowa", "Wisconsin", "Ohio"))
dat$DemocratVoteShare <- c(list1, list2, list3)
dat$RepublicanVoteShare <- c(list4, list5, list6)
dat$ThirdPartyVoteShare <- c(list7, list8, list9)
Note that Wisconsin is the only state with a Third Party candidate.
I'm trying to evaluate when a party had the maximum vote share in a state for a given simulation. The results would look like this:
dat$Winner <- c(list(c("Republican", "Republican", "Democrat")),
list(c("Third Party", "Republican", "Third Party")),
list(c("Republican", "Republican", "Republican")))
How can I achieve this using R, and ideally functionality from the tidyverse? Thanks in advance.

Here is a simple solution
dat %>%
mutate(
Winner = Map(function(x, y, z) {
c("Democrat", "Republican", "Third Party")[max.col(cbind(x, y ,z) == pmax(x, y ,z))]
}, DemocratVoteShare, RepublicanVoteShare, ThirdPartyVoteShare)
)
Output
State DemocratVoteShare RepublicanVoteShare ThirdPartyVoteShare Winner
1 Iowa 40, 44, 52 60, 56, 48 0, 0, 0 Republican, Republican, Democrat
2 Wisconsin 22, 36, 18 34, 52, 26 44, 12, 56 Third Party, Republican, Third Party
3 Ohio 45, 37, 42 55, 63, 58 0, 0, 0 Republican, Republican, Republican
Update
As pointed out by #akrun, you can use max.col directly, and it allows you to select a ties.method.
dat %>%
mutate(
Winner = Map(function(x, y, z) {
c("Democrat", "Republican", "Third Party")[max.col(cbind(x, y ,z), "first")]
}, DemocratVoteShare, RepublicanVoteShare, ThirdPartyVoteShare)
)
The code above gives higher priority to the option "Democrat". If you want to use "Republican" instead, then swap their positions like this
dat %>%
mutate(
Winner = Map(function(x, y, z) {
c("Republican", "Democrat", "Third Party")[max.col(cbind(x, y ,z), "first")]
}, RepublicanVoteShare, DemocratVoteShare, ThirdPartyVoteShare)
)

We can use pmap
library(dplyr)
library(purrr)
dat %>%
mutate(Winner = pmap(select(cur_data(), -State),
~ c("Democrat", "Republican", "Third Party")[max.col(cbind(..1, ..2, ..3))]))
# State DemocratVoteShare RepublicanVoteShare ThirdPartyVoteShare Winner
#1 Iowa 40, 44, 52 60, 56, 48 0, 0, 0 Republican, Republican, Democrat
#2 Wisconsin 22, 36, 18 34, 52, 26 44, 12, 56 Third Party, Republican, Third Party
#3 Ohio 45, 37, 42 55, 63, 58 0, 0, 0 Republican, Republican, Republican

I think it would be better if you unnest the data.
library(dplyr)
val <- sub('VoteShare', '', names(dat[-1]))
dat %>%
tidyr::unnest(-State) %>%
mutate(Winner = val[max.col(.[-1], ties.method = 'first')])
# State DemocratVoteShare RepublicanVoteShare ThirdPartyVoteShare Winner
# <chr> <dbl> <dbl> <dbl> <chr>
#1 Iowa 40 60 0 Republican
#2 Iowa 44 56 0 Republican
#3 Iowa 52 48 0 Democrat
#4 Wisconsin 22 34 44 ThirdParty
#5 Wisconsin 36 52 12 Republican
#6 Wisconsin 18 26 56 ThirdParty
#7 Ohio 45 55 0 Republican
#8 Ohio 37 63 0 Republican
#9 Ohio 42 58 0 Republican

How to perform pslda in R showing error rate for several elements of a list?

I am performing a splsda-model in R on 10 dataframes (data of 10 study areas), stored as a list (datalist). All these dataframes are similar, with the same variables, but just different values.
I use the micromics library to do this.
This is the head of the first study area. It compares the absence or presence of wetlands (factor variable - wetl or no wetl) depending on its value of TPI of different ranges.
> head(datalist[[1]])
OID POINTID WETLAND TPI200 TPI350 TPI500 TPI700 TPI900 TPI1000 TPI2000 TPI3000 TPI4000 TPI5000 TPI2500
1 -1 1 no wetl 70 67 55 50 48 46 53 47 49 63 48
2 -1 2 no wetl 37 42 35 29 32 16 17 35 49 63 26
3 -1 3 no wetl 45 55 45 39 41 41 53 47 49 63 48
4 -1 4 no wetl 46 58 51 43 46 36 54 47 49 62 49
5 -1 5 no wetl 58 55 53 49 47 46 54 47 49 62 49
6 -1 6 no wetl 56 53 51 49 46 46 54 47 49 61 49
I have done the cross validation step using following code:
library(mixOmics)
for (i in 1: length(model_list))
{
myperf_plsda <- perf(model_list[[i]], validation = "Mfold", folds = 10,
progressBar = FALSE, nrepeat = 10, auc = TRUE)
save(myperf_plsda, file="performancePLSDA.RData")
}
model_list is the list obtained from the spslda-function.
But now I am stuck in the next step, which is to look at the error rate (overall and per class)
For just one dataframe (studyarea), I can use the following code:
# cross-validation error in function of nr of PCs
# can see how many PCs is best
plot(myperf_plsda, col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal")
# error rate overall and per class
myperf_plsda$error.rate
myperf_plsda$error.rate.class
myperf_plsda$auc
So first, I am trying to plot see the error in function of the prinipal components (= plot, first code here above for one study area). The result would be something like I would like to have it in a pdf.
Second, I want to know the overall error rate and error rate per class, from which the code is mentioned above for one study area. The result for one study area is then for example:
overall error rate:
error rate per class:
I have tried some ways to all this codes in a for loop, or using lapply, in order to get these results for the 10 study areas.
, such as:
### To see how many PCs is best ###
pdf('overallerrorrate_wetlall_small.pdf')
for (i in 1:length(myperf_plsda))
{
plot(model_list[[i]], col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal")
}
dev.off()
or
for (i in 1:length(myperf_plsda))
{plot(myperf_plsda, col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal")}
or
for (i in 1:length(myperf_plsda))
{myperf_plsda[[1]]error.rate
myperf_plsda[[1]]error.rate.class
myperf_plsda[[i]]auc
}
or
lapply(myperf_plsda, [[, 'error.rate')`
But all these codes don't work! How can I run the code for multiple elements in a list? Many thanks!

Based on your outputs, you will have to create a new list and save the results on it. Using just myperf_plsda could be overwriting each step in the loop. Also most of the measures you want are lists, so I added some processing functions to reach dataframes. I used next dummy data:
library(mixOmics)
#Function
custom_splsda <- function(datalist, ncomp, keepX, ..., Xcols, Ycol){
Y <- datalist[[Ycol]]
X <- datalist[Xcols]
res <- splsda(X, Y, ncomp = ncomp, keepX = keepX, ...)
res
}
#Data
datalist <- list(df1 = structure(list(OID = c(-1, -1, -1, -1, -1, -1), POINTID = c(1,
2, 3, 4, 5, 6), WETLAND = c("no wetl", "no wetl", "no wetl",
"wetl", "wetl", "wetl"), TPI200 = c(70, 37, 45, 46, 58, 56),
TPI350 = c(67, 42, 55, 58, 55, 53), TPI500 = c(55, 35, 45,
51, 53, 51), TPI700 = c(50, 29, 39, 43, 49, 49), TPI900 = c(48,
32, 41, 46, 47, 46), TPI1000 = c(46, 16, 41, 36, 46, 46),
TPI2000 = c(53, 17, 53, 54, 54, 54), TPI3000 = c(47, 35,
47, 47, 47, 47), TPI4000 = c(49, 49, 49, 49, 49, 49), TPI5000 = c(63,
63, 63, 62, 62, 61), TPI2500 = c(48, 26, 48, 49, 49, 49)), row.names = c(NA,
6L), class = "data.frame"), df2 = structure(list(OID = c(-1,
-1, -1, -1, -1, -1), POINTID = c(1, 2, 3, 4, 5, 6), WETLAND = c("no wetl",
"no wetl", "no wetl", "wetl", "wetl", "wetl"), TPI200 = c(70,
37, 45, 46, 58, 56), TPI350 = c(67, 42, 55, 58, 55, 53), TPI500 = c(55,
35, 45, 51, 53, 51), TPI700 = c(50, 29, 39, 43, 49, 49), TPI900 = c(48,
32, 41, 46, 47, 46), TPI1000 = c(46, 16, 41, 36, 46, 46), TPI2000 = c(53,
17, 53, 54, 54, 54), TPI3000 = c(47, 35, 47, 47, 47, 47), TPI4000 = c(49,
49, 49, 49, 49, 49), TPI5000 = c(63, 63, 63, 62, 62, 61), TPI2500 = c(48,
26, 48, 49, 49, 49)), row.names = c(NA, 6L), class = "data.frame"))
Now the code, I will create an empty list myperf_plsda:
#Create model_list, you must have the object created
model_list <- lapply(datalist, custom_splsda,
ncomp = 2, keepX = c(5, 5),
Xcols = 4:8, Ycol = "WETLAND")
#Create empty list
myperf_plsda <- list()
#Loop for objects and saving
for (i in 1: length(model_list))
{
myperf_plsda[[i]] <- perf(model_list[[i]], validation = "Mfold", folds = 3,
progressBar = FALSE, nrepeat = 3, auc = TRUE)
object <- myperf_plsda[[i]]
save(object,file = paste0("performancePLSDA.",i,".RData"))
}
#Process the object myperf_plsda
#First function to get elements
extract1 <- function(x)
{
#Object
error.rate <- x$error.rate
error.rate <- lapply(error.rate, as.data.frame)
#Process
O1 <- do.call(rbind,error.rate)
#Separate vars
O1$id <- rownames(O1)
rownames(O1) <- NULL
O1$id1 <- gsub("\\..*","", O1$id )
O1$id2 <- gsub(".*\\.","", O1$id )
O1$id <- NULL
return(O1)
}
#Function 2
extract2 <- function(x)
{
#Object
error.rate.class <- x$error.rate.class
names(error.rate.class) <- gsub('.','_',names(error.rate.class),fixed = T)
error.rate.class <- lapply(error.rate.class, as.data.frame)
#Process
O2 <- do.call(rbind,error.rate.class)
#Separate vars
O2$id <- rownames(O2)
rownames(O2) <- NULL
O2$id1 <- gsub("\\..*","", O2$id )
O2$id2 <- gsub(".*\\.","", O2$id )
O2$id <- NULL
return(O2)
}
#Function 3
extract3 <- function(x)
{
#Object
auc <- x$auc
#Modify for dataframe
change <- function(x)
{
y <- as.data.frame(x)
y$id1 <- rownames(y)
rownames(y)<-NULL
y$id1 <- gsub('.','_',y$id1,fixed = T)
return(y)
}
auc <- lapply(auc, change)
#Process
O3 <- do.call(rbind,auc)
#Separate vars
O3$id2 <- rownames(O3)
rownames(O3) <- NULL
O3$id2 <- gsub("\\..*","", O3$id2 )
return(O3)
}
#Apply functions and save in lists for late process
L1 <- lapply(myperf_plsda,extract1)
L2 <- lapply(myperf_plsda,extract2)
L3 <- lapply(myperf_plsda,extract3)
#Assign the same names from model_list
names(L1) <- names(model_list)
names(L2) <- names(model_list)
names(L3) <- names(model_list)
#Bind the data
#Error rate
error.rate.df <- do.call(rbind,L1)
error.rate.df$genid <- gsub("\\..*","", rownames(error.rate.df) )
rownames(error.rate.df) <- NULL
#Error rate class
error.rate.class.df <- do.call(rbind,L2)
error.rate.class.df$genid <- gsub("\\..*","", rownames(error.rate.class.df) )
rownames(error.rate.class.df) <- NULL
#Auc
auc.df <- do.call(rbind,L3)
auc.df$genid <- gsub("\\..*","", rownames(auc.df) )
rownames(auc.df) <- NULL
With previous code you will end up with three dataframes that contains the values that are identified according to names of model_list, you can navigate by vars id1, id2 and genid to see measures, components and datasets:
error.rate.df
max.dist centroids.dist mahalanobis.dist id1 id2 genid
1 0.2222222 0.2222222 0.2222222 overall comp1 df1
2 0.2777778 0.3888889 0.2777778 overall comp2 df1
3 0.2222222 0.2222222 0.2222222 BER comp1 df1
4 0.2777778 0.3888889 0.2777778 BER comp2 df1
5 0.2222222 0.2222222 0.2222222 overall comp1 df2
6 0.2777778 0.3333333 0.2777778 overall comp2 df2
7 0.2222222 0.2222222 0.2222222 BER comp1 df2
8 0.2777778 0.3333333 0.2777778 BER comp2 df2
error.rate.class.df
comp1 comp2 id1 id2 genid
1 0.3333333 0.3333333 max_dist no wetl df1
2 0.1111111 0.2222222 max_dist wetl df1
3 0.3333333 0.6666667 centroids_dist no wetl df1
4 0.1111111 0.1111111 centroids_dist wetl df1
5 0.3333333 0.3333333 mahalanobis_dist no wetl df1
6 0.1111111 0.2222222 mahalanobis_dist wetl df1
7 0.3333333 0.3333333 max_dist no wetl df2
8 0.1111111 0.2222222 max_dist wetl df2
9 0.3333333 0.5555556 centroids_dist no wetl df2
10 0.1111111 0.1111111 centroids_dist wetl df2
11 0.3333333 0.3333333 mahalanobis_dist no wetl df2
12 0.1111111 0.2222222 mahalanobis_dist wetl df2
auc.df
x id1 id2 genid
1 0.62966667 AUC_mean comp1 df1
2 0.06414361 AUC_sd comp1 df1
3 0.81483333 AUC_mean comp2 df1
4 0.06414361 AUC_sd comp2 df1
5 0.62966667 AUC_mean comp1 df2
6 0.06414361 AUC_sd comp1 df2
7 0.77780000 AUC_mean comp2 df2
8 0.11110000 AUC_sd comp2 df2
Finally for the plots you can use next code (I have assigned the name of the dataset to x label so you can identify it into the plots):
#Plot and save
#Assign names
names(myperf_plsda) <- names(model_list)
pdf('example.pdf')
for (i in 1:length(myperf_plsda))
{
plot(myperf_plsda[[i]], col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal",xlab = paste0(names(myperf_plsda)[i],' (Comp)'))
}
dev.off()
As remark, I have changed the number of folds in order to make the code working but with your real data you could set the original values you have.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rename_with but predicate based on value in other variable - r

Related

R: Sum Two Lists Embedded in Rows of a Dataframe by Groups

sum cells across similar data frames within a list in R, by data frames names

Reshaping dataset in a way that values for variables become variable names and their values are picked from another column

How to find maximum value across several lists stored in a dataframe?

How to perform pslda in R showing error rate for several elements of a list?

Categories

Resources