Using apply over two lists of different lengths - r

This question is related to my earlier question found here: https://stackoverflow.com/questions/33089532/r-accounting-for-a-factor-with-this-logistic-regression-function-replace-lappl
I realize that I didn't do a good job at asking the first question, so here is a more simple analog with actual data:
My data looks something like this:
#data look like this, but with a variable number of "y" columms
wk<-rep(1:50,2)
X<-rnorm(1:100,1)
y1<-rnorm(1:100,1)
y2<-rnorm(1:100,1)
df1<-as.data.frame(cbind(wk,X,y1,y2))
df1$hyst<-ifelse(df1$wk>=5 & df1$wk<32, "R", "F")
Y<-df1[, -which(colnames(df1) %in% c("wk"))] #this step makes more sense with my actual data since I have a bunch of columns to remove
l1<-length(Y)-1
lst1<-lapply(2:l1,function(x){colnames(Y[x])})
dflst<-c("Y",'Y[Y$hyst=="R",]','Y[Y$hyst=="F",]')
I want to run a model over all Y columns for the full data set (all data) and for two subsets, when the factor hyst=="R" and when hyst=="F".
To do this, I have nested two lapply functions, which sort of works, but I think it essentially doubles my results and is causing me all sorts of list headaches.
Here is the nested lapply code:
lms <- lapply(dflst, function(z){
lapply(lst1, function(y) {
form <- paste0(y, " ~ X")
lm(form, data=eval(parse(text=z)))
})
})
How can I replace or modify the nested lapply function to obtain a model run for each Y column for each data set( all, "R", and "F")?

Construct your DF list like
DFlst <- c(list(full=Y), split(Y, Y$hyst))
str(DFlst)
List of 3
$ full:'data.frame': 100 obs. of 4 variables:
..$ X : num [1:100] 1.792 3.192 0.367 1.632 1.388 ...
..$ y1 : num [1:100] 3.354 1.189 1.99 0.639 0.1 ...
..$ y2 : num [1:100] 0.864 2.415 0.437 1.069 1.368 ...
..$ hyst: chr [1:100] "F" "F" "F" "F" ...
$ F :'data.frame': 46 obs. of 4 variables:
..$ X : num [1:46] 1.792 3.192 0.367 1.632 0.707 ...
..$ y1 : num [1:46] 3.354 1.189 1.99 0.639 0.894 ...
..$ y2 : num [1:46] 0.864 2.415 0.437 1.069 1.213 ...
..$ hyst: chr [1:46] "F" "F" "F" "F" ...
$ R :'data.frame': 54 obs. of 4 variables:
..$ X : num [1:54] 1.388 2.296 0.409 1.494 0.943 ...
..$ y1 : num [1:54] 0.1002 0.6425 -0.0918 1.199 0.8767 ...
..$ y2 : num [1:54] 1.368 1.122 0.402 -0.237 1.518 ...
..$ hyst: chr [1:54] "R" "R" "R" "R" ...
Do some regressions:
res <- lapply(DFlst, function(DF) {
cols = grep("^y[0-9]+$",names(DF),value=TRUE)
lapply(setNames(cols,cols),
function(y) lm(paste(y,"~X"), data=DF))
})
str(res, list.len=2, give.attr=FALSE)
List of 3
$ full:List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.903 0.111
.. ..$ residuals : Named num [1:100] 2.2509 -0.0698 1.046 -0.4464 -0.9578 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.423 -0.166
.. ..$ residuals : Named num [1:100] -0.2623 1.5213 -0.9253 -0.0837 0.1751 ...
.. .. [list output truncated]
$ F :List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.9289 0.0769
.. ..$ residuals : Named num [1:46] 2.2871 0.0146 1.0332 -0.4157 -0.0889 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.4177 -0.0789
.. ..$ residuals : Named num [1:46] -0.413 1.25 -0.952 -0.22 -0.149 ...
.. .. [list output truncated]
[list output truncated]

Related

Mutate within map of a list within list

How can I move the value g into a column in df using map?
r<-data.frame(o=runif(n = 50),m=rep(c("A","N"),25))
te<-data.frame(o=runif(n = 50),m=rep(c("G","H"),25))
aq<-list(f=list(df=r,g=0),g2=list(df=te,g=5))
the expected result after str is:
List of 2
$ f :List of 2
..$ df:'data.frame': 50 obs. of 2 variables:
.. ..$ o: num [1:50] 0.785 0.253 0.228 0.323 0.332 ...
.. ..$ m: chr [1:50] "A" "N" "A" "N" ...
.. ..$ g: num [1:50] 0
..$ g : num 0
$ g2:List of 2
..$ df:'data.frame': 50 obs. of 2 variables:
.. ..$ o: num [1:50] 0.0271 0.6264 0.1487 0.2008 0.6946 ...
.. ..$ m: chr [1:50] "G" "H" "G" "H" ...
.. ..$ g: num [1:50] 5
..$ g : num 5
map(aq,~mutate(.$df$g=.$g)) does not work. Any other idea how this can be done?
Same output as in Akrun's comment (ie one less nesting level), based on your code:
map(aq, ~ dplyr::mutate(.x$df, g = .x$g))
Simple edit to get your desired structure:
map(aq, ~ list(df = dplyr::mutate(.x$df, g = .x$g), g = .x$g))
(Edit: per Misha's comment, this is working with the development version of purrr (0.2.2.9000) but not with the current CRAN version (0.2.2). Don't know why yet).

Extracting sub objects from list in [duplicate]

After running a repeated measures ANOVA and naming the output
RM_test <- ezANOVA(data=test_data, dv=var_test, wid = .(subject),
within = .(water_year), type = 3)
I looked at the internal structure of the named object using str(RM_test) and received the following:
List of 3
$ ANOVA :List of 3
..$ ANOVA :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect: chr "water_year"
.. ..$ DFn : num 2
.. ..$ DFd : num 22
.. ..$ F : num 26.8
.. ..$ p : num 1.26e-06
.. ..$ p<.05 : chr "*"
.. ..$ ges : num 0.531
..$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
.. ..$ Effect: chr "water_year"
.. ..$ W : num 0.875
.. ..$ p : num 0.512
.. ..$ p<.05 : chr ""
..$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect : chr "water_year"
.. ..$ GGe : num 0.889
.. ..$ p[GG] : num 4.26e-06
.. ..$ p[GG]<.05: chr "*"
.. ..$ HFe : num 1.05
.. ..$ p[HF] : num 1.26e-06
.. ..$ p[HF]<.05: chr "*"
$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
..$ Effect: chr "wtr_yr"
..$ W : num 0.875
..$ p : num 0.512
..$ p<.05 : chr ""
$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
..$ Effect : chr "wtr_yr"
..$ GGe : num 0.889
..$ p[GG] : num 4.26e-06
..$ p[GG]<.05: chr "*"
..$ HFe : num 1.05
..$ p[HF] : num 1.26e-06
..$ p[HF]<.05: chr "*"
I was able to extract the fourth variable F from the first data frame using RM_test[[1]][[4]][1] but cannot figure out how to extract the third variable p[GG] from the data frame Sphericity Corrections. This data frame appears twice so extracting either one would be fine.
Suggestions on how to do this using bracketed numbers and names would be appreciated.
The problem seems to be you not knowing how to extract list elements. As you said, there are two Sphericity Corrections data frames, so I will how to get the p[GG] value for both.
using bracketed number
For the first one, we do RM_test[[1]][[3]][[3]]. You can do it step by step to understand it:
x1 <- RM_test[[1]]; str(x1)
x2 <- x1[[3]]; str(x2)
x3 <- x2[[3]]; str(x3)
For the second one, do RM_test[[3]][[3]].
using bracketed name
Instead of using numbers for indexing, we can use names. For the first, do
RM_test[["ANOVA"]][["Sphericity Corrections"]][["p[GG]"]]
For the second, do
RM_test[["Sphericity Corrections"]][["p[GG]"]]
using $
For the first one, do
RM_test$ANOVA$"Sphericity Corrections"$"p[GG]"
For the second one, do
RM_test$"Sphericity Corrections"$"p[GG]"
Note the use of quote "" when necessary.

Extract nested list elements using bracketed numbers and names

After running a repeated measures ANOVA and naming the output
RM_test <- ezANOVA(data=test_data, dv=var_test, wid = .(subject),
within = .(water_year), type = 3)
I looked at the internal structure of the named object using str(RM_test) and received the following:
List of 3
$ ANOVA :List of 3
..$ ANOVA :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect: chr "water_year"
.. ..$ DFn : num 2
.. ..$ DFd : num 22
.. ..$ F : num 26.8
.. ..$ p : num 1.26e-06
.. ..$ p<.05 : chr "*"
.. ..$ ges : num 0.531
..$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
.. ..$ Effect: chr "water_year"
.. ..$ W : num 0.875
.. ..$ p : num 0.512
.. ..$ p<.05 : chr ""
..$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
.. ..$ Effect : chr "water_year"
.. ..$ GGe : num 0.889
.. ..$ p[GG] : num 4.26e-06
.. ..$ p[GG]<.05: chr "*"
.. ..$ HFe : num 1.05
.. ..$ p[HF] : num 1.26e-06
.. ..$ p[HF]<.05: chr "*"
$ Mauchly's Test for Sphericity:'data.frame': 1 obs. of 4 variables:
..$ Effect: chr "wtr_yr"
..$ W : num 0.875
..$ p : num 0.512
..$ p<.05 : chr ""
$ Sphericity Corrections :'data.frame': 1 obs. of 7 variables:
..$ Effect : chr "wtr_yr"
..$ GGe : num 0.889
..$ p[GG] : num 4.26e-06
..$ p[GG]<.05: chr "*"
..$ HFe : num 1.05
..$ p[HF] : num 1.26e-06
..$ p[HF]<.05: chr "*"
I was able to extract the fourth variable F from the first data frame using RM_test[[1]][[4]][1] but cannot figure out how to extract the third variable p[GG] from the data frame Sphericity Corrections. This data frame appears twice so extracting either one would be fine.
Suggestions on how to do this using bracketed numbers and names would be appreciated.
The problem seems to be you not knowing how to extract list elements. As you said, there are two Sphericity Corrections data frames, so I will how to get the p[GG] value for both.
using bracketed number
For the first one, we do RM_test[[1]][[3]][[3]]. You can do it step by step to understand it:
x1 <- RM_test[[1]]; str(x1)
x2 <- x1[[3]]; str(x2)
x3 <- x2[[3]]; str(x3)
For the second one, do RM_test[[3]][[3]].
using bracketed name
Instead of using numbers for indexing, we can use names. For the first, do
RM_test[["ANOVA"]][["Sphericity Corrections"]][["p[GG]"]]
For the second, do
RM_test[["Sphericity Corrections"]][["p[GG]"]]
using $
For the first one, do
RM_test$ANOVA$"Sphericity Corrections"$"p[GG]"
For the second one, do
RM_test$"Sphericity Corrections"$"p[GG]"
Note the use of quote "" when necessary.

Adding principal components as variables to a data frame

I am working with a dataset of 10000 data points and 100 variables in R. Unfortunately the variables I have do not describe the data in a good way. I carried out a PCA analysis using prcomp() and the first 3 PCs seem to account for a most of the variability of the data. As far as I understand, a principal component is a combination of different variables; therefore it has a certain value corresponding to each data point and can be considered as a new variable. Would I be able to add these principal components as 3 new variables to my data? I would need them for further analysis.
A reproducible dataset:
set.seed(144)
x <- data.frame(matrix(rnorm(2^10*12), ncol=12))
y <- prcomp(formula = ~., data=x, center = TRUE, scale = TRUE, na.action = na.omit)
PC scores are stored in the element x of prcomp() result.
str(y)
List of 6
$ sdev : num [1:12] 1.08 1.06 1.05 1.04 1.03 ...
$ rotation: num [1:12, 1:12] -0.0175 -0.1312 0.3284 -0.4134 0.2341 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "X1" "X2" "X3" "X4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:12] 0.02741 -0.01692 -0.03228 -0.03303 0.00122 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ scale : Named num [1:12] 0.998 1.057 1.019 1.007 0.993 ...
..- attr(*, "names")= chr [1:12] "X1" "X2" "X3" "X4" ...
$ x : num [1:1024, 1:12] 1.023 -1.213 0.167 -0.118 -0.186 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1024] "1" "2" "3" "4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ call : language prcomp(formula = ~., data = x, na.action = na.omit, center = TRUE, scale = TRUE)
- attr(*, "class")= chr "prcomp"
You can get them with y$x and then chose those columns you need.
x.new<-cbind(x,y$x[,1:3])
str(x.new)
'data.frame': 1024 obs. of 15 variables:
$ X1 : num 1.14 2.38 0.684 1.785 0.313 ...
$ X2 : num -0.689 0.446 -0.72 -3.511 0.36 ...
$ X3 : num 0.722 0.816 0.295 -0.48 0.566 ...
$ X4 : num 1.629 0.738 0.85 1.057 0.116 ...
$ X5 : num -0.737 -0.827 0.65 -0.496 -1.045 ...
$ X6 : num 0.347 0.056 -0.606 1.077 0.257 ...
$ X7 : num -0.773 1.042 2.149 -0.599 0.516 ...
$ X8 : num 2.05511 0.4772 0.18614 0.02585 0.00619 ...
$ X9 : num -0.0462 1.3784 -0.2489 0.1625 0.6137 ...
$ X10: num -0.709 0.755 0.463 -0.594 -1.228 ...
$ X11: num -1.233 -0.376 -2.646 1.094 0.207 ...
$ X12: num -0.44 -2.049 0.315 0.157 2.245 ...
$ PC1: num 1.023 -1.213 0.167 -0.118 -0.186 ...
$ PC2: num 1.2408 0.6077 1.1885 3.0789 0.0797 ...
$ PC3: num -0.776 -1.41 0.977 -1.343 0.987 ...
Didzis Elferts's response only works if your data, x, has no NAs. Here's how you can add the components if your data does have NAs.
library(tidyverse)
components <- y$x %>% rownames_to_column("id")
x <- x %>% rownames_to_column("id") %>% left_join(components, by = "id")

Feature selection using the penalizedLDA package

I am trying to use the penalizedLDA package to run a penalized linear discriminant analysis in order to select the "most meaningful" variables. I have searched here and on other sites for help in accessing the the output from the penalized model to no avail.
My data comprises of 400 varaibles and 44 groups. Code I used and results I got thus far:
yy.m<-as.matrix(yy) #Factors/groups
xx.m<-as.matrix(xx) #Variables
cv.out<-PenalizedLDA.cv(xx.m,yy.m,type="standard")
## aplly the penalty
out <- PenalizedLDA(xx.m,yy.m,lambda=cv.out$bestlambda,K=cv.out$bestK)
Too get the structure of the output from the anaylsis:
> str(out)
List of 10
$ discrim: num [1:401, 1:4] -0.0234 -0.0219 -0.0189 -0.0143 -0.0102 ...
$ xproj : num [1:100, 1:4] -8.31 -14.68 -11.07 -13.46 -26.2 ...
$ K : int 4
$ crits :List of 4
..$ : num [1:4] 2827 2827 2827 2827
..$ : num [1:4] 914 914 914 914
..$ : num [1:4] 162 162 162 162
..$ : num [1:4] 48.6 48.6 48.6 48.6
$ type : chr "standard"
$ lambda : num 0
$ lambda2: NULL
$ wcsd.x : Named num [1:401] 0.0379 0.0335 0.0292 0.0261 0.0217 ...
..- attr(*, "names")= chr [1:401] "R400" "R405" "R410" "R415" ...
$ x : num [1:100, 1:401] 0.147 0.144 0.145 0.141 0.129 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:401] "R400" "R405" "R410" "R415" ...
$ y : num [1:100, 1] 2 2 2 2 2 1 1 1 1 1 ...
- attr(*, "class")= chr "penlda"
I am interested in obtaining a list or matrix of the top 20 variables for feature selection, more than likely based on the coefficients of the Linear discrimination.
I realized I would have to sort the coefficients in descending order, and get the variable names matched to it. So the output I would expect is something like this imaginary example
V1 V2
R400 0.34
R1535 0.22...
Can anyone provide any pointers (not necessarily the R code). Thanks in advance.
Your out$K is 4, and that means you have 4 discriminant vectors. If you want the top 20 variables according to, say, the 2nd vector, try this:
# get the data frame of variable names and coefficients
var.coef = data.frame(colnames(xx.m), out$discrim[,2])
# sort the 2nd column (the coefficients) in decreasing order, and only keep the top 20
var.coef.top = var.coef[order(var.coef[,2], decreasing = TRUE)[1:20], ]
var.coef.top is what you want.

Resources