Feature selection using the penalizedLDA package

Feature selection using the penalizedLDA package - r

I am trying to use the penalizedLDA package to run a penalized linear discriminant analysis in order to select the "most meaningful" variables. I have searched here and on other sites for help in accessing the the output from the penalized model to no avail.
My data comprises of 400 varaibles and 44 groups. Code I used and results I got thus far:
yy.m<-as.matrix(yy) #Factors/groups
xx.m<-as.matrix(xx) #Variables
cv.out<-PenalizedLDA.cv(xx.m,yy.m,type="standard")
## aplly the penalty
out <- PenalizedLDA(xx.m,yy.m,lambda=cv.out$bestlambda,K=cv.out$bestK)
Too get the structure of the output from the anaylsis:
> str(out)
List of 10
$ discrim: num [1:401, 1:4] -0.0234 -0.0219 -0.0189 -0.0143 -0.0102 ...
$ xproj : num [1:100, 1:4] -8.31 -14.68 -11.07 -13.46 -26.2 ...
$ K : int 4
$ crits :List of 4
..$ : num [1:4] 2827 2827 2827 2827
..$ : num [1:4] 914 914 914 914
..$ : num [1:4] 162 162 162 162
..$ : num [1:4] 48.6 48.6 48.6 48.6
$ type : chr "standard"
$ lambda : num 0
$ lambda2: NULL
$ wcsd.x : Named num [1:401] 0.0379 0.0335 0.0292 0.0261 0.0217 ...
..- attr(*, "names")= chr [1:401] "R400" "R405" "R410" "R415" ...
$ x : num [1:100, 1:401] 0.147 0.144 0.145 0.141 0.129 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:401] "R400" "R405" "R410" "R415" ...
$ y : num [1:100, 1] 2 2 2 2 2 1 1 1 1 1 ...
- attr(*, "class")= chr "penlda"
I am interested in obtaining a list or matrix of the top 20 variables for feature selection, more than likely based on the coefficients of the Linear discrimination.
I realized I would have to sort the coefficients in descending order, and get the variable names matched to it. So the output I would expect is something like this imaginary example
V1 V2
R400 0.34
R1535 0.22...
Can anyone provide any pointers (not necessarily the R code). Thanks in advance.

Your out$K is 4, and that means you have 4 discriminant vectors. If you want the top 20 variables according to, say, the 2nd vector, try this:
# get the data frame of variable names and coefficients
var.coef = data.frame(colnames(xx.m), out$discrim[,2])
# sort the 2nd column (the coefficients) in decreasing order, and only keep the top 20
var.coef.top = var.coef[order(var.coef[,2], decreasing = TRUE)[1:20], ]
var.coef.top is what you want.

Related

Rphylopars: "Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"

I'm trying to compute a phenotypic covariance matrix between a fatty acid dataset and a phylogenetic tree using the Rphylopars package.
I'm able to load the data set and phylogeny; however, when I attempt to run the test I get the error message
Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"
This is the code for the test
phy <- read.tree("combined_trees.txt")
plot(phy)
phy$tip.label
FA_data <- read.csv("fatty_acid_example_data.csv", header = TRUE, na.strings = ".")
head(FA_data)
str(FA_data)
PPE <- phylopars(trait_data = FA_data$fatty1_continuous, tree = FA_data$phy)
Not sure what other info will help figure out the issue. The data set and phylogeny loaded without an error.

In the tutorial, the tree and trait data are jointly simulated by the simtraits() function, so both end up as elements of a single list. In your case (which will be typical of real-data cases), the tree and the trait data come from different sources, so most likely you want
PPE <- phylopars(trait_data = FA_data, tree = phy)
provided that FA_data contains a first column species matching the tip names in phy, and otherwise only the numeric data you want to use (potentially only the single fatty_acid1 column).
For comparison, the data structure returned by simtraits() looks like this (using str()):
List of 4
$ trait_data:'data.frame': 45 obs. of 5 variables:
..$ species: chr [1:45] "t7" "t8" "t2" "t3" ...
..$ V1 : num [1:45] 1.338 0.308 1.739 2.009 2.903 ...
..$ V2 : num [1:45] -2.002 -0.115 -0.349 -4.452 NA ...
..$ V3 : num [1:45] -1.74 NA 1.09 -2.54 -1.19 ...
..$ V4 : num [1:45] 2.496 2.712 1.198 1.675 -0.117 ...
$ tree :List of 4
..$ edge : int [1:28, 1:2] 29 29 28 28 27 27 26 26 25 25 ...
..$ edge.length: num [1:28] 0.0941 0.0941 0.6233 0.7174 0.0527 ...
..$ Nnode : int 14
..$ tip.label : chr [1:15] "t7" "t8" "t2" "t3" ...
..- attr(*, "class")= chr "phylo"
..- attr(*, "order")= chr "postorder"
...
you can see that simtraits() returns a list containing (among other things) (1) a data frame with species as the first column and the other columns numeric and (2) a phylogenetic tree.
You

Produce a graphic tree diagram showing the structure of an R object

In R, str() is handy for showing the structure of an object, such as the list of lists returned by lm() and other modelling functions, but it gives way too much output. I'm looking for some tool to create a simple tree diagram showing only the names of the list elements and their structure.
e.g., for this example,
data(Prestige, package="car")
out <- lm(prestige ~ income+education+women, data=Prestige)
str(out, max.level=2)
#> List of 12
#> $ coefficients : Named num [1:4] -6.79433 0.00131 4.18664 -0.00891
#> ..- attr(*, "names")= chr [1:4] "(Intercept)" "income" "education" "women"
#> $ residuals : Named num [1:102] 4.58 -9.39 4.69 4.22 8.15 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ effects : Named num [1:102] -472.99 -123.61 -92.61 -2.3 6.83 ...
#> ..- attr(*, "names")= chr [1:102] "(Intercept)" "income" "education" "women" ...
#> $ rank : int 4
#> $ fitted.values: Named num [1:102] 64.2 78.5 58.7 52.6 65.3 ...
#> ..- attr(*, "names")= chr [1:102] "gov.administrators" "general.managers" "accountants" "purchasing.officers" ...
#> $ assign : int [1:4] 0 1 2 3
#> $ qr :List of 5
#> ..$ qr : num [1:102, 1:4] -10.1 0.099 0.099 0.099 0.099 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
#> ..$ qraux: num [1:4] 1.1 1.44 1.06 1.06
#> ..$ pivot: int [1:4] 1 2 3 4
#> ..$ tol : num 1e-07
#> ..$ rank : int 4
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 98
...
I would like to get something like this:
This is similar to what I get from tree for file folders in my file system:
C:\Dropbox\Documents\images>tree
Folder PATH listing
Volume serial number is 2250-8E6F
C:.
+---cartoons
+---chevaliers
+---icons
+---milestones
+---minard
+---minard-besancon
The result could be either in graphic characters, as in tree or an actual graphic as shown above. Is anything like this available?

A simple approach to getting this from the str output would be something like...
a <- capture.output(str(out, max.level=2))
a <- trimws(gsub("\\:.*", "", a[grepl("\\$", a)]))
cat(a, sep="\n")
$ coefficients
$ residuals
$ effects
$ rank
$ fitted.values
$ assign
$ qr
..$ qr
..$ qraux
..$ pivot
..$ tol
..$ rank
$ df.residual
$ xlevels
$ call
$ terms
$ model
..$ prestige
..$ income
..$ education
..$ women

Using apply over two lists of different lengths

This question is related to my earlier question found here: https://stackoverflow.com/questions/33089532/r-accounting-for-a-factor-with-this-logistic-regression-function-replace-lappl
I realize that I didn't do a good job at asking the first question, so here is a more simple analog with actual data:
My data looks something like this:
#data look like this, but with a variable number of "y" columms
wk<-rep(1:50,2)
X<-rnorm(1:100,1)
y1<-rnorm(1:100,1)
y2<-rnorm(1:100,1)
df1<-as.data.frame(cbind(wk,X,y1,y2))
df1$hyst<-ifelse(df1$wk>=5 & df1$wk<32, "R", "F")
Y<-df1[, -which(colnames(df1) %in% c("wk"))] #this step makes more sense with my actual data since I have a bunch of columns to remove
l1<-length(Y)-1
lst1<-lapply(2:l1,function(x){colnames(Y[x])})
dflst<-c("Y",'Y[Y$hyst=="R",]','Y[Y$hyst=="F",]')
I want to run a model over all Y columns for the full data set (all data) and for two subsets, when the factor hyst=="R" and when hyst=="F".
To do this, I have nested two lapply functions, which sort of works, but I think it essentially doubles my results and is causing me all sorts of list headaches.
Here is the nested lapply code:
lms <- lapply(dflst, function(z){
lapply(lst1, function(y) {
form <- paste0(y, " ~ X")
lm(form, data=eval(parse(text=z)))
})
})
How can I replace or modify the nested lapply function to obtain a model run for each Y column for each data set( all, "R", and "F")?

Construct your DF list like
DFlst <- c(list(full=Y), split(Y, Y$hyst))
str(DFlst)
List of 3
$ full:'data.frame': 100 obs. of 4 variables:
..$ X : num [1:100] 1.792 3.192 0.367 1.632 1.388 ...
..$ y1 : num [1:100] 3.354 1.189 1.99 0.639 0.1 ...
..$ y2 : num [1:100] 0.864 2.415 0.437 1.069 1.368 ...
..$ hyst: chr [1:100] "F" "F" "F" "F" ...
$ F :'data.frame': 46 obs. of 4 variables:
..$ X : num [1:46] 1.792 3.192 0.367 1.632 0.707 ...
..$ y1 : num [1:46] 3.354 1.189 1.99 0.639 0.894 ...
..$ y2 : num [1:46] 0.864 2.415 0.437 1.069 1.213 ...
..$ hyst: chr [1:46] "F" "F" "F" "F" ...
$ R :'data.frame': 54 obs. of 4 variables:
..$ X : num [1:54] 1.388 2.296 0.409 1.494 0.943 ...
..$ y1 : num [1:54] 0.1002 0.6425 -0.0918 1.199 0.8767 ...
..$ y2 : num [1:54] 1.368 1.122 0.402 -0.237 1.518 ...
..$ hyst: chr [1:54] "R" "R" "R" "R" ...
Do some regressions:
res <- lapply(DFlst, function(DF) {
cols = grep("^y[0-9]+$",names(DF),value=TRUE)
lapply(setNames(cols,cols),
function(y) lm(paste(y,"~X"), data=DF))
})
str(res, list.len=2, give.attr=FALSE)
List of 3
$ full:List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.903 0.111
.. ..$ residuals : Named num [1:100] 2.2509 -0.0698 1.046 -0.4464 -0.9578 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.423 -0.166
.. ..$ residuals : Named num [1:100] -0.2623 1.5213 -0.9253 -0.0837 0.1751 ...
.. .. [list output truncated]
$ F :List of 2
..$ y1:List of 12
.. ..$ coefficients : Named num [1:2] 0.9289 0.0769
.. ..$ residuals : Named num [1:46] 2.2871 0.0146 1.0332 -0.4157 -0.0889 ...
.. .. [list output truncated]
..$ y2:List of 12
.. ..$ coefficients : Named num [1:2] 1.4177 -0.0789
.. ..$ residuals : Named num [1:46] -0.413 1.25 -0.952 -0.22 -0.149 ...
.. .. [list output truncated]
[list output truncated]

How to get fitted values from ar() method model in R

I want to retrieve the fitted values from an ar() function output model in R. When using Arima() method, I get them using fitted(model.object) function, but I cannot find its equivalent for ar().

It does not store a fitted vector but does have the residuals. An example of using the residuals from the ar-object to reconstruct the predictions from the original data:
data(WWWusage)
arf <- ar(WWWusage)
str(arf)
#====================
List of 14
$ order : int 3
$ ar : num [1:3] 1.175 -0.0788 -0.1544
$ var.pred : num 117
$ x.mean : num 137
$ aic : Named num [1:21] 258.822 5.787 0.413 0 0.545 ...
..- attr(*, "names")= chr [1:21] "0" "1" "2" "3" ...
$ n.used : int 100
$ order.max : num 20
$ partialacf : num [1:20, 1, 1] 0.9602 -0.2666 -0.1544 -0.1202 -0.0715 ...
$ resid : Time-Series [1:100] from 1 to 100: NA NA NA -2.65 -4.19 ...
$ method : chr "Yule-Walker"
$ series : chr "WWWusage"
$ frequency : num 1
$ call : language ar(x = WWWusage)
$ asy.var.coef: num [1:3, 1:3] 0.01017 -0.01237 0.00271 -0.01237 0.02449 ...
- attr(*, "class")= chr "ar"
#===================
str(WWWusage)
# Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
png(); plot(WWWusage)
lines(seq(WWWusage),WWWusage - arf$resid, col="red"); dev.off()

The simplest way to get the fits from an AR(p) model would be to use auto.arima() from the forecast package, which does have a fitted() method. If you really want a pure AR model, you can constrain the differencing via the d parameter and the MA order via the max.q parameter.
> library(forecast)
> fitted(auto.arima(WWWusage,d=0,max.q=0))
Time Series:
Start = 1
End = 100
Frequency = 1
[1] 91.68778 86.20842 82.13922 87.60576 ...

How to access parts of a list in R

I've got the optim function in r returning a list of stuff like this:
[[354]]
r k sigma
389.4 354.0 354.0
but when I try accessing say list$sigma it doesn't exist returning NULL.
I've tried attach and I've tried names, and I've tried assigning it to a matrix, but none of these things would work
Anyone got any idea how I can access the lowest or highest value for sigma r or k in my list??
Many thanks!!
str gives me this output:
List of 354
$ : Named num [1:3] -55.25 2.99 119.37
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -53.91 4.21 119.71
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -41.7 14.6 119.2
So I've got a double within a list within a list (?) I'm still mystified as to how I can cycle through the list and pick one out meeting my conditions without writing a function from scratch

The key issue is that you have a list of lists (or a list of data.frames, which in fact is also a list).
To confirm this, take a look at is(list[[354]]).
The solution is simply to add an additional level of indexing. Below you have multiple alternatives of how to accomplish this.
you can use a vector as an index to [[, so for example if you want to access the third element from the 354th element, you can use
myList[[ c(354, 3) ]]
You can also use character indecies, however, all nested levels must have named indecies.
names(myList) <- as.character(1:length(myList))
myList[[ c("5", "sigma") ]]
Lastly, please try to avoid using names like list, data, df etc. This will lead to crashing code and erors which will seem unexplainable and mysterious until one realizes that they've tried to subset a function
Edit:
In response to your question in the comments above: If you want to see the structure of an object (ie the "makeup" of the object), use str
> str(myList)
List of 5
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.654
..$ b : num -0.0823
..$ sigma: num -31
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.656
..$ b : num -0.167
..$ sigma: num -49
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.154
..$ b : num 0.522
..$ sigma: num -89
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.676
..$ b : num 0.595
..$ sigma: num 145
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.75
..$ b : num 0.772
..$ sigma: num 6

If you want -for example- all the sigmas, you can use sapply:
sapply(list, function(x)x["sigma"])
You can use that to find the minimum and maximum:
range(sapply(list, function(x)x["sigma"]))

Using , do.call you can do this :
do.call('[',mylist,354)['sigma']

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Feature selection using the penalizedLDA package - r

Related

Rphylopars: "Error in class(tree) <- "phylo" : attempt to set an attribute on NULL"

Produce a graphic tree diagram showing the structure of an R object

Using apply over two lists of different lengths

How to get fitted values from ar() method model in R

How to access parts of a list in R

Categories

Resources