How to analysis irregular results by discriminant analysis in R? - r
I have learnt the use of LDA function in R to analysis regular results like this:
(x1&x2 are factors, G is classification)
X1 X2 G
2.95 6.63 1
2.53 7.79 1
3.57 5.65 1
3.16 5.47 2
2.16 6.22 2
Now my question is how to analysis the data if G are irregular results like (1.2;2.3;1.6...)
I also have learnt how I can get the scores of LDA, but I can't get it in QDA.
I had read the file of predict.qda and can't find the score option. Is this impossible in QDA?
This is my code:
C=read.table("clipboard",header=TRUE)
attach(T)
library(MASS)
ld=lda(G~x1+x2)
Z=predict(ld)
newG=Z$class
cbind(G,Z$x,newG)
Related
Read functions as text and use for plotting
I have a set of 500 equations listed in a single column of a .csv file. The equations are written as text like this (for example): 15+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2 (this is the "right" side of the equation, which equals "y", but the text "y=" does not appear in the .csv file) These are general linear models that have been written to a .csv file by someone else. Not all models have the same number of variables. I would like to read these functions into R and format them in a way that will allow for using them to (iteratively) make simple line plots (one for each n = 500 models) of "y" across a range of values for A (shown on the x-axis), given values of B, C, and D. Does anyone have any suggestions for how to do this?
I thought of something based on this [post][1], it is not the best solution, but it seems to work. Equations Created two equations for an example models <- c("15+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2","50+6.2*A-4.3*B+3.7*C-7.9*B*C+2*D^2") models_names <- c("model1","model2") Data Random data as an example data <- tibble( A = rnorm(100), B = rnorm(100), C = rnorm(100), D = rnorm(100) ) Function Then a created a function that uses those text equations and apply as function returning the values text_model <- function(formula){ eval(parse(text = paste('f <- function(A,B,C,D) { return(' , formula , ')}', sep=''))) out <- f(data$A,data$B,data$C,data$D) return(out) } Applied equations Finally, I apply each equation for the data, binding both. data %>% bind_cols( map(.x = models,.f = text_model) %>% set_names(models_names) %>% bind_rows(.id = "model") ) # A tibble: 100 x 6 A B C D model1 model2 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 -0.0633 1.18 -0.409 2.01 9.52 54.9 2 -0.00207 1.35 1.28 1.59 9.16 40.3 3 0.798 -0.141 1.58 -0.123 20.6 63.2 4 -0.162 -0.0795 0.408 0.663 14.3 52.0 5 -1.11 0.788 -1.37 1.20 4.71 46.0 6 2.80 1.84 -0.850 0.161 24.4 68.7 7 1.03 0.550 0.907 -1.92 19.0 60.8 8 0.515 -0.179 -0.980 0.0437 19.0 48.9 9 -0.353 0.0643 1.39 1.30 12.5 55.3 10 -0.427 -1.01 -1.11 -0.547 16.7 39.3 # ... with 90 more rows
Why can't I use cv.glm on the output of bestglm?
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is - cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5) res.best.logistic <- bestglm(Xy = winedata, family = binomial, # binomial family for logistic IC = "AIC", # Information criteria method = "exhaustive") res.best.logistic$BestModels best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10) However, this gives the error - Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL" I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm? The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package. The data was processed as - winedata <- read.delim("winequality-white.csv", sep = ';') winedata$quality[winedata$quality< 7] <- "0" #recode winedata$quality[winedata$quality>=7] <- "1" #recode winedata$quality <- factor(winedata$quality)# Convert the column to a factor names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that It's always good to check what is the class: class(res.best.logistic$BestModel) [1] "glm" "lm" But if you look at the call of res.best.logistic$BestModel: res.best.logistic$BestModel$call glm(formula = y ~ ., family = family, data = Xi, weights = weights) head(res.best.logistic$BestModel$model) y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides 1 0 7.0 0.27 0.36 20.7 0.045 2 0 6.3 0.30 0.34 1.6 0.049 3 0 8.1 0.28 0.40 6.9 0.050 4 0 7.2 0.23 0.32 8.5 0.058 5 0 7.2 0.23 0.32 8.5 0.058 6 0 8.1 0.28 0.40 6.9 0.050 free.sulfur.dioxide density pH sulphates 1 45 1.0010 3.00 0.45 2 14 0.9940 3.30 0.49 3 30 0.9951 3.26 0.44 4 47 0.9956 3.19 0.40 5 47 0.9956 3.19 0.40 6 30 0.9951 3.26 0.44 You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm: best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which) # take the variable names for best model best_var = names(best_var[[1]]) new_form = as.formula(paste("good ~", paste(best_var,collapse="+"))) fit = glm(new_form,winedata,family="binomial") best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
Conditional density distribution, two discrete variables
I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1). Some data: dat <- read.table( text="y x 3.00 0.0 2.75 0.0 2.75 0.1 2.75 0.1 2.75 0.2 2.25 0.2 3 0.3 2 0.3 2.25 0.4 1.75 0.4 1.75 0.5 2 0.5 1.75 0.6 1.75 0.6 1.75 0.7 1 0.7 0.54 0.8 0 0.8 0.54 0.9 0 0.9 0 1.0 0 1.0", header=TRUE, colClasses="factor") I wonder if my variables are appropriate to run this kind of analysis. Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables: library(rms) # also loads Hmisc # first get data in the form you described dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables ?lrm #read help page ... Also look at the supporting book and citations on that page lrm( y ~ x, data=dat) # --- output------ Logistic Regression Model lrm(formula = y ~ x, data = dat) Frequencies of Responses 0 0.54 1 1.75 2 2.25 2.75 3 3.00 4 2 1 5 2 2 4 1 1 Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 22 LR chi2 51.66 R2 0.920 C 0.869 max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738 Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916 gp 0.500 tau-a 0.658 Brier 0.048 Coef S.E. Wald Z Pr(>|Z|) y>=0.54 41.6140 108.3624 0.38 0.7010 y>=1 31.9345 88.0084 0.36 0.7167 y>=1.75 23.5277 74.2031 0.32 0.7512 y>=2 6.3002 2.2886 2.75 0.0059 y>=2.25 4.6790 2.0494 2.28 0.0224 y>=2.75 3.2223 1.8577 1.73 0.0828 y>=3 0.5919 1.4855 0.40 0.6903 y>=3.00 -0.4283 1.5004 -0.29 0.7753 x -19.0710 19.8718 -0.96 0.3372 x=0.2 0.7630 3.1058 0.25 0.8059 x=0.3 3.0129 5.2589 0.57 0.5667 x=0.4 1.9526 6.9051 0.28 0.7773 x=0.5 2.9703 8.8464 0.34 0.7370 x=0.6 -3.4705 53.5272 -0.06 0.9483 x=0.7 -10.1780 75.2585 -0.14 0.8924 x=0.8 -26.3573 109.3298 -0.24 0.8095 x=0.9 -24.4502 109.6118 -0.22 0.8235 x=1 -35.5679 488.7155 -0.07 0.9420 There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".
apply model coefficients on new data
I have two matrices sub and macro_data. They include the estimated coefficients of a model and the macro data, respectively > sub coeff varname 1 -1.50 gdp 2 0.005 inflation 3 -2.4 constant > macro_data gdp inflation 1 18.0 -0.17 2 15.8 -0.14 3 17.7 -0.15 I would like to apply the following formula: -1.5*gdp+0.005*inflation-2.4 in order to get the scores. I have tried for (i in 1:1){ sub$coeff[i]*macro_data[,1]+sub$coeff[i+1]*macro_data[,sub$coeff[i+1]]+sub$coeff[i+2] } Actually it works but this is not the best solution, because I would like something general. Any idea?
You can do a matrix multiplication: cbind(macro_data, 1) %*% sub[, "coeff", drop=FALSE] If your coefficients are from estimating a model, then normally the function predict.~() can take a parameter newdata= to claculate estimates for new data. For your example data this wont work because you have dataframes. This will do: sub <- read.table(header=TRUE, text= "coeff varname -1.50 gdp 0.005 inflation -2.4 constant ") macro_data <- read.table(header=TRUE, text= "gdp inflation 1 18.0 -0.17 2 15.8 -0.14 3 17.7 -0.15") m <- cbind(macro_data, constant=1) C <- sub$coeff names(C) <- sub$varname m$gdp*C["gdp"] + m$inflation*C["inflation"] + m$constant*C["constant"] The last line can be shorten to: as.matrix(m) %*% C[names(m)]
Find where species accumulation curve reaches asymptote
I have used the specaccum() command to develop species accumulation curves for my samples. Here is some example data: site1<-c(0,8,9,7,0,0,0,8,0,7,8,0) site2<-c(5,0,9,0,5,0,0,0,0,0,0,0) site3<-c(5,0,9,0,0,0,0,0,0,6,0,0) site4<-c(5,0,9,0,0,0,0,0,0,0,0,0) site5<-c(5,0,9,0,0,6,6,0,0,0,0,0) site6<-c(5,0,9,0,0,0,6,6,0,0,0,0) site7<-c(5,0,9,0,0,0,0,0,7,0,0,3) site8<-c(5,0,9,0,0,0,0,0,0,0,1,0) site9<-c(5,0,9,0,0,0,0,0,0,0,1,0) site10<-c(5,0,9,0,0,0,0,0,0,0,1,6) site11<-c(5,0,9,0,0,0,5,0,0,0,0,0) site12<-c(5,0,9,0,0,0,0,0,0,0,0,0) site13<-c(5,1,9,0,0,0,0,0,0,0,0,0) species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13) accum <- specaccum(species_counts, method="random", permutations=100) plot(accum) In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13). results <- with(accum, data.frame(sites, richness, sd)) Produces this: sites richness sd 1 1 3.46 0.9991916 2 2 4.94 1.6625403 3 3 5.94 1.7513054 4 4 7.05 1.6779918 5 5 8.03 1.6542263 6 6 8.74 1.6794660 7 7 9.32 1.5497149 8 8 9.92 1.3534841 9 9 10.51 1.0492422 10 10 11.00 0.8408750 11 11 11.35 0.7017295 12 12 11.67 0.4725816 13 13 12.00 0.0000000 I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit: species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0, 5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0, 5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0), byrow=TRUE,nrow=13) Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package): set.seed(101) library(vegan) accum <- specaccum(species_counts, method="random", permutations=100) Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites) (slopes <- with(accum,diff(richness)/diff(sites))) ## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31 In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3: which(slopes<0.3)[1] returns NA.