I have a dataset, df, with one column where weights are present as below as a .csv file:
Outcome,Heat,Mobility,Time,weights
Good,125,0.2,9,2
Neutral,250,0.5,10,2
Bad,12,1.6,1,3
Good,162,0.1,9,1
Good,150,0.3,9,1
Bad,8,5.2,2,4
Neutral,330,0.2,12,3
Neutral,200,0.6,8,1
Bad,50,12,4,3
Good,130,0.9,10,4
I usually begin PCA analysis by using prcomp(df[,2:4]). But there doesn't seem to be any option to add the weights.
I tried doing prcomp(df[,2:4],scale. =as.numeric(unlist(df[5]))) option, but that gave errors stating that the number of columns provided was not suitable. Is there a way to add the associated to each row here, somehow?
Also, how I go about cross validating the model I generate here using the "leave-one-out" approach?
Related
I am using a randomforest model. I want to compare the outcomes with OLS, and mainly see whether and how the contributions of individual variables differ between these. So, I used partial dependence plots to see the effect of variables. However, this still does not give me a clear coefficient.
Is there another way to extract a coefficients, or is it possible to extract the coefficient from the pdp.
I tried different ways to find the underlying data of the pdp. I tried creating simulated data with only one variable different to see how the predictions changed.
First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})
I am new to machine learning/statistical modelling.
I am trying to run a classification on a highly sparse dataset with 100 features, most of which are categorical (TRUE/FALSE) with the remaining values missing. To handle missing values, I filled the missing spots with the text 'Nothing', thereby creating a new level.
Next, I am trying to run a logistic regression using a penalty (glmnet package). When I check the coefficients, I see dummy variables corresponding to 'Nothing' having the higher coefficients.
How should I remove these coefficients? What would be a better approach to this?
Or should I just use trees? Please suggest the best way forward.
Thanks!
I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!
So I'm currently using the Naive Bayes classifier from the e1071 package to classify data, and I was wondering if there was any way to interact with, and edit the data.
For example, using the iris dataset, and the methods described here to extract a classifier from it, I want to be able to select the individual tables in the classifier.
I would like to be able to select a specific data table (such as the Sepal.Length) table, and compare the values against each other in order to get more information.
Does anyone have any methods for doing this?
Just figured it out
Essentially, the classifier is a set of 4 values, the apriori probabilities, the mean and standard deviations of each of the probabilities, the different classes, and the original call.
Each of those values is a nested list with one item, and if you keep on delving into the individual lists you can get at the individual items, including the individual probability matrices, and work from there. The first value of each is the mean, and the second is the standard deviation. From there you can pull whatever data you want, and edit to your heart's extent.