PLS-DA deal with Missing values - r

I am performing an OPLSDA, all my columns have some missing values.
I am following these instructions: https://www.bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html
This is my code:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.10")
BiocManager::install("ropls")
library(ropls)
dataMatrix=df.Baseline.All[,c(6:63,74:143)]
dataMatrix= dataMatrix[c(23:294),]
dataMatrix = as.matrix(as.data.frame(lapply(dataMatrix, as.numeric)))
str(dataMatrix)
class(dataMatrix)
sampleMetadata = df.Baseline.All[,c(2,165,168,192)]
sampleMetadata= as.data.frame(sampleMetadata)
attach(df.Baseline.All)
dev.off()
view(dataMatrix)
dev.off()
view(sampleMetadata)
adds.pca <- opls(dataMatrix)
adds.pcs <- opls(dataMatrix) gives me an error
Error: 'x' contains columns with 'NA' only
how can I handle Missing data??
This is how SIMCA software deals with missing values:
"Put simply the NIPALS algorithm interpolates the missing point using a least squares fit but give the
missing data no influence on the model. Successive iterations refine the missing value by simply
multiplying the score and the loading for that point. Many different methods exist for missing data,
such as estimation but they generally converge to the same solution. Missing data is acceptable if they
are randomly distributed. Systematic blocks of missing data are problematic. "
How would you do this in R?
Thanks!
lili

Apparently, you have whole columns with NAs only. You should remove those columns from your data before attenpting to perform PCA. I incidentally created a function to detect which columns are all NAs.
NAcols <- function(X){
thecols <- apply(X, 2, function(x){sum(is.na(x))}) == dim(X)[1]
return(thecols)
}
dataMatrixClean <- dataMatrix[,NAcols(dataMatrix)]
adds.pca <- opls(dataMatrixClean)

Related

R BiCopKDE cov.wt(z) : 'x' must contain finite values only

My dataset consists of stock prices. My final goal is to fit for practice a copula to two stocks.
I've transformed my data to a [0,1] scale and would like to plot the bivariate density with BiCopKDE.
However, although I tried to detect possible non-finite values, I still get the same error message "cov.wt(z) : 'x' must contain finite values only". I reduced my dataset to 16 rows in order to understand the reason, but it didn't help.
The code:
DFM.roh <- read.xlsx("C:\\Users\\Simon\\Documents\\ML Seminar\\Deutscher Finanzmarkt Daten.xlsx")
DFM <- data.frame(X_bei = DFM.roh$s_bei, X_bayn = DFM.roh$s_bayn)
y_a <- ecdf(DFM$X_bei)(DFM$X_bei)
y_b <- ecdf(DFM$X_bayn)(DFM$X_bayn)
Datacop <- data.frame(y_a, y_b)
which(is.na(Datacop), arr.ind=TRUE)
#row col
all(sapply(Datacop, is.finite))
#TRUE
BiCopKDE(Datacop$y_a, Datacop$y_b, "surface")
# cov.wt(z) : 'x' must contain finite values only
The dataset:
enter image description here
Anybody with an idea to solve this?
Best,
Simon
A good way to get what you want is to use BiCopSelect, which is a function in the VineCopula package. Once you get the result, then you can just use the plot function available in the same package.

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Kaggle Digit Recognizer Using SVM (e1071): Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty

I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error.
I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model.
Here Goes my code,
Given_Training_data <- get(load("Given_Training_data.RData"))
Given_Testing_data <- get(load("Given_Testing_data.RData"))
Maximum_Pixel_value = max(Given_Training_data)
Tot_Col_Train_data = ncol(Given_Training_data)
training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value
testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value
label_training_data <- Given_Training_data$label
final_training_data <- cbind(label_training_data, training_data_adjusted)
smp_size <- floor(0.75 * nrow(final_training_data))
set.seed(100)
training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size)
training_data1 <- final_training_data[training_ind, ]
train_no_label1 <- as.data.frame(training_data1[,-1])
train_label1 <-as.data.frame(training_data1[,1])
svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error
Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase.
Thanks.
Update to the question :
trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F)
It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf):
svm(x, y = NULL, scale = TRUE, type = NULL, ...)
...
Arguments:
...
x a data matrix, a vector, or a sparse matrix (object of class
Matrix provided by the Matrix package, or of class matrix.csr
provided by the SparseM package,
or of class simple_triplet_matrix provided by the slam package).
y a response vector with one label for each row/component of x.
Can be either a factor (for classification tasks) or a numeric vector
(for regression).
Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to:
svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1]))
it will work as expected. Note that training will take some minutes to run.
You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package.
You can also use the formula version of svm instead of the matrix and vectors which is
svm(result~.,data = your_training_set)
in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression
I can debug it if you can share the data:Given_Training_data.RData

e1071 SVM: Error trying to predict

I keep receiving this error and I cannot figure out why.
Error in scale.default(newdata[, object$scaled, drop = FALSE], center
= object$x.scale$"scaled:center", : length of 'center' must equal the number of columns of 'x'
I'm using the default iris dataset, and here is all of my code. It's an attempt at implementing a multiclass SVM using the pairwise method.
# pass in the dataframe & the number of classes
multiclass.svm <- function(data) {
class.vec = data[,length(data)]
levels = levels(class.vec)
pair1 <- data[which(class.vec == levels[1]),]
pair1 <- droplevels(pair1)
pair2 <- data[which(class.vec == levels[length(levels)]),]
pair2 <- droplevels(pair2)
pairs = list(rbind(pair1, pair2))
# print(pairs)
for(i in 2:length(levels)){
L1 <- data[which(class.vec == levels[i-1]),]
L1 <- droplevels(L1)
L2 <- data[which(class.vec == levels[i]),]
L2 <- droplevels(L2)
pair <- list(rbind(L1, L2))
pairs <- c(pairs, pair)
}
# now we construct our (n choose 2) binary models
models = list()
for(pair in pairs){
classifier = pair[,length(pair)]
p.svm = svm(formula=classifier~., data=pair)
models = c(models, list(p.svm))
}
for(model in models){
test = iris[1,]
print(predict(model, test))
}
return(models)
}
Testing/usage:
> h = multiclass.svm(iris)
Show Traceback
Rerun with Debug
Error in scale.default(newdata[, object$scaled, drop = FALSE], center = object$x.scale$"scaled:center", :
length of 'center' must equal the number of columns of 'x'
>
Any help would be very much appreciated... I've found a few other questions on this very topic to no avail. Thank you.
Okay, so the answer is unfortunately quite tricky (depending on your dataset.) The problem is that in the iris dataset, there are THREE levels of classification. Since I'm breaking the classes into pairs, each of my models only have TWO levels of classification.
When using predict the model you've trained on AND the value you're testing must both have the same levels. So, the tricky part (at least in this case) is deleting the unnecessary levels from each pair.
I recommend using the library plyr for its revalue function. To remove specific levels (instead of all unused a la the drop levels function) you can use revalue and rename each unwanted level to an existing one (essentially destroying it.)
Credit to this polish blogger for steering me in the right direction:
http://ppiotrow.blogspot.com/2013/04/solved-r-svm-test-data-does-not-match.html
The quick and easy way, though, to solve my specific problem was simply removing all of the droplevels calls haha. Since the SVM won't find any points to draw upon the unused level, there is no actual problem with leaving the extraneous level in.
Hope this helps someone out there.
Mike
I have experienced the same issue. I fixed the error by converting all of the predictors in the test set to their correct class, i.e., as.factor, as.numeric.
For example, if a numeric predictor variable in the training set is in memory as a character variable in your test set you will get this error. I hope this helps.

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources