Pincipal Component Analysis error - r

I keep getting this error when I try to run a Principal Component Analysis -
Final_Dataset <- Final_Dataset[, colSums(is.na(Final_Dataset)) != nrow(Final_Dataset)]
Final_Dataset <- Final_Dataset[,-grep("Date|factor|character|logical", sapply(Final_Dataset, class))]
table(sapply(Final_Dataset, class))
nzv <- nearZeroVar(Final_Dataset, saveMetrics = TRUE)
print(paste('range:', range(nzv$percentUnique)))
dim(nzv[nzv$percentUnique > 0.1,])
gisette_nzv <- Final_Dataset[c(rownames(nzv[nzv$percentUnique > 0.1,]))]
pmatrix <- scale(gisette_nzv)
princ <- prcomp(pmatrix)
Error in svd(x, nu = 0) : infinite or missing values in 'x'
Is there any way of telling the function to omit na? The problem here is the dataset is huge so if I remove nas there will be no rows left, because out of the ~1000 there are always rows with missing values.

Related

Having trouble with making K Nearest Neighbors work in R Studio

I'm trying to use the knn function in r but I keep getting this error message when I try to compute it.
> knn(Taxi_train,Taxi_test,cl,k=100)
Error in knn(Taxi_train, Taxi_test, cl, k = 100) :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(Taxi_train, Taxi_test, cl, k = 100) : NAs introduced by coercion
2: In knn(Taxi_train, Taxi_test, cl, k = 100) : NAs introduced by coercion
I don't know what exactly is wrong with my code so I need some help to get it working.
I tried making sure that all the variables are numeric but that didn't change anything. It may also be an issue with my cl factor in the knn equation.
Here is what my code is currently:
date<-chicago_taxi$date
class(date)
Date <- as.Date(date)
class(Date)
Julian <- yday(Date)
class(Julian)
head(Julian)
chicago_taxi <- cbind(chicago_taxi,Julian)
chicago_taxi$seconds <- as.numeric(chicago_taxi$seconds)
set.seed(7777)
train_set <- sample(1:13081,10400,replace = FALSE)
Taxi_train <- chicago_taxi[train_set,]
Taxi_test <- chicago_taxi[-train_set,]
cl <- Taxi_train$payment_type
scale(chicago_taxi$miles)
scale(chicago_taxi$seconds)
scale(chicago_taxi$Julian)
knn(Taxi_train,Taxi_test,cl,k=100)

Why am I getting an error with TTR rollSFM loading a CSV file

I'm trying to plot a rolling linear regression, but get the following error:
Error in if (any(by < 0L) || any(by > nc)) stop("'by' must match numbers of columns") :
missing value where TRUE/FALSE needed
The relevant code is:
library(TTR)
mydata <-read.csv("EURUSD2.csv", sep=",",header=TRUE)
gh<-as.matrix(mydata)
reg <- rollSFM(mydata[, "PRICE"], 1:nrow(mydata), n=5)
and here is a sample of the CSV file:
BAR,PRICE,HIGH,LOW
0,113.3035,113.306,113.282
1,113.306,113.311,113.296
2,113.306,113.312,113.302
3,113.297,113.308,113.297
4,113.2875,113.304,113.285
5,113.2975,113.298,113.29
6,113.278,113.303,113.278
7,113.2915,113.294,113.28
8,113.3,113.3,113.284
9,113.298,113.302,113.296
10,113.3195,113.32,113.3
11,113.3275,113.336,113.32
Why won't it accept this data?
I figured it out. You just have to provide your own index to the data. For example:
xts1 <- xts(mydata[, "PRICE"], order.by=Sys.Date()+1:length(mydata[, "BAR"]))
reg <- rollSFM(xts1, .index(xts1), 2)
rma <- reg$alpha + reg$beta*.index(xts1)

KNN: "no missing values are allow" -> I do not have missing values

I am in a group project for a class and one of the people in my group ran the normalization, as well as creating the test/train sets so that we all have the same sets to work from (we're all utilizing different algorithms). I am assigned with running the KNN algorithm.
We had multiple columns with NA's so those columns were omitted (<-NULL). When attempting to run the KNN I keep getting the error of
Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
no missing values are allowed
I ran which(is.na(dataset$col)) and found:
which(is.na(testsetne$median_days_on_market))
# [1] 8038 8097 8098 8100 8293 8304
When I look through the dataset those cells do not have missing data.
I am wondering if I may get some help with how to either find and fix the "No missing values" or to find a work around (if any).
I am sorry if I am missing something simple. Any help is appreciated.
I have listed the code that we have below:
ne$pending_ratio_yy <- ne$total_listing_count_yy <- ne$average_listing_price_yy <- ne$median_square_feet_yy <- ne$median_listing_price_per_square_feet_yy <- ne$pending_listing_count_yy <- ne$price_reduced_count_yy <- ne$median_days_on_market_yy <- ne$new_listing_count_yy <- ne$price_increased_count_yy <- ne$active_listing_count_yy <- ne$median_listing_price_yy <- ne$flag <- NULL
ne$pending_ratio_mm <- ne$total_listing_count_mm <- ne$average_listing_price_mm <- ne$median_square_feet_mm <- ne$median_listing_price_per_square_feet_mm <- ne$pending_listing_count_mm <- ne$price_reduced_count_mm <- ne$price_increased_count_mm <- ne$new_listing_count_mm <- ne$median_days_on_market_mm <- ne$active_listing_count_mm <- ne$median_listing_price_mm <- NULL
ne$factor_month_date <- as.factor(ne$month_date_yyyymm)
ne$factor_median_days_on_market <- as.factor(ne$median_days_on_market)
train20ne= sample(1:20893, 4179)
trainsetne=ne[train20ne,1:10]
testsetne=ne[-train20ne,1:10]
#This is where I start to come in
ne_train_target <- ne[train20ne, 3]
ne_test_target <- ne[-train20ne, 3]
predict_1 <- knn(train = trainsetne, test = testsetne, cl=ne_train_target, k=145)
# Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
# no missing values are allowed

PCA with result non-interactively in R

I send you a message because I would like realise an PCA in R with the package ade4.
I have the data "PAYSAGE" :
All the variables are numeric, PAYSAGE is a data frame, there are no NAS or blank.
But when I do :
require(ade4)
ACP<-dudi.pca(PAYSAGE)
2
I have the message error :
**You can reproduce this result non-interactively with:
dudi.pca(df = PAYSAGE, scannf = FALSE, nf = NA)
Error in if (nf <= 0) nf <- 2 : missing value where TRUE/FALSE needed
In addition: Warning message:
In as.dudi(df, col.w, row.w, scannf = scannf, nf = nf, call = match.call(), :
NAs introduced by coercion**
I don't understand what does that mean. Have you any idea??
Thank you so much
I'd suggest sharing a data set/example others could access, if possible. This seems data-specific and with NAs introduced by coercion you may want to check the type of your input - typeof(PAYSAGE) - the manual for dudi.pca states it takes a data frame of numeric values as input.
Yes, for example :
ag_div <- c(75362,68795,78384,79087,79120,73155,58558,58444,68795,76223,50696,0,17161,0,0)
canne <- c(rep(0,10),5214,6030,0,0,0)
prairie_el<- c(60, rep(0,13),76985)
sol_nu <- c(18820,25948,13150,9903,12097,21032,35032,35504,25948,20438,12153,33096,15748,33260,44786)
urb_peu_d <- c(448,459,5575,5902,5562,458,6271,6136,459,1850,40,13871,40,13920,28669)
urb_den <- c(rep(0,12),14579,0,0)
veg_arbo <- c(2366,3327,3110,3006,3049,2632,7546,7620,3327,37100,3710,0,181,0,181)
veg_arbu <- c(18704,18526,15768,15527,15675,18886,12971,12790,18526,15975,22216,24257,30962,24001,14523)
eau <- c(rep(0,10),34747,31621,36966,32165,28054)
PAYSAGE<-data.frame(ag_div,canne,prairie_el,sol_nu,urb_peu_d,urb_den,veg_arbo,veg_arbu,eau)
require(ade4)
ACP<-dudi.pca(PAYSAGE)

Error running neural net

library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$class)
hepatitisANN = nnet(hepatitis[train1,-20], ideal[train1,], size=10, softmax=TRUE)
j <- predict(hepatitisANN, hepatitis[test1,-20], type="class")
hepatitis[test1,]$class
table(predict(hepatitisANN, hepatitis[test1,-20], type="class"),hepatitis[test1,]$class)
confusionMatrix(hepatitis[test1,]$class, j)
Error:
Error in nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In nnet.default(hepatitis[train1, -20], ideal[train1, ], size = 10, :
NAs introduced by coercion
hepatitis variable consists of the hepatitis dataset available on UCI.
This error message is because you have character values in your data.
Try reading the hepatitis dataset with na.strings = "?". This is defined in the description of the dataset on the uci page.
headers <- c("Class","AGE","SEX","STEROID","ANTIVIRALS","FATIGUE","MALAISE","ANOREXIA","LIVER BIG","LIVER FIRM","SPLEEN PALPABLE","SPIDERS","ASCITES","VARICES","BILIRUBIN","ALK PHOSPHATE","SGOT","ALBUMIN","PROTIME","HISTOLOGY")
hepatitis <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data", header = FALSE, na.strings = "?")
names(hepatitis) <- headers
library(nnet)
set.seed(9850)
train1<- sample(1:155,110)
test1 <- setdiff(1:110,train1)
ideal <- class.ind(hepatitis$Class)
# will give error due to missing values
# 1st column of hepatitis dataset is the class variable
hepatitisANN <- nnet(hepatitis[train1,-1], ideal[train1,], size=10, softmax=TRUE)
This code will not give your error, but it will give an error on missing values. You will need to do address those before you can continue.
Also be aware that the class variable is the first variable in the dataset straight from the UCI data repository
Edit based on comments:
The na.action only works if you use the formula notation of nnet.
So in your case:
hepatitisANN <- nnet(class.ind(Class)~., hepatitis[train1,], size=10, softmax=TRUE, na.action = na.omit)

Resources