I have a data frame with 30 row and 850 column(features).
when I want to use svm or other classifier with caret and e1071 packages, I faced this error!
Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'
Even when I want to use feature selection method such as Boruta, I face the same error.
I double check my feature and found nothing. I thought I must have the same column name in data frame so I create a sample data and check as follow:
test<-data.frame("w1"=c(1:6),"w1.1"=c(2:7),"w1"=c(3:8), "ta"=c("T","F","T","F","F","T"))
set.seed(100)
train <- createDataPartition(y=test$ta,p=0.6,list = FALSE)
TrainSet <- test[train,]
TestSet <- test[-train,]
trcontrol_rcv<- trainControl(method="cv", number=10)
svm_test<-svm(ta ~., data=TrainSet,trControl=trcontrol_rcv)
It works good and no Error occurs.
As I see no error happen when test data even has exactly the same colname.
I want to know why this error"Error in terms.formula(formula, data = data) :
duplicated name 'X10Percentile' in data frame using '.'" happen for my data, and how can I eliminate it?
Thank you in advance.
Thank you, everyone. Fortunately, I found the cause of this error.
Because R considers variables as factors. Therefore it makes a data. frame (which in fact is a list).To solve this problem, I converted it into a data numeric in the following way;
test1<-sapply(test,function(x) as.numeric(as.character(x)))
For me that was not the solution, I had a LargeMatrix as an object of only numeric type vectors.
The problem was that some dimnames(MyLargeMatrix) were duplicated. I change them and the error went away.
Related
Okay so here's the deal. I have to use the BaylorEdPsych package in R to test whether the dataset that I have is MCAR or not.
I ran the LittleMCAR function in it with the sample dataset (EndersTable1_1) and it worked flawlessly.
When I try to run the dataset that I have into the function I get this error:
Error in eigen(sampmat, symmetric = TRUE) :
infinite or missing values in 'x'
I don't understand why this would throw an error when my dataset conforms to the structure of the sample data.
My dataset by the way is a time series that details climate variables for the year 2000 with daily resolution.
Here's my dataset for anyone who wants to reproduce this problem. https://drive.google.com/open?id=0B8hGFkkZ5DlfZFl4MGxXY1Y2dlE
My code is below:
install.packages("BaylorEdPsych")
install.packages("mvnmle")
library(BaylorEdPsych)
library(mvnmle)
#<update>
data(EndersTable1_1) #retrieve the enders dataset
view(EndersTable1_1) #view the dataset on R's data viewer
LittleMCAR(EndersTable1_1)
#</update>
LittleMCAR(year_2000) #this is what I named the imported dataset
What am I doing wrong?
Thanks to anyone who replies.
After taking out the blocks of rows that were all NA and the column that was all NA, this succeeds:
LittleMCAR(year_2000[ !apply(year_2000, 1, function(x) all(is.na(x))), -10])
Even though the NaiveBayes() help says that numeric can be passed in the first parameter 'x', I am not able to run it successfully. Without numeric variable(resale) it works fine. Here is the script:
library(readr)
library(klaR)
### load dataset
Dataset <- read_csv("D:/sampledata.csv")
### converting 'model' and 'type' to factor
Dataset$model <- factor(Dataset$model)
Dataset$type <- factor(Dataset$type)
### Executing NaiveBayes with numeric 'resale'
NaiveBayesModel1 <- NaiveBayes(model~type+mylogical+resale,data=Dataset,na.action =na.omit)
### now removing resale. Following works as expected.
NaiveBayesModel1 <- NaiveBayes(model~type+mylogical,data=Dataset,na.action =na.omit)
'model' and 'type' are factors,
'mylogical' is a logical and
'resale' is a numeric variable.
Since, I cannot attach my datafile, I am pasting few rows here. Copy these rows and save as sampledata.csv file on your drive. Modify read_csv() in the above script to point to this csv file.
"model","sales","resale","type","mylogical"
"Integra",16.919,16.36,"Automobile",TRUE
"TL",39.384,19.875,"Automobile",FALSE
"Camry",247.994,13.245,"Automobile",FALSE
"Avalon",63.849,18.14,"Automobile",TRUE
"Celica",33.269,15.445,"Automobile",TRUE
"Tacoma",84.087,9.575,"Truck",TRUE
"RAV4",25.106,13.325,"Truck",FALSE
"4Runner",68.411,19.425,"Truck",FALSE
"Land Cruiser",9.835,34.08,"Truck",TRUE
"Golf",9.761,11.425,"Automobile",FALSE
"Jetta",83.721,13.24,"Automobile",FALSE
"Passat",51.102,16.725,"Automobile",TRUE
"Cabrio",9.569,16.575,"Automobile",FALSE
"GTI",5.596,13.76,"Automobile",FALSE
I get following error if I run NaiveBayes with "resale".
Error in if (any(temp)) stop("Zero variances for at least one class in variables: ", :
missing value where TRUE/FALSE needed
R help ( help(NaiveBayes) ) says I can use numeric. I don't understand what is wrong. Please help.
Regards,
SG
The error is caused by zero variance in variable resale values for each of the outcomes in model. Most likely your training set contains single training record for each distinct value in model.
I'm trying to run correlations on R.
This is my code so far:
library("foreign")
mydata<-read.csv(" ",header=FALSE)
options(max.print=1000000)
attach(mydata)
cor(as.numeric(agree_election),as.numeric(agree_party))
Then it gives me the error that object "agree_election" is not an object.
However, agree_election is just one of the headers of my columns for my excel spreadsheet.How do I fix this?
Check the names in your data frame! Does it contain a variable with a name agree_election?
Please avoid the attach function. It could be fine with just one data frame, but it can make a mess if you have several data frames attached.
This could should be fine, if the variable names are correct.
mydata <- read.csv("...", header = F)
names(mydata)
str(mydata)
cor(as.numeric(mydata$agree_election), as.numeric(mydata$agree_party))
I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:
Error in update.biglm(model, new) : model matrices incompatible
The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels
specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.
Here's an example piece of code to illustrate my problem:
df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df
df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)
Thanks for any advice you have.
I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?
large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)
If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.
If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:
df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]
model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2) # this is now compatible
I am trying to read in yearly data with gaps using the read.zoo function from the zoo package. I am having some trouble finding the FUN that declares the data to be yearly data. The data set is located here.
The function call I am trying is
tsGDP <- read.zoo("us-gross-domestic-product-192919.csv", sep=",", format="%Y",
regular=FALSE, header=TRUE, index.column=1)
plot(log(tsGDP))
This works fine, but it chokes when I try to plot the ACF of the series
> acf(tsGDP)
Error in na.fail.default(as.ts(x)) : missing values in object
This R-list posting seems to indicate that this is because I am not declaring yearly data correctly.
Without data , it is hard to reproduce problem.
But , from the documentation of acf
By default, no missing values are allowed. If the na.action function passes through missing values (as na.pass does), the covariances are computed from the complete cases.
why not to try with
acf(x = tsGDP, na.fail = na.pass)