I used the qda{MASS} to find the classfier for my data and it always reported "some group is too small for 'qda'". Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....
set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)
Here is my data example
>head (testData)
sequence mono_score eudi_score type
PhHe_4822_404_76 DTRPTAPGHSPGAGH 51.4930 39.55000 monocot
SoBi_10_265860_58 QTESTTPGHSPSIGH 33.1408 2.23333 monocot
EuGr_5_187924_158 AFRPTSPGHSPGAGH 27.0000 54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH 20.6901 50.21670 eudicot
PoTr_Chr07_112594_90 DFRPTAPGHSPGVGH 43.8732 56.66670 eudicot
OrSa.JA_3_261556_75 GVRPTNPGHSPGIGH 55.0986 45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH 25.8169 2.50000 monocot
>testData$type <- as.factor (testData$type)
> dim (testData)
[1] 200 4
> levels (testData$type)
[1] "eudicot" "monocot" "other"
> table (testData$type)
eudicot monocot other
100 100 0
> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils
My R version is R 3.0.2.
tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.
Here's a way to make up a data set that looks like yours:
set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
mono_score=runif(100,0,100),
dicot_score=runif(100,0,100))
Some useful diagnostics:
str(mytest)
## 'data.frame': 200 obs. of 3 variables:
## $ type : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
## $ mono_score : num 37.22 4.38 70.97 65.77 24.99 ...
## $ dicot_score: num 12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
## type mono_score dicot_score
## dicot :100 Min. : 1.019 Min. : 0.8594
## monocot:100 1st Qu.:24.741 1st Qu.:26.7358
## Median :57.578 Median :50.6275
## Mean :52.502 Mean :52.2376
## 3rd Qu.:77.783 3rd Qu.:78.2199
## Max. :99.341 Max. :99.9288
##
with(mytest,table(type))
## type
## dicot monocot
## 100 100
Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...
This made-up example seems to work fine, so there must be something you're not showing us about your data set ...
library(MASS)
qda(type~mono_score+dicot_score,data=mytest)
Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...
bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) :
## some group is too small for 'qda'
I had this error as well, so I explained what went wrong on my side for anyone stumbling upon this in the future.
You might have factors on the variable you want to predict. All levels in this factor must have some amount of observations. If you don't have enough observations in a group, you will get this error.
For me, I removed a level completely, but there was still this level left in the factor.
To remove this you have to do this
df$var %<>% factor
NB. %<>% requires magrittr
However, even when I did this, it still failed. When I debugged this further it appears that if you subset from a dataframe that had factor applied you have to refactor again, somehow.
Your grouping variable has 3 levels including 'other' with non cases. Since the number of response variables (2 variables, i.e. mono_score, dicot_score) is larger than the number of cases in any given group level (100, 100 and 0, for dicot, monocot and other, respectively), the analysis cannot be performed.
One way to get rid of unnecesary group levels is by redifining the grouping variable as factor after setting it to character:
test.data$type <- as.factor(as.character(test.data$type))
Another alternative is by defining the levels of the grouping variable:
test.data$type <- factor(test.data$type, levels = c("dicot", "monocot"))
If your dataset was so unbalanced and had, for example, 2 cases of 'other', it would probably make sense to exclude them from the analysis.
This message could still appear if the number of response variables is larger than the number of cases in any given group level. Since you have 100 cases for both group levels (i.e. dicot, monocot) and only two response variables (i.e. mono_score, dicot_score) this should not be a problem anymore.
Related
This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 2 years ago.
I am looking at a video game dataset
I'm trying to calculate the average User score (User_score column in the dataset).
The issue I'm facing is that when ever I try to use the mean function to get the User score average , I always get this error:
"‘>’ not meaningful for factors[1] 16" and i get Nan as a result .
I looked up this problem online and it seems that it happens because I'm trying to find the mean for a categorical variable, however when I use typeof() to check the data type for User_score it says its a integer which is the same as another column I found the mean of(Critic_Score). i tried to remove all rows that have NAN and NA's in order for it to work but it hasn't.
Here is what I tried so far
game_data = read.csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
game_data <- mutate(game_data, Critic_Score = ifelse(Critic_Score > 100, NA, Critic_Score))
game_data <- game_data[complete.cases(game_data), ]
typeof(game_data$User_Score)
typeof(game_data$Critic_Score)
#game_data$User_Score = as.numeric(game_data$User_Score)
game_data <- mutate(game_data, User_Score = ifelse(User_Score > 10, NA, User_Score))
head(game_data)
ncol(game_data)
nrow(game_data)
mean(game_data$Critic_Score, na.rm = T)
mean(game_data$User_Score,na.rm = T)
here are the results
[1] "integer"
[1] "integer"
‘>’ not meaningful for factors[1] 16
[1] 7017
[1] 70.24982
[1] NaN
I was wondering if anyone could help
It seems that it's a data cleaning issue: some values in User_Score column are not numeric but "tbd", and that's why it's imported as character column instead of numeric. Moreover, read.csv() imports that as factor.
str(game_data$User_Score)
# Factor w/ 97 levels "","0","0.2","0.3",..: 79 1 82 79 1 1 84 65 83 1 ...
Check that with:
table(game_data$User_Score)
So you need to replace the "tbd" values. You need to decide what you want to do with them: replace with 0, replace with NA - it's up to you and depends on your insight into the dataset.
If you want to use NAs, you can just convert that from factor to characters and then to numeric values:
game_data$User_Score = as.numeric(as.character(game_data$User_Score))
I am struggling to write a function for a dataset that looks like this:
identifier age occupation
pers1 18 student
pers2 45 teacher
pers3 65 retired
What I am trying to do, is to write a function that will:
sort my variables into numerical vs. factor variable
for the numerical variables, give me the mean, min and mx
for the factor variable, give me a frequency table
return point (2) and (3) in a "nice" format (dataframe, vector or table)
So far, I have tried this:
describe<- function(x)
{ if (is.numeric(x)) { mean <- mean(x)
min <- min(x)
max <- max(x)
d <- data.frame(mean, min, max)}
else { factor <- table(x) }
}
stats <- lapply(data, describe)
Problems:
My problem is that now, "stats" is a list that is difficult to read and to export to Excel or share. I don't know how to make the list "stats" more reader-friendly.
Alternatively, maybe is there a better way to build the function "describe"?
Any thoughts on how to solve these two problems are much appreciated!
I ma be late to the party, but maybe you still need a solution. I combined the answers from some of the comments to your post to the following code. It assumes you only have numerical columns and factors, and scales to a large number of columns, as you specified:
# Just some sample data for my example, you don't need ggplot2.
library(ggplot2)
data=diamonds
# Find which columns are numeric, and which are not.
classes = sapply(data,class)
numeric = which(classes=="numeric")
non_numeric = which(classes!="numeric")
# create the summary objects
summ_numeric = summary(data[,numeric])
summ_non_numeric = summary(data[,non_numeric])
# result is easily written to csv
write.csv(summ_non_numeric,file="test.csv")
Hope this helps.
The desired functionality is already available elsewhere, so if you are not interested in coding it yourself then you can maybe use this. The Publish package can be used to generate a table for presentation in a paper. It is not on CRAN, but you can install it from github
devtools::install_github('tagteam/Publish')
library(Publish)
library(isdals) # Get some data
data(fev)
fev$Smoke <- factor(fev$Smoke, levels=0:1, labels=c("No", "Yes"))
fev$Gender <- factor(fev$Gender, levels=0:1, labels=c("Girl", "Boy"))
The univariateTable can generate a publication-ready table presenting the data. By default, univariateTable computes the mean and standard deviation for numeric variables and the distribution of observations in categories for factors. These values can be computed and compared across groups. The main input to univariateTable is a formula where the right-hand side lists the variables to be included in the table while the left-hand side --- if present --- specifies a grouping variable.
univariateTable(Smoke ~ Age + Ht + FEV + Gender, data=fev)
This produces the following output
Variable Level No (n=589) Yes (n=65) Total (n=654) p-value
1 Age mean (sd) 9.5 (2.7) 13.5 (2.3) 9.9 (3.0) <1e-04
2 Ht mean (sd) 60.6 (5.7) 66.0 (3.2) 61.1 (5.7) <1e-04
3 FEV mean (sd) 2.6 (0.9) 3.3 (0.7) 2.6 (0.9) <1e-04
4 Gender Girl 279 (47.4) 39 (60.0) 318 (48.6)
5 Boy 310 (52.6) 26 (40.0) 336 (51.4) 0.0714
Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function
I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"
Please help me.
Thanks in advance.
Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).
One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this X).
2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.
3) For the top X - 1 levels of the factor leave them as is.
4) For the levels < X change them all to one factor to identify them as low-occurrence levels.
Here's an example that's a bit long but hopefully helps:
# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels. If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000"))
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))
Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.
The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".
What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.
#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables
I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.