‘>’ not meaningful for factors error using dplyr [duplicate] - r

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 2 years ago.
I am looking at a video game dataset
I'm trying to calculate the average User score (User_score column in the dataset).
The issue I'm facing is that when ever I try to use the mean function to get the User score average , I always get this error:
"‘>’ not meaningful for factors[1] 16" and i get Nan as a result .
I looked up this problem online and it seems that it happens because I'm trying to find the mean for a categorical variable, however when I use typeof() to check the data type for User_score it says its a integer which is the same as another column I found the mean of(Critic_Score). i tried to remove all rows that have NAN and NA's in order for it to work but it hasn't.
Here is what I tried so far
game_data = read.csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
game_data <- mutate(game_data, Critic_Score = ifelse(Critic_Score > 100, NA, Critic_Score))
game_data <- game_data[complete.cases(game_data), ]
typeof(game_data$User_Score)
typeof(game_data$Critic_Score)
#game_data$User_Score = as.numeric(game_data$User_Score)
game_data <- mutate(game_data, User_Score = ifelse(User_Score > 10, NA, User_Score))
head(game_data)
ncol(game_data)
nrow(game_data)
mean(game_data$Critic_Score, na.rm = T)
mean(game_data$User_Score,na.rm = T)
here are the results
[1] "integer"
[1] "integer"
‘>’ not meaningful for factors[1] 16
[1] 7017
[1] 70.24982
[1] NaN
I was wondering if anyone could help

It seems that it's a data cleaning issue: some values in User_Score column are not numeric but "tbd", and that's why it's imported as character column instead of numeric. Moreover, read.csv() imports that as factor.
str(game_data$User_Score)
# Factor w/ 97 levels "","0","0.2","0.3",..: 79 1 82 79 1 1 84 65 83 1 ...
Check that with:
table(game_data$User_Score)
So you need to replace the "tbd" values. You need to decide what you want to do with them: replace with 0, replace with NA - it's up to you and depends on your insight into the dataset.
If you want to use NAs, you can just convert that from factor to characters and then to numeric values:
game_data$User_Score = as.numeric(as.character(game_data$User_Score))

Related

Strange behaviour of as.numeric() with factor variable - gives completely different numbers to those supplied [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 3 years ago.
I have a dataset where I am trying to convert a factor into a numeric variable, it appeared to work fine the first time I ran it but now I have changed the vector contents the as.numeric() function is returning different (possibly previous) values rather the values now in the vector, despite the fact that these do not appear to be stored anywhere. It works fine if I convert to a character first, however. The code I am using is:
rm(reprex) # ensure does not exist from previously
reprex <- data.frame(rbind(c("BT",8),c("BL", 1), c("TS",1), c("SA", 7), c("S", 5), c("LS",5), c("M",3), c("CV",3), c("CF",3), c("PE",3)))
names(reprex) <-c("Post Area", "Count")
reprex$Countnum <- as.numeric(reprex$Count) # should be same as Count
reprex$Countnum_char <- as.numeric(as.character(reprex$Count)) # is same as Count
head(reprex)
gives:
Post Area Count Countnum Countnum_char
1 BT 8 5 8
2 BL 1 1 1
3 TS 1 1 1
4 SA 7 4 7
5 S 5 3 5
6 LS 5 3 5
Why is this? It seems to work if I convert it to a character before converting to numeric so I can avoid it, but I am confused about why this happens at all and where the strangely-mapped (I suspect from a previous version of the dataframe) factor levels are being stored such that they persist after I remove the object.
This question deals with how R understands your process. Count = 1 is the smallest number and so this become Countnum = 1. Count = 3 is the second highest number so the factor level is 2, which also means that the Countnum = 2, and so on and so forth. In effect, what your first as.numeric does is takes the factor level and converts the factor level to a number. The Countnum_char takes the character value (e.g. Count = 8 is factor level = 5 or Count = 5 is factor level = 3) as its value and converts the value to a number, not the factor level.
Take a look here to shed some light on the why this is happening: https://www.dummies.com/programming/r/how-to-convert-a-factor-in-r/
The Dummies website has a lot of good free resources on R.
> numbers <- factor(c(9, 8, 10, 8, 9))
If you run str() on the above code snippet you get this output:
> str(numbers)
Factor w/ 3 levels "8","9","10": 2 1 3 1 2
R stores the values as c(2, 1, 3, 1, 2) with associated levels of c(“8”, “9”, “10”)
When converting numbers to character vectors you receive the expected output:
> as.character(numbers)
[1] "9" "8" "10" "8" "9"
However when you use as.numeric() you will get the output of the internal level representation of the vector, and not the original values.
Doing what you did
> as.numeric(as.character(numbers))
[1] 9 8 10 8 9
Is exactly how you fix this! This is normal behavior for R when doing what you are doing; you've not made any mistakes here that I can see.

Changing df of integers to num doesn't work

I'm a newbie in R and don't have much experience with solving errors, so one more time I need help. I have a data frame named n_occur that has 2 columns - number and freq. Both values are integers. I wanted to get the histogram, but I got the error: argument x must be numeric, so I wanted to change both columns into num.
Firstly I tried with the simplest way:
n_occur[,] = as.numeric(as.character(n_occur[,]))
but as a result all values changed into NA. So after searching on stack, I decided to use this formula:
indx <- sapply(n_occur, is.factor)
n_occur[indx] <- lapply(n_occur[indx], function(x) as.numeric(as.character(x)))
and nothing changed, I still have integers and hist still doesn't work. Any ideas how to do that?
If anyone needs it in future, I solved the problem with mutate from dplyr:
n_occur <- mutate(n_occur, freq=as.numeric(freq))
for both columns separately. It worked!
I don't think you really have to do that, just supply the function with the actual columns instead of the entire dataframe. For example:
n_occur = data.frame(
number = sample(as.integer(1:10), 10, TRUE),
freq = sample(as.integer(0:10), 10, TRUE)
)
str(n_occur)
'data.frame': 10 obs. of 2 variables:
$ number: int 9 8 8 5 6 7 8 10 3 4
$ freq : int 0 9 2 0 4 10 7 2 7 9
hist(n_occur$number) # works
hist(n_occur$freq) # works
plot(n_occur$number, n_occur$freq, type = 'h') # works
hist(n_occur) # fails since it is the whole dataframe
Also if you still want to do that, this converts a factor to numeric:
as.numeric(factor(1:10))

R Error : some group is too small for 'qda'

I used the qda{MASS} to find the classfier for my data and it always reported "some group is too small for 'qda'". Is it due to the size of test data I used for model ? I increased the test sample size from 30 to 100, it reported the same error. Helpppppppp.....
set.seed(1345)
AllMono <- AllData[AllData$type == "monocot",]
MonoSample <- sample (1:nrow(AllMono), size = 100, replace = F)
set.seed(1355)
AllEudi <- AllData[AllData$type == "eudicot",]
EudiSample <- sample (1:nrow(AllEudi), size = 100, replace = F)
testData <- rbind (AllMono[MonoSample,],AllEudi[EudiSample,])
plot (testData$mono_score, testData$eudi_score, col = as.numeric(testData$type), xlab = "mono_score", ylab = "eudi_score", pch = 19)
qda (type~mono_score+eudi_score, data = testData)
Here is my data example
>head (testData)
sequence mono_score eudi_score type
PhHe_4822_404_76 DTRPTAPGHSPGAGH 51.4930 39.55000 monocot
SoBi_10_265860_58 QTESTTPGHSPSIGH 33.1408 2.23333 monocot
EuGr_5_187924_158 AFRPTSPGHSPGAGH 27.0000 54.55000 eudicot
LuAn_AOCW01152859.1_2_79 NFRPTEPGHSPGVGH 20.6901 50.21670 eudicot
PoTr_Chr07_112594_90 DFRPTAPGHSPGVGH 43.8732 56.66670 eudicot
OrSa.JA_3_261556_75 GVRPTNPGHSPGIGH 55.0986 45.08330 monocot
PaVi_contig16368_21_57 QTDSTTPGHSPSIGH 25.8169 2.50000 monocot
>testData$type <- as.factor (testData$type)
> dim (testData)
[1] 200 4
> levels (testData$type)
[1] "eudicot" "monocot" "other"
> table (testData$type)
eudicot monocot other
100 100 0
> packageDescription("MASS")
Package: MASS
Priority: recommended
Version: 7.3-29
Date: 2013-08-17
Revision: $Rev: 3344 $
Depends: R (>= 3.0.0), grDevices, graphics, stats, utils
My R version is R 3.0.2.
tl;dr my guess is that your predictor variables got made into factors or character vectors by accident. This can easily happen if you have some minor glitch in your data set, such as a spurious character in one row.
Here's a way to make up a data set that looks like yours:
set.seed(101)
mytest <- data.frame(type=rep(c("monocot","dicot"),each=100),
mono_score=runif(100,0,100),
dicot_score=runif(100,0,100))
Some useful diagnostics:
str(mytest)
## 'data.frame': 200 obs. of 3 variables:
## $ type : Factor w/ 2 levels "dicot","monocot": 2 2 22 2 2 2 ...
## $ mono_score : num 37.22 4.38 70.97 65.77 24.99 ...
## $ dicot_score: num 12.5 2.33 39.19 85.96 71.83 ...
summary(mytest)
## type mono_score dicot_score
## dicot :100 Min. : 1.019 Min. : 0.8594
## monocot:100 1st Qu.:24.741 1st Qu.:26.7358
## Median :57.578 Median :50.6275
## Mean :52.502 Mean :52.2376
## 3rd Qu.:77.783 3rd Qu.:78.2199
## Max. :99.341 Max. :99.9288
##
with(mytest,table(type))
## type
## dicot monocot
## 100 100
Importantly, the first two (str() and summary()) show us what type each variable is. Update: it turns out the third test is actually the important one in this case, since the problem was a spurious extra level: the droplevel() function should take care of this problem ...
This made-up example seems to work fine, so there must be something you're not showing us about your data set ...
library(MASS)
qda(type~mono_score+dicot_score,data=mytest)
Here's a guess. If your score variables were actually factors rather than numeric, then qda would automatically attempt to create dummy variables from them which would then make the model matrix much wider (101 columns in this example) and provoke the error you're seeing ...
bad <- transform(mytest,mono_score=factor(mono_score))
qda(type~mono_score+dicot_score,data=bad)
## Error in qda.default(x, grouping, ...) :
## some group is too small for 'qda'
I had this error as well, so I explained what went wrong on my side for anyone stumbling upon this in the future.
You might have factors on the variable you want to predict. All levels in this factor must have some amount of observations. If you don't have enough observations in a group, you will get this error.
For me, I removed a level completely, but there was still this level left in the factor.
To remove this you have to do this
df$var %<>% factor
NB. %<>% requires magrittr
However, even when I did this, it still failed. When I debugged this further it appears that if you subset from a dataframe that had factor applied you have to refactor again, somehow.
Your grouping variable has 3 levels including 'other' with non cases. Since the number of response variables (2 variables, i.e. mono_score, dicot_score) is larger than the number of cases in any given group level (100, 100 and 0, for dicot, monocot and other, respectively), the analysis cannot be performed.
One way to get rid of unnecesary group levels is by redifining the grouping variable as factor after setting it to character:
test.data$type <- as.factor(as.character(test.data$type))
Another alternative is by defining the levels of the grouping variable:
test.data$type <- factor(test.data$type, levels = c("dicot", "monocot"))
If your dataset was so unbalanced and had, for example, 2 cases of 'other', it would probably make sense to exclude them from the analysis.
This message could still appear if the number of response variables is larger than the number of cases in any given group level. Since you have 100 cases for both group levels (i.e. dicot, monocot) and only two response variables (i.e. mono_score, dicot_score) this should not be a problem anymore.

as.numeric is rounding positive values / outputing NA for negative values [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 4 years ago.
I am trying to do something with [R] which should be extremely simple: convert values in a data.frame to numbers, as I need to test for their values and r does not recognize them as number.
When I convert a decimal number to numeric, I get the correct value:
> a <- as.numeric(1.2)
> a
[1] 1.2
However, when I extract a positive value from the data.frame then use as.numeric, the number is rounded up:
> class(slices2drop)
[1] "data.frame"
> slices2drop[2,1]
[1] 1.2
Levels: 1 1.2
> a <- as.numeric(slices2drop[2,1])
> a
[1] 2
Just in case:
> a*100
[1] 200
So this is not a problem with display, the data itself is not properly handled.
Also, when the number is negative, I get NA:
> slices2drop[2,1] <- -1
> a <- as.numeric(slices2drop[2,1])
> a
[1] NA
Any idea as to what may be happening?
This problem has to do with factors. To solve your problem, first coerce your factor variable to be character and then apply as.numeric to get what you want.
> x <- factor(c(1, 1.2, 1.3)) # a factor variable
> as.numeric(x)
[1] 1 2 3
Integers number are returned, one per each level, there are 3 levels: 1, 1.2 and 1.3, therefore 1,2,3 is returned.
> as.numeric(as.character(x)) # this is what you're looking for
[1] 1.0 1.2 1.3
Actually as.numeric is not rounding your numbers, it returns a unique integer per each level in your factor variable.
I faced a similar situation where the conversion of factor into numeric would generate incorrect results.
When you type: ?factor
The Warning mentioned with the factor variables explains this complexity very well and provides the solution to this problem as well.
It's a good place to start working with...
Another thing to consider is that, such conversion would transform NULLs into NAs

How to sum data.frame column values?

I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?
You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).
To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)
you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))
When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)
to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.

Resources