In R I am trying to delete rows within a dataframe (ants) which have a negative value under the column heading Turbidity. I have tried
ants<-ants[ants$Turbidity<0,]
but it returns the following error:
Warning message:
In Ops.factor(ants$Turbidity, 0) : < not meaningful for factors
Any ideas why this may be? Perhaps I need to make the negative values
NA before I then delete all NAs?
Any ideas much appreciated, thank you!
#Joris: result is
str(ants$Turbidity)
num [1:291] 0 0 -0.1 -0.2 -0.2 -0.5 0.1 -0.4 0 -0.2 ...
Marek is right, it's a data problem. Now be careful if you use [as.numeric(ants$Turbidity] , as that one will always be positive. It gives the factor levels (1 to length(ants$Turbidity)), not the numeric factors.
Try this :
tt <- as.numeric(as.character(ants$Turbidity))
which(!is.na(tt))
It will give you a list of indices where the value was not numeric in the first place. This should enable you to first clean up your data.
eg:
> Turbidity <- factor(c(1,2,3,4,5,6,7,8,9,0,"a"))
> tt <- as.numeric(as.character(Turbidity))
Warning message:
NAs introduced by coercion
> which(is.na(tt))
[1] 11
You shouldn't use the as.numeric(as.character(...)) structure to convert problematic data, as it will generate NA's that will mess with the rest. Eg:
> Turbidity[tt > 5]
[1] 6 7 8 9 <NA>
Levels: 0 1 2 3 4 5 6 7 8 9 a
Always do summary(ants) after reading in data, and check if you get what you expect.
It will save you lots of problems. Numeric data is prone to magic conversion to character or factor types.
EDIT. I forget about as.character conversion (see Joris comment).
Message mean that ants$Turbidit is a factor. It will work when you do
ants <- ants[as.numeric(as.character(ants$Turbidity)) > 0,]
or
ants <- subset(ants, as.character(as.numeric(Turbidity)) > 0)
But the real problem is that your data are not prepared to analysis. Such conversion should be done in the beginning. You should be careful cause there could be non-numeric values also.
This should also work using the tidyverse (assuming column is the correct data type).
ants %>% dplyr::filter(Turbidity >= 0)
Related
I'm currently writing my master thesis and when I made a regression I found out that I have some outliers which I would like to either delete or fill in a zero. I got a dataframe with company names and their daily returns from 2010 until 2021.
The dataframe is called xsr. I want to find the outliers which are above 0.5 and below -0.5. I managed to create a dataframe according to this condition xsr_short <- xsr[,c(2:214)] <0.5. Then I tried to pick the false values outliers <- subset(xsr_short, xsr_short = FALSE). Which just gives me back the initial xsr_short.
I also tried it with the select command: xsr_short <- select(xsr, c('ABBN SW Equity':'ZWM SW Equity') < 0.5).
The output to this is:
Error in `select()`:
! NA/NaN argument
Backtrace:
1. dplyr::select(xsr, c("ABBN SW Equity":"ZWM SW Equity") < 0.5)
22. base::.handleSimpleError(`<fn>`, "NA/NaN argument", base::quote("ABBN SW Equity":"ZWM SW Equity"))
23. rlang (local) h(simpleError(msg, call))
24. handlers[[1L]](cnd)
Warning messages:
1: In eval_tidy(expr, context_mask) : NAs introduced by coercion
2: In eval_tidy(expr, context_mask) : NAs introduced by coercion
I need to fill in the second condition > -0.5 and then delete the values that are out of this range.
Thank you very much in advance for your help and your time!
It seems like you are less concerned with an actual subset but rather just switching out unwanted values in your data while preserving what you have for the regression. In that case, the tidyverse package may be helpful. First, you can load this package as well as this fake dataset:
#### Load Tidyverse ####
library(tidyverse)
#### Make Data Frame ####
data <- data.frame(IV = c("Control","Treatment",
"Control","Treatment"),
DV = c(-9999,2,4,5555))
data
Which gives you this:
IV DV
1 Control -9999
2 Treatment 2
3 Control 4
4 Treatment 5555
From there you can simply use mutate and ifelse to remove the unwanted values and replace then with NA missing values with this code, saving the data into a new version with the replacement values:
#### Swap Outliers with NA Values ####
clean.data <- data %>%
mutate(DV = ifelse(DV < 0,
NA,
ifelse(DV > 100,
NA,
DV)))
clean.data
Which gives you this:
IV DV
1 Control NA
2 Treatment 2
3 Control 4
4 Treatment NA
As some others have noted, its generally bad practice to delete outliers in your data unless you have a defensible reason to do so. So if you do remove them, make sure you have something justifiable to include in your thesis that explains why you removed the values.
I am using the ordinal package and using the clmm function on R, but keep getting the following error despite ensuring that my response variable is ordinal (aka an ordered factor):
Error in getY(fullmf) : response needs to be a factor
Here is the code with the error, also showing how R already understands the variable 'helpfulness' to be an ordered factor.
> library(ordinal)
> hyp1.model1<-clmm(helpfulness~reflectiontype+session+(1+reflectiontype|participant),data=hyp1data)
Error in getY(fullmf):response needs to be a factor
> unique(helpfulness)
[1] 4 2 1 3 0
Levels: 0 < 1 < 2 < 3 < 4
> class(helpfulness)
[1] "ordered" "factor"
I do not have much to add to what have been said by Roland and Ryan J Field. I am just trying to explain it.
Without looking at your data, I could be totally wrong.
I have tested using the example data wine. The model only gave such an error message when the left side of the ~ is numeric. It did not give men any error messages when I replaces other variables with numeric values.
It appears that you might have two helpfulness: one vector helpfulness and one variable helpfulness in your data hyp1data.
> unique(helpfulness)
[1] 4 2 1 3 0
Levels: 0 < 1 < 2 < 3 < 4
> class(helpfulness)
[1] "ordered" "factor"
Only tells the vector helpfulness is a factor. Your variable hyp1data$helpfulness might still be numeric. You may want to try class(hyp1data$helpfulness) to check this.
Please let me know if this is not the case. I will delete this answer.
This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 4 years ago.
I am trying to do something with [R] which should be extremely simple: convert values in a data.frame to numbers, as I need to test for their values and r does not recognize them as number.
When I convert a decimal number to numeric, I get the correct value:
> a <- as.numeric(1.2)
> a
[1] 1.2
However, when I extract a positive value from the data.frame then use as.numeric, the number is rounded up:
> class(slices2drop)
[1] "data.frame"
> slices2drop[2,1]
[1] 1.2
Levels: 1 1.2
> a <- as.numeric(slices2drop[2,1])
> a
[1] 2
Just in case:
> a*100
[1] 200
So this is not a problem with display, the data itself is not properly handled.
Also, when the number is negative, I get NA:
> slices2drop[2,1] <- -1
> a <- as.numeric(slices2drop[2,1])
> a
[1] NA
Any idea as to what may be happening?
This problem has to do with factors. To solve your problem, first coerce your factor variable to be character and then apply as.numeric to get what you want.
> x <- factor(c(1, 1.2, 1.3)) # a factor variable
> as.numeric(x)
[1] 1 2 3
Integers number are returned, one per each level, there are 3 levels: 1, 1.2 and 1.3, therefore 1,2,3 is returned.
> as.numeric(as.character(x)) # this is what you're looking for
[1] 1.0 1.2 1.3
Actually as.numeric is not rounding your numbers, it returns a unique integer per each level in your factor variable.
I faced a similar situation where the conversion of factor into numeric would generate incorrect results.
When you type: ?factor
The Warning mentioned with the factor variables explains this complexity very well and provides the solution to this problem as well.
It's a good place to start working with...
Another thing to consider is that, such conversion would transform NULLs into NAs
So I am having some issues with some NA values in the residuals of a lm cross sectional regression in R.
The issue isn't the NA values themselves, it's the way R presents them.
For example:
test$residuals
# 1 2 4 5
# 0.2757677 -0.5772193 -5.3061303 4.5102816
test$residuals[3]
# 4
# -5.30613
In this simple example a NA value will make one of the residuals go missing. When I extract the residuals I can clearly see the third index missing. So far so good, no complaints here. The problem is that the corresponding numeric vector is now one item shorter so the third index is actually the fourth. How can I make R return these residuals instead, i.e., explicitly showing NA instead of skipping an index?
test$residuals
# 1 2 3 4 5
# 0.2757677 -0.5772193 NA -5.3061303 4.5102816
I need to keep track of all individual residuals so it would make my life much easier if I could extract them this way instead.
I just found this googling around a bit deeper. The resid function on a lm with na.action=na.exclude is the way to go.
Yet another idea is to take advantage of the row names associated with the data frame provided as input to lm. In that case, the residuals should retain the names from the source data. Accessing the residuals from your example would give a value of -5.3061303 for test$residuals["4"] and NA for test$residuals["3"].
However, this does not exactly answer your question. One approach to doing exactly what you asked for in terms of getting the NA values back into the residuals is illustrated below:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D)
> D[names(Z$residuals),"residual"]<-Z$residuals
> D
x y residual
1 NA 2.1 NA
2 2 3.2 -0.28
3 3 4.9 0.55
4 4 5.0 -0.22
5 5 6.0 -0.09
6 6 7.0 0.04
If you are doing predictions based on the regression results, you may want to specify na.action=na.exclude in lm. See the help results for na.omit for a discussion. Note that simply specifying na.exclude does not actually put the NA values back into the residuals vector itself.
As noted in a prior answer, resid (synonym for residuals) provides a generic access function in which the residuals will contain the desired NA values if na.exclude was specified in lm. Using resid is probably more general and a cleaner approach. In that case, the code for the above example would be changed to:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D,na.action=na.exclude)
> D$residuals<-residuals(Z)
Here an illustrated strategy using a slightly modified example on the lm help page. This is a direct application of the definition of residual:
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
# Two NA's introduced
weight <- c(4.17,5.58,NA,6.11,4.50,4.61,5.17,4.53,5.33,5.14,
4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
lm.D9 <- lm(weight ~ group)
rr2 <- weight- predict(lm.D9, na.action=na.pass)
Warning message:
In weight - predict(lm.D9, na.action = na.pass) :
longer object length is not a multiple of shorter object length
> rr2
[1] -0.8455556 0.5644444 NA 1.0944444 -0.5155556 -0.4055556 0.1544444
[8] -0.4855556 0.3144444 0.5044444 0.1744444 -0.4655556 -0.2255556 -1.0455556
[15] 1.2344444 -0.8055556 1.3944444 NA -0.6955556 -0.3255556
I think it would be dangerous to directly modify an lm object so that lm.D9$residual would return that result.
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.