factor in R Programming - r

I'm quite new to R Programming so I am just learning here and there. I recently got into these lines
x <- as.factor(rep(1:4, 2))
x
# [1] 1 2 3 4 1 2 3 4
# Levels: 1 2 3 4
But if I do
x <- factor(rep(1:4, 2))
that gives me the same result. So what is the difference between factor and as.factor? I get how factor is pulling same numbers out and making them levels, but I don't get what the exact differences are between factor and as.factor.

Five atomic data types in R (in least to most flexible order) are: logical, integer, double, and character.
All elements of an atomic vector must be the same type, so when we attempt to combine different types they will be coerced to the most flexible type.
For example:
str(c("a", 1))
#> chr [1:2] "a" "1"
Coercion often happens automatically. We can also explicitly coerce with as.factor(), as.character(), as.double(),as.integer(), or as.logical().
So, as pointed by #alistaire, to understand the difference between factor() and as.factor(), you must understand 'coercion'.
You can read abour coercion at https://www.safaribooksonline.com/library/view/r-in-a/9781449358204/ch05s08.html
Also, as a beginer, you must read about data strucures in R at
http://adv-r.had.co.nz/Data-structures.html

Related

How do I get the number of levels of a factor in a tibble?

This seems pretty basic, but the number of verbs in the tidyverse is huge now and I don't know which package to look for this.
Here is the problem. I have a tibble
df <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15))
Now if I use the $ operator I can easily find out how many levels are in the factor.
nlevels(df$f1)
# [1] 3
But if I use the [] operator it returns an incorrect number of levels.
nlevels(df[,"f1"])
# [1] 0
Now if df is a data.frame and not a tibble the nlevels() function works with both the $ operator and the [] operator.
So does anyone know the tidyverse equivalent of nlevels() that works on both data.frames and tibbles?
Elaborating on the answer from timcdlucas (and the comments from r2evans), the issue here is the behavior of various forms of the extract operator, not the behavior of tibble. Why? a tibble is actually a kind of data.frame as illustrated when we use the str() function on a tibble.
> library(dplyr)
> aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
+ c1 = rnorm(15))
>
> # illustrate that aTibble is actually a type of data frame
> str(aTibble)
tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
$ f1: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
$ c1: num [1:15] -0.5829 0.3682 1.1854 -0.6309 -0.0268 ...
There are four forms of the extract operator in R: [, [[, $, and #; as noted in What is the meaning of the dollar sign $ in R function?.
The first form, [ can be used to extract content form vectors, lists, matrices, or data frames. When used with a data frame (or tibble in the tidyverse), it returns an object of type data.frame or tibble unless the drop = TRUE argument is included, as noted in the question comments by r2evans.
Since the default setting of drop= in the [ function is FALSE, it follows that df[,"f1"] produces an unexpected or "wrong" result for the code posted with the original question.
library(dplyr)
aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15))
# produces unexpected answer
nlevels(aTibble[,"f1"])
> nlevels(aTibble[,"f1"])
[1] 0
The drop = argument is used when extracting from matrices or arrays (i.e. any object that has a dim attribute, as explained in help for the drop() function.
> dim(aTibble)
[1] 15 2
>
When we set drop = TRUE, the extract function returns an object of the lowest type available, that is all extents of length 1 are removed. In the case of the original question, drop = TRUE with the extract operator returns a factor, which is the right type of input for nlevels().
> nlevels(aTibble[,"f1",drop=TRUE])
[1] 3
The [[ and $ forms of the extract operator extract a single object, so they return objects of type factor, the required input to nlevels().
> str(aTibble$f1)
Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble$f1)
[1] 3
>
> # produces expected answer
> str(aTibble[["f1"]])
Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble[["f1"]])
[1] 3
>
The fourth form of the extract operator, # (known as the slot operator), is used with formally defined objects built with the S4 object system, and is not relevant for this question.
Conclusion: Base R is still relevant when using the Tidyverse
Per tidyverse.org, the tidyverse is a collection of R packages that share an underlying philosophy, grammar, and data structures. When one becomes familiar with the tidyverse family of packages, it's possible to do many things in R without understanding the fundamentals of how Base R works.
That said, when one incorporates Base R functions or functions from packages outside the tidyverse into tidyverse-style code, it's important to know key Base R concepts.
I think you might need to use [[ rather than [, e.g.,
> nlevels(df[["f1"]])
[1] 3
df[,"f1"] returns a tibble with one column. So you're doing nlevels on an entire tibble which doesn't make sense.
df %>% pull('f1') %>% nlevels
gives you what you want.

as.numeric is rounding positive values / outputing NA for negative values [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 4 years ago.
I am trying to do something with [R] which should be extremely simple: convert values in a data.frame to numbers, as I need to test for their values and r does not recognize them as number.
When I convert a decimal number to numeric, I get the correct value:
> a <- as.numeric(1.2)
> a
[1] 1.2
However, when I extract a positive value from the data.frame then use as.numeric, the number is rounded up:
> class(slices2drop)
[1] "data.frame"
> slices2drop[2,1]
[1] 1.2
Levels: 1 1.2
> a <- as.numeric(slices2drop[2,1])
> a
[1] 2
Just in case:
> a*100
[1] 200
So this is not a problem with display, the data itself is not properly handled.
Also, when the number is negative, I get NA:
> slices2drop[2,1] <- -1
> a <- as.numeric(slices2drop[2,1])
> a
[1] NA
Any idea as to what may be happening?
This problem has to do with factors. To solve your problem, first coerce your factor variable to be character and then apply as.numeric to get what you want.
> x <- factor(c(1, 1.2, 1.3)) # a factor variable
> as.numeric(x)
[1] 1 2 3
Integers number are returned, one per each level, there are 3 levels: 1, 1.2 and 1.3, therefore 1,2,3 is returned.
> as.numeric(as.character(x)) # this is what you're looking for
[1] 1.0 1.2 1.3
Actually as.numeric is not rounding your numbers, it returns a unique integer per each level in your factor variable.
I faced a similar situation where the conversion of factor into numeric would generate incorrect results.
When you type: ?factor
The Warning mentioned with the factor variables explains this complexity very well and provides the solution to this problem as well.
It's a good place to start working with...
Another thing to consider is that, such conversion would transform NULLs into NAs

Data.frame with both characters and numerics in one column

I have a function I'm using in R that requires input to several parameters, once as a numeric (1) and as a character (NULL). The default is NULL.
I want to apply the function using all possible combinations of parameters, so I used expand.grid to try and create a dataframe which stores these. However, I am running into problems with creating an object that contains both numerics and characters in one column.
This is what I've tried:
comb<-expand.grid(c("NULL",1),c("NULL",1),stringsAsFactors=FALSE), which returns:
comb
Var1 Var2
1 NULL NULL
2 1 NULL
3 NULL 1
4 1 1
with all entries characters:
class(comb[1,1])
[1] "character"
If I now try and insert a numeric into a specific spot, I still receive a character:
comb[2,1]<-as.numeric(1)
class(comb[2,1])
[1] "character"
I've also tried it using stringsAsFactors=TRUE, or using expand.grid(c(0,1),c(0,1)) and then switching out the 0 for NULL but always have the exact same problem: whenever I do this, I do not get a numeric 1.
Manually creating an object using cbind and then inserting the NULL as a character also does not help. I'd be grateful for a pointer, or a work-around to running the function with all possible combinations of parameters.
As you have been told, generally speaking columns of data frames need to be a single type. It's hard to solve your specific problem, because it is likely that the solution is not really "putting multiple types into a single column" but rather re-organizing your other unseen code to work within this restriction.
As I suggested, it probably will be better to use the built in NA value as expand.grid(c(NA,1),c(NA,1)) and then modify your function to use NA as an input. Or, of course, you could just use some "special" numeric value, like -1, or -99 or something.
The related issue that I mentioned is that you really should avoid using the character string "NULL" to mean anything, since NULL is a special value in R, and confusion will ensue.
These sorts of strategies would all be preferable to mixing types, and using character strings of reserved words like NULL.
All that said, it technically is possible to get around this, but it is awkward, and not a good idea.
d <- data.frame(x = 1:5)
> d$y <- list("a",1,2,3,"b")
> d
x y
1 1 a
2 2 1
3 3 2
4 4 3
5 5 b
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y:List of 5
..$ : chr "a"
..$ : num 1
..$ : num 2
..$ : num 3
..$ : chr "b"

AUTORECODE from SPSS to R

I want to write a function that is doing the same as the SPSS command AUTORECODE.
AUTORECODE recodes the values of string and numeric variables to consecutive integers and puts the recoded values into a new variable called a target variable.
At first I tried this way:
AUTORECODE <- function(variable = NULL){
A <- sort(unique(variable))
B <- seq(1:length(unique(variable)))
REC <- Recode(var = variable, recodes = "A = B")
return(REC)
}
But this causes an error. I think the problem is caused by the committal of A and B to the recodes argument. Thats why I tried
eval(parse(text = paste("REC <- Recode(var = variable, recodes = 'c(",A,") = c(",B,")')")))
within the function. But this isn´t the right solution.
Ideas?
factor may be simply what you need, as James suggested in a comment, it's storing them as integers behind the scenes (as seen by str) and just outputting the corresponding labels. This may also be very useful as R has lots of commands for working with factors appropriately, such as when fitting linear models, it makes all the "dummy" variables for you.
> x <- LETTERS[c(4,2,3,1,3)]
> f <- factor(x)
> f
[1] D B C A C
Levels: A B C D
> str(f)
Factor w/ 4 levels "A","B","C","D": 4 2 3 1 3
If you do just need the numbers, use as.integer on the factor.
> n <- as.integer(f)
> n
[1] 4 2 3 1 3
An alternate solution is to use match, but if you're starting with floating-point numbers, watch out for floating-point traps. factor converts everything to characters first, which effectively rounds floating-point numbers to a certain number of digits, making floating-point traps less of a concern.
> match(x, sort(unique(x)))
[1] 4 2 3 1 3

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Resources