Using below code I import a dataset, explore it and remove a row.
After removing the row the output of my length and levels command is unchanged. Why?
MT <- read_csv("Q:/PhD/PhD courses/Data Doc and Man/day3-day4/bromraw.txt",
col_names = FALSE)
names(MT) <- c("id","pnr","age","sex", "runtime")
MT$sex <- as.factor(MT$sex)
length(levels(MT$sex))
levels(MT$sex)
This is the output:
[1] 3
[1] "33529" "K" "M"
Something is wrong. I investigate the row where sex has the value 33529
filter(MT, sex == 33529)
After examining the row I decide to drop it, and recheck the sex variable again.
MT <- subset(MT, sex !=33529)
length(levels(MT$sex))
levels(MT$sex)
[1] 3
[1] "33529" "K" "M"
The row is not there when I browse the data, but the output of the length and levels command is the same as before. What am I doing wrong?
I feel the question deserves a better explanation than just a piece of code.
Factor levels can exist independent of the data, e.g.
x <- factor(character(0), levels = LETTERS[1:3])
creates a vector of length 0 which has 3 factor levels
x
factor(0)
Levels: A B C
The length of the vector length(x) is zero but x has 3 levels
levels(x)
[1] "A" "B" "C"
(and length(levels(x)) is 3, accordingly).
The benefit is that we can add data later on which is checked if it is compatible with the defined factor levels:
x[1:4] <- LETTERS[1:4]
Warning message: In [<-.factor(*tmp*, 1:4, value = c("A", "B",
"C", "D")) : invalid factor level, NA generated
x
[1] A B C <NA>
Levels: A B C
Now, the vector consists of 4 elements (length(x)) but there are still only 3 factor levels. Note that "D" has not become an additional factor level automatically but was replaced by NA instead.
If elements of the vector are removed, e.g.
y <- x[-c(1L, 4L)]
y
[1] B C
Levels: A B C
the factor levels remain unchanged while length(y) is 2 now.
However, if you want to remove unused factor levels you can do so by explicitely using the droplevels() function as pointed out by akrun:
y <- droplevels(y)
y
[1] B C
Levels: B C
Now, factor level "A" has been dropped as it is unused.
While the levels() function shows the factor levels which are defined it does not tell which of the boxes (credit to Acccumulation for the picture) are filled or not. The unique() function returns a vector of distinct values while the table() function counts the number of occurrences:
set.seed(1L)
z <- sample(LETTERS[1:8], 10, replace = TRUE)
z
[1] "C" "B" "E" "H" "A" "B" "D" "A" "D" "C"
unique(z)
[1] "C" "B" "E" "H" "A" "D"
table(z)
z
A B C D E H
2 2 2 2 1 1
This could be a case of unused levels. We can resolve it by dropping the levels
MT <- droplevels(subset(MT, sex != 33529))
Related
I have three columns which are characters A, B, and C respectively. I am using is.numeric to convert them to numeric and then assign them values e.g. 1,2 and 3, but when I am using is.numeric(). it returns back NAs. In different data frames these orders vary, e.g. ABC or ACB, but A=i+0i, B=2+3i and C is also a complex number. I want to first convert the string to a complex number and then assign values to them.
LV$phase1 <- as.numeric(LV$phase1)
class(phase1)
A=1
print(phase1)
This is the error:
"Warning message:
NAs introduced by coercion "
It does not usually make sense to convert character data to numeric, but if the letters refer to an ordered sequence of events/phases/periods, then it may be useful. R uses factors for this purpose. For example
set.seed(42)
phase <- sample(LETTERS[1:4], 10, replace=TRUE)
phase
# [1] "A" "A" "A" "A" "B" "D" "B" "B" "A" "D"
factor(phase)
# [1] A A A A B D B B A D
Levels: A B D
as.numeric(factor(phase))
# [1] 1 1 1 1 2 3 2 2 1 3
If this is what you are trying to do
LV$phase1 <- as.numeric(factor(LV$phase1))
will convert the letters to an ordered sequence and assign numbers to represent those categories.
I am creating a function to help me quickly recode variables into numerical values, as a form of practice. The idea behind creating the function is to quickly recode several values into numerical form, for any length. If a dataset is really long for instance, the function in theory should recode all of these values without having to manually type out each condition in which to recode it into a specific value.
For instance:
levels(d$letters)
[1] a b c d
The general form of the function is to:
d$letters.recode[d$letters == "a"] <- 1
d$letters.recode[d$letters == "b"] <- 2
d$letters.recode[d$letters == "c"] <- 3
And using this function:
rc.f <- function(a, b){
x <- levels(a)
y <- length(a)
b <- NA
for (i in 1:y){
z <- b[a==x[i]] <- i
}
}
In theory, the idea is that this function should create another variable, where a is recoded as 1, b is recoded as 2 and so on.
However when I run rc.f(d$letters, d$letters.recode), no new variables are created in the dataset, and the function does not return an error.
Any ideas?
Thanks.
Another example dataset d:
Say for a list of respondents they are assigned a category depending on their region:
Respondent Region
1 d
2 b
3 g
4 c
5 e
6 c
7 f
8 a
I am looking for a way to recode d$Region into a numerical value, to d$Region.R.
Using the same function as above, I am wondering whether I can use the function to create another variable in the dataframe, by inputting d$Region and d$Region.R into the function. So recoding a,b,c,[...],g into 1,2,3,[...],7.
If you want to a,b,f,d as 1,2,4,3 then use following
I have updated your code for function rc.f a little bit
Removed second argument b, since we are giving b <- NA ,so we do not need second argument
We do not need other variable to store the value of b , so i removed z
Since every argument is not factor so we need to coerce it into factor
we do not need y , we can directly put length(a) in for loop condition
and last but not the least the last line is the output of the function unless we use return, so there i putted b in last
The code is
rc.f <- function(a)
{
a<-as.factor(a)
x <- levels(a)
b <- NA
for (i in 1:length(a))
{
b[a==x[i]] <- i
}
b
}
let us take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f"
[14] "d" "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 5 6 10 4 9 6 7 4 3 1 8 8 8
If you want a,b,f,d as 1,2,6,4 then use following
rc.f <- function(a)
{
a<-as.factor(a)
b <- NA
for (i in 1:26)
{
b[a==letters[i]] <- i
}
b
}
lets take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f" "d"
[15] "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 7 8 25 6 22 8 10 6 4 1 19 19 19
I am trying to count the number of discordant pairs. For example:
arg1=c("b","c","a","d")
arg2 = c("b","c","d","a")
There is 1 discordant pair in the above (the pair: "a" and "d")
But when I run:
require(asbio)
sum(ConDis.matrix(arg1,arg2)==-1,na.rm=TRUE)
The answer I receive is: 5 (instead of the correct answer - 1)
I also tried:
require(RankAggreg)
require(DescTools)
xy <- table(arg1,arg2)
cd <- ConDisPairs(xy)
cd$D
the answer is 5 again.
What am I missing?
I think you are misunderstanding how ConDis.matrix works.
The pairs it refers to are pairs of indices of elements and the function checks, for each pair, whether they are moving in the same way in both vectors.
So, in your vector, you have indeed 5 discordant pairs, that is (considering letters with an ordered quantitative view):
between obs1 and obs3 ("a" is lower than "b" in arg1 but "d" is higher in arg2)
between obs1 and obs4 ("a" is lower than "b" in arg2 but "d" is higher in arg1)
between obs2 and obs3 ("a" is lower than "c" in arg1 but "d" is higher in arg2)
between obs2 and obs4 ("a" is lower than "c" in arg2 but "d" is higher in arg1)
between obs3 and obs4 ("a" is lower than "d" in arg1 but "d" is higher than "a" in arg2)
Based on #Cath's initial comment, converting the character vectors into factors seems like it might provide a workaround by mapping the text values to integers that can then be used in the function. Edit: be aware that reordering the factor levels changes the final result. I don't know enough about the discordance function to say if this is the expected behavior.
# Original Character vectors
arg1 <- c("b","c","a","d")
arg2 <- c("b","c","d","a")
# Translate character vectors into factors
all_levels <- unique(arg1, arg2)
arg1 <- factor(arg1, levels = all_levels)
arg1
[1] b c a d
Levels: b c a d
arg2 <- factor(arg2, levels = all_levels)
arg2
[1] b c d a
Levels: b c a d
# This maps each text string to a number
as.numeric(arg1)
[1] 1 2 3 4
as.numeric(arg2)
[1] 1 2 4 3
# Use the underlying numeric data in the function
require(asbio)
sum(ConDis.matrix(as.numeric(arg1), as.numeric(arg2))==-1,na.rm=TRUE)
[1] 1
Edit: sorting the factor levels changes the final output
arg1 <- c("b","c","a","d")
arg2 <- c("b","c","d","a")
all_levels <- sort(unique(arg1, arg2)) # sorted
arg1 <- factor(arg1, levels = all_levels)
arg2 <- factor(arg2, levels = all_levels)
sum(ConDis.matrix(as.numeric(arg1), as.numeric(arg2))==-1,na.rm=TRUE)
[1] 5
I have data frame with some numerical variables and some categorical factor variables. The order of levels for those factors is not the way I want them to be.
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
df
# numbers letters
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
If I change the order of the levels, the letters no longer are with their corresponding numbers (my data is total nonsense from this point on).
levels(df$letters) <- c("d", "c", "b", "a")
df
# numbers letters
# 1 1 d
# 2 2 c
# 3 3 b
# 4 4 a
I simply want to change the level order, so when plotting, the bars are shown in the desired order - which may differ from default alphabetical order.
Use the levels argument of factor:
df <- data.frame(f = 1:4, g = letters[1:4])
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
levels(df$g)
# [1] "a" "b" "c" "d"
df$g <- factor(df$g, levels = letters[4:1])
# levels(df$g)
# [1] "d" "c" "b" "a"
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
some more, just for the record
## reorder is a base function
df$letters <- reorder(df$letters, new.order=letters[4:1])
library(gdata)
df$letters <- reorder.factor(df$letters, letters[4:1])
You may also find useful Relevel and combine_factor.
Since this question was last active Hadley has released his new forcats package for manipulating factors and I'm finding it outrageously useful. Examples from the OP's data frame:
levels(df$letters)
# [1] "a" "b" "c" "d"
To reverse levels:
library(forcats)
fct_rev(df$letters) %>% levels
# [1] "d" "c" "b" "a"
To add more levels:
fct_expand(df$letters, "e") %>% levels
# [1] "a" "b" "c" "d" "e"
And many more useful fct_xxx() functions.
so what you want, in R lexicon, is to change only the labels for a given factor variable (ie, leave the data as well as the factor levels, unchanged).
df$letters = factor(df$letters, labels=c("d", "c", "b", "a"))
given that you want to change only the datapoint-to-label mapping and not the data or the factor schema (how the datapoints are binned into individual bins or factor values, it might help to know how the mapping is originally set when you initially create the factor.
the rules are simple:
labels are mapped to levels by index value (ie, the value
at levels[2] is given the label, label[2]);
factor levels can be set explicitly by passing them in via the the
levels argument; or
if no value is supplied for the levels argument, the default
value is used which is the result calling unique on the data vector
passed in (for the data argument);
labels can be set explicitly via the labels argument; or
if no value is supplied for the labels argument, the default value is
used which is just the levels vector
Dealing with factors in R is quite peculiar job, I must admit... While reordering the factor levels, you're not reordering underlying numerical values. Here's a little demonstration:
> numbers = 1:4
> letters = factor(letters[1:4])
> dtf <- data.frame(numbers, letters)
> dtf
numbers letters
1 1 a
2 2 b
3 3 c
4 4 d
> sapply(dtf, class)
numbers letters
"integer" "factor"
Now, if you convert this factor to numeric, you'll get:
# return underlying numerical values
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
# change levels
1> levels(dtf$letters) <- letters[4:1]
1> dtf
numbers letters
1 1 d
2 2 c
3 3 b
4 4 a
# return numerical values once again
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
As you can see... by changing levels, you change levels only (who would tell, eh?), not the numerical values! But, when you use factor function as #Jonathan Chang suggested, something different happens: you change numerical values themselves.
You're getting error once again 'cause you do levels and then try to relevel it with factor. Don't do it!!! Do not use levels or you'll mess things up (unless you know exactly what you're doing).
One lil' suggestion: avoid naming your objects with an identical name as R's objects (df is density function for F distribution, letters gives lowercase alphabet letters). In this particular case, your code would not be faulty, but sometimes it can be... but this can create confusion, and we don't want that, do we?!? =)
Instead, use something like this (I'll go from the beginning once again):
> dtf <- data.frame(f = 1:4, g = factor(letters[1:4]))
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 1 2 3 4
> dtf$g <- factor(dtf$g, levels = letters[4:1])
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 4 3 2 1
Note that you can also name you data.frame with df and letters instead of g, and the result will be OK. Actually, this code is identical with the one you posted, only the names are changed. This part factor(dtf$letter, levels = letters[4:1]) wouldn't throw an error, but it can be confounding!
Read the ?factor manual thoroughly! What's the difference between factor(g, levels = letters[4:1]) and factor(g, labels = letters[4:1])? What's similar in levels(g) <- letters[4:1] and g <- factor(g, labels = letters[4:1])?
You can put ggplot syntax, so we can help you more on this one!
Cheers!!!
Edit:
ggplot2 actually requires to change both levels and values? Hm... I'll dig this one out...
I wish to add another case where the levels could be strings carrying numbers alongwith some special characters : like below example
df <- data.frame(x = c("15-25", "0-4", "5-10", "11-14", "100+"))
The default levels of x is :
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 100+ 11-14 15-25 5-10
Here if we want to reorder the factor levels according to the numeric value, without explicitly writing out the levels, what we could do is
library(gtools)
df$x <- factor(df$x, levels = mixedsort(df$x))
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 5-10 11-14 15-25 100+
as.numeric(df$x)
# [1] 4 1 2 3 5
I hope this can be considered as useful information for future readers.
Here's my function to reorder factors of a given dataframe:
reorderFactors <- function(df, column = "my_column_name",
desired_level_order = c("fac1", "fac2", "fac3")) {
x = df[[column]]
lvls_src = levels(x)
idxs_target <- vector(mode="numeric", length=0)
for (target in desired_level_order) {
idxs_target <- c(idxs_target, which(lvls_src == target))
}
x_new <- factor(x,levels(x)[idxs_target])
df[[column]] <- x_new
return (df)
}
Usage: reorderFactors(df, "my_col", desired_level_order = c("how","I","want"))
I would simply use the levels argument:
levels(df$letters) <- levels(df$letters)[c(4:1)]
To add yet another approach that is quite useful as it frees us from remembering functions from differents packages. The levels of a factor are just attributes, so one can do the following:
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
# Original attributes
> attributes(df$letters)
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
# Modify attributes
attr(df$letters,"levels") <- c("d", "c", "b", "a")
> df$letters
[1] d c b a
Levels: d c b a
# New attributes
> attributes(df$letters)
$levels
[1] "d" "c" "b" "a"
$class
[1] "factor"
I am using R to build prediction model. However, the predict always gives me the error message such as
I know that it should be caused by some test feature levels are not included in the training feature levels. Since the feature matrix itself is big, and it is very hard to modify the feature levels one-by-one in the feature matrix of test data set. Is there a way to enforce the levels of feature items in the test data set to fit the existing levels of training feature items.
Here's an example of making a test variables have the same levels as a training variable:
test <- factor(LETTERS[1:5])
training <- factor(LETTERS[4:10])
levels(test)
#[1] "A" "B" "C" "D" "E"
Trying to replace a value where the level is not present:
test[2] <- training[5]
#Warning:
# In `[<-.factor`(`*tmp*`, 2, value = 5L) :
# invalid factor level, NA generated
You can get around this by uniting the factor levels:
levels(test) <- union(levels(test), levels(training))
levels(test)
#[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
test
#[1] A B C D E
#Levels: A B C D E F G H I J
Now you can do the previous operation without warning:
test[2] <- training[5]
test
#[1] A H C D E
#Levels: A B C D E F G H I J
Most likely you can use a similar approach in your case, but I'm not sure about the exact structure of your data.