duplicated levels in factors will be forbidden April 2017. What about the levels function? - r

In the R-devel list, Martin Maechler posted a message about duplicated levels in factors
"factors with non-unique (duplicated) levels have been deprecated since 2009 -- are more deprecated now ..." June 4, 2016
It states that in R 3.4, scheduled for April 2017, duplicated levels will cause an error, no longer just a warning.
I wonder why does the levels function not cause a similar warning? Here I combine the first three levels as "a" in two ways, one deprecated.
Example
> x <- c("a", "b", "c", "d")
> xf <- factor(x, levels = c("a", "b", "c", "d"),
labels = c("a", "a", "a", "d"))
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
> xf <- factor(x)
> levels(xf) <- c("a", "a", "a", "d")
> xf
[1] a a a d
Levels: a d
I would like to understand why the latter is treated differently by R than the former.
This is the documented behavior of levels, I'm not exploiting an unstated element. In ?levels, there is an example in which duplicated levels are allowed. I'll paste it in to save you the lookup.
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z

Factors are used to create categorical variables. The Levels attribute of this variable represents the different categories. A variable cannot have duplicate category. It does not make sense. However, a variable can have duplicate data values of the same category.
The data inside a categorical variable is represented as integer vector. Use unclass to see the integer vector. The levels attribute represents the categories of this variable. For example the first value of this variable belongs to a particular category and it will be assigned number 1. If it is an ordered factor, then the lowest category will be assigned number 1.
x <- c(letters[1:3], letters[1:3])
xf <- factor(x)
xf
# [1] a b c a b c
# Levels: a b c
attributes(xf)
# $levels
# [1] "a" "b" "c"
#
# $class
# [1] "factor"
unclass(xf)
# [1] 1 2 3 1 2 3
# attr(,"levels")
# [1] "a" "b" "c"
If a category does not have values in a variable, then it will be assigned with NA.
factor(c("a", "b", "c"), levels = c("e", "f", "g"))
# [1] <NA> <NA> <NA>
# Levels: e f g
labels is an optional argument used to change the name of the category. If the variable has data values according to the levels argument then the value in the labels argument will be given to it. Notice the value "e" is given the category "h".
factor(c("a", "b", "e"), levels = c("e", "f", "g"), labels = c("h", "i", "j"))
# [1] <NA> <NA> h
# Levels: h i j
Now levels() is a replacement function used to change the data present inside a factor variable. The data used in the levels() function must correspond to the factor variable. Otherwise garbage is created.
xf
# [1] a b c a b c
# Levels: a b c
The values with "a" is changed to "e", "b" to "f", "c" to "g". This example shows how to properly convert the category names of a factor variable.
levels(xf) <- c("e", "f", "g", "e", "f", "g")
> xf
# [1] e f g e f g
# Levels: e f g
Now the garbage type: Notice that the data does not correspond to the factor variable xf. To see the integer vector, use unclass(xf).
levels(xf) <- c("m", "m", "m", "n", "n", "n")
xf
# [1] m m m m m m
# Levels: m n

Related

Is there a way to relevel a variable using the original level positions?

I have a variable with many very long factor names that are in alphabetical order instead of logical. Is there a way to relevel by position instead of variable name?
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
Instead of fct_relevel(f, "b", "a")
using level order to move the second (b) before the first (a) fct_relevel(f, 2, 1)?
You can get the value from f :
forcats::fct_relevel(f, as.character(f[2]), as.character(f[1]))
#[1] a b c d
#Levels: b a c d

Generating distinct groups based on vector/column pairs in R

SEE UPDATE BELOW:
Given a data frame with two columns (x1, x2) representing pairs of objects, I would like to generate groups where all members of each group are paired with all other members in that group. Thus far, I have been able to generate groups by showing all items in x2 that are paired with each item in x1, but this leaves me with groups where a couple of members are only paired with one other group member. I'm having a hard time getting off the ground with this one... Thanks in advance for any help you may have. Please let me know if I should edit this post as I am new to Stack Overflow and new to R coding.
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E")
df <- data.frame(x1, x2)
I would like to go from this df, to an output that looks like df2.
group1 <- c("A")
group2 <- c("B", "C")
group3 <- c("B", "D")
group4 <- c("D", "E")
df2 <- data.frame(cbind.fill(group1, group2, group3, group4, fill = "NULL"))
UPDATE:
Given the following dataset....
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E", "B", "C", "F")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E", "F", "F", "F")
df <- data.frame(x1, x2)
.... I would like to identify groups of x1/x2 where all objects within said group are connected to all other objects of that group.
This is what I have thus far (I'm sure this is riddled with best-practice errors, feel free to call them out. I'm eager to learn)...
n <- nrow(as.data.frame(unique(df$x1)))
RosterGuide <- as.data.frame(matrix(nrow = n , ncol = 1))
RosterGuide$V1 <- seq.int(nrow(RosterGuide))
RosterGuide$Object <- (unique(df$x1))
colnames(RosterGuide) <- c("V1","Object")
groups_frame <- matrix(, ncol= length(n), nrow = length(n))
for (loopItem in 1:nrow(RosterGuide)) {
object <- subset(RosterGuide$Object, RosterGuide$V1 == loopItem)
group <- as.data.frame(subset(df$x2, df$x1 == object))
groups_frame <- cbind.fill(group, groups_frame, fill = "NULL")
}
Groups <- as.data.frame(groups_frame)
Groups <- subset(Groups, select = - c(object))
colnames(Groups) <- RosterGuide$V1
This yields the data frame 'Groups'....
1 2 3 4 5 6
1 F D B B B A
2 NULL E D C C NULL
3 NULL NULL E F D NULL
4 NULL NULL NULL NULL F NULL
... which is exactly what I am looking for, except that if you look at the original df, objects F and D are never paired, rendering group 5 invalid. Also, objects B and E are never paired, rendering group 3 invalid. A valid output should look like this...
1 2 3 4 5
1 D B B B A
2 E D C C NULL
3 NULL NULL NULL F NULL
Question: is there some way that I can relate the groups listed above in the 'Groups' data frame to the original df to remove groups with invalid relationships? This really has me stumped.
For context: What I am really trying to do is group items based on pairwise connections derived from a network of nodes where not all nodes are connected.
Here is one way doing it in base R using apply and unique
df <- data.frame(x1, x2, stringsAsFactors = F)
df <- df[df$x1 != df$x2, ]
unique(t(apply(df, 1, sort)))
[,1] [,2]
3 "B" "C"
4 "B" "D"
9 "D" "E"
dplyr
df %>%
dplyr::filter(x1 != x2) %>%
dplyr::filter(!duplicated(paste(pmin(x1,x2), pmax(x1,x2), sep = "-")))
x1 x2
1 B C
2 B D
3 D E
data.table (there might be another better way)
library(data.table)
as.data.table(df)[, .SD[x1 != x2]][, .GRP, by = .(x1 = pmin(x1,x2), x2 = pmax(x1,x2))]
x1 x2 GRP
1: B C 1
2: B D 2
3: D E 3

Is there a way in R to rank categorical variable (of characters) into ranked ordinal data?

I have a list of character strings, say
alphabets = c(a, b, c, d,..., z) and I would like to get the index of this list as a new column in a data.frame.
e.g. (b, a, c, d, e, g) would yield (2, 1, 3, 4, 5, 7).
The solution you need is to convert the character vector to a factor:
alphabets = c("b", "a", "c", "d", "e", "g")
#convert to class factor with the order define by the levels option
alphabets<-factor(alphabets, levels=letters)
#display the values
as.numeric(alphabets)
#[1] 2 1 3 4 5 7
This is a case for match
x <- c("b", "a", "c", "d", "e", "g")
match(x, letters)
#[1] 2 1 3 4 5 7
Or sapply with grep returning a named int vector
sapply(x, grep, letters)
#b a c d e g
#2 1 3 4 5 7
Two comments:
"I have a list of character strings" Be precise with class names of objects! alphabets = c("a", "b", "c", "d") is a character vector, not a list.
letters is a built-in constant which returns the 26 lower-case letters (of the Roman alphabet) as a character vector. See ?letters for details.

For loop with factor data

I have two vectors of factor data with equal length. Just for examples sake:
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
Ultimately, I am trying to generate a classification matrix showing the number of times each factor is correctly predicted. This would look like the following for the example:
name T F
a 1 2
b 1 1
c 1 1
Note that the tables() command doesn't work here because I have 11 different factors, and the output would be 11x11 instead of 11x2. My plan is to create three vectors, and combine them into a data frame.
First, a vector of the unique factor values in the existing vectors. This is simple enough,
names=unique(df$observed)
Next, a vector of values showing the number of correct predictions. This is where I am running into trouble. I can get the number of correct predictions for an individual factor like so:
correct.a=sum(predicted[which(observed == "a")] == "a")
But this is cumbersome to repeat time and time again, and then combine into a vector like
correct=c("correct.a", "correct.b", correct.c")
Is there a way to use a loop (or other strategy that you can think of) to improve this process?
Also note that the final vector I would create would be something like this:
incorrect.a=sum(observed == "a")-correct.a
t(sapply(split(predicted == observed, observed), table))
# FALSE TRUE
#a 2 1
#b 1 1
#c 1 1
I would suggest you use data.table for explicit clean way to define your results:
library(data.table)
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
dt <- data.table(observed, predicted)
res <- dt[, .(
T = sum(observed == predicted),
F = sum(observed != predicted)),
observed
]
res
# observed T F
# 1: a 1 2
# 2: b 1 1
# 3: c 1 1

Keeping value as a factor when extracting most common factor in R

I can get the most frequent level or name of a factor in a table using table() and levels() or name() as explained here, but how can I get a factor itself?
> a <- ordered (c("a", "b", "c", "b", "c", "b", "a", "c", "c"))
> tt <- table(a)
> m = names(which.max(tt)) # what do I put here?
> is.factor(m)
[1] FALSE # I want this to be TRUE and for m to be identical a[3]
This is just an example, of course. What I'm really trying to do is a lot of manipulation and aggregation of factors and I just want to keep the factors consistent across all the variables. I don't want them to change levels or order or drop levels because there is no data.
It's not clear exactly what you do want. If you want a factor vector of length 4 with the same levels as a:
m = a[ a %in% names(which.max(tt)) ]
For a length one vector, do the same as above and just take the first one:
m = a[ a %in% names(which.max(tt)) ][1]
m
#--------
[1] c
Levels: a < b < c
> m == a[3]
[1] TRUE
If you want a vector of the same length, then:
m <- a
is.na(m) <- ! m %in% names(which.max(tt))

Resources