Is there a way to relevel a variable using the original level positions? - r

I have a variable with many very long factor names that are in alphabetical order instead of logical. Is there a way to relevel by position instead of variable name?
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
Instead of fct_relevel(f, "b", "a")
using level order to move the second (b) before the first (a) fct_relevel(f, 2, 1)?

You can get the value from f :
forcats::fct_relevel(f, as.character(f[2]), as.character(f[1]))
#[1] a b c d
#Levels: b a c d

Related

Exclude common rows in tibbles [duplicate]

This question already has an answer here:
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
Closed 2 years ago.
I'm looking for a way to join two tibbles in a a way to leave rows only unique to the first first tibble or unique in both tibbles - simply those one that do not have any matched key.
Let's see example:
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
With common dplyr::join I am not able to get this:
A
1 d
2 e
Is there some way within dplyr to overcome it or in general in tidyverse to overcome it?
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
dplyr::anti_join will keep only the rows that are unique to the tibble/data.frame of the first argument.
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
dplyr::anti_join(A, B, by = "A")
# A
# <chr>
# 1 d
# 2 e
A base R possibility (well except the tibble):
A[!A$A %in% B$A,]
returns
# A tibble: 2 x 1
A
<chr>
1 d
2 e

For loop with factor data

I have two vectors of factor data with equal length. Just for examples sake:
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
Ultimately, I am trying to generate a classification matrix showing the number of times each factor is correctly predicted. This would look like the following for the example:
name T F
a 1 2
b 1 1
c 1 1
Note that the tables() command doesn't work here because I have 11 different factors, and the output would be 11x11 instead of 11x2. My plan is to create three vectors, and combine them into a data frame.
First, a vector of the unique factor values in the existing vectors. This is simple enough,
names=unique(df$observed)
Next, a vector of values showing the number of correct predictions. This is where I am running into trouble. I can get the number of correct predictions for an individual factor like so:
correct.a=sum(predicted[which(observed == "a")] == "a")
But this is cumbersome to repeat time and time again, and then combine into a vector like
correct=c("correct.a", "correct.b", correct.c")
Is there a way to use a loop (or other strategy that you can think of) to improve this process?
Also note that the final vector I would create would be something like this:
incorrect.a=sum(observed == "a")-correct.a
t(sapply(split(predicted == observed, observed), table))
# FALSE TRUE
#a 2 1
#b 1 1
#c 1 1
I would suggest you use data.table for explicit clean way to define your results:
library(data.table)
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
dt <- data.table(observed, predicted)
res <- dt[, .(
T = sum(observed == predicted),
F = sum(observed != predicted)),
observed
]
res
# observed T F
# 1: a 1 2
# 2: b 1 1
# 3: c 1 1

Counting number of elements in a character column by levels of a factor column in a dataframe

I am a beginner in R. I have a dataframe in which there are two factor columns. One column is a company column, second is a product column. There are several missing values in product column and so I want to count the number of values in product column for each company (or each level of the company variable). I tried table, and count function in plyr package but they only seem to work with numeric variables. Please help!
Lets say the data frame looks like this:
df <- data.frame(company= c("A", "B", "C", "D", "A", "B", "C", "C", "D", "D"), product = c(1, 1, 2, 3, 4, 3, 3, NA, NA, NA))
So the output I am looking for is -
A 2
B 2
C 3
D 2
Thanks in advance!!
A dplyr solution.
df %>%
filter(!is.na(product)) %>%
group_by(company) %>%
count()
# A tibble: 4 × 2
comp n
<fctr> <int>
1 A 2
2 B 2
3 C 3
4 D 1
We can use rowsum from base R
with(df, rowsum(+!is.na(prod), comp))
Assuming your df is :
CASE 1) As give in question
Data for df:
options(stringsAsFactors = F)
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c(1,1,2,3,4,3,3,1,NA,NA)
df <- data.frame(comp=comp,prod=prod)
Program:
df$prodflag <- !is.na(df$prod)
tapply(df$prodflag , df$comp,sum)
Output:
> tapply(df$prodflag , df$comp,sum)
A B C D
2 2 3 1
#########################################################################
CASE 2) In case stringsAsFactors is on and prod is in characters, even NAs are quoted as characters and marked as factors then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a","NA","NA")
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!as.character(df$prod)=="NA")
tapply(df$prodflag , df$comp,sum)
#########################################################################
CASE 3) In case the prod is a character and stringsAsFactors is on but NAs are not quoted then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a",NA,NA)
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!is.na(df$prod))
tapply(df$prodflag , df$comp,sum)
Moral of the story, we should understand our data and then we can the logic which best suits our need.

duplicated levels in factors will be forbidden April 2017. What about the levels function?

In the R-devel list, Martin Maechler posted a message about duplicated levels in factors
"factors with non-unique (duplicated) levels have been deprecated since 2009 -- are more deprecated now ..." June 4, 2016
It states that in R 3.4, scheduled for April 2017, duplicated levels will cause an error, no longer just a warning.
I wonder why does the levels function not cause a similar warning? Here I combine the first three levels as "a" in two ways, one deprecated.
Example
> x <- c("a", "b", "c", "d")
> xf <- factor(x, levels = c("a", "b", "c", "d"),
labels = c("a", "a", "a", "d"))
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
> xf <- factor(x)
> levels(xf) <- c("a", "a", "a", "d")
> xf
[1] a a a d
Levels: a d
I would like to understand why the latter is treated differently by R than the former.
This is the documented behavior of levels, I'm not exploiting an unstated element. In ?levels, there is an example in which duplicated levels are allowed. I'll paste it in to save you the lookup.
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
Factors are used to create categorical variables. The Levels attribute of this variable represents the different categories. A variable cannot have duplicate category. It does not make sense. However, a variable can have duplicate data values of the same category.
The data inside a categorical variable is represented as integer vector. Use unclass to see the integer vector. The levels attribute represents the categories of this variable. For example the first value of this variable belongs to a particular category and it will be assigned number 1. If it is an ordered factor, then the lowest category will be assigned number 1.
x <- c(letters[1:3], letters[1:3])
xf <- factor(x)
xf
# [1] a b c a b c
# Levels: a b c
attributes(xf)
# $levels
# [1] "a" "b" "c"
#
# $class
# [1] "factor"
unclass(xf)
# [1] 1 2 3 1 2 3
# attr(,"levels")
# [1] "a" "b" "c"
If a category does not have values in a variable, then it will be assigned with NA.
factor(c("a", "b", "c"), levels = c("e", "f", "g"))
# [1] <NA> <NA> <NA>
# Levels: e f g
labels is an optional argument used to change the name of the category. If the variable has data values according to the levels argument then the value in the labels argument will be given to it. Notice the value "e" is given the category "h".
factor(c("a", "b", "e"), levels = c("e", "f", "g"), labels = c("h", "i", "j"))
# [1] <NA> <NA> h
# Levels: h i j
Now levels() is a replacement function used to change the data present inside a factor variable. The data used in the levels() function must correspond to the factor variable. Otherwise garbage is created.
xf
# [1] a b c a b c
# Levels: a b c
The values with "a" is changed to "e", "b" to "f", "c" to "g". This example shows how to properly convert the category names of a factor variable.
levels(xf) <- c("e", "f", "g", "e", "f", "g")
> xf
# [1] e f g e f g
# Levels: e f g
Now the garbage type: Notice that the data does not correspond to the factor variable xf. To see the integer vector, use unclass(xf).
levels(xf) <- c("m", "m", "m", "n", "n", "n")
xf
# [1] m m m m m m
# Levels: m n

Can R display how many changes were made to a variable like Stata does

When one is, e.g., replacing a variable in Stata, the Stata output will say that x real changes were made to the variable. This is very useful to know. Is there any similar functionality in R?
I think you could achieve the desired results by simply comparing newly created vectors and tabulating the results:
A <- c("A", "B", "C", "D")
B <- c("A", "C", "C", "E")
A == B
# OR
table(A == B)
In effect, you should be able to save your transformations as a new column/vector and then compare with the original object, summarising TRUE/FALSE values should provide you with the desired information on how many values were changed.
Full output
> A <- c("A", "B", "C", "D")
> B <- c("A", "C", "C", "E")
> A == B
[1] TRUE FALSE TRUE FALSE
> table(A == B)["TRUE"]
TRUE
2
> table(A == B)
FALSE TRUE
2 2

Resources