Is there a way in R to rank categorical variable (of characters) into ranked ordinal data? - r

I have a list of character strings, say
alphabets = c(a, b, c, d,..., z) and I would like to get the index of this list as a new column in a data.frame.
e.g. (b, a, c, d, e, g) would yield (2, 1, 3, 4, 5, 7).

The solution you need is to convert the character vector to a factor:
alphabets = c("b", "a", "c", "d", "e", "g")
#convert to class factor with the order define by the levels option
alphabets<-factor(alphabets, levels=letters)
#display the values
as.numeric(alphabets)
#[1] 2 1 3 4 5 7

This is a case for match
x <- c("b", "a", "c", "d", "e", "g")
match(x, letters)
#[1] 2 1 3 4 5 7
Or sapply with grep returning a named int vector
sapply(x, grep, letters)
#b a c d e g
#2 1 3 4 5 7
Two comments:
"I have a list of character strings" Be precise with class names of objects! alphabets = c("a", "b", "c", "d") is a character vector, not a list.
letters is a built-in constant which returns the 26 lower-case letters (of the Roman alphabet) as a character vector. See ?letters for details.

Related

Looping in R with dynamic variables as dataframe names

I am trying to loop through dataframes where my search variable is in the name of the dataframe. Here I have multiple dataframes beginning with "person", "place", or "thing" and ending with either "5" or "8." I would like to loop through the many combinations of beginning and ending to create a temporary dataframe. The temporary dataframe will be used to create a plot and save the plot.
When I try my current code, I'm able to get the variable name to loop correctly (in other words, I can get "person_odds5" or "place_odds5"), but I cannot use those variables to access the corresponding column in the dataframe.
My current code is:
person_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=1:4, or_uci95=11:14, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
place_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=5:8, or_uci95=15:18, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
thing_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=9:12, or_uci95=19:22, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
nouns <- list("person", "place", "thing")
for (x in nouns) {
pval <- c(5)
for (p in pval) {
name <- paste(x,"_odds",p, sep="")
odds <- paste(name,"$odds", sep="")
temp_dat <- data.frame(odds=odds, index=1:nrow(name))
}
}
When I run this code, my output for "name" is "person_odds5" as character type; my output for "odds" is "person_odds5$odds" as character type, and I encounter "Error in 1:nrow(name) : argument of length 0." Basically, it appears that I can't parse my name assignment through the original dataframe.
Input:
>person_odds5
odds or_lci95 or_uci95 id.exposure id.outcome
1 a 1 11 f w
2 b 2 12 g x
3 c 3 13 h y
4 d 4 14 i z
>
Desired output:
>temp_dat
odds index
1 a 1
2 b 2
3 c 3
4 d 4
>

R - Data.table fast binary search based subset with multiple values in second key

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point.
My data looks like this:
ID TYPE MEASURE_1 MEASURE_2
1 A 3 3
1 B 4 4
1 C 5 5
1 Mean 4 4
2 A 10 1
2 B 20 2
2 C 30 3
2 Mean 20 2
When I do this ... all works as expected.
setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C
Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key, I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key. So, it only takes the first value defined in the vector and ignores all following values.
# extract SD of all IDs with one of the 3 types A/B/C
dt[.(unique(ID), c("A", "B", "C")]
# previous output is equivalent to
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
# I want/expect
dt[TYPE %in% c("A", "B", "C")]
What am I missing here or is this sth I cannot do with keyed subsets?
To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)
And defining multiple keys in key 1 works also as expected.
dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE
In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments), it is mentioned :
character, list and data.frame input to i is converted into a data.table internally using as.data.table.
So, the classical recycling rule used in R (or in data.frame) applies. That is, .(unique(ID), c("A", "B", "C")), which is equivalent to list(unique(ID), c("A", "B", "C")), becomes:
as.data.table(list(unique(ID), c("A", "B", "C")))
and since the length of the longest list element (length of c("A", "B", "C")) is not a multiple of the shorter one (length of unique(ID)), you will get an error.
If you want each value in unique(ID) combined with each element in c("A", "B", "C"), you should use CJ(unique(ID), c("A", "B", "C")) instead.
So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))].
Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID).

Is there a way to relevel a variable using the original level positions?

I have a variable with many very long factor names that are in alphabetical order instead of logical. Is there a way to relevel by position instead of variable name?
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
Instead of fct_relevel(f, "b", "a")
using level order to move the second (b) before the first (a) fct_relevel(f, 2, 1)?
You can get the value from f :
forcats::fct_relevel(f, as.character(f[2]), as.character(f[1]))
#[1] a b c d
#Levels: b a c d

How to return the index of certain duplicate strings in a character vector ignoring the index of the first occurence of the duplicate string?

I have a vector of strings and I want to return the index of the duplicate values, except for the index of the first occurrence of a duplicate value, given another vector with matches. For example:
x <- c("a", "b", "c", "b", "a", "a", "c", "c")
matching_values <- c("a", "b")
So I would like to have an integer vector returned with the indexes 4, 5, 6. So the first duplicate of a occurs at position 5 and the second duplicate at position 6. The first duplicate for b occurs at index 4 and because I did not specify to match c, there will be no index returned. Thank you!
You could use :
which(duplicated(x) & x %in% matching_values)
#[1] 4 5 6
We can use duplicated with %in%
which(x %in% matching_values & duplicated(x))
#[1] 4 5 6

duplicated levels in factors will be forbidden April 2017. What about the levels function?

In the R-devel list, Martin Maechler posted a message about duplicated levels in factors
"factors with non-unique (duplicated) levels have been deprecated since 2009 -- are more deprecated now ..." June 4, 2016
It states that in R 3.4, scheduled for April 2017, duplicated levels will cause an error, no longer just a warning.
I wonder why does the levels function not cause a similar warning? Here I combine the first three levels as "a" in two ways, one deprecated.
Example
> x <- c("a", "b", "c", "d")
> xf <- factor(x, levels = c("a", "b", "c", "d"),
labels = c("a", "a", "a", "d"))
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
> xf <- factor(x)
> levels(xf) <- c("a", "a", "a", "d")
> xf
[1] a a a d
Levels: a d
I would like to understand why the latter is treated differently by R than the former.
This is the documented behavior of levels, I'm not exploiting an unstated element. In ?levels, there is an example in which duplicated levels are allowed. I'll paste it in to save you the lookup.
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
Factors are used to create categorical variables. The Levels attribute of this variable represents the different categories. A variable cannot have duplicate category. It does not make sense. However, a variable can have duplicate data values of the same category.
The data inside a categorical variable is represented as integer vector. Use unclass to see the integer vector. The levels attribute represents the categories of this variable. For example the first value of this variable belongs to a particular category and it will be assigned number 1. If it is an ordered factor, then the lowest category will be assigned number 1.
x <- c(letters[1:3], letters[1:3])
xf <- factor(x)
xf
# [1] a b c a b c
# Levels: a b c
attributes(xf)
# $levels
# [1] "a" "b" "c"
#
# $class
# [1] "factor"
unclass(xf)
# [1] 1 2 3 1 2 3
# attr(,"levels")
# [1] "a" "b" "c"
If a category does not have values in a variable, then it will be assigned with NA.
factor(c("a", "b", "c"), levels = c("e", "f", "g"))
# [1] <NA> <NA> <NA>
# Levels: e f g
labels is an optional argument used to change the name of the category. If the variable has data values according to the levels argument then the value in the labels argument will be given to it. Notice the value "e" is given the category "h".
factor(c("a", "b", "e"), levels = c("e", "f", "g"), labels = c("h", "i", "j"))
# [1] <NA> <NA> h
# Levels: h i j
Now levels() is a replacement function used to change the data present inside a factor variable. The data used in the levels() function must correspond to the factor variable. Otherwise garbage is created.
xf
# [1] a b c a b c
# Levels: a b c
The values with "a" is changed to "e", "b" to "f", "c" to "g". This example shows how to properly convert the category names of a factor variable.
levels(xf) <- c("e", "f", "g", "e", "f", "g")
> xf
# [1] e f g e f g
# Levels: e f g
Now the garbage type: Notice that the data does not correspond to the factor variable xf. To see the integer vector, use unclass(xf).
levels(xf) <- c("m", "m", "m", "n", "n", "n")
xf
# [1] m m m m m m
# Levels: m n

Resources