How to access actual internal factor lookup hashtable in R

How to access actual internal factor lookup hashtable in R - r

Dear Stackoverflow community,
I have looked everywhere but can't find the answer to this question. I am trying to access the factor lookup table that R uses when you change a string vector into a factor vector. I am not trying to convert a string to a factor but rather to get the lookup table underlying the factor variable and store it as a hash table for use elsewhere.
I encountered the problem because I want to use this factor lookup table on a list of different length vectors, to convert them from strings to numbers.
i.e., I have a list of item sets that I want to convert to numeric, but each set in the list has a different number of items.
So far, I have converted the list of vectors into a vector
vec <- unlist(list)
vec <- factor(vec)
Now I want to do a lookup on the original list with the factor lookup table which must be underlying vec, but I can't seem to find it.

I think you either want the indexes which map the elements of the factor to elements of the factor levels, as in:
vec <- c('a','b','c','b','a')
f <- factor(vec)
f
#> [1] a b c b a
#> Levels: a b c
indx <- (f)
attributes(indx) <- NULL
indx
#> [1] 1 2 3 2 1
or you want the hash tables used internally to create the factor variable. Unfortunately, any hash tables created in the process of creating a factor, would be created by the functions unique and match which are internal functions, so you won't have access to anything those functions create (other than the return value of course). If you want a hash table so you can use it to index a character vector with the same levels as your existing factor, just create a hash table, as in:
library(hash)
.levels <- levels(f)
h <- hash(keys = .levels,values = seq_along(.levels))
newVec <- sample(.levels,10,replace=T)
newVec
#> [1] "a" "b" "a" "a" "a" "c" "c" "b" "c" "a"
values(h,keys = newVec)
#> a b a a a c c b c a
#> 1 2 1 1 1 3 3 2 3 1

Related

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.

You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

R: Index to unique vector that returns original

I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.

Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3

understanding levels: is levels not same as unique()

I read a csv file into a data frame named rr. The character column was treated as factors which was nice.
Do I understand correctly that the levels are just the unique values of the columns? i.e.
levels(rr$col) == unique(rr$col)
Then I wanted to strip leading and trailing whitespaces.(I didn't knew about strip.WHITESPACE option in read)
So I did
rr$col = str_trim(rr$col).
Now the rr$col is no longer a factor. So I did
rr$col = as.factor(rr$col)
But I see now that levels(rr$col) is missing some unique values !! Why?

"Level" is a special property of a variable (column). They are handy because they are retained even if a subset does not contain any values from a specific level. Take for example
x <- as.factor(rep(letters[1:3], each = 3))
If we subset only elements under levels a and b, c is left out. It will be detected with levels(), but not unique(). The latter will see which values appear in the subset only.
> x[c(1,2, 4)]
[1] a a b
Levels: a b c
> levels(x[c(1,2, 4)])
[1] "a" "b" "c"
> unique(x[c(1,2, 4)])
[1] a b
Levels: a b c

Having Numeric data type and character data type in the same column of a data frame?

I have a large data frame (570 rows by 200000 columns) in R. For those of you that are familiar with PLINK, I am trying to create a PED file for a GWAS analysis. Plink requires that each missing character be coded with a zero. The non-missing values are "A", "T", "C", or "G".
So, for example, the data structure looks like this in the data frame.
COL1 COL2
PT1 A T
PT2 T T
PT3 A A
PT4 A T
PT5 0 0
PT6 A A
PT7 T A
PTn T T
When I run my file in Plink, I get an error. I went back to check my file in R and found that the zeros were "character" types. Is it possible to have two different data types (numeric and character) in a given column in R? I've tried making the 0's a numeric type and keep the letters as character type, but it won't work.

I think Justin's advice will probably fix the problem you have with Plink, but wanting to answer your question in bold...
Is it possible to have two different data types (numeric and character) in a given column in R?
Not really, but in this particular scenario, when it is a discrete variable, kind of yes. In R you have the factor basic type, an enumerate in some other languages.
For example try this:
x = factor(c("0","A","C","G","T"),levels=c(0,"A","T","G","C"))
print(x)
[1] 0 A C G T
Levels: 0 A T G C
You can transform them back in integers (first level is 1 by default) and characters:
> as.integer(x)
[1] 1 2 5 4 3
> as.character(x)
[1] "0" "A" "C" "G" "T"
Now when you read a table with read.table you can indicate that all character types should be read as factor even those with quotes around them.
mydata = read.table("yourData.tsv",stringAsFactors=T);

Concatenating data frame values in R not working as expected

Consider the following code:
> a <- data.frame(name=c('a','b','c'))
> b <- data.frame(type=a$name[1])
> c <- data.frame(type=c(a$name[1],a$name[2]))
> b
type
1 a
> c
type
1 1
2 2
Why does b$type have a value of a, the actual assigned value, whereas c$type takes the value of the index number (1 and 2)?

Well, a$name is a factor, not a character vector and You can't concatenate factors like that (because the c function currently doesn't handle factors). Factors are really integer vectors with a levels attribute (and a class), so the c function just uses the integer values. This can probably be considered a bug.
One way to combine factor is by using unlist, which has special code for this case:
c <- data.frame(type=unlist(list(a$name[1], a$name[2])))
Another way is to convert to character vectors:
c <- data.frame(type=c(as.character(a$name[1]), as.character(a$name[2])))
A third way is to use a character vector from the start:
a <- data.frame(name=c('a','b','c'), stringsAsFactors=FALSE)
c <- data.frame(type=c(a$name[1],a$name[2]))