understanding levels: is levels not same as unique() - r

I read a csv file into a data frame named rr. The character column was treated as factors which was nice.
Do I understand correctly that the levels are just the unique values of the columns? i.e.
levels(rr$col) == unique(rr$col)
Then I wanted to strip leading and trailing whitespaces.(I didn't knew about strip.WHITESPACE option in read)
So I did
rr$col = str_trim(rr$col).
Now the rr$col is no longer a factor. So I did
rr$col = as.factor(rr$col)
But I see now that levels(rr$col) is missing some unique values !! Why?

"Level" is a special property of a variable (column). They are handy because they are retained even if a subset does not contain any values from a specific level. Take for example
x <- as.factor(rep(letters[1:3], each = 3))
If we subset only elements under levels a and b, c is left out. It will be detected with levels(), but not unique(). The latter will see which values appear in the subset only.
> x[c(1,2, 4)]
[1] a a b
Levels: a b c
> levels(x[c(1,2, 4)])
[1] "a" "b" "c"
> unique(x[c(1,2, 4)])
[1] a b
Levels: a b c

Related

Trouble with understanding explanation of %in%

I am having trouble with understanding %in%. In Hadley Wickham's Book "R for data science" in section 5.2.2 it says, "A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y." Then this example is given:
nov_dec <- filter(flights, month %in% c(11, 12))
However, I when I look at the syntax, It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 (y) appear in "month" (x).
?"%in%" doesn't make this any clearer to me. Obviously I'm missing something, but could someone please spell out exactly how this function works?
It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 appear in "month."
If you don't understand the behavior from looking at the example, try it out yourself. For example, you could do this:
> c(1,2,3) %in% c(2,4,6)
[1] FALSE TRUE FALSE
So it looks %in% gives you a vector of TRUE and FALSE values that correspond to each of the items in the first argument (the one before %in%). Let's try another:
> c(1,2,3) %in% c(2,4,6,8,10,12,1)
[1] TRUE TRUE FALSE
That confirms it: the first item in the returned vector is TRUE if the first item in the first argument is found anywhere in the second argument, and so on. Compare that result to the one you get using match():
> match(c(1,2,3), c(2,4,6,8,10,12,1))
[1] 7 1 NA
So the difference between match() and %in% is that the former gives you the actual position in the second argument of the first match for each item in the first argument, whereas %in% gives you a logical vector that just tells you whether each item in the first argument appears in the second.
In the context of Wickham's book example, month is a vector of values representing the months in which various flights take place. So for the sake of argument, something like:
> month <- c(2,3,5,11,2,9,12,10,9,12,8,11,3)
Using the %in% operator lets you turn that vector into the answers to the question Is this flight in month 11 or 12? like this:
> month %in% c(11,12)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[13] FALSE
which gives you a logical vector, i.e. a list of true/false values. The filter() function uses that logical vector to select corresponding rows from the flights table. Used together, filter and %in% answer the question What are all the flights that occur in months 11 or 12?
If you turned the %in% around and instead asked:
> c(11,12) %in% month
[1] TRUE TRUE
you're really just asking Are there any flights in each of month 11 and month 12?
I can imagine that it might seem odd to ask whether a large vector is "in" a vector that has only two values. Consider reading x %in% y as Are each of the values from x also in y?
A quick exercise should be enough to demonstrate how the function works:
> x <- c(1, 2, 3, 4)
> y <- 4
> z <- 5
> x %in% y
[1] FALSE FALSE FALSE TRUE
So the fourth element of numeric vector x is present in numeric vector y.
> y %in% x
[1] TRUE
And the first element of y (there's only one) is in x.
> z %in% x
[1] FALSE
> x %in% z
[1] FALSE FALSE FALSE FALSE
And neither z is in x nor any of x is in z.
Also see the help for all matching functions with ?match
I think understanding how it works is somewhat semantic, and once you can say it logically then the grammar works itself out.
The key is to create a sentence in your head, as you read the code, that would include the context of apply as you work you way through each row, and Boolean Logic to include or exclude rows based on what is contained in the "filter by list "%in% c( ).
nov_dec <- filter(flights, month %in% c(11, 12))
In this case for your example above it should read like this:
"Set the variable nov_dec equal to the subset of rows in flights, where the variable column month (from those rows) is in the list c(11,12). "
As r works from the top down it looks at month and if the it is either 11 or 12, the two variables in your list, then it includes them in nov_dec, otherwise it just continues on.
this explicitly means: are value from x also in y
The best way to understand is a exemple :
x <- 1:10 # numbers from 1 to 10
y <- (1:5)*2 # pair numbers between 2 and 10
y %in% x # all pair numbers between 2 and 10 are in numbers from 1 to 10
x %in% y #only pair numbers are return as True

Different results subsetting with column names

I apologize if I'm duplicating a question but I'm a newbie and I couldn't find the answer (probably because I lack the jargon).
I generated a data frame like so:
x1 <- c(1,2,3,4,5)
x2 <- c("a", "b", "c", "d", "e")
df <- data.frame(x1,x2)
x1 x2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
Then I tried to subset conditioning on the first column like this
df[df$x1>3, "x2"]
The result was as expected
[1] d e
However when I try
df["x1" >3, "x2"]
[1] a b c d e
R seems to ignore the conditional statement and returns the whole column x2. Is there a way of evaluating conditional statements (<,>,==) using the column names?
EDIT: I think I found the answer partially: R evaluates
"some text" > 1000
[1] TRUE
and that explains why I get all the rows.
The question remains: what is a good way of evaluating conditional statements using column names?
I won't go into a long explanation because I think you'll be able to see the issue clearly with a few examples. But basically, if you want to use the character data frame names, you will need a construct like this
df[df[["x1"]] > 3, "x2"]
# [1] d e
# Levels: a b c d e
What was happening with your second try is this
"x1" > 3
# [1] TRUE
And then basically what you did was this
df[TRUE, "x2"]
# [1] a b c d e
# Levels: a b c d e
giving all elements. I would have to look up the reason of exactly why a character is always greater than a number. I think this reason has been described in detail somewhere around here before. If I remember correctly it has to do with precedence between classes. I'll see if I can find it.
Your question could have many answers, especially depending on the context and the type of data you're working with. In this particular case though, you could simply use df[x1 > 3, "x2"].
The first argument is for rows and the second argument is for columns. Essentially, you are saying to return all df rows where x1 is greater than 3. In terms of columns, you want only column x2. You'll get it pretty quickly once you tweak around with the different statements. Hope this helps.

How to access actual internal factor lookup hashtable in R

Dear Stackoverflow community,
I have looked everywhere but can't find the answer to this question. I am trying to access the factor lookup table that R uses when you change a string vector into a factor vector. I am not trying to convert a string to a factor but rather to get the lookup table underlying the factor variable and store it as a hash table for use elsewhere.
I encountered the problem because I want to use this factor lookup table on a list of different length vectors, to convert them from strings to numbers.
i.e., I have a list of item sets that I want to convert to numeric, but each set in the list has a different number of items.
So far, I have converted the list of vectors into a vector
vec <- unlist(list)
vec <- factor(vec)
Now I want to do a lookup on the original list with the factor lookup table which must be underlying vec, but I can't seem to find it.
I think you either want the indexes which map the elements of the factor to elements of the factor levels, as in:
vec <- c('a','b','c','b','a')
f <- factor(vec)
f
#> [1] a b c b a
#> Levels: a b c
indx <- (f)
attributes(indx) <- NULL
indx
#> [1] 1 2 3 2 1
or you want the hash tables used internally to create the factor variable. Unfortunately, any hash tables created in the process of creating a factor, would be created by the functions unique and match which are internal functions, so you won't have access to anything those functions create (other than the return value of course). If you want a hash table so you can use it to index a character vector with the same levels as your existing factor, just create a hash table, as in:
library(hash)
.levels <- levels(f)
h <- hash(keys = .levels,values = seq_along(.levels))
newVec <- sample(.levels,10,replace=T)
newVec
#> [1] "a" "b" "a" "a" "a" "c" "c" "b" "c" "a"
values(h,keys = newVec)
#> a b a a a c c b c a
#> 1 2 1 1 1 3 3 2 3 1

Safely merge data frames by factor columns

Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.
a <- factor(letters[1:3])
b <- factor(letters[1:3], levels=letters[4:1])
a == b
## Error in Ops.factor(a, b) : level sets of factors are different
a < a
## [1] NA NA NA
## Warning message:
## In Ops.factor(a, a) : < not meaningful for factors
However, contrary to my expectation, this check is not performed when merging data frames:
ad <- data.frame(x=a, a=as.numeric(a))
bd <- data.frame(x=b, b=as.numeric(b))
merge(ad, bd)
## x a b
## 1 a 1 4
## 2 b 2 3
## 3 c 3 2
Those factors simply seem to be coerced to characters.
Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?
Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.
The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.
And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.
So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.
Well, with much credit (and apologies to) MrFlick:
> attributes(ad$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
> attributes(ad$a)
NULL
> attributes(ad$b)
NULL
> adfoo<-merge(ad,bd)
> attributes(adfoo$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
So in fact the merged column $x is a factor, although only levels common to both ad and bd are merged. The other columns were coerced via as.numeric long ago.

Having Numeric data type and character data type in the same column of a data frame?

I have a large data frame (570 rows by 200000 columns) in R. For those of you that are familiar with PLINK, I am trying to create a PED file for a GWAS analysis. Plink requires that each missing character be coded with a zero. The non-missing values are "A", "T", "C", or "G".
So, for example, the data structure looks like this in the data frame.
COL1 COL2
PT1 A T
PT2 T T
PT3 A A
PT4 A T
PT5 0 0
PT6 A A
PT7 T A
PTn T T
When I run my file in Plink, I get an error. I went back to check my file in R and found that the zeros were "character" types. Is it possible to have two different data types (numeric and character) in a given column in R? I've tried making the 0's a numeric type and keep the letters as character type, but it won't work.
I think Justin's advice will probably fix the problem you have with Plink, but wanting to answer your question in bold...
Is it possible to have two different data types (numeric and character) in a given column in R?
Not really, but in this particular scenario, when it is a discrete variable, kind of yes. In R you have the factor basic type, an enumerate in some other languages.
For example try this:
x = factor(c("0","A","C","G","T"),levels=c(0,"A","T","G","C"))
print(x)
[1] 0 A C G T
Levels: 0 A T G C
You can transform them back in integers (first level is 1 by default) and characters:
> as.integer(x)
[1] 1 2 5 4 3
> as.character(x)
[1] "0" "A" "C" "G" "T"
Now when you read a table with read.table you can indicate that all character types should be read as factor even those with quotes around them.
mydata = read.table("yourData.tsv",stringAsFactors=T);

Resources