R factor and level - r

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?

A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

Related

Investigate if scores of a Likert item actually differ from one onother?

My dataset contains a Likert item containing how energetic the participants were at that moment rated from 0-6. Where 0 = not energetic at all and 6 = very energetic. I have to investigate if these scores actually differ from one another based on the data. If 0 and 1 do not differ from eachother, I have to combine these two levels into one and so on. So at the end I might have 2 or 4 levels instead of 6.
I have tried applying classification algorithms to the data to see if a model classifying '0' would give an error rate when classifying '1'. Unfortunately, this did not work as I wanted. Is this actually possible?
My question is if someone knows how I can best investigate if there is indeed a difference between those 6 levels or whether I can combine some of them based on differences (or not) in the data of those levels.

Does 0 plays any important role in as.numeric function when using factors in R

Hi guys :) I know this question has been asked before here for example but I would like to ask if 0 plays any important role using the as.numeric function. For example, we have the following simple code
x2<-factor(c(2,2,0,2), label=c('Male','Female'))
as.numeric(x2) #knonwing that this is not the appropriate command used , as.numeric(levels(x2))[x2] would be more appropriate but return NAs
this returns
[1] 2 2 1 2
Is 0 being replaced here by 1 ? Moreover,
unclass(x2)
seems to give the same thing as well:
[1] 2 2 1 2
attr(,"levels")
[1] "Male" "Female"
It might be simple but I am trying to figure this out and it seems that I cant. Any help would be highly appreciated as I am new in R.
0 has no special meaning for factor.
As commenters have pointed out, factor recodes the input vector to an integer vector (starting with 1) and slaps a name tag onto each integer (the levels).
In the most simplest case, factor(c(2,2,0,2), the function takes the unique values of the input vector, sorts it, and converts it to a character vector, for the levels. I.e. the factor is internally represented as c(2,2,1,2) where 1 corresponds to '0' and 2 to '2'.
You then go further on by giving the levels some labels; these are normally identical to the levels. In your case factor(c(2,2,0,2), labels=c('Male','Female')), the levels are still evaluated to the sorted, unique vector (i.e. c(2,2,1,2)) but the levels now have labels Male for first level and Female for second level.
We can decide which levels should be used, as in factor(c(2,2,0,2), levels=c(2,0), labels=c('Male','Female')). Now we have been explicit towards which input value should have which level and label.

converting rows to columns of a data frame in R

I have a data set like this
movieID title year country genre directorName Rating actorName1 actorName.2
1 hello 1995 USA action john smith 6 tom hanks charlie sheen
2 MI2 1997 USA action mad max 8 tom cruize some_body
3 MI2 1997 USA thriller mad max 8 tom cruize some_body
basically there are numerous rows that just have a different user given genre that I would like to columns having genre1, genre2, ...
I tried reshape() but it would only convert based on some ID variable. If anyone has any ideas let me know
You can use reshape() to do this, if you understand the lens through which reshape() views data.
Background
First, consider the concept of a record in the context of the relational model of data management. Generally, in a table of data, each record should correspond to a well-defined unit of data, concisely termed the record unit, with one or more columns acting as identification or key variables that serve to differentiate between unique instances of the record unit.
Usually, units are described by a set of scalar variables. In other words, each record has associated with it one or more scalar values, each of which provides a single piece of information about the unit. In a nice simple world, all properties of units would be scalar, and thus you could represent each variable as a single column vector, with each element/cell corresponding to one record unit, and thereby providing the value of that particular property for that particular unit.
Further to the concept of properties, it is possible and very common to identify typing or grouping classifications of units. These are often represented as additional scalar properties of units.
When people talk about the long format vs. the wide format of tabular data, they are generally referring to how these kinds of type classifications are laid out in a table. This choice of data layout is directly related to the choice of unit that is represented by a single record in the table. These are actually one and the same choice.
For example, in an experiment with multiple measurements per individual, it would be possible to store one measurement per record, with individuals represented over multiple records, and a type column to distinguish between measurement type. Alternatively, it would be possible to store one individual per record, with each measurement represented by a single column. With respect to each other, the former format is long, and the latter format is wide. But now consider that, if each individual belonged to a single experimental group within the experiment, it would be possible to store one group per record, with each individual represented by a set of columns, and each measurement represented by one column within the set. This is yet a "wider" format, if you will. It's all relative.
Unfortunately, unit characteristics are sometimes more complex than simple scalar values. The most common case of this is a multivalue property, sometimes described as a many-to-one relationship (especially in the context of DBMSs). In other words, multiple values for the property can be associated with a single record unit. In cases like this, it is not possible to represent the multivalue property as a simple column vector within the data set. There are hacks that programmers often settle into when trying to deal with this complexity, such as:
Concatenating the multiple values into a single scalar value (such as a single comma-separated string, or a bit vector). Let's call this the "concatenation hack".
Duplicating the unit record for each value of the property. (This generally can only be plausible if only one of the properties in the data set is multivalue.) Let's call this the "duplication hack".
Separating the property into multiple "instances" of itself, each stored in its own column. Let's call this the "separation hack".
Simply trying to ignore all but one of the multiple values. Let's call this the "ignorance hack".
In some contexts, special data types can be used to more appropriately represent the data as a pseudo-column-vector. PostgreSQL, for example, provides an array column type, and even R data.frames can have list columns whose individual elements can hold any data type supported by R, including multielement vectors. These representations are usually preferable to the aforementioned hacks.
Probably the most widely used solution that I wouldn't qualify as a hack is to completely separate the multivalue property from the primary table of data, and instead store it as a separate table which is linked to the primary table on a key. Each record in the secondary table has a key to a record in the primary table, and stores alongside the key a single value of the multivalue property. This is the design advocated by the relational model.
These approaches all have their own tradeoffs, of course, and the analysis of which is optimal for a given situation can be very complex, nebulous, and even somewhat subjective. I won't go into more detail on this here.
Before I begin to talk about reshape(), it is important to emphasize that unit typing is a very different thing from multivalue properties. Reshaping data is generally supposed to be about managing typing and record unit selection. It is not supposed to be about managing multivalue property layout, but it can be used in this way, as we will see.
reshape()
At its most abstract, reshape() can be used to transform a set of typed scalar data columns from one row per type with a discriminator column to one column per type with a discriminator suffix in the column name, for every unique (possibly multicolumn) key, and vice-versa.
The key will generally correspond with a single record unit, to use the terminology introduced earlier. Each key uniquely identifies one record unit.
The data columns are the actual variables/properties which describe the record units, with the discriminator acting to distinguish between the different types of the data variables.
In the terminology of the reshape() documentation and interface, the key columns are "id" columns, the discriminator is the "time" column, and the data columns are "varying" columns.
It is important to understand that the key you specify as the idvar argument is always the unique key of the wide format, whether you are transforming to wide from long, or to long from wide. In the long format, the unique key is the idvar columns plus the discriminator column (timevar).
Here's a simple demo:
## define example long table
long <- data.frame(id1=rep(letters[1:2],each=4L),id2=rep(1:2,each=2L),type=1:2,x=1:8,y=9:16);
long;
## id1 id2 type x y
## 1 a 1 1 1 9
## 2 a 1 2 2 10
## 3 a 2 1 3 11
## 4 a 2 2 4 12
## 5 b 1 1 5 13
## 6 b 1 2 6 14
## 7 b 2 1 7 15
## 8 b 2 2 8 16
## convert to wide
idvar <- c('id1','id2');
timevar <- 'type';
wide <- reshape(long,dir='w',idvar=idvar,timevar=timevar);
attr(wide,'reshapeWide') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
wide;
## id1 id2 x.1 y.1 x.2 y.2
## 1 a 1 1 9 2 10
## 3 a 2 3 11 4 12
## 5 b 1 5 13 6 14
## 7 b 2 7 15 8 16
## convert back to long
long2 <- reshape(wide,dir='l',idvar=idvar,timevar=timevar,varying=names(wide)[!names(wide)%in%c(idvar,timevar)]);
attr(long2,'reshapeLong') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
long2 <- long2[do.call(order,long2[c(idvar,timevar)]),]; ## better order, corresponding with original long
rownames(long2) <- NULL; ## remove useless row names
long2$type <- as.integer(long2$type); ## annoyingly, longifying interprets discriminator suffixes as doubles
identical(long,long2);
## [1] TRUE
The above code also demonstrates some of the quirks committed by reshape(), such as attribute assignments (that I've never seen anyone rely upon), unexpected row order, undesirable row names, and non-ideal vector type derivation. All of these quirks can be papered over with simple modifications, as I show above.
Also notice that the varying argument can be omitted when transforming from long to wide, in which case it is derived by reshape() by the process of elimination, but it cannot be omitted when transforming from wide to long.
Input
The situation you've gotten yourself into appears to be that you have a data.frame that is supposed to contain one row per movie, but each movie record has been duplicated for each genre that is associated with the movie. In other words, the movie is the record unit, and the genre is a multivalue property associated with the movie, which is currently being represented by the duplication hack.
Your objective seems to be to transform the data from the duplication hack into the separation hack.
I don't mean to sound too critical here; these hacks are widely used and are, in many cases, fairly effective at handling this kind of complexity in a relatively simple way. It's very likely this is a good solution for your application. But I'm going to call a spade a spade; these are hacks, and are far from the most appropriate or robust solutions for data processing. And I agree that the separation hack is better than the duplication hack.
Another confusing detail is that there is a movieID column which appears to be unique per row, and not unique per movie. IDs 2 and 3 both seem to be associated with movie MI2.
My interpretation is that, in the input, because the duplication hack has been used to deal with multiple genres, each row can be thought of as being unique per genre instance. In other words, each row represents a single instance of a genre as used in a single movie. Hence the movieID column is better thought of as a genre instance identifier, and has just been misnamed. (An alternative interpretation is that it was generated incorrectly, and should be unique per movie, in which case it should be fixed and treated identically to the key columns described later.)
Solution
We can solve this problem by calling reshape() to transform from long format to wide format.
Recall that reshaping is supposed to be used for type layout, for navigating between record unit representations. Here we're instead going to use it for transforming how the multivalue property currently stored in the genre column is laid out.
Now, the most important question is, which columns are keys (idvar), which is the discriminator (timevar), and which are data (varying)?
The easiest one is the genre column. It's a data column. It's not part of the key that will help uniquely identify each movie record in the wide format, and it's certainly not a discriminator of other data columns, so it must be a data column itself. We can also arrive at this answer by considering what must happen to it during the transformation; for each unique key, the genre values must be separated from one row per value to one column per value, which is what happens to all data columns when transforming from long to wide.
Now it's useful to consider the discriminator column. Which one is it? In actuality, it doesn't exist in the input. There's no column that says "this is genre type X, this is genre type Y". So what do we do? According to your required output, you want to associate with each genre a sequential index number, presumably in row order. This means we need to synthesize a new column with such a sequence when passing the data.frame to reshape(). However, we must be careful to ensure that the sequence starts anew for each movie, otherwise every record in the input table would see its genre occupy its own column in the output, due to its unique discriminator suffix. We can do this with ave() (grouping by the key columns) and transform(). We'll name the synthesized column time, which is the default assumption by reshape() if you don't specify the timevar argument. This will allow us to omit specification of that argument. (Note: I've always wished that reshape() would default to such a row-order sequence instead of looking for an input column named time, but it doesn't do that. Oh well.)
Now let's deal with the movieID column. Being a unique identifier in the input table, the only way to include it in the output table would be to also treat it as a data column, so that it would be split by the discriminator into separate columns. I decided to make the assumption that you don't want to do this, so I just removed it from the input table before reshaping, by exploiting the same transform() call. If you want, you can excise the removal piece to see the effect of including movieID across the transformation.
That leaves the remaining columns of title, year, country, directorName, Rating, actorName1, and actorName.2. How should we treat these?
Technically speaking, conceptually, most of them should be data columns. They can't be discriminators (we already covered that), and there's no way most of them (Rating, for example) could be considered key columns. Again, conceptually.
But it would be incorrect to specify any of them as data columns. The reason is that we're not using reshape() in the normal way. We know the movie records have been duplicated for the genre duplication hack used by the input data.frame, and so all the columns I just listed are actually just duplicates within the movie record group. We need these columns to effectively collapse to a single record in the output, and that's exactly what happens with key columns that pass through a reshape() call. Hence, we must identify them all as key columns by passing them to the idvar argument.
Another way of thinking about this is that the key columns are left untouched by reshape(), other than deduplication (if going from long to wide) or duplication (if going from wide to long). It is only the discriminator column that is transferred from column to suffix (if going from long to wide) or vice-versa (if going from wide to long), and data columns that are transferred from single column to multiple columns (if going from long to wide) or vice-versa (if going from wide to long). We need these columns to remain untouched, other than deduplication. Hence we require all columns, other than the target multivalue property column genre and the synthesized time column (and, in this case, the extraneous movieID column) to be specified as key columns.
Note that this is true even if one or more of the key columns could serve as a true key for the movie records. For example, if title was known to be unique within the table by movie, it would still be incorrect to just specify title as the key, and all the other column names I listed as data columns, because they would then be widened in the output according to the synthesized discriminator, even though we know all values within each movie record group are identical.
So, here's the end result:
df <- data.frame(movieID=c(1L,2L,3L),title=c('hello','MI2','MI2'),year=c(1995L,1997L,1997L),country=c('USA','USA','USA'),genre=c('action','action','thriller'),directorName=c('john smith','mad max','mad max'),Rating=c(6L,8L,8L),actorName1=c('tom hanks','tom cruize','tom cruize'),actorName.2=c('charlie sheen','some_body','some_body'),stringsAsFactors=F);
idcns <- names(df)[!names(df)%in%c('movieID','genre')];
reshape(transform(df,movieID=NULL,time=ave(df$movieID,df[idcns],FUN=seq_along)),dir='w',idvar=idcns,sep='');
## title year country directorName Rating actorName1 actorName.2 genre1 genre2
## 1 hello 1995 USA john smith 6 tom hanks charlie sheen action <NA>
## 2 MI2 1997 USA mad max 8 tom cruize some_body action thriller
Note that it is irrelevant exactly which vector is passed as the first argument to ave(), since seq_along() ignores its argument, except for its length. But we do require an integer vector, since ave() tries to coerce its result to the same type as the argument. It is acceptable to use df$movieID because it is an integer vector; alternatively we could use df$year, df$Rating, or synthesize an integer vector with seq_len(nrow(df)) or integer(nrow(df)).
Try this with dplyr and tidyr:
library(tidyr)
library(dplyr)
df %>% mutate(yesno=1) %>% spread(genre, yesno, fill=0)
This creates a column yesno that just gives a value to fill in for each genre. We can then use spread from tidyr. fill=0 means to fill in those not in the genre with 0 instead of NA.
Before:
genre title yesno
1 action lethal weapon 1
2 thriller shining 1
3 action taken 1
4 scifi alien 1
After:
title action scifi thriller
1 alien 0 1 0
2 lethal weapon 1 0 0
3 shining 0 0 1
4 taken 1 0 0

Is there anything like numerical variable with labels?

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?
I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.
You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Resources