converting rows to columns of a data frame in R - r

I have a data set like this
movieID title year country genre directorName Rating actorName1 actorName.2
1 hello 1995 USA action john smith 6 tom hanks charlie sheen
2 MI2 1997 USA action mad max 8 tom cruize some_body
3 MI2 1997 USA thriller mad max 8 tom cruize some_body
basically there are numerous rows that just have a different user given genre that I would like to columns having genre1, genre2, ...
I tried reshape() but it would only convert based on some ID variable. If anyone has any ideas let me know

You can use reshape() to do this, if you understand the lens through which reshape() views data.
Background
First, consider the concept of a record in the context of the relational model of data management. Generally, in a table of data, each record should correspond to a well-defined unit of data, concisely termed the record unit, with one or more columns acting as identification or key variables that serve to differentiate between unique instances of the record unit.
Usually, units are described by a set of scalar variables. In other words, each record has associated with it one or more scalar values, each of which provides a single piece of information about the unit. In a nice simple world, all properties of units would be scalar, and thus you could represent each variable as a single column vector, with each element/cell corresponding to one record unit, and thereby providing the value of that particular property for that particular unit.
Further to the concept of properties, it is possible and very common to identify typing or grouping classifications of units. These are often represented as additional scalar properties of units.
When people talk about the long format vs. the wide format of tabular data, they are generally referring to how these kinds of type classifications are laid out in a table. This choice of data layout is directly related to the choice of unit that is represented by a single record in the table. These are actually one and the same choice.
For example, in an experiment with multiple measurements per individual, it would be possible to store one measurement per record, with individuals represented over multiple records, and a type column to distinguish between measurement type. Alternatively, it would be possible to store one individual per record, with each measurement represented by a single column. With respect to each other, the former format is long, and the latter format is wide. But now consider that, if each individual belonged to a single experimental group within the experiment, it would be possible to store one group per record, with each individual represented by a set of columns, and each measurement represented by one column within the set. This is yet a "wider" format, if you will. It's all relative.
Unfortunately, unit characteristics are sometimes more complex than simple scalar values. The most common case of this is a multivalue property, sometimes described as a many-to-one relationship (especially in the context of DBMSs). In other words, multiple values for the property can be associated with a single record unit. In cases like this, it is not possible to represent the multivalue property as a simple column vector within the data set. There are hacks that programmers often settle into when trying to deal with this complexity, such as:
Concatenating the multiple values into a single scalar value (such as a single comma-separated string, or a bit vector). Let's call this the "concatenation hack".
Duplicating the unit record for each value of the property. (This generally can only be plausible if only one of the properties in the data set is multivalue.) Let's call this the "duplication hack".
Separating the property into multiple "instances" of itself, each stored in its own column. Let's call this the "separation hack".
Simply trying to ignore all but one of the multiple values. Let's call this the "ignorance hack".
In some contexts, special data types can be used to more appropriately represent the data as a pseudo-column-vector. PostgreSQL, for example, provides an array column type, and even R data.frames can have list columns whose individual elements can hold any data type supported by R, including multielement vectors. These representations are usually preferable to the aforementioned hacks.
Probably the most widely used solution that I wouldn't qualify as a hack is to completely separate the multivalue property from the primary table of data, and instead store it as a separate table which is linked to the primary table on a key. Each record in the secondary table has a key to a record in the primary table, and stores alongside the key a single value of the multivalue property. This is the design advocated by the relational model.
These approaches all have their own tradeoffs, of course, and the analysis of which is optimal for a given situation can be very complex, nebulous, and even somewhat subjective. I won't go into more detail on this here.
Before I begin to talk about reshape(), it is important to emphasize that unit typing is a very different thing from multivalue properties. Reshaping data is generally supposed to be about managing typing and record unit selection. It is not supposed to be about managing multivalue property layout, but it can be used in this way, as we will see.
reshape()
At its most abstract, reshape() can be used to transform a set of typed scalar data columns from one row per type with a discriminator column to one column per type with a discriminator suffix in the column name, for every unique (possibly multicolumn) key, and vice-versa.
The key will generally correspond with a single record unit, to use the terminology introduced earlier. Each key uniquely identifies one record unit.
The data columns are the actual variables/properties which describe the record units, with the discriminator acting to distinguish between the different types of the data variables.
In the terminology of the reshape() documentation and interface, the key columns are "id" columns, the discriminator is the "time" column, and the data columns are "varying" columns.
It is important to understand that the key you specify as the idvar argument is always the unique key of the wide format, whether you are transforming to wide from long, or to long from wide. In the long format, the unique key is the idvar columns plus the discriminator column (timevar).
Here's a simple demo:
## define example long table
long <- data.frame(id1=rep(letters[1:2],each=4L),id2=rep(1:2,each=2L),type=1:2,x=1:8,y=9:16);
long;
## id1 id2 type x y
## 1 a 1 1 1 9
## 2 a 1 2 2 10
## 3 a 2 1 3 11
## 4 a 2 2 4 12
## 5 b 1 1 5 13
## 6 b 1 2 6 14
## 7 b 2 1 7 15
## 8 b 2 2 8 16
## convert to wide
idvar <- c('id1','id2');
timevar <- 'type';
wide <- reshape(long,dir='w',idvar=idvar,timevar=timevar);
attr(wide,'reshapeWide') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
wide;
## id1 id2 x.1 y.1 x.2 y.2
## 1 a 1 1 9 2 10
## 3 a 2 3 11 4 12
## 5 b 1 5 13 6 14
## 7 b 2 7 15 8 16
## convert back to long
long2 <- reshape(wide,dir='l',idvar=idvar,timevar=timevar,varying=names(wide)[!names(wide)%in%c(idvar,timevar)]);
attr(long2,'reshapeLong') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
long2 <- long2[do.call(order,long2[c(idvar,timevar)]),]; ## better order, corresponding with original long
rownames(long2) <- NULL; ## remove useless row names
long2$type <- as.integer(long2$type); ## annoyingly, longifying interprets discriminator suffixes as doubles
identical(long,long2);
## [1] TRUE
The above code also demonstrates some of the quirks committed by reshape(), such as attribute assignments (that I've never seen anyone rely upon), unexpected row order, undesirable row names, and non-ideal vector type derivation. All of these quirks can be papered over with simple modifications, as I show above.
Also notice that the varying argument can be omitted when transforming from long to wide, in which case it is derived by reshape() by the process of elimination, but it cannot be omitted when transforming from wide to long.
Input
The situation you've gotten yourself into appears to be that you have a data.frame that is supposed to contain one row per movie, but each movie record has been duplicated for each genre that is associated with the movie. In other words, the movie is the record unit, and the genre is a multivalue property associated with the movie, which is currently being represented by the duplication hack.
Your objective seems to be to transform the data from the duplication hack into the separation hack.
I don't mean to sound too critical here; these hacks are widely used and are, in many cases, fairly effective at handling this kind of complexity in a relatively simple way. It's very likely this is a good solution for your application. But I'm going to call a spade a spade; these are hacks, and are far from the most appropriate or robust solutions for data processing. And I agree that the separation hack is better than the duplication hack.
Another confusing detail is that there is a movieID column which appears to be unique per row, and not unique per movie. IDs 2 and 3 both seem to be associated with movie MI2.
My interpretation is that, in the input, because the duplication hack has been used to deal with multiple genres, each row can be thought of as being unique per genre instance. In other words, each row represents a single instance of a genre as used in a single movie. Hence the movieID column is better thought of as a genre instance identifier, and has just been misnamed. (An alternative interpretation is that it was generated incorrectly, and should be unique per movie, in which case it should be fixed and treated identically to the key columns described later.)
Solution
We can solve this problem by calling reshape() to transform from long format to wide format.
Recall that reshaping is supposed to be used for type layout, for navigating between record unit representations. Here we're instead going to use it for transforming how the multivalue property currently stored in the genre column is laid out.
Now, the most important question is, which columns are keys (idvar), which is the discriminator (timevar), and which are data (varying)?
The easiest one is the genre column. It's a data column. It's not part of the key that will help uniquely identify each movie record in the wide format, and it's certainly not a discriminator of other data columns, so it must be a data column itself. We can also arrive at this answer by considering what must happen to it during the transformation; for each unique key, the genre values must be separated from one row per value to one column per value, which is what happens to all data columns when transforming from long to wide.
Now it's useful to consider the discriminator column. Which one is it? In actuality, it doesn't exist in the input. There's no column that says "this is genre type X, this is genre type Y". So what do we do? According to your required output, you want to associate with each genre a sequential index number, presumably in row order. This means we need to synthesize a new column with such a sequence when passing the data.frame to reshape(). However, we must be careful to ensure that the sequence starts anew for each movie, otherwise every record in the input table would see its genre occupy its own column in the output, due to its unique discriminator suffix. We can do this with ave() (grouping by the key columns) and transform(). We'll name the synthesized column time, which is the default assumption by reshape() if you don't specify the timevar argument. This will allow us to omit specification of that argument. (Note: I've always wished that reshape() would default to such a row-order sequence instead of looking for an input column named time, but it doesn't do that. Oh well.)
Now let's deal with the movieID column. Being a unique identifier in the input table, the only way to include it in the output table would be to also treat it as a data column, so that it would be split by the discriminator into separate columns. I decided to make the assumption that you don't want to do this, so I just removed it from the input table before reshaping, by exploiting the same transform() call. If you want, you can excise the removal piece to see the effect of including movieID across the transformation.
That leaves the remaining columns of title, year, country, directorName, Rating, actorName1, and actorName.2. How should we treat these?
Technically speaking, conceptually, most of them should be data columns. They can't be discriminators (we already covered that), and there's no way most of them (Rating, for example) could be considered key columns. Again, conceptually.
But it would be incorrect to specify any of them as data columns. The reason is that we're not using reshape() in the normal way. We know the movie records have been duplicated for the genre duplication hack used by the input data.frame, and so all the columns I just listed are actually just duplicates within the movie record group. We need these columns to effectively collapse to a single record in the output, and that's exactly what happens with key columns that pass through a reshape() call. Hence, we must identify them all as key columns by passing them to the idvar argument.
Another way of thinking about this is that the key columns are left untouched by reshape(), other than deduplication (if going from long to wide) or duplication (if going from wide to long). It is only the discriminator column that is transferred from column to suffix (if going from long to wide) or vice-versa (if going from wide to long), and data columns that are transferred from single column to multiple columns (if going from long to wide) or vice-versa (if going from wide to long). We need these columns to remain untouched, other than deduplication. Hence we require all columns, other than the target multivalue property column genre and the synthesized time column (and, in this case, the extraneous movieID column) to be specified as key columns.
Note that this is true even if one or more of the key columns could serve as a true key for the movie records. For example, if title was known to be unique within the table by movie, it would still be incorrect to just specify title as the key, and all the other column names I listed as data columns, because they would then be widened in the output according to the synthesized discriminator, even though we know all values within each movie record group are identical.
So, here's the end result:
df <- data.frame(movieID=c(1L,2L,3L),title=c('hello','MI2','MI2'),year=c(1995L,1997L,1997L),country=c('USA','USA','USA'),genre=c('action','action','thriller'),directorName=c('john smith','mad max','mad max'),Rating=c(6L,8L,8L),actorName1=c('tom hanks','tom cruize','tom cruize'),actorName.2=c('charlie sheen','some_body','some_body'),stringsAsFactors=F);
idcns <- names(df)[!names(df)%in%c('movieID','genre')];
reshape(transform(df,movieID=NULL,time=ave(df$movieID,df[idcns],FUN=seq_along)),dir='w',idvar=idcns,sep='');
## title year country directorName Rating actorName1 actorName.2 genre1 genre2
## 1 hello 1995 USA john smith 6 tom hanks charlie sheen action <NA>
## 2 MI2 1997 USA mad max 8 tom cruize some_body action thriller
Note that it is irrelevant exactly which vector is passed as the first argument to ave(), since seq_along() ignores its argument, except for its length. But we do require an integer vector, since ave() tries to coerce its result to the same type as the argument. It is acceptable to use df$movieID because it is an integer vector; alternatively we could use df$year, df$Rating, or synthesize an integer vector with seq_len(nrow(df)) or integer(nrow(df)).

Try this with dplyr and tidyr:
library(tidyr)
library(dplyr)
df %>% mutate(yesno=1) %>% spread(genre, yesno, fill=0)
This creates a column yesno that just gives a value to fill in for each genre. We can then use spread from tidyr. fill=0 means to fill in those not in the genre with 0 instead of NA.
Before:
genre title yesno
1 action lethal weapon 1
2 thriller shining 1
3 action taken 1
4 scifi alien 1
After:
title action scifi thriller
1 alien 0 1 0
2 lethal weapon 1 0 0
3 shining 0 0 1
4 taken 1 0 0

Related

Assign integral value to list of relative values

I have an assortment of syrups, each of which has a value - the amount of sugar per volume. As people blend these syrups, I track which ones are used, and created a table to get a Relative Weight of each blend. I understand > Data > Sort > Options > Custom Sort Order.
However, I really don't wish to sort each table, and am looking for a way to parse a column of this list as entered, and return a column with results in an Integral Relative Value of each row, as compared to the weights of syrups in the other rows of the table.
Unique Name weight Not Unique Relative Value
blueberry .250 2
raspberry .333 3
orange .425 4
tangerine .333 3
blackberry .225 1
I am attempting to find the "Relative Sort", a nested function which can assign an integral value of the Unique Name which compares the weights of the syrups. A "Lookup" only works if there is an absolute equality, right?
What if someone doesn't use "blackberry syrup", then "blueberry" is the lightest, and should be labeled as 1.
Is this too complicated for LibreOffice Calc?
It's a recursive greater than/less than/equal to comparison?
IF the problem is calculating the right hand column below from entries that may be sorted ascending by value as on the left:
then an answer is, in C2 and copied down to suit (provided C1 is blank or 0):
=IF(B1<>B2,C1+1,C1)
Without sorting the RANK function might be simpler and adequate (though in the example returning 5 rather than 4).

R set column value to be other column value based on string search

I'm trying to find a clean way to get the first column of my DT, for each row, to be equal to the user_id found in other columns. That is, I must perform a search of "user_id" across each row, and return the entirety of the cell where the instance is found.
I first tried to get the index of the column where the partial match is found, and then use this to set the first column's values, but it did not work. Example:
user_id 1 2
1: N/A 300 user_id154
2: N/A user_id301 user_id125040
3: N/A 302 user_id2
For instance, I want to obtain the following
**user_id**
user_id154
user_id301
user_id2
Please bear in mind I am new to such data formatting in R (most of the work I do does not involve cleaning JSON files..), and that my data.table has overs 1M rows. The answer does not need to be super efficient, but it definitely shouldn't take more than 5 minutes or it will be considered as too slow by my boss.
Hopefully it is understandable
I'm sure someone will provide a more elegant solution, but this does the trick:
dt[, user_id := str_extract(str_c(1, 2), "user_id[0-9]*")]
This first combines all columns row-per-row, then for each row, looks for the first user_id in the combined value.
(Requires the stringr package)
For every row in your table grep first value that has "user_id" in it and put result into column user_id.
df$user_id <- apply(df, 1, function(x) grep("user_id", x, value = TRUE)[1])

R factor and level

Levels make sense that it is unique values of the vector, but I can't get my head around what factor is. It just seems to repeat the vector values.
factor(c(1,2,3,3,4,5,1))
[1] 1 2 3 3 4 5 1
Levels: 1 2 3 4 5
Can anyone explain what factor is supposed to do, or why would I used it?
I'm starting to wonder if factors are like a code table in a database. Where the factor name is code table name and levels are the unique options of the code table. ?
A factor is stored as a hash table rather than raw character vector. What does this imply? There are two major benefits.
Much smaller memory footprint. Consider a text file containing the phrase "New Jersey" 100,000 times over encoded in ASCII. Now imagine if you just had to store the number 16 (in binary 100,000 times and then another table indicating that 16 means "New Jersey". It's leaner and faster.
Especially for visualization and statistical analysis, frequently we test for values "across all categories" (think ANOVA or what you would color a stacked barplot by). We can either repeatedly encode all of our functions to stack up observed choices in a string vector or we can simply create a new type of vector which will tell you what the valid choices are. That is called a factor, and the valid choices are called levels.

Most efficient format for array data for R import?

I'm in the enviable position of being able to set up the format for my data collection ahead of time, rather than being handed some crazy format and having to struggle with it. I'd like to make sure I'm setting it up in a way that minimizes headaches down the road, but I'm not very familiar with importing into multidimensional arrays so I'd like input. It also seems like a thought exercise that others might get some use from.
I am compiling a large number of data summaries (500+) with 23 single data values for each experiment and two additional vectors that vary between 100 and 1500 data values (these two vectors happen to always match in length for each sample, but their length is different for each sample). I'm having to store all of these in an Excel sheet which I'm currently building. I want to set it up in a way that efficiently stores this data for import into an R array.
I'm assuming that the longer dimensions, which vary in length, will have the max length (1500) and a bunch of NA's at the end rather than try to keep track of ragged data in Excel.
My current plan would be to store these in long form in Excel, with data labels in the first column (dim1, dim2,...), and the data summaries in each subsequent column (a, b, c...), since this saves the most space. Using a smaller number of dimensions as an example (7 single values, 2 vectors of length 1500), the data would look like this in Excel:
a b c...
dim1 2 5 7...
dim2 3 6 8...
dim3 6 8 2 ...
dim4 5 6 1...
dim5 6 2 1...
dim6 0 3 8...
dim7 8 5 4...
dim8 1 1 1...
dim8 2 2 2 ...
... continued x1500
dim9 4 4 4...
dim9 5 5 5 ...
...continued x1500
Can I easily import this, using the leftmost column to identify the dimensions of the array in long form? I don't see an easy way to do this using Reshape2, but perhaps I'm missing something. Or, do I need to have the data in paired columns?
It isn't clear to me whether this format is the most efficient way to organize this data for import into a multidimensional array, or if there is a better way. Eventually there will be a large number of samples so I'd like to think through this now rather than struggle later.
What is the most painless way to import this...or, is there a more efficient way of setting it up for easier import?
Hmm.. I can't think of a case that you would have to use melt. If you keep the current format, and add a heading to the 'dim' column then you should be able to work with that data fairly easily.
If you did transpose the data on 'dim' I think it would make things a lot more difficult.
It might good to know what variable types a,b,c,etc. are in order to make a better assessment.

Sorting by a slice of the text string

I am wanting to sort my data but the standard Excel "A to Z" sort function isn't cutting it. I was hoping someone knew how to make a custom sort that could suit my needs. Here is a sample:
chrPos count
chr1_10000598 10
chr1_10000647 10
chr1_10001370 30
chr1_10001390 30
chr1_10001392 30
chr1_10001414 30
chr1_10001418 30
chr1_10001473 10
chr1_10001505 10
chr1_10001516 20
chr1_1000156 30
As you can see the last row is out of place when using the built in sort function, this should be the first row not the last one here. I think adding a second layer of sorting would to the trick but that layer would have to sort by ascending value based on the number that is following the underscore.
Any ideas? Would this possibly be easier with R instead?
Edit to add details from comments:
Sorting is to be ascending on the numeric part after the underscore, within ascending on the chr numeric part (running from 1 to 22 both inclusive) and then chrM_, chrX_ and chrY_ in that order (also with their numeric parts sorted ascending).
The numeric part after the underscore may be up to 8 digits.
Assuming chrPos is in ColumnA, please try in a helper column:
=IF(FIND("_",A1)=5,CHAR(64+MID(A1,4,1)),CHAR(64+MID(A1,4,2)))&REPT("0",8-LEN(A1)+FIND("_",A1))&MID(A1,FIND("_",A1)+1,8)
OR, for additional requirements as mentioned in comments:
=IF(MID(A1,4,1)="M","W",IF(MID(A1,4,1)="X","X",IF(MID(A1,4,1)="Y","Y",IF(FIND("_",A1)=5,CHAR(64+MID(A1,4,1)),CHAR(64+MID(A1,4,2))))))&REPT("0",9-LEN(A1)+FIND("_",A1))&MID(A1,FIND("_",A1)+1,9)
then select the helper column, Copy, Paste Special, Values over the top and use that for sorting.

Resources