Most efficient format for array data for R import? - r

I'm in the enviable position of being able to set up the format for my data collection ahead of time, rather than being handed some crazy format and having to struggle with it. I'd like to make sure I'm setting it up in a way that minimizes headaches down the road, but I'm not very familiar with importing into multidimensional arrays so I'd like input. It also seems like a thought exercise that others might get some use from.
I am compiling a large number of data summaries (500+) with 23 single data values for each experiment and two additional vectors that vary between 100 and 1500 data values (these two vectors happen to always match in length for each sample, but their length is different for each sample). I'm having to store all of these in an Excel sheet which I'm currently building. I want to set it up in a way that efficiently stores this data for import into an R array.
I'm assuming that the longer dimensions, which vary in length, will have the max length (1500) and a bunch of NA's at the end rather than try to keep track of ragged data in Excel.
My current plan would be to store these in long form in Excel, with data labels in the first column (dim1, dim2,...), and the data summaries in each subsequent column (a, b, c...), since this saves the most space. Using a smaller number of dimensions as an example (7 single values, 2 vectors of length 1500), the data would look like this in Excel:
a b c...
dim1 2 5 7...
dim2 3 6 8...
dim3 6 8 2 ...
dim4 5 6 1...
dim5 6 2 1...
dim6 0 3 8...
dim7 8 5 4...
dim8 1 1 1...
dim8 2 2 2 ...
... continued x1500
dim9 4 4 4...
dim9 5 5 5 ...
...continued x1500
Can I easily import this, using the leftmost column to identify the dimensions of the array in long form? I don't see an easy way to do this using Reshape2, but perhaps I'm missing something. Or, do I need to have the data in paired columns?
It isn't clear to me whether this format is the most efficient way to organize this data for import into a multidimensional array, or if there is a better way. Eventually there will be a large number of samples so I'd like to think through this now rather than struggle later.
What is the most painless way to import this...or, is there a more efficient way of setting it up for easier import?

Hmm.. I can't think of a case that you would have to use melt. If you keep the current format, and add a heading to the 'dim' column then you should be able to work with that data fairly easily.
If you did transpose the data on 'dim' I think it would make things a lot more difficult.
It might good to know what variable types a,b,c,etc. are in order to make a better assessment.

Related

R: Finding duplicates in a data frame and recording them in vectors

I am trying to create some lines on a graph based on a third coordinate (x,y, temp). I would like to get a vector of indexes so I can split them into x and y vectors for each duplicate temperature. To make this more clear, I will include my actual data set:
DataFrame
I am trying to make multiple lines that have the same temp value. For example, I would like to have the following coordinates on the same line [0,14] [0,22] [0,26] [0,28]. They all have the temp value of 5.8. Once I find the duplicates, I will record the indexes in a vector which will allow me to retrieve the x and y coordinates. One other aspect is that I will not always know how many entries are going to be in the data.frame.
My question is how can I find the duplicates and store their indices in a vector? Once I have the indices for the duplicate temps, I can be sure to grab their x y coordinates and use that to create lines.
If you can answer my question or have any advice on how I can do this better, all help is appreciated
Consider the following:
df <- data.frame(temp = sample.int(n=3, size=5, replace=T))
df
temp
1 3
2 3
3 1
4 3
5 1
duplicated(df$temp)
[1] FALSE TRUE FALSE TRUE TRUE
which(duplicated(df$temp))
[1] 2 4 5
You've stated in the comments that you're looking to make an isopleth graph. The procedure you have described will not generate anything resembling an isopleth graph. Since it looks like your data is arranged in a regular grid, you should do something like the solutions presented in this question and answer, which use functions specifically designed for extracting contours from a grid of values. Another option is the contourLines function in the gDevices package. If you want higher-resolution, less jagged contours, you might look into using either the interp.surface or Krig functions from the fields package to interpolate your data to the resolution you require.

converting rows to columns of a data frame in R

I have a data set like this
movieID title year country genre directorName Rating actorName1 actorName.2
1 hello 1995 USA action john smith 6 tom hanks charlie sheen
2 MI2 1997 USA action mad max 8 tom cruize some_body
3 MI2 1997 USA thriller mad max 8 tom cruize some_body
basically there are numerous rows that just have a different user given genre that I would like to columns having genre1, genre2, ...
I tried reshape() but it would only convert based on some ID variable. If anyone has any ideas let me know
You can use reshape() to do this, if you understand the lens through which reshape() views data.
Background
First, consider the concept of a record in the context of the relational model of data management. Generally, in a table of data, each record should correspond to a well-defined unit of data, concisely termed the record unit, with one or more columns acting as identification or key variables that serve to differentiate between unique instances of the record unit.
Usually, units are described by a set of scalar variables. In other words, each record has associated with it one or more scalar values, each of which provides a single piece of information about the unit. In a nice simple world, all properties of units would be scalar, and thus you could represent each variable as a single column vector, with each element/cell corresponding to one record unit, and thereby providing the value of that particular property for that particular unit.
Further to the concept of properties, it is possible and very common to identify typing or grouping classifications of units. These are often represented as additional scalar properties of units.
When people talk about the long format vs. the wide format of tabular data, they are generally referring to how these kinds of type classifications are laid out in a table. This choice of data layout is directly related to the choice of unit that is represented by a single record in the table. These are actually one and the same choice.
For example, in an experiment with multiple measurements per individual, it would be possible to store one measurement per record, with individuals represented over multiple records, and a type column to distinguish between measurement type. Alternatively, it would be possible to store one individual per record, with each measurement represented by a single column. With respect to each other, the former format is long, and the latter format is wide. But now consider that, if each individual belonged to a single experimental group within the experiment, it would be possible to store one group per record, with each individual represented by a set of columns, and each measurement represented by one column within the set. This is yet a "wider" format, if you will. It's all relative.
Unfortunately, unit characteristics are sometimes more complex than simple scalar values. The most common case of this is a multivalue property, sometimes described as a many-to-one relationship (especially in the context of DBMSs). In other words, multiple values for the property can be associated with a single record unit. In cases like this, it is not possible to represent the multivalue property as a simple column vector within the data set. There are hacks that programmers often settle into when trying to deal with this complexity, such as:
Concatenating the multiple values into a single scalar value (such as a single comma-separated string, or a bit vector). Let's call this the "concatenation hack".
Duplicating the unit record for each value of the property. (This generally can only be plausible if only one of the properties in the data set is multivalue.) Let's call this the "duplication hack".
Separating the property into multiple "instances" of itself, each stored in its own column. Let's call this the "separation hack".
Simply trying to ignore all but one of the multiple values. Let's call this the "ignorance hack".
In some contexts, special data types can be used to more appropriately represent the data as a pseudo-column-vector. PostgreSQL, for example, provides an array column type, and even R data.frames can have list columns whose individual elements can hold any data type supported by R, including multielement vectors. These representations are usually preferable to the aforementioned hacks.
Probably the most widely used solution that I wouldn't qualify as a hack is to completely separate the multivalue property from the primary table of data, and instead store it as a separate table which is linked to the primary table on a key. Each record in the secondary table has a key to a record in the primary table, and stores alongside the key a single value of the multivalue property. This is the design advocated by the relational model.
These approaches all have their own tradeoffs, of course, and the analysis of which is optimal for a given situation can be very complex, nebulous, and even somewhat subjective. I won't go into more detail on this here.
Before I begin to talk about reshape(), it is important to emphasize that unit typing is a very different thing from multivalue properties. Reshaping data is generally supposed to be about managing typing and record unit selection. It is not supposed to be about managing multivalue property layout, but it can be used in this way, as we will see.
reshape()
At its most abstract, reshape() can be used to transform a set of typed scalar data columns from one row per type with a discriminator column to one column per type with a discriminator suffix in the column name, for every unique (possibly multicolumn) key, and vice-versa.
The key will generally correspond with a single record unit, to use the terminology introduced earlier. Each key uniquely identifies one record unit.
The data columns are the actual variables/properties which describe the record units, with the discriminator acting to distinguish between the different types of the data variables.
In the terminology of the reshape() documentation and interface, the key columns are "id" columns, the discriminator is the "time" column, and the data columns are "varying" columns.
It is important to understand that the key you specify as the idvar argument is always the unique key of the wide format, whether you are transforming to wide from long, or to long from wide. In the long format, the unique key is the idvar columns plus the discriminator column (timevar).
Here's a simple demo:
## define example long table
long <- data.frame(id1=rep(letters[1:2],each=4L),id2=rep(1:2,each=2L),type=1:2,x=1:8,y=9:16);
long;
## id1 id2 type x y
## 1 a 1 1 1 9
## 2 a 1 2 2 10
## 3 a 2 1 3 11
## 4 a 2 2 4 12
## 5 b 1 1 5 13
## 6 b 1 2 6 14
## 7 b 2 1 7 15
## 8 b 2 2 8 16
## convert to wide
idvar <- c('id1','id2');
timevar <- 'type';
wide <- reshape(long,dir='w',idvar=idvar,timevar=timevar);
attr(wide,'reshapeWide') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
wide;
## id1 id2 x.1 y.1 x.2 y.2
## 1 a 1 1 9 2 10
## 3 a 2 3 11 4 12
## 5 b 1 5 13 6 14
## 7 b 2 7 15 8 16
## convert back to long
long2 <- reshape(wide,dir='l',idvar=idvar,timevar=timevar,varying=names(wide)[!names(wide)%in%c(idvar,timevar)]);
attr(long2,'reshapeLong') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
long2 <- long2[do.call(order,long2[c(idvar,timevar)]),]; ## better order, corresponding with original long
rownames(long2) <- NULL; ## remove useless row names
long2$type <- as.integer(long2$type); ## annoyingly, longifying interprets discriminator suffixes as doubles
identical(long,long2);
## [1] TRUE
The above code also demonstrates some of the quirks committed by reshape(), such as attribute assignments (that I've never seen anyone rely upon), unexpected row order, undesirable row names, and non-ideal vector type derivation. All of these quirks can be papered over with simple modifications, as I show above.
Also notice that the varying argument can be omitted when transforming from long to wide, in which case it is derived by reshape() by the process of elimination, but it cannot be omitted when transforming from wide to long.
Input
The situation you've gotten yourself into appears to be that you have a data.frame that is supposed to contain one row per movie, but each movie record has been duplicated for each genre that is associated with the movie. In other words, the movie is the record unit, and the genre is a multivalue property associated with the movie, which is currently being represented by the duplication hack.
Your objective seems to be to transform the data from the duplication hack into the separation hack.
I don't mean to sound too critical here; these hacks are widely used and are, in many cases, fairly effective at handling this kind of complexity in a relatively simple way. It's very likely this is a good solution for your application. But I'm going to call a spade a spade; these are hacks, and are far from the most appropriate or robust solutions for data processing. And I agree that the separation hack is better than the duplication hack.
Another confusing detail is that there is a movieID column which appears to be unique per row, and not unique per movie. IDs 2 and 3 both seem to be associated with movie MI2.
My interpretation is that, in the input, because the duplication hack has been used to deal with multiple genres, each row can be thought of as being unique per genre instance. In other words, each row represents a single instance of a genre as used in a single movie. Hence the movieID column is better thought of as a genre instance identifier, and has just been misnamed. (An alternative interpretation is that it was generated incorrectly, and should be unique per movie, in which case it should be fixed and treated identically to the key columns described later.)
Solution
We can solve this problem by calling reshape() to transform from long format to wide format.
Recall that reshaping is supposed to be used for type layout, for navigating between record unit representations. Here we're instead going to use it for transforming how the multivalue property currently stored in the genre column is laid out.
Now, the most important question is, which columns are keys (idvar), which is the discriminator (timevar), and which are data (varying)?
The easiest one is the genre column. It's a data column. It's not part of the key that will help uniquely identify each movie record in the wide format, and it's certainly not a discriminator of other data columns, so it must be a data column itself. We can also arrive at this answer by considering what must happen to it during the transformation; for each unique key, the genre values must be separated from one row per value to one column per value, which is what happens to all data columns when transforming from long to wide.
Now it's useful to consider the discriminator column. Which one is it? In actuality, it doesn't exist in the input. There's no column that says "this is genre type X, this is genre type Y". So what do we do? According to your required output, you want to associate with each genre a sequential index number, presumably in row order. This means we need to synthesize a new column with such a sequence when passing the data.frame to reshape(). However, we must be careful to ensure that the sequence starts anew for each movie, otherwise every record in the input table would see its genre occupy its own column in the output, due to its unique discriminator suffix. We can do this with ave() (grouping by the key columns) and transform(). We'll name the synthesized column time, which is the default assumption by reshape() if you don't specify the timevar argument. This will allow us to omit specification of that argument. (Note: I've always wished that reshape() would default to such a row-order sequence instead of looking for an input column named time, but it doesn't do that. Oh well.)
Now let's deal with the movieID column. Being a unique identifier in the input table, the only way to include it in the output table would be to also treat it as a data column, so that it would be split by the discriminator into separate columns. I decided to make the assumption that you don't want to do this, so I just removed it from the input table before reshaping, by exploiting the same transform() call. If you want, you can excise the removal piece to see the effect of including movieID across the transformation.
That leaves the remaining columns of title, year, country, directorName, Rating, actorName1, and actorName.2. How should we treat these?
Technically speaking, conceptually, most of them should be data columns. They can't be discriminators (we already covered that), and there's no way most of them (Rating, for example) could be considered key columns. Again, conceptually.
But it would be incorrect to specify any of them as data columns. The reason is that we're not using reshape() in the normal way. We know the movie records have been duplicated for the genre duplication hack used by the input data.frame, and so all the columns I just listed are actually just duplicates within the movie record group. We need these columns to effectively collapse to a single record in the output, and that's exactly what happens with key columns that pass through a reshape() call. Hence, we must identify them all as key columns by passing them to the idvar argument.
Another way of thinking about this is that the key columns are left untouched by reshape(), other than deduplication (if going from long to wide) or duplication (if going from wide to long). It is only the discriminator column that is transferred from column to suffix (if going from long to wide) or vice-versa (if going from wide to long), and data columns that are transferred from single column to multiple columns (if going from long to wide) or vice-versa (if going from wide to long). We need these columns to remain untouched, other than deduplication. Hence we require all columns, other than the target multivalue property column genre and the synthesized time column (and, in this case, the extraneous movieID column) to be specified as key columns.
Note that this is true even if one or more of the key columns could serve as a true key for the movie records. For example, if title was known to be unique within the table by movie, it would still be incorrect to just specify title as the key, and all the other column names I listed as data columns, because they would then be widened in the output according to the synthesized discriminator, even though we know all values within each movie record group are identical.
So, here's the end result:
df <- data.frame(movieID=c(1L,2L,3L),title=c('hello','MI2','MI2'),year=c(1995L,1997L,1997L),country=c('USA','USA','USA'),genre=c('action','action','thriller'),directorName=c('john smith','mad max','mad max'),Rating=c(6L,8L,8L),actorName1=c('tom hanks','tom cruize','tom cruize'),actorName.2=c('charlie sheen','some_body','some_body'),stringsAsFactors=F);
idcns <- names(df)[!names(df)%in%c('movieID','genre')];
reshape(transform(df,movieID=NULL,time=ave(df$movieID,df[idcns],FUN=seq_along)),dir='w',idvar=idcns,sep='');
## title year country directorName Rating actorName1 actorName.2 genre1 genre2
## 1 hello 1995 USA john smith 6 tom hanks charlie sheen action <NA>
## 2 MI2 1997 USA mad max 8 tom cruize some_body action thriller
Note that it is irrelevant exactly which vector is passed as the first argument to ave(), since seq_along() ignores its argument, except for its length. But we do require an integer vector, since ave() tries to coerce its result to the same type as the argument. It is acceptable to use df$movieID because it is an integer vector; alternatively we could use df$year, df$Rating, or synthesize an integer vector with seq_len(nrow(df)) or integer(nrow(df)).
Try this with dplyr and tidyr:
library(tidyr)
library(dplyr)
df %>% mutate(yesno=1) %>% spread(genre, yesno, fill=0)
This creates a column yesno that just gives a value to fill in for each genre. We can then use spread from tidyr. fill=0 means to fill in those not in the genre with 0 instead of NA.
Before:
genre title yesno
1 action lethal weapon 1
2 thriller shining 1
3 action taken 1
4 scifi alien 1
After:
title action scifi thriller
1 alien 0 1 0
2 lethal weapon 1 0 0
3 shining 0 0 1
4 taken 1 0 0

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Arranging Vigenere Cipher into columns

As I understand if you arrange a Vigenere cipher into columns you can use the Index Of Coincidence to find out the key length.
I'm struggling to write an Algorithm that would take a piece of text and arrange it into columns.
For example -
1 2 3 4 5 6 7 8 9 10
Would return this if the period is 2 -
1,3,5,7,9
2,4,6,8,10
and perform an IOC test on each of these strings
IF the period is 3 -
1,4,7,10
2,5,8
3,6,9
and perform an IOC test on each of these strings
Etc etc.
I've constructed an IOC test however I'm struggling to think of an algorithm to split the text up into collumns, any tips on how to think more like a computer scientist and construct algorithms like this?
If you already know the key length, it's pretty trivial. If you don't know the key length, you have to guess it by entropy. Here is an example in Python for instance:
if you_dont_know_key_length:
key_length = find_key_length_by_entropy(ciphertext)
columns = [ciphertext[i::key_length] for i in xrange(key_length)]
Any language should basically have the same construct (pick every n-th element in the ciphertext)

Iterate process in R using range of vectors derived from matrix

I must first apologize as I have no programming background, so please forgive me if this question is overly simplistic or if it has been addressed repeatedly. I would be very willing to help clarify my issue if it is not clear from my explanation.
I have two sets of data matrices. "A":
[Ac1] [Ac2] ... [Ac500]
[Ac1] 25 30 ... 15
[Ar2] 7 54 ... 41
...
[cr25000]
and
"B" which is similar in the number of columns, but not the number of rows
[Bc1] [Bc2] ... [Bc500]
[Br1] 25 30 ... 15
[Br2] 7 54 ... 41
...
[Br20000]
I'm running an module ("npSeq") in R that uses the matrix A consistently as an input value, a horizontal vector that includes all of the values from a row in matrix B, ex [1]. The module returns a separate list of values. I will need to run the analysis independently for all of the rows in matrix B saving all of the returned lists which I will then need to combine.
However I would like to know if there is a way to automate the process so that the module runs using a vector derived from row [Br1], saves the returned list, and then runs the process again using the vector derived from row [Br2]. Repeating the process until [Br20000].
Again I'm sorry that this is worded so poorly. I wish I understood enough of the terminology to state my problem more clearly.
You can use lapply to loop over B's row indices:
result.list <- lapply(1:nrow(B), function(i) npSeq(A, B[i, ]))
Note that this is not going to be much (any?) faster than using a for loop. It is just a short and clean equivalent. 20,000 iterations does sound like a lot so it may take a while depending on how slow the function is.

Resources