Assign integral value to list of relative values - formula

I have an assortment of syrups, each of which has a value - the amount of sugar per volume. As people blend these syrups, I track which ones are used, and created a table to get a Relative Weight of each blend. I understand > Data > Sort > Options > Custom Sort Order.
However, I really don't wish to sort each table, and am looking for a way to parse a column of this list as entered, and return a column with results in an Integral Relative Value of each row, as compared to the weights of syrups in the other rows of the table.
Unique Name weight Not Unique Relative Value
blueberry .250 2
raspberry .333 3
orange .425 4
tangerine .333 3
blackberry .225 1
I am attempting to find the "Relative Sort", a nested function which can assign an integral value of the Unique Name which compares the weights of the syrups. A "Lookup" only works if there is an absolute equality, right?
What if someone doesn't use "blackberry syrup", then "blueberry" is the lightest, and should be labeled as 1.
Is this too complicated for LibreOffice Calc?
It's a recursive greater than/less than/equal to comparison?

IF the problem is calculating the right hand column below from entries that may be sorted ascending by value as on the left:
then an answer is, in C2 and copied down to suit (provided C1 is blank or 0):
=IF(B1<>B2,C1+1,C1)
Without sorting the RANK function might be simpler and adequate (though in the example returning 5 rather than 4).

Related

Octave: Values inside a matrix that are close

I have a vector that is being filled with random numbers within this range [0,1]. I want to somehow accept only the vectors, in which an element inside of it has a maximum deviation of 0,02 from its previous one and its next one.
For example I have the below vector [3,1]. This is acceptable, because the deviation of the 2nd element, between the first and the third element is not bigger than 0,02. Vector is not always consisted of 3 rows, it could be more.
**Vector**
0.32957
0.33097
0.33946
This is what i thought:
n=4
P=rand(1,n);
sort(P,"ascend");
for L=2:n
while P(L-1)-P(L)>0.02
P=rand(1,n);
endwhile
endfor
Vectorize this!
isvalid=~any(diff(sort(a))>0.02);
sort(a) : if its not sorted, sort
diff() : take the difference between adjacent elements
___ >0.02: Check if any of those differences is bigger than what you accept
~any(): if any is bigger, then return zero, "not valid".
From your code, it seems that there may be more to the question than what you ask, you seem to have the XY problem. You want to create a random vector that has the properties that you describe. You seem to be using uniform random numbers, so let me propose a way to generate your vector where your conditions are always true.
a(1)=rand(1); %or any other way to generate a first value.
length=100; %desired length.
a(2:length)=rand(length-1,1)*0.02; %generate random numbers never bigger than 0.02
a=cumsum(a); %cumulative sum
This ensures the vector is increasing in value, and never increasing more than 0.02

How to obtain the maximum sum of the array with the following condition?

Suppose the problem posed is as follows:
On Mars there lives a colony of worms. Each worm is represented as elements in an 1D array. Worms decide to eat each other but any worm can eat only its nearest neighbour. Each worm has a preset amount of energy(i.e the value of the element). On Mars, the laws dictate that when a worm i with energy x eats another worm with energy y, the i-th worm’s final energy becomes x-y. A worm is allowed to have negative energy levels.
Find the maximum value of energy of the last standing worm.
Sample data:
0,-1,-1,-1,-1 has answer 4.
2,1,2,1 has answer 4.
What will be the suitable logic to address this problem?
This problem has a surprisingly simple O(N) solution.
If any two members in the array have different signs, the answer is then sum of absolute values of all elements.
To see why, imagine a single positive value in the array, all other elements are negative (Example 1). Now the best strategy would be keeping this value positive and gradually eating all neighbors away to increase this positive value. The position of the positive value doesn't matter. The strategy is same in case of a single negative element.
In more general case, if an array of size N have values of different signs, we can always find an array of size N-1 with different signs, because there must be a pair of neighbors with different sign, which we can combine to form a number of any sign we prefer.
For example with this array : [1,2,-5,4,-10]
we can combine either (2,-5) or (4,-10). Lets combine (4,-10) to get [1,2,-5,-14]
We can only take (2,-5) now. So our array now is : [1,-7,-14]
Again only (1,-7) possible. But this time we have to keep combined value positive. So we are left with: [8,-14]
Final combining gives us 22, sum of all absolute values.
In case of all values with same sign, our first move would be to produce an opposite sign combining a neighbor pair with as little "cost" as possible. Intuitively, we don't want to waste two big numbers on this conversion. If we take x,y neighbor pair, when combined the new value (of opposite sign) will be abs(x-y). Since result is simply sum of absolute values, we can interpret it as - "loosing" abs(x) and abs(y) from maximum possible output and "gaining" abs(x-y) instead. So the "cost" for using this pair for sign conversion is abs(x)+abs(y)-abs(x-y). Since we need to minimise this cost, we choose from initial array neighbor pair that have lowest such value.
So if we take the above array but now all values are positive [1,2,5,4,10]:
"cost" of converting (1,2) to -1 is 1+2-abs(-1)=2.
"cost" of converting (2,5) to -3 is 2+5-abs(-3)=4.
"cost" of converting (5,4) to -1 is 5+4-abs(-1)=8.
"cost" of converting (4,10) to -6 is 4+10-abs(-6)=8.
So, we take and convert pair (1,2) to -1. Then just sum absolute values of resultant array to get 20. Notice that this value is exactly 2 less than our previous example.

Efficient string similarity grouping

Setting:
I have data on people, and their parent's names, and I want to find siblings (people with identical parent names).
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
The expected output here would be a column indicating that the first two observations belong to family X, while the third and fourth columns are each in a separate family. E.g:
person_id parents_name family_id
1 "peter pan + marta steward", 1
2 "pieter pan + marta steward", 1
3 "armin dolgner + jane johanna dough", 2
4 "jack jackson + sombody else" 3
Current approach:
I am flexible regarding the distance metric. Currently, I use Levenshtein edit-distance to match obs, allowing for two-character differences. But other variants such as "largest common sub string" would be fine if they run faster.
For smaller subsamples I use stringdist::stringdist in a loop or stringdist::stringdistmatrix, but this is getting increasingly inefficient as sample size increases.
The matrix version explodes once a certain sample size is used. My terribly inefficient attempt at looping is here:
#create data of the same complexity using random last-names
#(4mio obs and ~1-3 kids per parents)
pdata<-data.frame(parents_name=paste0(rep(c("peter pan + marta ",
"pieter pan + marta ",
"armin dolgner + jane johanna ",
"jack jackson + sombody "),1e6),stringi::stri_rand_strings(4e6, 5)))
for (i in 1:nrow(pdata)) {
similar_fatersname0<-stringdist::stringdist(pdata$parents_name[i],pdata$parents_name[i:nrow(pdata)],nthread=4)<2
#[create grouping indicator]
}
My question: There should be substantial efficiency gains, e.g. because I could stop comparing strings once I found them to sufficiently different in something that is easier to assess, eg. string length, or first word. The string length variant already works and reduces complexity by a factor ~3. But thats by far too little. Any suggestions to reduce computation time are appreciated.
Remarks:
The strings are actually in unicode and not in the Latin alphabet (Devnagari)
Pre-processing to drop unused characters etc is done
There are two challenges:
A. The parallel execution of Levenstein distance - instead of a sequential loop
B. The number of comparisons: if our source list has 4 million entries, theoretically we should run 16 trillion of Levenstein distance measures, which is unrealistic, even if we resolve the first challenge.
To make my use of language clear, here are our definitions
we want to measure the Levenstein distance between expressions.
every expression has two sections, the parent A full name and the parent B full name which are separated by a plus sign
the order of the sections matters (i.e. two expressions (1, 2) are identical if Parent A of expression 1 = Parent A of expression 2 and Parent B or expression 1= Parent B of expression 2. Expressions will not be considered identical if Parent A of expression 1 = Parent B of expression 2 and Parent B of expression 1 = Parent A of expression 2)
a section (or a full name) is a series of words, which are separated by spaces or dashes and correspond to the the first name and last name of a person
we assume the maximum number of words in a section is 6 (your example has sections of 2 or 3 words, I assume we can have up to 6)
the sequence of words in a section matters (the section is always a first name followed by a last name and never the last name first, e.g. Jack John and John Jack are two different persons).
there are 4 million expressions
expressions are assumed to contain only English characters. Numbers, spaces, punctuation, dashes, and any non-English character can be ignored
we assume the easy matches are already done (like the exact expression matches) and we do not have to search for exact matches
Technically the goal is to find series of matching expressions in the 4-million expressions list. Two expressions are considered matching expression if their Levenstein distance is less than 2.
Practically we create two lists, which are exact copies of the initial 4-million expressions list. We call then the Left list and the Right list. Each expression is assigned an expression id before duplicating the list.
Our goal is to find entries in the Right list which have a Levenstein distance of less than 2 to entries of the Left list, excluding the same entry (same expression id).
I suggest a two step approach to resolve the two challenges separately. The first step will reduce the list of the possible matching expressions, the second will simplify the Levenstein distance measurement since we only look at very close expressions. The technology used is any traditional database server because we need to index the data sets for performance.
CHALLENGE A
The challenge A consists of reducing the number of distance measurements. We start from a maximum of approx. 16 trillion (4 million to the power of two) and we should not exceed a few tens or hundreds of millions.
The technique to use here consists of searching for at least one similar word in the complete expression. Depending on how the data is distributed, this will dramatically reduce the number of possible matching pairs. Alternatively, depending on the required accuracy of the result, we can also search for pairs with at least two similar words, or with at least half of similar words.
Technically I suggest to put the expression list in a table. Add an identity column to create a unique id per expression, and create 12 character columns. Then parse the expressions and put each word of each section in a separate column. This will look like (I have not represented all the 12 columns, but the idea is below):
|id | expression | sect_a_w_1 | sect_a_w_2 | sect_b_w_1 |sect_b_w_2 |
|1 | peter pan + marta steward | peter | pan | marta |steward |
There are empty columns (since there are very few expressions with 12 words) but it does not matter.
Then we replicate the table and create an index on every sect... column.
We run 12 joins which try to find similar words, something like
SELECT L.id, R.id
FROM left table L JOIN right table T
ON L.sect_a_w_1 = R.sect_a_w_1
AND L.id <> R.id
We collect the output in 12 temp tables and run an union query of the 12 tables to get a short list of all expressions which have a potential matching expressions with at least one identical word. This is the solution to our challenge A. We now have a short list of the most likely matching pairs. This list will contain millions of records (pairs of Left and Right entries), but not billions.
CHALLENGE B
The goal of challenge B is to process a simplified Levenstein distance in batch (instead of running it in a loop).
First we should agree on what is a simplified Levenstein distance.
First we agree that the levenstein distance of two expressions is the sum of the levenstein distance of all the words of the two expressions which have the same index. I mean the Levenstein distance of two expressions is the distance of their two first words, plus the distance of their two second words, etc.
Secondly, we need to invent a simplified Levenstein distance. I suggest to use the n-gram approach with only grams of 2 characters which have an index absolute difference of less than 2 .
e.g. the distance between peter and pieter is calculated as below
Peter
1 = pe
2 = et
3 = te
4 = er
5 = r_
Pieter
1 = pi
2 = ie
3 = et
4 = te
5 = er
6 = r_
Peter and Pieter have 4 common 2-grams with an index absolute difference of less than 2 'et','te','er','r_'. There are 6 possible 2-grams in the largest of the two words, the distance is then 6-4 = 2 - The Levenstein distance would also be 2 because there's one move of 'eter' and one letter insertion 'i'.
This is an approximation which will not work in all cases, but I think in our situation it will work very well. If we're not satisfied with the quality of the results we can try with 3-grams or 4-grams or allow a larger than 2 gram sequence difference. But the idea is to execute much fewer calculations per pair than in the traditional Levenstein algorithm.
Then we need to convert this into a technical solution. What I have done before is the following:
First isolate the words: since we need only to measure the distance between words, and then sum these distances per expression, we can further reduce the number of calculations by running a distinct select on the list of words (we have already prepared the list of words in the previous section).
This approach requires a mapping table which keeps track of the expression id, the section id, the word id and the word sequence number for word, so that the original expression distance can be calculated at the end of the process.
We then have a new list which is much shorter, and contains a cross join of all words for which the 2-gram distance measure is relevant.
Then we want to batch process this 2-gram distance measurement, and I suggest to do it in a SQL join. This requires a pre-processing step which consists of creating a new temporary table which stores every 2-gram in a separate row – and keeps track of the word Id, the word sequence and the section type
Technically this is done by slicing the list of words using a series (or a loop) of substring select, like this (assuming the word list tables - there are two copies, one Left and one Right - contain 2 columns word_id and word) :
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 1 AS gram_seq, SUBSTRING(word,1,2) AS gram
FROM left_word_table
And then
INSERT INTO left_gram_table (word_id, gram_seq, gram)
SELECT word_id, 2 AS gram_seq, SUBSTRING(word,2,2) AS gram
FROM left_word_table
Etc.
Something which will make “steward” look like this (assume the word id is 152)
| pk | word_id | gram_seq | gram |
| 1 | 152 | 1 | st |
| 2 | 152 | 2 | te |
| 3 | 152 | 3 | ew |
| 4 | 152 | 4 | wa |
| 5 | 152 | 5 | ar |
| 6 | 152 | 6 | rd |
| 7 | 152 | 7 | d_ |
Don't forget to create an index on the word_id, the gram and the gram_seq columns, and the distance can be calculated with a join of the left and the right gram list, where the ON looks like
ON L.gram = R.gram
AND ABS(L.gram_seq + R.gram_seq)< 2
AND L.word_id <> R.word_id
The distance is the length of the longest of the two words minus the number of the matching grams. SQL is extremely fast to make such a query, and I think a simple computer with 8 gigs of RAM would easily do several hundred of million lines in a reasonable time frame.
And then it's only a matter of joining the mapping table to calculate the sum of word to word distance in every expression, to get the total expression to expression distance.
You are using the stringdist package anyway, does stringdist::phonetic() suit your needs? It computes the soundex code for each string, eg:
phonetic(pdata$parents_name)
[1] "P361" "P361" "A655" "J225"
Soundex is a tried-and-true method (almost 100 years old) for hashing names, and that means you don't need to compare every single pair of observations.
You might want to go further and do soundex on first name and last name seperately for father and mother.
My suggestion is to use a data science approach to identify only similar (same cluster) names to compare using stringdist.
I have modified a little bit the code generating "parents_name" adding more variability in first and second names in a scenario close to reality.
num<-4e6
#Random length
random_l<-round(runif(num,min = 5, max=15),0)
#Random strings in the first and second name
parent_rand_first<-stringi::stri_rand_strings(num, random_l)
order<-sample(1:num, num, replace=F)
parent_rand_second<-parent_rand_first[order]
#Paste first and second name
parents_name<-paste(parent_rand_first," + ",parent_rand_second)
parents_name[1:10]
Here start the real analysis, first extract feature from the names such as global length, length of the first, length of the second one, numeber of vowels and consonansts in both first and second name (and any other of interest).
After that bind all these feature and clusterize the data.frame in a high number of clusters (eg. 1000)
features<-cbind(nchars,nchars_first,nchars_second,nvowels_first,nvowels_second,nconsonants_first,nconsonants_second)
n_clusters<-1000
clusters<-kmeans(features,centers = n_clusters)
Apply stringdistmatrix only inside each cluster (containing similar couple of names)
dist_matrix<-NULL
for(i in 1:n_clusters)
{
cluster_i<-clusters$cluster==i
parents_name<-as.character(parents_name[cluster_i])
dist_matrix[[i]]<-stringdistmatrix(parents_name,parents_name,"lv")
}
In dist_matrix you have the distance beetwen each element in the cluster and you are able to assign the family_id using this distance.
To compute the distance in each cluster (in this example) the code takes approximately 1 sec (depending on the dimension of the cluster), in 15mins all the distances are computed.
WARNING: dist_matrix grow very fast, in your code is better if you will analyze it inside di for loop extracting famyli_id and then you can discard it.
You may improve by not comparing all the couples of lines.
Instead, create a new variable that will be helpfull for decide if it is worth comparing.
For exemple, create a new variable "score" contaning the ordered list of letters used in parents_name (for exemple if "peter pan + marta steward" then the score will be "ademnprstw"), and calculate distance only between lines where score are matching.
Of course, you can find a score that fits better your need, and improve a little to enable comparison when not all the letters used are common ..
I faced the same performance issue couple years ago. I had to match people's duplicates based on their typed names. My dataset had 200k names and the matrix approach exploded. After searching for some day about a better method, the method I'm proposing here did the job for me in some minutes:
library(stringdist)
parents_name <- c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else")
person_id <- 1:length(parents_name)
family_id <- vector("integer", length(parents_name))
#Looping through unassigned family ids
while(sum(family_id == 0) > 0){
ids <- person_id[family_id == 0]
dists <- stringdist(parents_name[family_id == 0][1],
parents_name[family_id == 0],
method = "lv")
matches <- ids[dists <= 3]
family_id[matches] <- max(family_id) + 1
}
result <- data.frame(person_id, parents_name, family_id)
That way the while will compare fewer matches on every iteration. From that, you might implement different performance boosters, like filtering the names with the same first letter before comparing, etc.
Making equivalency groups on non transitive relation does not make sense. If A is like B and B is like C, but A is not like C, how would you make families from that? Using something like soundex (that was idea of Neal Fultz, not mine) seems the only meaningful option and it solves your problem with performance too.
What I have used to reduce the permutations involved in this sort of name matching, is create a function that counts the syllables in the name (surname) involved. Then store this in the database, as a pre-processed value. This becomes a Syllable Hash function.
Then you can choose to group words together with the same number of syllables as each other. (Although I use algorithms that allow 1 or 2 syllables difference, which may be presented as legitimate spelling / typo errors...But my research has found that 95% of misspellings share the same number of syllables)
In this case Peter and Pieter would have the same syllable count (2), but Jones and Smith do not (they have 1). (For example)
If your function does not get 1 syllable for Jones, then you may need to increase your tolerance to allow for at least 1 syllable difference in the Syllable Hash function grouping that you use. (To account for incorrect syllable function results, and to catch the matching surname correctly in the grouping)
My syllable counting function may not apply completely - as you might need to cope with non-English letter sets...(So I have not pasted the code...Its in C anyway) Mind you - the Syllable count function does not have to be accurate in terms of TRUE syllable count; it simply needs to act as a reliable Hashing function - which it does. Far superior to SoundEx which relies on the first letter being accurate.
Give it a go, you might be surprised how much improvement you get by implementing a Syllable Hash function. You may have to ask SO for help getting the function into your language.
If I get it right, you want to compare every parent pair (every row in parent_name data frame) with all other pairs (rows), and keep rows that have Levenstein distance smaller or equal to 2.
I have written following code for the beginning:
pdata<-data.frame(parents_name=c("peter pan + marta steward",
"pieter pan + marta steward",
"armin dolgner + jane johanna dough",
"jack jackson + sombody else"))
fuzzy_match <- list()
system.time(for (i in 1:nrow(pdata)){
fuzzy_match[[i]] <- cbind(pdata, parents_name_2 = pdata[i,"parents_name"],
dist = as.integer(stringdist(pdata[i,"parents_name"], pdata$parents_name)))
fuzzy_match[[i]] <- fuzzy_match[[i]][fuzzy_match[[i]]$dist <= 2,]
})
fuzzy_final <- do.call(rbind, fuzzy_match)
Does it return what you wanted?
it reproduces your output, i guess you will have to decide partial matching criteria, i kept the default agrep ones
pdata$parents_name<-as.character(pdata$parents_name)
x00<-unique(lapply(pdata$parents_name,function(x) agrep(x,pdata$parents_name)))
x=c()
for (i in 1:length(x00)){
x=c(x,rep(i,length(x00[[i]])))
}
pdata$person_id=seq(1:nrow(pdata))
pdata$family_id=x

converting rows to columns of a data frame in R

I have a data set like this
movieID title year country genre directorName Rating actorName1 actorName.2
1 hello 1995 USA action john smith 6 tom hanks charlie sheen
2 MI2 1997 USA action mad max 8 tom cruize some_body
3 MI2 1997 USA thriller mad max 8 tom cruize some_body
basically there are numerous rows that just have a different user given genre that I would like to columns having genre1, genre2, ...
I tried reshape() but it would only convert based on some ID variable. If anyone has any ideas let me know
You can use reshape() to do this, if you understand the lens through which reshape() views data.
Background
First, consider the concept of a record in the context of the relational model of data management. Generally, in a table of data, each record should correspond to a well-defined unit of data, concisely termed the record unit, with one or more columns acting as identification or key variables that serve to differentiate between unique instances of the record unit.
Usually, units are described by a set of scalar variables. In other words, each record has associated with it one or more scalar values, each of which provides a single piece of information about the unit. In a nice simple world, all properties of units would be scalar, and thus you could represent each variable as a single column vector, with each element/cell corresponding to one record unit, and thereby providing the value of that particular property for that particular unit.
Further to the concept of properties, it is possible and very common to identify typing or grouping classifications of units. These are often represented as additional scalar properties of units.
When people talk about the long format vs. the wide format of tabular data, they are generally referring to how these kinds of type classifications are laid out in a table. This choice of data layout is directly related to the choice of unit that is represented by a single record in the table. These are actually one and the same choice.
For example, in an experiment with multiple measurements per individual, it would be possible to store one measurement per record, with individuals represented over multiple records, and a type column to distinguish between measurement type. Alternatively, it would be possible to store one individual per record, with each measurement represented by a single column. With respect to each other, the former format is long, and the latter format is wide. But now consider that, if each individual belonged to a single experimental group within the experiment, it would be possible to store one group per record, with each individual represented by a set of columns, and each measurement represented by one column within the set. This is yet a "wider" format, if you will. It's all relative.
Unfortunately, unit characteristics are sometimes more complex than simple scalar values. The most common case of this is a multivalue property, sometimes described as a many-to-one relationship (especially in the context of DBMSs). In other words, multiple values for the property can be associated with a single record unit. In cases like this, it is not possible to represent the multivalue property as a simple column vector within the data set. There are hacks that programmers often settle into when trying to deal with this complexity, such as:
Concatenating the multiple values into a single scalar value (such as a single comma-separated string, or a bit vector). Let's call this the "concatenation hack".
Duplicating the unit record for each value of the property. (This generally can only be plausible if only one of the properties in the data set is multivalue.) Let's call this the "duplication hack".
Separating the property into multiple "instances" of itself, each stored in its own column. Let's call this the "separation hack".
Simply trying to ignore all but one of the multiple values. Let's call this the "ignorance hack".
In some contexts, special data types can be used to more appropriately represent the data as a pseudo-column-vector. PostgreSQL, for example, provides an array column type, and even R data.frames can have list columns whose individual elements can hold any data type supported by R, including multielement vectors. These representations are usually preferable to the aforementioned hacks.
Probably the most widely used solution that I wouldn't qualify as a hack is to completely separate the multivalue property from the primary table of data, and instead store it as a separate table which is linked to the primary table on a key. Each record in the secondary table has a key to a record in the primary table, and stores alongside the key a single value of the multivalue property. This is the design advocated by the relational model.
These approaches all have their own tradeoffs, of course, and the analysis of which is optimal for a given situation can be very complex, nebulous, and even somewhat subjective. I won't go into more detail on this here.
Before I begin to talk about reshape(), it is important to emphasize that unit typing is a very different thing from multivalue properties. Reshaping data is generally supposed to be about managing typing and record unit selection. It is not supposed to be about managing multivalue property layout, but it can be used in this way, as we will see.
reshape()
At its most abstract, reshape() can be used to transform a set of typed scalar data columns from one row per type with a discriminator column to one column per type with a discriminator suffix in the column name, for every unique (possibly multicolumn) key, and vice-versa.
The key will generally correspond with a single record unit, to use the terminology introduced earlier. Each key uniquely identifies one record unit.
The data columns are the actual variables/properties which describe the record units, with the discriminator acting to distinguish between the different types of the data variables.
In the terminology of the reshape() documentation and interface, the key columns are "id" columns, the discriminator is the "time" column, and the data columns are "varying" columns.
It is important to understand that the key you specify as the idvar argument is always the unique key of the wide format, whether you are transforming to wide from long, or to long from wide. In the long format, the unique key is the idvar columns plus the discriminator column (timevar).
Here's a simple demo:
## define example long table
long <- data.frame(id1=rep(letters[1:2],each=4L),id2=rep(1:2,each=2L),type=1:2,x=1:8,y=9:16);
long;
## id1 id2 type x y
## 1 a 1 1 1 9
## 2 a 1 2 2 10
## 3 a 2 1 3 11
## 4 a 2 2 4 12
## 5 b 1 1 5 13
## 6 b 1 2 6 14
## 7 b 2 1 7 15
## 8 b 2 2 8 16
## convert to wide
idvar <- c('id1','id2');
timevar <- 'type';
wide <- reshape(long,dir='w',idvar=idvar,timevar=timevar);
attr(wide,'reshapeWide') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
wide;
## id1 id2 x.1 y.1 x.2 y.2
## 1 a 1 1 9 2 10
## 3 a 2 3 11 4 12
## 5 b 1 5 13 6 14
## 7 b 2 7 15 8 16
## convert back to long
long2 <- reshape(wide,dir='l',idvar=idvar,timevar=timevar,varying=names(wide)[!names(wide)%in%c(idvar,timevar)]);
attr(long2,'reshapeLong') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
long2 <- long2[do.call(order,long2[c(idvar,timevar)]),]; ## better order, corresponding with original long
rownames(long2) <- NULL; ## remove useless row names
long2$type <- as.integer(long2$type); ## annoyingly, longifying interprets discriminator suffixes as doubles
identical(long,long2);
## [1] TRUE
The above code also demonstrates some of the quirks committed by reshape(), such as attribute assignments (that I've never seen anyone rely upon), unexpected row order, undesirable row names, and non-ideal vector type derivation. All of these quirks can be papered over with simple modifications, as I show above.
Also notice that the varying argument can be omitted when transforming from long to wide, in which case it is derived by reshape() by the process of elimination, but it cannot be omitted when transforming from wide to long.
Input
The situation you've gotten yourself into appears to be that you have a data.frame that is supposed to contain one row per movie, but each movie record has been duplicated for each genre that is associated with the movie. In other words, the movie is the record unit, and the genre is a multivalue property associated with the movie, which is currently being represented by the duplication hack.
Your objective seems to be to transform the data from the duplication hack into the separation hack.
I don't mean to sound too critical here; these hacks are widely used and are, in many cases, fairly effective at handling this kind of complexity in a relatively simple way. It's very likely this is a good solution for your application. But I'm going to call a spade a spade; these are hacks, and are far from the most appropriate or robust solutions for data processing. And I agree that the separation hack is better than the duplication hack.
Another confusing detail is that there is a movieID column which appears to be unique per row, and not unique per movie. IDs 2 and 3 both seem to be associated with movie MI2.
My interpretation is that, in the input, because the duplication hack has been used to deal with multiple genres, each row can be thought of as being unique per genre instance. In other words, each row represents a single instance of a genre as used in a single movie. Hence the movieID column is better thought of as a genre instance identifier, and has just been misnamed. (An alternative interpretation is that it was generated incorrectly, and should be unique per movie, in which case it should be fixed and treated identically to the key columns described later.)
Solution
We can solve this problem by calling reshape() to transform from long format to wide format.
Recall that reshaping is supposed to be used for type layout, for navigating between record unit representations. Here we're instead going to use it for transforming how the multivalue property currently stored in the genre column is laid out.
Now, the most important question is, which columns are keys (idvar), which is the discriminator (timevar), and which are data (varying)?
The easiest one is the genre column. It's a data column. It's not part of the key that will help uniquely identify each movie record in the wide format, and it's certainly not a discriminator of other data columns, so it must be a data column itself. We can also arrive at this answer by considering what must happen to it during the transformation; for each unique key, the genre values must be separated from one row per value to one column per value, which is what happens to all data columns when transforming from long to wide.
Now it's useful to consider the discriminator column. Which one is it? In actuality, it doesn't exist in the input. There's no column that says "this is genre type X, this is genre type Y". So what do we do? According to your required output, you want to associate with each genre a sequential index number, presumably in row order. This means we need to synthesize a new column with such a sequence when passing the data.frame to reshape(). However, we must be careful to ensure that the sequence starts anew for each movie, otherwise every record in the input table would see its genre occupy its own column in the output, due to its unique discriminator suffix. We can do this with ave() (grouping by the key columns) and transform(). We'll name the synthesized column time, which is the default assumption by reshape() if you don't specify the timevar argument. This will allow us to omit specification of that argument. (Note: I've always wished that reshape() would default to such a row-order sequence instead of looking for an input column named time, but it doesn't do that. Oh well.)
Now let's deal with the movieID column. Being a unique identifier in the input table, the only way to include it in the output table would be to also treat it as a data column, so that it would be split by the discriminator into separate columns. I decided to make the assumption that you don't want to do this, so I just removed it from the input table before reshaping, by exploiting the same transform() call. If you want, you can excise the removal piece to see the effect of including movieID across the transformation.
That leaves the remaining columns of title, year, country, directorName, Rating, actorName1, and actorName.2. How should we treat these?
Technically speaking, conceptually, most of them should be data columns. They can't be discriminators (we already covered that), and there's no way most of them (Rating, for example) could be considered key columns. Again, conceptually.
But it would be incorrect to specify any of them as data columns. The reason is that we're not using reshape() in the normal way. We know the movie records have been duplicated for the genre duplication hack used by the input data.frame, and so all the columns I just listed are actually just duplicates within the movie record group. We need these columns to effectively collapse to a single record in the output, and that's exactly what happens with key columns that pass through a reshape() call. Hence, we must identify them all as key columns by passing them to the idvar argument.
Another way of thinking about this is that the key columns are left untouched by reshape(), other than deduplication (if going from long to wide) or duplication (if going from wide to long). It is only the discriminator column that is transferred from column to suffix (if going from long to wide) or vice-versa (if going from wide to long), and data columns that are transferred from single column to multiple columns (if going from long to wide) or vice-versa (if going from wide to long). We need these columns to remain untouched, other than deduplication. Hence we require all columns, other than the target multivalue property column genre and the synthesized time column (and, in this case, the extraneous movieID column) to be specified as key columns.
Note that this is true even if one or more of the key columns could serve as a true key for the movie records. For example, if title was known to be unique within the table by movie, it would still be incorrect to just specify title as the key, and all the other column names I listed as data columns, because they would then be widened in the output according to the synthesized discriminator, even though we know all values within each movie record group are identical.
So, here's the end result:
df <- data.frame(movieID=c(1L,2L,3L),title=c('hello','MI2','MI2'),year=c(1995L,1997L,1997L),country=c('USA','USA','USA'),genre=c('action','action','thriller'),directorName=c('john smith','mad max','mad max'),Rating=c(6L,8L,8L),actorName1=c('tom hanks','tom cruize','tom cruize'),actorName.2=c('charlie sheen','some_body','some_body'),stringsAsFactors=F);
idcns <- names(df)[!names(df)%in%c('movieID','genre')];
reshape(transform(df,movieID=NULL,time=ave(df$movieID,df[idcns],FUN=seq_along)),dir='w',idvar=idcns,sep='');
## title year country directorName Rating actorName1 actorName.2 genre1 genre2
## 1 hello 1995 USA john smith 6 tom hanks charlie sheen action <NA>
## 2 MI2 1997 USA mad max 8 tom cruize some_body action thriller
Note that it is irrelevant exactly which vector is passed as the first argument to ave(), since seq_along() ignores its argument, except for its length. But we do require an integer vector, since ave() tries to coerce its result to the same type as the argument. It is acceptable to use df$movieID because it is an integer vector; alternatively we could use df$year, df$Rating, or synthesize an integer vector with seq_len(nrow(df)) or integer(nrow(df)).
Try this with dplyr and tidyr:
library(tidyr)
library(dplyr)
df %>% mutate(yesno=1) %>% spread(genre, yesno, fill=0)
This creates a column yesno that just gives a value to fill in for each genre. We can then use spread from tidyr. fill=0 means to fill in those not in the genre with 0 instead of NA.
Before:
genre title yesno
1 action lethal weapon 1
2 thriller shining 1
3 action taken 1
4 scifi alien 1
After:
title action scifi thriller
1 alien 0 1 0
2 lethal weapon 1 0 0
3 shining 0 0 1
4 taken 1 0 0

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Resources