Finding matching rows - multidimensional-array

Given two matrices A and B with the same number of columns I would like to know if there are any rows which are the same in A and B. In Dyalog APL I can use the function split like this:
(↓A) ∊ ↓B
Is there a way to calculate the same result without the split function?

What you've found is a design flaw in Membership ∊ in that it implies that the right argument is a set of scalars rather than looking at it as a collection of major cells. This precluded extension according to Leading axis theory. However, Index of ⍳ was extended thus, and so we can use the fact that it returns the index beyond the end of of the lookup array when a major cell isn't found:
⎕← A ← 4 2⍴2 7 1 8 2 8 1 8
2 7
1 8
2 8
1 8
⎕← B ← 5 2⍴1 6 1 8 0 3 3 9 8 9
1 6
1 8
0 3
3 9
8 9
(↓A) ∊ ↓B
0 1 0 1
Membership ← {(≢⍵) ≥ ⍵⍳⍺}
A Membership B
0 1 0 1
Try it online!
This can also be written tacitly as Membership ← ⊢∘≢ ≥ ⍳⍨.
Either way, note that avoiding the detour of nested arrays leads to significant speed gains:
A←?1000 4⍴10
B←?1000 4⍴10
]runtime -compare "(↓A) ∊ ↓B" "A Membership B"
(↓A) ∊ ↓B → 1.6E¯4 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
A Membership B → 8.9E¯6 | -95% ⎕⎕

Something like A⍳B would show not only membership but location of equal rows.

Related

How to calculate similarity of numbers (in list)

I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.
Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

Sum variables conditionally with loop in r

I realize this is a topic that's covered somewhat well but I couldn't find anything that approaches this specific concern:
I have a df with 800 columns, 10 iterations of 80 columns (each column represents an item) - Each column is named something like: 1_BL_PRE.1 1_FU_PRE.1 1_BL_PRE.1 1_BL_POST.1
Where the first '1' indicates the item number and the second '1' indicates the iteration number.
What I'm trying to figure out is how to get the sums of specific groups of items from all 10 iterations.
As a short example let's say I want to take the 1st and 3rd item of BL_PRE and get the sum of all 10 iterations for those 2 items - how would I do this?
subject 1_BL_PRE.1 2_BL_PRE.1 3_BL_PRE.1 1_BL_PRE.2 2_BL_PRE.2
1 40002 3 4 3 1 2
2 40004 1 2 3 4 4
3 40006 4 3 3 3 1
4 40008 2 3 1 2 3
5 40009 3 4 1 2 3
Expected output (where A represents the sum of 1_BL_PRE.1, 3_BL_PRE.1, 1_BL_PRE.2 and so on):
subject BL_PRE_A
1 40002 12
2 40004 14
3 40006 15
4 40008 20
5 40009 12
My hunch is the solution is related to a for-loop or lappy (and I'm not familiar at all with either). I'm trying to work with apply(finaldata,1,function(x) {sum(x ...)}) but I haven't been able to figure out the conditional statement for the function of sum.
If there's an implementation with plyr I'd be really curious to see what that looks like. (and if there's a thread that answers this, apologies and just re-direct!)
**Edited to include small example + code I'm trying to get to work
Thanks!

Short(er) notation of selecting a part of a data.frame or other objects in R

I always get angry at my R code when I have to process dataframes, i.e. filtering out certain rows. The code gets very illegible as I tend to choose meaningful, but long, names for my objects. An example:
all.mutations.extra.large.name <- read.delim(filename)
head(all.mutations.extra.large.name)
id gene pos aa consequence V
ENSG00000105732 ZN574_HUMAN 81 x/N missense_variant 3
ENSG00000125879 OTOR_HUMAN 7 V/3 missense_variant 2
ENSG00000129194 SOX15_HUMAN 20 N/T missense_variant 3
ENSG00000099204 ABLM1_HUMAN 33 H/R missense_variant 2
ENSG00000103335 PIEZ1_HUMAN 11 Q/R missense_variant 3
ENSG00000171533 MAP6_HUMAN 39 A/G missense_variant 3
all.mutations.extra.large.name <- all.mutations.extra.large.name[which(all.mutations.extra.large.name$gene == ZN574_HUMAN)]
So in order to kick out all other lines in which I am not interested I need to reference 3 times the object all.mutations.extra.large.name. And reating this kind of step for different columns makes the code really difficult to understand.
Therefore my question: Is there a way to filter out rows by a criterion without referencing the object 3 times. Something like this would be beautiful: myobj[,gene=="ZN574_HUMAN"]
You can use subset for that:
subset(all.mutations.extra.large.name, gene == "ZN574_HUMAN")
Several options:
all.mutations.extra.large.name <- data.frame(a=1:5, b=2:6)
within(all.mutations.extra.large.name, a[a < 3] <- 0)
a b
1 0 2
2 0 3
3 3 4
4 4 5
5 5 6
transform(all.mutations.extra.large.name, b = b^2)
a b
1 1 4
2 2 9
3 3 16
4 4 25
5 5 36
Also check ?attach if you would like to avoid repetitive typing like all.mutations.extra.large.name$foo.

Combination with a minimum number of elements in a fixed length subset

I have been searching for long but unable to find a solution for this.
My question is "Suppose you have n street lights(cannot be moved) and if you get any m from them then it should have atleast k working.Now in how many ways can this be done"
This seems to be a combination problem, but the problem here is "m" must be sequential.
Eg:
1 2 3 4 5 6 7 (Street lamps)
Let m=3
Then the valid sets are,
1 2 32 3 43 4 54 5 65 6 7Whereas,1 2 4 and so are invalid selections.
So every set must have atleast 2 working lights. I have figured how to find the minimum lamps required to satisfy the condition but how can I find the number of ways in it can be done ?
There should certainly some formula to do this but I am unable to find it.. :(
Should always be (n-m)+1.
E.g., 10 lights (n = 10), 5 in set (m = 5):
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
Gives (10-5)+1 = 6 sets.
The answer should always be m choose k for all values of n where n > m > k. I'll try to explain why;
Given, for example, the values m = 10, n = 4, k = 2, you can start by generating all possible permutations of 1s and 0s for sets of 4 lights, with exactly 2 lights on;
1100
0110
0011
1001
0101
1010
As you can see, there are 6 permutations, because 4 choose 2 = 6. You can choose any of these 6 permutations to be the first 4 lights. You then continue the sequence until you get n (in this case 10) lights, ensuring that you only ever add a zero if you must in order to keep the condition true of having 2 lights on for every 4. What you will find is that the sequence simply repeats; for example:
1100 -> next can be 1, so 11001
Next can still be 1 and meet the condition, so 110011.
The next must now be a zero, giving 1100110, and then again -> 11001100. This simply continues until the length is n : 1100110011. Given that the starting four can only be one of the above set, you will only get 6 different permutations.
Now, since the sequence will repeat exactly the same for any value of n, it means that the answer will always be m choose k.
For your example in your comment of 6,3,2, I can only find the following permutations:
011011
110110
101101
Which works, because 3 choose 2 = 3. If you can find more, then I guess I'm wrong and I've probably misunderstood again :D but from my understanding of this problem, I'm certain that the answer will always be m choose k.

Resources