Items Similarity based on their features - similarity

I have a dataset with items but with no user ratings.
Items have features (~400 feature).
I want to measure the similarity between items based on features (Row similarity).
I convert the item-feature into a binary matrix like the fowllowing
itemID | feature1 | feature2 | feature3 | feature4 ....
1 | 0 | 1 | 1 | 0
2 | 1 | 0 | 0 | 1
3 | 1 | 1 | 1 | 0
4 | 0 | 0 | 1 | 1
I don't know what to use (and how to use it) to measure the row similarity.
I want, for Item X, to get the top k similar items.
A sample code will be very appreciated

What you are looking for is termed similarity measure. A quick google/SO search will reveal various methods to get similarity between two vectors. Here is some sample code in python2 for cosine similarity:
from math import *
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])
taken from: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
I noticed that you want top k similar items for every item. The best way to do that is with a k Nearest Neighbour implementation. What you can do is create a knn graph and return the top k similar items from the graph for a query.
A great library for this is nmslib. Here is some sample code for a knn query from the library for the HNSW method with cosine similarity (you can use one of the several available methods. HNSW is particularly efficient for your high dimensional data):
import nmslib
import numpy
# create a random matrix to index
data = numpy.random.randn(10000, 100).astype(numpy.float32)
# initialize a new index, using a HNSW index on Cosine Similarity
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2}, print_progress=True)
# query for the nearest neighbours of the first datapoint
ids, distances = index.knnQuery(data[0], k=10)
# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
neighbours = index.knnQueryBatch(data, k=10, num_threads=4)
At the end of the code, the k top neighbours for every data point will be stored in the neighbours variable. You can use that for your purposes.

Related

How to make a dataset with rows representing the no. of clusters while columns are two variables with different value for clusters

Im quiet confused.
I have 50 clusters each with a different size, and I have two variables "Year" and "Income level".
The data set I have right now has 10,000 rows where each row represents a single individual.
What I want to do is to form a new dataset from this dataframe where each row represents the number of clusters (50) and the columns be the two variables + the cluster variable. The problem is these two variables (that we call the study level covariates) do not have a unique value for clusters.
How would I put them in one cell for each cluster then?
X1<-c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4) #Clusters
X2<c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2,2) #Covariate1
X3<-c(1991,2001,2002,1998,2014,2015,1990,
2002,2004,2006,2006,2006,2005,2003,2003,2000) #Covariate2
data<-data.frame(X1,X2,X3)
My desire output should be something like this:
|Clusters|Covariet1|Covariate2|
|--------|---------|----------|
|1 | ? |? |
|2 | ? |? |
|3 | ? |? |
|4 | ? |? |
Meanening that instead of a data frame with 16 rows, a dataframe with 4 rows
Here is how to aggreagate the data using the average of the covariate per cluster:
df <- data.frame(X1 = c(1,1,1,2,2,2,2,2,3,3,4,4,4,4,4,4),
X2 = c(1,2,3,1,1,1,1,1,1,2,3,3,1,1,2,2),
X3 = c(1991,2001,2002,1998,2014,2015,1990,2002,2004,2006,2006,2006,2005,2003,2003,2000)
)
library(tidyverse)
df %>% group_by(X1) %>% summarise(mean_cov1 = mean(X2))
# A tibble: 4 x 2
X1 mean_cov1
* <dbl> <dbl>
1 1 2
2 2 1
3 3 1.5
4 4 2
For the case you are working on, you have to decide what the most relevant aggreagation is. You can probably also create multiple at once.

loop for fetching and or manipulating phylogenies in R over multiple values of another variable

I have a dataset in which there are multiple species listed for each value of another column (call it the index column) in R. I need to extract or prune a phylogeny (from phylomatic) for each column value, and then calculate the phylogenetic distance for each value of the index column. There are many index values in my dataset, so it is impractical to do this manually.
Example input data ( 2 columns, one for index and one for species):
Index Species
A ; Sp. 1
A ; Sp. 2
A ; Sp. 3
A ; Sp. 4
B ; Sp. 5
B ; Sp. 1
C ; Sp. 6
D ; Sp. 7
D ; Sp. 2
D ; Sp. 8
D ; Sp. 9
What I want for output (values made up; each index appears once and the phylogenetic distance of all its associated species is calculated):
Index; phylogentic distance
A ; 7
B ; 3
C ; 1
D ; 5
I have downloaded a global tree (for all species in the dataset) from phylomatic using the brranching package. I could either set up a loop or function to prune the phylogeny to the species within each index category and calculate PD, or I could set up a function to extract a new tree from phylomatic for each value of the index, and then calculate PD. I would really appreciate if anyone has done something similar or has ideas about how to implement this in R. Much appreciated in advance!
library(brranching)
phylomatic(data$Species)
I worked it out. First I converted the Index column to numeric, and then I ran the following loop. The drop.tip() function is from the ape package.
IDs <- unique(d.test$index.num)
dd <- data.frame(matrix(nrow=length(IDs), ncol=2))
for (i in 1:length(IDs)) {
temp <- subset(d.test, d.test$index.num == i)
species <- as.vector(temp$Species)
species <- tolower(species)
pruned.tree <- drop.tip(tree, setdiff(tree$tip.label, species))
pd <- sum(pruned.tree$edge.length)
dd[i,1] <-i
dd[i,2] <- pd
}

Calculate autocorrelation in panel data?

I have a large panel data set in the form:
ID | Time| X-VALUE
---| ----|-----
1 | 1 |x
1 | 2 |x
1 | 3 |x
2 | 1 |x
2 | 2 |x
2 | 3 |x
3 | 1 |x
3 | 2 |x
3 | 3 |x
. | . |.
. | . |.
More specifically, I have dataset of a large set of individual stock returns over a period of 30 years. I would like to calculate the "stock-specific" first (lag 1) autocorrelation in returns for all stocks individually.
I suspect that by applying the code: acf(pdata$return, lag.max = 1, plot = FALSE) I'll only get som kind of "average" autocorrelation value, is that correct?
Thank you
You can split the data frame and do the acf on each subset. There are tons of ways to do this in R. For example
by(pdata$return, pdata$ID, function(i) { acf(i, lag.max = 1, plot = FALSE) })
You may need to change variable and data frame names to match your own data.
This is not exactly what was requested, but a real autocorrelation function for panel data in R is collapse::psacf, it works by first standardizing data in each group, and then computing the autocovariance on the group-standardized panel-series using proper panel-lagging. Implementation is in C++ and very fast.

Creating a dictionary for tabular data in Julia

I have a tabular data like:
+---+----+----+
| | a | b |
+---+----+----+
| P | 1 | 2 |
| Q | 10 | 20 |
+---+----+----+
and I want to represent this using a Dict.
With the column and row names:
x = ["a", "b"]
y = ["P", "Q"]
and data
data = [ 1 2 ;
10 20 ]
how may I create a dictionary object d, so that d["a", "P"] = 1 and so on? Is there a way like
d = Dict(zip(x,y,data))
?
Your code works with a minor change to use Iterators.product:
d = Dict(zip(Iterators.product(x, y), data.'))
To do this you need to add a line using Iterators to your project, and might need to Pkg.add("Iterators"). Because Julia matrices are column-major (elements are stored in order within columns, and columns are stored in order within the matrix), we needed to transpose the data matrix using the transpose operator .'.
This is a literal answer to your question. I don't recommend doing that. If you have tabular data, it's probably better to use a DataFrame. These are not two dimensional (rows have no names) but that can be fixed by adding an additional column, and using select.

Plot a tree in R given pairs of leaves and heights where they merge

I have a list of leaves in a tree and the height at which I'd like them to merge, i.e. the height of their most recent common ancestor. All leaves are assumed to be at height 0. A toy example might look like:
as.data.frame(rbind(c("a","b",1),c("c","d",2),c("a","d",4)))
V1 V2 V3
1 a b 1
2 c d 2
3 a d 4
I want to plot a tree representing this data. I know that R can plot trees coming from hclust. How do I get my data into the format returned by hclust or into some other format that is easily plotted?
Edited to add diagram:
The tree for the above dataset looks like this:
__|___
| |
| _|_
_|_ | |
| | | |
a b c d
What you have is a hierarchical clustering already specified (in your own data format convention), and you would like to use R's plotting facilities. This seems to be not easy. The only way I can see now to achieve this is to create an object such as that returned by hclust. It has attributes "merge", "height", "order", "labels", "method", "call", "dist.method" which are all fairly easy to understand. Someone already tried this: https://stat.ethz.ch/pipermail/r-help/2006-February/089170.html but apparently still had issues. What you could also try to do is to fill in a distance matrix with dummy values that are consistent with your clustering, then submit this to hclust. E.g.
a <- matrix(ncol=4,nrow=4, c(0,1,4,4,1,0,4,4,4,4,0,2,4,4,2,0))
b <- hclust(as.dist(a), method="single")
plot(b, hang=-1)
This could perhaps be useful.

Resources