Creating a dictionary for tabular data in Julia - dictionary

I have a tabular data like:
+---+----+----+
| | a | b |
+---+----+----+
| P | 1 | 2 |
| Q | 10 | 20 |
+---+----+----+
and I want to represent this using a Dict.
With the column and row names:
x = ["a", "b"]
y = ["P", "Q"]
and data
data = [ 1 2 ;
10 20 ]
how may I create a dictionary object d, so that d["a", "P"] = 1 and so on? Is there a way like
d = Dict(zip(x,y,data))
?

Your code works with a minor change to use Iterators.product:
d = Dict(zip(Iterators.product(x, y), data.'))
To do this you need to add a line using Iterators to your project, and might need to Pkg.add("Iterators"). Because Julia matrices are column-major (elements are stored in order within columns, and columns are stored in order within the matrix), we needed to transpose the data matrix using the transpose operator .'.
This is a literal answer to your question. I don't recommend doing that. If you have tabular data, it's probably better to use a DataFrame. These are not two dimensional (rows have no names) but that can be fixed by adding an additional column, and using select.

Related

Subtract one column from multiple columns in r

I have this data frame with date, Mkt, Rf and then 237 variables which have numbered names. I want to subtract the variable Rf from all 230 numbered variables. I have tried
df[,4:240] = df[,4:240] - df[,3]
but it doesn't seem to work. I'm assuming I would have to create a loop for this type of subtraction but I don't know how I would add the Rf column to subtract inside the loop.
| |Date |Mkt |Rf |10094|10098|10115|...
|:-|:---------|:----|:----|:----|:----|:----|...
|1 |01-01-1997|0.056|0.006|0.002|0.034|0.564|...
|2 |01-02-1997|0.653|0.009|0.009|0.052|0.445|...
You could use this simple for loop:
for(column in 4:240){
df[,column]=df[,column]-df[,3]
}

R Group by exactly inverted columns for large data

I'm trying to group my data by a very specific condition. Consider below data.frame:
from <- c("a", "b", "a", "b")
to <- c("b", "a", "b", "a")
give <- c("x", "y", "y", "x")
take <- c("y", "x", "x", "y")
amount <- c(1, 2, 3, 4)
df <- data.frame(from, to, give, take, amount)
which creates something like:
| from | to | give | take | amount
---------------------------------------
1 | a | b | x | y | 1
2 | b | a | y | x | 2
3 | a | b | y | x | 3
4 | b | a | x | y | 4
To provide some background: consider some user in the 'from' column giving something (in column 'give') to the user in 'to' column and taking something in return (in column 'take'). As you might see, rows 1 & 2 are the same in that way, because they describe the same scenario, just form another perspective. Therefore, I want these to belong to the same group. (You could also consider them as duplicates, which involves the same task, i.e. identifying them as similar.) The same holds for rows 3 & 4. The amount is some value to be summed up per group, to make the example clear.
My desired result for grouping them is as follows.
| user1 | user2 | given_by_user1 | taken_by_user1 | amount
-----------------------------------------------------------
| a | b | x | y | 3 # contains former rows 1&2
| a | b | y | x | 7 # contains former rows 3&4
Note that both from&to and give&take need to by inverted, i.e. taking the values from two columns, sorting their values and considering them equal on that basis is not what I need. This would lead to all four rows in above example being considered equal. That kind of solution was proposed in similar posts, e.g.:
Remove duplicates where values are swapped across 2 columns in R
I've read many similar solutions and found one which actually does the trick:
match two columns with two other columns
However, the proposed solution creates an outer product of two columns, which is not feasible in my case, because my data has millions of rows and at least thousands of unique values within each column.
(Any solution that either groups the rows directly, or gets the indices of rows belonging to the same group would be great!)
Many thanks for any suggestions!

Summarizing R corpus with doc ID

I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3
If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.
To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.
# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]
or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame
# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]
In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.

Items Similarity based on their features

I have a dataset with items but with no user ratings.
Items have features (~400 feature).
I want to measure the similarity between items based on features (Row similarity).
I convert the item-feature into a binary matrix like the fowllowing
itemID | feature1 | feature2 | feature3 | feature4 ....
1 | 0 | 1 | 1 | 0
2 | 1 | 0 | 0 | 1
3 | 1 | 1 | 1 | 0
4 | 0 | 0 | 1 | 1
I don't know what to use (and how to use it) to measure the row similarity.
I want, for Item X, to get the top k similar items.
A sample code will be very appreciated
What you are looking for is termed similarity measure. A quick google/SO search will reveal various methods to get similarity between two vectors. Here is some sample code in python2 for cosine similarity:
from math import *
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])
taken from: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
I noticed that you want top k similar items for every item. The best way to do that is with a k Nearest Neighbour implementation. What you can do is create a knn graph and return the top k similar items from the graph for a query.
A great library for this is nmslib. Here is some sample code for a knn query from the library for the HNSW method with cosine similarity (you can use one of the several available methods. HNSW is particularly efficient for your high dimensional data):
import nmslib
import numpy
# create a random matrix to index
data = numpy.random.randn(10000, 100).astype(numpy.float32)
# initialize a new index, using a HNSW index on Cosine Similarity
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
index.createIndex({'post': 2}, print_progress=True)
# query for the nearest neighbours of the first datapoint
ids, distances = index.knnQuery(data[0], k=10)
# get all nearest neighbours for all the datapoint
# using a pool of 4 threads to compute
neighbours = index.knnQueryBatch(data, k=10, num_threads=4)
At the end of the code, the k top neighbours for every data point will be stored in the neighbours variable. You can use that for your purposes.

Plot a tree in R given pairs of leaves and heights where they merge

I have a list of leaves in a tree and the height at which I'd like them to merge, i.e. the height of their most recent common ancestor. All leaves are assumed to be at height 0. A toy example might look like:
as.data.frame(rbind(c("a","b",1),c("c","d",2),c("a","d",4)))
V1 V2 V3
1 a b 1
2 c d 2
3 a d 4
I want to plot a tree representing this data. I know that R can plot trees coming from hclust. How do I get my data into the format returned by hclust or into some other format that is easily plotted?
Edited to add diagram:
The tree for the above dataset looks like this:
__|___
| |
| _|_
_|_ | |
| | | |
a b c d
What you have is a hierarchical clustering already specified (in your own data format convention), and you would like to use R's plotting facilities. This seems to be not easy. The only way I can see now to achieve this is to create an object such as that returned by hclust. It has attributes "merge", "height", "order", "labels", "method", "call", "dist.method" which are all fairly easy to understand. Someone already tried this: https://stat.ethz.ch/pipermail/r-help/2006-February/089170.html but apparently still had issues. What you could also try to do is to fill in a distance matrix with dummy values that are consistent with your clustering, then submit this to hclust. E.g.
a <- matrix(ncol=4,nrow=4, c(0,1,4,4,1,0,4,4,4,4,0,2,4,4,2,0))
b <- hclust(as.dist(a), method="single")
plot(b, hang=-1)
This could perhaps be useful.

Resources