I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features.
Currently i have (example):
# define the similarity function
cosineSim <- function(x){
as.matrix(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
# define some feature vectors
A <- c(1,1,0,0.5)
B <- c(1,1,0,0.5)
C <- c(1,1,0,1.2)
D <- c(1,0,0,0.7)
dataTest <- data.frame(A,B,C,D)
dataTest <- data.frame(t(dataTest))
dataMatrix <- as.matrix(dataTest)
# get similarity matrix
cosineSim(dataMatrix)
which works fine.
But say i want to add in a categorical variable such as city to generate a feature that is 1 when two cities are equal and 0 other wise.
In this case, example feature vectors would be:
A <- c(1,1,0,0.5,"Dublin")
B <- c(1,1,0,0.5,"London")
C <- c(1,1,0,1.2,"Dublin")
D <- c(1,0,0,0.7,"New York")
I'm wondering is there a neat way to generate the pairwise equality of the last feature on the fly within the function in a way that keeps it a vectorised implementation?
I have tried pre-processing to make binary flags for each category such that above example would become something like:
A <- c(1,1,0,0.5,1,0,0)
B <- c(1,1,0,0.5,0,1,0)
C <- c(1,1,0,1.2,1,0,0)
D <- c(1,0,0,0.7,0,0,1)
This works but the problem is it means i have to pre-process each variable and in some cases i can see the number of categories becoming quite large. This seems quite expensive/inefficient when all i want is to generate a feature that returns 1 for equality and 0 otherwise (granted there is complexity here in that it is essentially a feature dependent on two records and shared between them).
One solution i can see is to just write a loop to build each pair of feature vectors (where i can build a feature such as [is_same_city]=1/0 and set to 1 for each vector when we have equality and 0 otherwise) and then get distance - but this approach will kill me when i try to scale.
I am hoping my R skills are not well enough developed and there is a neat solution that ticks most of the boxes...
Any suggestions at all are very welcome, Thanks
Related
I've been looking for library functions to do this, but I'm surprised that I cannot find one.
There are quite a few stats functions in R that do the statistical test then output
a table that includes letters denoting significance groups, for example LSD.test.An example of how LSD's might be calculated and used to make multicomparison letters, used in a graph
There are others. All of the examples I could find tend to work from a model object, and then do their job. However, I already have the LSD values and the means -- and want to work directly with them. I've been looking for a common function that all of these multicomparison methods use to do this final step, but can't find one.
So, this is what I want to do...given the least significant difference between values (LSD) and the mean values them selves:
lsd <- 1.0
vals <- c(2,3,3.5,4,4.2,6.0)
I want to I want to produce output something like:
2 a
3 b
3.5 bc
4 c
4.2 c
6.0 d
,where values followed by the same letter are not significantly different, based on the least significant difference value.
Ideally, it would be best if it could handle the list of values un-ordered...
vals <- c(6.0, 2, 3.5, 4.0, 4.2, 3)
producing the output:
6.0 d
2 a
3.5 bc
4.0 b
4.2 c
3 c
I've been thinking that most of these LSD.test and multicompare functions
are probably using a base function to put together the letter list -- but I have not been able to find it.
Working through the problem, I think this does the trick, but it's pretty ugly...
lsd.letters <- function(vals, lsd) {
#find their order
#record their order
indx <- order(vals)
#sort their order
srt <- vals[indx]
#assign a variable of letters
lts <- letters
#create a character vector
siglets <- rep("", length(vals))#c("a",rep("", length(vals)-1))
#use a single pass through the list of means
#use the first letter a for the lowest value
itlet <- 1
for (i in c(1:(length(vals)))){
crnt <- srt[i]
clet <- lts[itlet]
#is this value within the LSD of any other value in the remaining list
ix <- which(srt[i:length(srt)] < (crnt+lsd))+i-1
for (ix2 in ix){
newletter <- 0
if (length(intersect( unlist(strsplit(siglets[i], "")), unlist(strsplit(siglets[ix2], "")))) == 0){
#If the string for this mean does not already contain a letter in common for the current step mean... assign the letter
#siglets[ix2] <- paste0(siglets[ix2],clet)
newletter <- 1
}
}
if (newletter == 1){
siglets[ix] <- paste0(siglets[ix],clet)
itlet <- itlet + 1
}
}
siglets
}
It's ugly, and I am not yet sorting the output (sorting it is easy).
Is there a library function to do this? Or has anyone written a better approach to do this?
Thanks for your help!
Using WinBUGS, how can I calculate the product of all values in a single vector?
I have tried using a for loop over the same vector.
For example:
In R, if A <- [1,2,3,4], prod(A) = 24.
However,
in BUGS, if a <- 2 , and for (i in 1:n){ a <- a * A[i] }, this loop cannot work because 'a' is defined twice.
Hi and welcome to the site!
Remember that BUGS is a declarative syntax and not a programming language, so you cannot over-write variable values as you expect to be able to in a language such as R. So you need to create some intermediate nodes to do what you calculate.
If you have the following data:
A <- [1,2,3,4]
nA <- 4
Then you can include in your model:
sumlogA[1] <- 0
for(i in 1:nA){
sumlogA[i+1] <- sumlogA[i] + log(A[i])
}
prodA <- exp(sumlogA[nA+1])
Notice that I am working on the log scale and then take the exponent of the sum - this is mathematically equivalent to the product but is a much more computationally stable calculation.
Hope that helps,
Matt
I have started working on a few ML projects and use R as the preferred language. I am trying to build a basic recommendation system
http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html
I need to find the similarity matrix (according to the website) and using cosine function (in 'lsa' package) to find user_similarity.
library(lsa)
data_rating <- read.csv("recommendation_basic1.csv", header = TRUE)
x = data_rating[,2:7]
x[is.na(x)] = 0
print(x)
similarity_users <- cosine(as.matrix(x))
similarity_users
But I need to find the similarity matrix among users and this code is giving me an output similarity matrix among the movies. Do I need to modify the below line?
x = data_rating[,2:7]
PS. The recommendation_basic1.csv is the same as in the link.
Putting this in so the question is not unanswered.
You can just use similarity_users <- cosine(as.matrix(t(x)))
Here, the t is matrix transpose, so it just switches the rows and columns which is equivalent to switching the users and the movies.
I am trying to generate a function that conducts various mathematical operations within a matrix and stores the outcomes of these operations in a new matrix with similar dimensions.
Here's an example matrix (a lot of silly computations in it to get sufficient variability in the data)
test<-matrix(1:290,nrow=10,ncol=29) ; colnames(test)<-1979+seq(1,29)
rownames(test)<-c("a","b","c","d","e","f","g","h","i","j")
test[,4]<-rep(8)
test[7,]<-seq(1,29)
test[c(3,5,9),]<-test[c(3,5,9),] * 1/2
test[,c(4,6,8,9,10,15,16,18)]<-test[,c(4,6,8,9,10,15,16,18)]*1/3
I want for instance to be able to calculate the difference between the value in (a,1999) and the average of the 3 values before (a, 1999). This needs to be flexible and for every rowname (firm) and every column (year).
The code I am trying to build looks something like this (I guess):
for(year in 1:29)
for (k in 1:10)
qw<-matrix((test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3])), nrow=10, ncol=29)
When I run it, this code generates a matrix but the value in that matrix is always the one for the last calculation (i.e. 20 in my example) while every matrix value should be stored in qw.
Any suggestions on how I can achieve this (maybe via an apply function)?
Thanks in advance
You are creating a matrix qw in every iteration. Each new matrix overwrites the previous one. Here's how to do what I think you would like to do, altough I didn't know how you want to handle the first 3 years.
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
for(year in 4:29){
for (k in 1:10){
qw[k, year] <- (test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3]))
}
}
qw
In R, it is usually a bad idea to use loops, since there are much more efficient functions. Here is the R way of doing this, using the package zoo.
require(zoo)
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
qw[,4:29] <- test[,4:29]-t(head(rollmean(t(test), 3),-1))
qw
I am trying to set up a Gibbs sampler in R where I update my value at each step.
I have a function in R that I want to maximise for 2 values; my previous value and a new one.
So I know the maximum outcome from the function applied to both values. But then how do I select the best input without doing it manually? (I need to do a lot of iterations). Here is an idea of the code and the variables:
g0<-function(k){sample(0:1,k,replace=T)}
this is a k dimensional vector with entries 1 or 0 uniformly. Initial starting point for my chain. If i=1 then include the i'th variable in the design matrix.
X1 design matrix
Xg<-function(g){
Xg<-cbind(X1[,1]*g[1],X1[,2]*g[2],X1[,3]*g[3],X1[,4]*g[4],X1[,5]*g[5],X1[,6]*g[6],X1[,7]*g[7])
return(Xg[,which(!apply(Xg,2,FUN = function(x){all(x == 0)}))])
}
Xg0<-Xg(g0)
reduced design matrix for g0
c<-1:100000
mp<-function(g){
mp<-sum((1/(c*(c+1)^-((q+1)/2)))*
(t(Y)%*%Y-(c/(c+1))*t(Y)%*%Xg(g)%*%solve(t(Xg(g))%*%Xg(g))%*%t(Xg(g))%*%Y)^(-27/2))
return(mp)
}
this is my function.
Therefore if I have mp(g) and mp(g*), for 2 inputs g and g*, such that the max is mp(g*) how can I return g*?
Thanks for any help and if you have any queries just ask. sorry about the messy code as well; I have not used this site before.
Like this:
inputs <- list(g, g2)
outputs <- sapply(inputs, mp)
best.input <- inputs[which.max(outputs)]