Calculate similarity matrix for the 1st column - r

I have started working on a few ML projects and use R as the preferred language. I am trying to build a basic recommendation system
http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html
I need to find the similarity matrix (according to the website) and using cosine function (in 'lsa' package) to find user_similarity.
library(lsa)
data_rating <- read.csv("recommendation_basic1.csv", header = TRUE)
x = data_rating[,2:7]
x[is.na(x)] = 0
print(x)
similarity_users <- cosine(as.matrix(x))
similarity_users
But I need to find the similarity matrix among users and this code is giving me an output similarity matrix among the movies. Do I need to modify the below line?
x = data_rating[,2:7]
PS. The recommendation_basic1.csv is the same as in the link.

Putting this in so the question is not unanswered.
You can just use similarity_users <- cosine(as.matrix(t(x)))
Here, the t is matrix transpose, so it just switches the rows and columns which is equivalent to switching the users and the movies.

Related

subsetting from an bigstatsr::FBM object for LDpred2 R-tutorial

I'm using LDpred2 incorporated in bigsnpr to calculate polygenic scores with my own set of genetic Data. I am following the steps found in the online tutorial of LDpred2 on Github (https://privefl.github.io/bigsnpr/articles/LDpred2.html) to use the automatic model snp_ldpred2_auto.
I cannot execute the line:
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = df_beta[["_NUM_ID_"]])
I suspect this happens because the matrices are not fit for multiplication with each other since the number of columns in G (the FBM matrix) is not identical as the number of rows in beta_auto (a common matrix). I intend to filter out variants (SNPs) from G such that the number of variants in G equals the number of variants in beta_auto .
I have never worked before with matrices of class FBM.code256 and do not know how to achieve this subsetting. Guidance is much appreciated.

What is the fastest way to convert correlation between a vector and a matrix in r?

I am trying to find a fast way to calculate the correlation between a vector of values and a matrix. I have a data frame with 200 rows and 400,000 observations after transposing the data. I need to find the cor between each column and every other column.
My code is below but it is too slow. Can anyone come up with a faster way.
for(i in 1:400000){
x=cor(trainDataNew[,i],trainDataNew[,-i])
}
You don't need my data to do this. You can create random data like below.
norm1 <- rnorm(1000)
norm2 <- rnorm(1000)
norm3 <- rnorm(1000)
as.data.frame(cbind(norm1,norm2,norm3))
What's wrong with
cc <- cor(trainDataNew)
?
If you only want the lower triangle you can then use
cc2 <- cc[lower.tri(cc,diag=FALSE)]
This blog post claims to have done a similar-sized (slightly smaller) problem in about a minute. Their approach is implemented in HiClimR::fastCor.
library(HiClimR)
system.time(cc <- fastCor(dd, nSplit = 10,
upperTri = TRUE, verbose = TRUE,
optBLAS=TRUE))
I haven't gotten this working yet (keep running out of memory), but you may have better luck. You should also look into linking R to an optimized BLAS, e.g. see here for MacOS.
Someone here reports a parallelized version (code is here, along with some forked versions)

Overlaying a matrix to diamond square algorithm output

This is a part repeat of a question I asked a couple of days ago, but as has been pointed out I did so extremely poorly, so I'm sorry for that, I'm still learning how to minimalise everything. So I'll ask the two parts separately as it might make it easier for others to find the answers, and to actually answer in the first place (I hope).
I've used a diamond square algorithm (DSQA) I found from this website, which has the following output image
What I need to do is overlay a matrix to this output, which will then be populated with "species" - creating an ecosystem with species occupying different levels of the "terrain". Eventually I'd like to create a range for the "species", but for now I'd just like to know how to overlay the matrix in such a way that the species would populate different levels (e.g. a species at a "high/orange" location would have different co-ordinates(?) to one at a "lower/green"
The matrix I create looks something like this:
#Create Species Vector
species.v<-letters[1:5]
species.v<-as.character(species.v)
#Check species Vector
species.v
#Immigration Vector
immigration.lower<-letters[1:26]
immigration.vec<-toupper(immigration.lower)
immigration.vec
#Matrix creation (Random)
orig.neutral<- matrix(sample(species.v,25,replace=TRUE),
nrow=5,
ncol=5)
#Neutral Matrix
neutral.v0<-orig.neutral
#Create dice roll for replacement
dice.vector<-c(1:10)
dice.vector
#For loop and Ifs for replacement/immigration/speciation
for (i in 1:100) {{dice.roll<-sample(dice.vector,1)}###For Loop with IF functions
if(dice.roll <= 7) {
neutral.v0[sample(length(neutral.v0),1)]<-as.character(sample(neutral.v0,1))
} else if (dice.roll > 7 & dice.roll < 10){
neutral.v0[sample(length(neutral.v0),1)]<-as.character(sample(immigration.vec,1))
} else if (dice.roll == 10){
elIdx = sample(length(neutral.v0),1) #index of a randomly selected element
neutral.v0[elIdx] = paste(neutral.v0[elIdx], "2", sep="")
}}
The replacement and such is all part of a future ecosystem code, and will eventually have a check for the species input into the matrix being in the correct "range" of the DSQA output.
But what I need to know is how to overlay/merge/create this matrix with the DSQA output, so that matrix is part of the output. There don't need to be any limiting ranges at present, I just can't conceptualise how to merge these two separate pieces of code into one thing that I can work on.
So in the example, my matrix is only 5x5, but I have no idea how to create/specify the size of the DSQA Output, let alone ensure my matrix is part of it/effected by it. I don't know if I've got too high a density DSQA output for a simple 5x5 matrix maybe? The actual matrix I'm using in my project is 1000x1000 but that's unnecessarily large for an example, as was pointed out by #gregor, as I just need to know the concept of how to do this, which I can then apply to my ridiculously sized matrix/DSQA output.
I'm still not sure I've explained it well, but any help would be appreciated.
I found out what I was doing wrong, which was instead of taking the output of the diamond square algorithm as a matrix of values, I was trying to think about overlaying a different matrix over the top. The long and short was that I only have to refer to the diamond square algorithm output matrix, not directly use it as a "base" for a second higher matrix.

Categorical Features in Distance Matrix

I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features.
Currently i have (example):
# define the similarity function
cosineSim <- function(x){
as.matrix(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
# define some feature vectors
A <- c(1,1,0,0.5)
B <- c(1,1,0,0.5)
C <- c(1,1,0,1.2)
D <- c(1,0,0,0.7)
dataTest <- data.frame(A,B,C,D)
dataTest <- data.frame(t(dataTest))
dataMatrix <- as.matrix(dataTest)
# get similarity matrix
cosineSim(dataMatrix)
which works fine.
But say i want to add in a categorical variable such as city to generate a feature that is 1 when two cities are equal and 0 other wise.
In this case, example feature vectors would be:
A <- c(1,1,0,0.5,"Dublin")
B <- c(1,1,0,0.5,"London")
C <- c(1,1,0,1.2,"Dublin")
D <- c(1,0,0,0.7,"New York")
I'm wondering is there a neat way to generate the pairwise equality of the last feature on the fly within the function in a way that keeps it a vectorised implementation?
I have tried pre-processing to make binary flags for each category such that above example would become something like:
A <- c(1,1,0,0.5,1,0,0)
B <- c(1,1,0,0.5,0,1,0)
C <- c(1,1,0,1.2,1,0,0)
D <- c(1,0,0,0.7,0,0,1)
This works but the problem is it means i have to pre-process each variable and in some cases i can see the number of categories becoming quite large. This seems quite expensive/inefficient when all i want is to generate a feature that returns 1 for equality and 0 otherwise (granted there is complexity here in that it is essentially a feature dependent on two records and shared between them).
One solution i can see is to just write a loop to build each pair of feature vectors (where i can build a feature such as [is_same_city]=1/0 and set to 1 for each vector when we have equality and 0 otherwise) and then get distance - but this approach will kill me when i try to scale.
I am hoping my R skills are not well enough developed and there is a neat solution that ticks most of the boxes...
Any suggestions at all are very welcome, Thanks

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources