I have been trying to build this program or find out how to access what KKNN does to produce its results. I am using the KKNN function and package to help predict future baseball stats. It takes in 11 predictor variables (previous 3 year stats, PA and level, along with age and another predictor). The predictions work great but what I am hoping to do is when I am predicting only one player (as this would be ridiculous while predicting 100s of players), I would like to see maybe the 3 closest neighbors to the player in question and their previous stats with what they produced the next year. I am most concerned with the name of the nearest neighbors as knowing which players are closest will give context to the prediction that it makes.
I am fine with trying to edit the actual code to the function if that is the only way to get at these. Even finding the indices would be helpful as I can backsolve from there to get the names. Thank you so much for all of your help!
Here is some sample code that should help:
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin")
z
lag1=c(100,90,75,89,95,70)
lag2=c(120,80,95,79,92,90)
Runs=c(65,120,105,99,65,100)
full=cbind(name,lag1,lag2,Runs)
full=data.frame(full)
learn=full
learn
learn$lag1=as.numeric(as.character(learn$lag1))
learn$lag2=as.numeric(as.character(learn$lag2))
learn$Runs=as.numeric(as.character(learn$Runs))
valid=learn[5,]
learn=learn[-5,]
valid
k=kknn(Runs~lag1+lag2,learn,valid,k=2,distance=1)
summary(k)
fit=fitted(k)
fit
Here is the function that I am actually calling if that helps you tailor your answers for workarounds!
kknn(RVPA~(lag1*lag1LVL*lag1PA)+(lag2*lag2LVL*lag2PA)+(lag3*lag3LVL*lag3PA)+Age1+PAsize, RV.learn, RV.valid,k=86, distance = 1,kernel = "optimal")
Here's a slightly modified version of your example:
full= data.frame(
name=c("McGwire,Mark","Bonds,Barry","Helton,Todd","Walker,Larry","Pujols,Albert","Pedroia,Dustin"),
lag1=c(100,90,75,89,95,70),
lag2=c(120,80,95,79,92,90),
Runs=c(65,120,105,99,65,100)
)
library(kknn)
train=full[full$name!="Bonds,Barry",]
test=full[full$name=="Bonds,Barry",]
k=kknn(Runs~lag1+lag2,train=train, test=test,k=2,distance=1)
This predicts Bonds to have 80.2 runs. The Runs variable acts like a class label and if you call k$CL you'll get back 65 and 99 (the number of runs corresponding to the two nearest neighbors). There are two players (McGwire, Pujols) with 65 runs and one with 99, so you can't tell directly who the neighbors are. It appears that the output for kknn does not include a list of the nearest neighbors to the test set (though you could probably back it out from the various outputs).
The FNN package, however, will let you do a query against your training data in the way you want:
library(FNN)
get.knnx(data=train[,c("lag1","lag2")], query=test[,c("lag1","lag2")],k=2)
$nn.index
[,1] [,2]
[1,] 3 4
$nn.dist
[,1] [,2]
[1,] 1.414214 13
train[c(3,4),"name"]
[1] Walker,Larry Pujols,Albert
So nearest neighbors to Bonds are Pujols and Walker.
Related
I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.
Using the var function,
(a) find the sample variance of your row averages from above;
(b) find the sample variance for your XYZmat as a whole; <-this
(c) Divide the sample variance of the XYZmat by the sample variance of the row averages. The statistical theory says that ratio will on average be close to the row sample size, which is n, here.
(d) Do your results agree with theory? (That is a non-trivial question.) Show your work.
So this is what he asked for in the question, I could not get the single number result, so I just used the sd function and then squared the result. I keep wondering if there is still a way to get a single number result using var function. In my case n is 30, I got it from the previous part of the homework. This is the first R class I am taking and this is the first homework assigned, so the answer should be pretty simple.
I tried as.vector() function and I still got the set of numbers as a result. I played around with var function, no changes.
Unfortunately, I deleted all the code I had since the matrix is so big that my laptop started lagging.
I did not have any error messages, but I kept getting a set of numbers for the answer...
set.seed(123)
XYZmat <- matrix(runif(10000), nrow=100, ncol=100) # make a matrix
varmat <- var(as.vector(XYZmat)) # variance of whole matrix
n <- nrow(XYZmat) # number of rows
n
#> [1] 100
rowmeans <- rowMeans(XYZmat) # row means
varmat/var(rowmeans) # should be near n
#> [1] 100.6907
Created on 2019-07-17 by the reprex package (v0.3.0)
I'm profiling tumor microenvironment and I want to show interactions between subpopulations that I found. I have a list of receptors and ligands for example, and I want to show that population A expresses ligand 1 and population C expresses receptor 1 so there's likely an interaction between these two populations through the expression of ligand-receptor 1.
I have been trying to use circlize to visualize these interactions by making a chordDiagram, but it requires an adjacency matrix as input and I do not understand how to create the matrix. The adjacency matrix is supposed to show the strength of the relationship between any two genes in my matrix. I have 6 unique populations of cells that can express any of the 485 ligands/receptors that I am interested in, and the goal is to show interactions between these populations through the ligands and receptors.
I found a tool to use in RStudio called BUS- gene.similarity: Calculate adjacency matrix for gene-gene interaction.
Maybe I am just using BUS incorrectly but it says: For gene expression data with M genes and N experiments, the adjacency matrix is in size of MxM. An adjacency matrix in size of MxM with rows and columns both standing for genes. Element in row i and column j indicates the similarity between gene i and gene j.
So, I made a matrix where each column is a subpopulation and each row is a ligand/receptor I want to show interactions with. The cells have expression values and it looks like this:
> head(Test)
A B C D E F
Adam10 440.755990 669.875468 748.7313995 702.991422 1872.033343 2515.074366
Adam17 369.813134 292.625603 363.0301707 434.905968 1183.152694 1375.424034
Agt 12.676036 28.269671 9.2428034 19.920561 121.587010 168.116735
Angpt1 22.807415 42.350205 25.5464603 16.010813 194.620550 99.383567
Angpt2 92.492760 186.167844 819.3679836 852.666499 669.642441 1608.748788
Angpt4 3.327743 0.693985 0.8292746 1.112826 5.463647 5.826927
Where A-F are my populations. Then I pass this matrix to BUS:
res<-gene.similarity(Test,measure="corr",net.trim="none")
Warning message:
In cor(mat) : the standard deviation is zero
But the output file which is supposed to be my adjacency matrix is full of NA's:
Adam10 Adam17
Adam10 1 NA
Adam17 NA 1
I thought maybe my matrix was too complex, so I compared only 2 cell populations with my ligands/receptors, but I get the exact same output.
I was expecting to get something like:
A:Adam10 A:Adam17
C:Adam10 6 1
E:Adam17 2 10
But, even if the res object gave me numbers instead of NA it does not maintain the identity of the population when making relationships amongst genes so it still would not produce my expected output.
I do not have to use BUS to make the matrix, so I don't necessarily need help troubleshooting that code, I just need SOME way to make an adjacency matrix.
I've never used circlize or Circos before so I apologize if my question is stupid.
Seems like you need to transform you matrix a little.
you can create a new matrix which has size (nrow(Test) x ncol(Text)) x (nrow(Test) x ncol(Text)), so in the example you gave, the new matrix will be 36x36, and the colnames and rownames will be the same which are A_Adam10, A_Adam17,..., A_Angpt4, B_Adam10,..., F_Angpt4.
With the help of a loop, you can load the similarity of each pair into the new matrix, and now you can plot the matrix. It's a little complicated, also takes a while to run the loop, but it's intuitive.
You're welcomed to check my github repo since I had a similar problem not too long ago, and I posted detailed code on there. I hope this will help you
I have two time series- a baseline (x) and one with an event (y). I'd like to cluster based on dissimilarity of these two time series. Specifically, I'm hoping to create new features to predict the event. I'm much more familiar with clustering, but fairly new to time series.
I've tried a few different things with a limited understanding...
Simulating data...
x<-rnorm(100000,mean=1,sd=10)
y<-rnorm(100000,mean=1,sd=10)
This package seems awesome but there is limited information available on SO or Google.
library(TSclust)
d<-diss.ACF(x, y)
the value of d is
[,1]
[1,] 0.07173596
I then move on to clustering...
hc <- hclust(d)
but I get the following error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
My assumption is this error is because I only have one value in d.
Alternatively, I've tried the following on a single timeseries (the event).
library(dtw)
distMatrix <- dist(y, method="DTW")
hc <- hclust(y, method="complete")
but it takes FOREVER to run the distance Matrix.
I have a couple of guesses at what is going wrong, but could use some guidance.
My questions...
Do I need a set of baseline and a set of event time series? Or is one pairing ok to start?
My time series are quite large (100000 rows). I'm guessing this is causing the SLOW distMatrix calculation. Thoughts on this?
Any resources on applied clustering on large time series are welcome. I've done a pretty thorough search, but I'm sure there are things I haven't found.
Is this the code you would use to accomplish these goals?
Thanks!
I am interested in deriving dominance metrics (as in a dominance hierarchy) for nodes in a dominance directed graph, aka a tournament graph. I can use R and the package igraph to easily construct such graphs, e.g.
library(igraph)
create a data frame of edges
the.froms <- c(1,1,1,2,2,3)
the.tos <- c(2,3,4,3,4,4)
the.set <- data.frame(the.froms, the.tos)
set.graph <- graph.data.frame(the.set)
plot(set.graph)
This plotted graph shows that node 1 influences nodes 2, 3, and 4 (is dominant to them), that 2 is dominant to 3 and 4, and that 3 is dominant to 4.
However, I see no easy way to actually calculate a dominance hierarchy as in the page: https://www.math.ucdavis.edu/~daddel/linear_algebra_appl/Applications/GraphTheory/GraphTheory_9_17/node11.html . So, my first and main question is does anyone know how to derive a dominance hierarchy/node-based dominance metric for a graph like this using some hopefully already coded solution in R?
Moreover, in my real case, I actually have a sparse matrix that is missing some interactions, e.g.
incomplete.set <- the.set[-2, ]
incomplete.graph <- graph.data.frame(incomplete.set)
plot(incomplete.graph)
In this plotted graph, there is no connection between species 1 and 3, however making some assumptions about transitivity, the dominance hierarchy is the same as above.
This is a much more complicated problem, but if anyone has any input about how I might go about deriving node-based metrics of dominance for sparse matrices like this, please let me know. I am hoping for an already coded solution in R, but I'm certainly MORE than willing to code it myself.
Thanks in advance!
Not sure if this is perfect or that I fully understand this, but it seems to work as it should from some trial and error:
library(relations)
result <- relation_consensus(endorelation(graph=the.set),method="Borda")
relation_class_ids(result)
#1 2 3 4
#1 2 3 4
There are lots of potential options for method= for dealing with ties etc - see ?relation_consensus for more information. Using method="SD/L" which is a linear order might be the most appropriate for your data, though it can suggest multiple possible solutions due to conflicts in more complex examples. For the current simple data this is not the case though - try:
result <- relation_consensus(endorelation(graph=the.set),method="SD/L",
control=list(n="all"))
result
#An ensemble of 1 relation of size 4 x 4.
lapply(result,relation_class_ids)
#[[1]]
#1 2 3 4
#1 2 3 4
Methods of dealing with this are again provided in the examples in ?relation_consensus.