Calculate the Euclidean distance of 3 points - r

I have a data.frame (Centroid) that contains points in virtual 3D space (columns = AV, V and A), each representing a character (column = Character). Each row contains a different character.
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
Character<-c("a","A","b")
centroid = data.frame(AV,V,A,Character)
centroid
AV V A Character
1 37.90 1.07 0.04 a
2 10.87 1.14 -1.23 A
3 40.05 1.90 -1.10 b
I wish to know the similarity/dissimilarity between each character. For example, "a" corresponds to 37.9, 1.07 and 0.04 whilst "A" corresponds to 10.87, 1.14, -1.23. I want to know the distance between these characters/ 3 points.
I believe I can calculate this using Euclidean distance between each character, but am unsure of the code to run.
I have attempted to use
dist(as.matrix(Centroids))
But have been unsuccessful, as this just gives a big print in the console. Any assistance would be greatly appreciated.

Following may be helpful:
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
centroid = data.frame(A,V,AV)
centroid
A V AV
1 0.04 1.07 37.90
2 -1.23 1.14 10.87
3 -1.10 1.90 40.05
mm = as.matrix(centroid)
mm
A V AV
[1,] 0.04 1.07 37.90
[2,] -1.23 1.14 10.87
[3,] -1.10 1.90 40.05
dist(mm)
1 2
2 27.059909
3 2.571186 29.190185
as.dist(mm)
A V
V -1.23
AV -1.10 1.90
It is not clear what you mean by "Character<-c(a,A,b)"

Related

Creating an igraph from weighted correlation matrix csv

First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!
This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.

Conditional density distribution, two discrete variables

I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".

Finding Sync of Columns in R

Data
v1 <- c(52.9799999999814, 53.4200000000128, 52.0899999999965, 57.9700000000012,
60.679999999993, 0.300000000017462, 1.76999999998952, 61.1900000000023,
58.9599999999919, 1.73000000001048, 0.269999999989523, 6.92000000001281,
60.5299999999988, 60.859999999986, 59.5599999999977, 61.0600000000268,
60.6299999999756, 60.9700000000012, 60.1600000000035, 60.4599999999919,
60.0900000000256)
v2 <- c(52.679999999993, 53.140000000014, 52.8899999999849, 57.6700000000128,
60.5199999999895, 2.04000000000815, 61.890000000014, 59.5699999999779,
2.05999999999767, 6.98000000001048, 60.7399999999907, 60.7799999999988,
59.7300000000105, 60.9100000000035, 60.3299999999872, 60.5500000000175,
60.6600000000035, 60.3499999999767, 60.7300000000105, 60.6700000000128,
60.3799999999756)
tv3 <- data.frame(v1,v2)
tv3$v5 <- tv3$v2 - tv3$v1
tv3$v5
[1] -0.30 -0.28 0.80 -0.30 -0.16 1.74 60.12 -1.62 -56.90 5.25 60.47 53.86 -0.80
[14] 0.05 0.77 -0.51 0.03 -0.62 0.57 0.21 0.29
So you see, the difference should remain smaller, if it is larger, like in this case, at particular row, it gets 60.
So basically if we remove the 0.30 row in just V1 and shift it one cell up, The difference wouldn't hike upto 60.
So the 0.30 is noise value and that's what I have to figure and put it in V3
My desired results are the following.
v1 v2 V3
52.98 52.68 0.3
53.42 53.14 0.27
52.09 52.89
57.97 57.67
60.68 60.52
1.77 2.04
61.19 61.89
58.96 59.57
1.73 2.06
6.92 6.98
60.53 60.74
60.86 60.78
59.56 59.73
61.06 60.91
60.63 60.33
60.97 60.55
60.16 60.66
60.46 60.35
60.09 60.73
So notice here that all the sequence of columns seem to be in sync with just a difference of few points.
May be my case requires implementation of Needleman-Wunsch Algo

svd imputation R

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).
This is the dataset with missing data
http://pastebin.com/YS9qaUPs
#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)
head(SVDImpNA)
V1 V2 V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01
Where am I wrong?
The impute.svd algorithm works as follows:
Replace all missing values with the corresponding column means.
Compute a rank-k approximation to the imputed matrix.
Replace the values in the imputed positions with the corresponding values from the rank-k approximation computed in Step 2.
Repeat Steps 2 and 3 until convergence.
In your example code, you are setting k=min(n,p) (the default). Then, in Step 2, the rank-k approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).
If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k. The following code demonstrates this with your sample data:
> library("bcv")
> dataMiss = read.csv('dataMiss.csv')
k=3
> SVDimputation = impute.svd(dataMiss, k = 3, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-0.01 0.01
531 1062
k=2
> SVDimputation = impute.svd(dataMiss, k = 2, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-11.31 -6.94 -2.59 -2.52 -2.19 -2.02 -1.67 -1.63
25 23 61 2 54 23 5 44
-1.61 -1.2 -0.83 -0.8 -0.78 -0.43 -0.31 -0.15
14 10 13 19 39 1 14 19
-0.14 -0.02 0 0.01 0.02 0.03 0.06 0.17
83 96 94 77 30 96 82 28
0.46 0.53 0.55 0.56 0.83 0.91 1.26 1.53
1 209 83 23 28 111 16 8
1.77 5.63 9.99 14.34
112 12 33 5
Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000.
The problem that you describe likely occurs because impute.svd initially sets all of the NA values to be equal to the column means, and then doesn't change these values upon convergence.
It depends on the reason that you are using SVD imputation in the first place, but in case you are flexible, a good solution to this problem might be to switch the rank of the SVD call, by setting k to, e.g., 1. Currently, k is set automatically to min(n, p), where n = nrow, and p = ncol, which for your data means k = 3. For example, if you set it to 1 (as it is set in the example in the impute.svd function documentation), then this problem does not occur:
library(bcv)
dataMiss = read.csv("dataMiss.csv")
SVDimputation = round(impute.svd(dataMiss, k = 1)$x, 2)
head(SVDimputation)
[,1] [,2] [,3]
[1,] 0.96 -0.23 0.52
[2,] 0.02 -0.23 -1.92
[3,] -1.87 -0.23 0.52
[4,] -0.92 -0.23 0.52
[5,] 0.49 -0.46 0.52
[6,] -1.87 -0.23 0.52

how can I do vector integral in R? [duplicate]

This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 7 years ago.
I want to integrate a one dimensional vector in R, How should I do that?
Let's say I have:
d=hist(p, breaks=100, plot=FALSE)$density
where p is a sample like:
p=rnorm(1e5)
How can I calculate an integral over d?
If we assume that the values in d correspond to the y values of a function then we can calculate the integral by using a discrete approximation. We can for example use the trapezium rule or Simpsons rule for this purpose. We then also need to input the stepsize that corresponds to the discrete interval on the x-axis in order to "approximate the area under the curve".
Discrete integration functions defined below:
p=rnorm(1e5)
d=hist(p,breaks=100,plot=FALSE)$density
discreteIntegrationTrapeziumRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
stepsize*(0.5*integrand[1]+sum(integrand[2:(l-1)])+0.5*v[l])
}
discreteIntegrationSimpsonRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
a = seq(from=2,to=l-1,by=2);
b = seq(from=3,to=l-1,by=2)
(stepsize/3)*(integrand[1]+4*sum(integrand[a])+2*sum(integrand[b])+integrand[l])
}
As an example, let's approximate the complete area under the curve while assuming discrete x steps of size 1 and then do the same for the second half of d while we assume x-steps of size 0.2.
> plot(1:length(d),d) # stepsize one on x-axis
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultTrapeziumRule
[1] 9.9999
> resultSimpsonRule
[1] 10.00247
> plot(seq(from=-10,to=(-10+(length(d)*0.2)-0.2),by=0.2),d) # stepsize 0.2 on x-axis
> resultTrapziumRule <- discreteIntegrationTrapeziumRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultTrapziumRule
[1] 1.15478
> resultSimpsonRule
[1] 1.11678
In general, the Simpson rule offers better approximations of the integral. The more y-values you have (and the smaller the x-axis stepsize), the better your approximations will become.
Small EDIT for clarity:
In this particular case the stepsize should obviously be 0.1. The complete area under the density curve is then (approximately) equal to 1, as expected.
> d=hist(p,breaks=100,plot=FALSE)$density
> hist(p,breaks=100,plot=FALSE)$mids # stepsize = 0.1
[1] -4.75 -4.65 -4.55 -4.45 -4.35 -4.25 -4.15 -4.05 -3.95 -3.85 -3.75 -3.65 -3.55 -3.45 -3.35 -3.25 -3.15 -3.05 -2.95 -2.85 -2.75 -2.65 -2.55
[24] -2.45 -2.35 -2.25 -2.15 -2.05 -1.95 -1.85 -1.75 -1.65 -1.55 -1.45 -1.35 -1.25 -1.15 -1.05 -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25
[47] -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05
[70] 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 3.55 3.65 3.75 3.85 3.95 4.05 4.15
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d,stepsize=0.1)
> resultTrapeziumRule
[1] 0.999985

Resources