R - Manhattan / Euclidean distance calculations into a matrix - r

I would like to ask help on distance measures for continuous variables
There is an example:
x1 = (0,0)
x2 = (1,0)
x3 = (5,5)
The example is to find the distance matrix for L1-norm and L2-norm(Euclidean).
I don't know how to compute in R to get the following answer:
I have tried to do it like this but it didn't work as expected.
y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
y5 <- rbind(y2,y3,y4)
dist(y5)

y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
mat <- rbind(y2, y3, y4)
d1 <- dist(mat, upper=TRUE, diag=TRUE, method="manhattan")
d1
# y2 y3 y4
# y2 0 1 10
# y3 1 0 9
# y4 10 9 0
d2 <- dist(mat, upper=TRUE, diag=TRUE)^2
d2
# y2 y3 y4
# y2 0 1 50
# y3 1 0 41
# y4 50 41 0

Related

Adding data by row into an empty matrix and handling missing data

I have an empty matrix with a certain number of columns that I'm trying to fill row-by-row with output vectors of a for-loop. However, some of the output are not the same length as the number of columns as my matrix, and just want to fill up those "empty spaces" with NAs.
For example:
matrix.names <- c("x1", "x2", "x3", "x4", "y1", "y2", "y3", "y4", "z1", "z2", "z3", "z4")
my.matrix <- matrix(ncol = length(matrix.names))
colnames(my.matrix) <- matrix.names
This would be the output from one iteration:
x <- c(1,2)
y <- c(4,2,1,5)
z <- c(1)
Where I would want it in the matrix like this:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
The output from the next iteration would be, for example:
x <- c(1,1,1,1)
y <- c(0,4)
z <- c(4,1,3)
And added as a new row in the matrix:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
[2,] 1 1 1 1 0 4 NA NA 4 1 3 NA
It's not really a concern if I have a 0, it's just where there is no data. Also, the data is saved in such a way that whatever is there is listed in the row first, followed by NAs in empty slots. In other words, I'm not worried if an NA may pop up first.
Also, is such a thing better handled in data frames rather than matrices?
not the efficient answer : just a try
logic : extending the length to 4.(exception could be if already x/y/z is laready of length4) Therefore while rbinding I only extract the first 4 elements .
x[length(x)+1:4] <- NA
y[length(y)+1:4] <- NA
z[length(z)+1:4] <- NA
my.matrix <- rbind(my.matrix,c(x[1:4],y[1:4],z[1:4]))
Note : the exception I mentioned above is like below :
> x <- c(1,1,1,1)
> x
[1] 1 1 1 1
> x[length(x)+1:4] <- NA
> x
[1] 1 1 1 1 NA NA NA NA # therefore I extract only the first four
Here is an option to do this programmatically
d1 <- stack(mget(c("x", "y", "z")))[2:1]
nm <- with(d1, paste0(ind, ave(seq_along(ind),ind, FUN = seq_along)))
my.matrix[,match(nm,colnames(my.matrix), nomatch = 0)] <- d1$values
my.matrix
# x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
#[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
Or another option is stri_list2matrix from stringi
library(stringi)
m1 <- as.numeric(stri_list2matrix(list(x,y, z)))
Change the 'x', 'y', 'z' values
m2 <- as.numeric(stri_list2matrix(list(x,y, z)))
rbind(m1, m2)

Correlation with p values matrix

I have following data:
x1 = sample(1:10, 100, replace=T)
x2 = sample(1:3, 100, replace=T)
x3 = sample(50:100, 100, replace=T)
y1 = sample(50:100, 100, replace=T)
y2 = sample(50:100, 100, replace=T)
mydf = data.frame(x1,x2,x3,y1,y2)
head(mydf)
x1 x2 x3 y1 y2
1 2 2 96 100 73
2 5 2 77 93 52
3 10 1 86 54 80
4 3 2 98 59 94
5 2 2 85 94 85
6 9 2 56 79 99
I have following data:
I want to do correlations and produce following output:
x1 x2 x3
y1 r.value; p.value r.value; p.value r.value; p.value
y2 r.value; p.value r.value; p.value r.value; p.value
R value needs to be rounded to 2 digits and p_value to 3 digits.
How can this be done? Thanks for your help.
I tried following:
library(Hmisc)
res = rcorr(as.matrix(mydf), type="pearson")
res
x1 x2 x3 y1 y2
x1 1.00 -0.01 -0.16 -0.28 -0.21
x2 -0.01 1.00 -0.20 -0.10 -0.13
x3 -0.16 -0.20 1.00 0.14 -0.09
y1 -0.28 -0.10 0.14 1.00 0.12
y2 -0.21 -0.13 -0.09 0.12 1.00
n= 100
P
x1 x2 x3 y1 y2
x1 0.9520 0.1089 0.0047 0.0364
x2 0.9520 0.0444 0.3463 0.1887
x3 0.1089 0.0444 0.1727 0.3948
y1 0.0047 0.3463 0.1727 0.2482
y2 0.0364 0.1887 0.3948 0.2482
matrix(paste0(round(res[[1]][,1:3],2),';',round(res[[3]][1:2,],4)),ncol=3)
[,1] [,2] [,3]
[1,] "1;NA" "-0.01;0.0444" "-0.16;NA"
[2,] "-0.01;0.952" "1;0.0047" "-0.2;0.952"
[3,] "-0.16;0.952" "-0.2;0.3463" "1;0.952"
[4,] "-0.28;NA" "-0.1;0.0364" "0.14;NA"
[5,] "-0.21;0.1089" "-0.13;0.1887" "-0.09;0.1089"
But the combination is not correct.
You can also do the following, which doesn't need to precise the positions of rows/columns you need :
matrix(paste(unlist(round(res[[1]],2)),unlist(round(res[[3]],3)),sep=";"),
nrow=nrow(res[[1]]),dimnames=dimnames(res[[1]]))
update : I added a dimnames parameter so the dimnames are "transmitted" to the result matrix.
For example, with the random sampling I had, you'll get :
x1 x2 x3 y1 y2
x1 "1;NA" "-0.2;0.052" "0.02;0.833" "-0.04;0.674" "0.02;0.819"
x2 "-0.2;0.052" "1;NA" "-0.13;0.202" "-0.01;0.896" "0.05;0.653"
x3 "0.02;0.833" "-0.13;0.202" "1;NA" "-0.05;0.636" "-0.13;0.185"
y1 "-0.04;0.674" "-0.01;0.896" "-0.05;0.636" "1;NA" "-0.02;0.858"
y2 "0.02;0.819" "0.05;0.653" "-0.13;0.185" "-0.02;0.858" "1;NA"
Try
r2 <- matrix(0, ncol=3, nrow=2,
dimnames=list( paste0('y',1:2), paste0('x',1:3)))
r2[] <- paste(round(res$r[4:5,1:3],2), round(res$P[4:5,1:3],4), sep="; ")
Update
You could create a function like below
f1 <- function(df){
df1 <- df[order(colnames(df))]
indx <- sub('\\d+', '', colnames(df1))
indx1 <- which(indx[-1]!= indx[-length(indx)])
indx2 <- (indx1+1):ncol(df1)
r2 <- matrix(0, ncol=indx1, nrow=(ncol(df1)-indx1),
dimnames=list(colnames(df1)[indx2], colnames(df1)[1:indx1]))
r1 <- rcorr(as.matrix(df1), type='pearson')
r2[] <- paste(round(r1$r[indx2,1:indx1],2), round(r1$P[indx2,1:indx1],4),
sep="; ")
r2
}
f1(mydf) #using your dataset (`set.seed` is different)
# x1 x2 x3
#y1 "0.07; 0.4773" "0.02; 0.84" "0.21; 0.0385"
#y2 "-0.08; 0.4363" "0.08; 0.4146" "0.02; 0.8599"
Testing with unordered dataset
f1(mydf1)
# x1 x2 x3 x4
#y1 "-0.08; 0.4086" "0.17; 0.0945" "-0.25; 0.0112" "-0.16; 0.1025"
#y2 "0.07; 0.5174" "-0.1; 0.3054" "0.03; 0.7478" "-0.06; 0.5776"
Update2
If you want a function to have the numeric index argument
f2 <- function(df, v1, v2){
r2 <- matrix(0, nrow=length(v2), ncol=length(v1),
dimnames=list(colnames(df)[v2], colnames(df)[v1]))
r1 <- rcorr(as.matrix(df), type='pearson')
r2[] <- paste(round(r1$r[v2,v1],2), round(r1$P[v2,v1],4), sep="; ")
r2
}
f2(mydf, 1:3, 4:5)
f2(mydf, c(1,3), c(2,4,5))
data
set.seed(29)
x1 = sample(1:10, 100, replace=T)
x2 = sample(1:3, 100, replace=T)
x3 = sample(50:100, 100, replace=T)
x4 <- sample(40:80, 100, replace=TRUE)
y1 = sample(50:100, 100, replace=T)
y2 = sample(50:100, 100, replace=T)
mydfN = data.frame(x1,x2,x3,x4, y1,y2)
set.seed(25)
mydf1 <- mydfN[sample(colnames(mydfN))]

cast() dataset and return two values

Aloha,
I am trying to cast() a dataset in which every unique combination of W-X-Y returns the max number of Z AND the associated week. For example:
W X Y week Z
w1 x1 y1 1 0
w1 x1 y1 2 0.1
w1 x1 y1 3 0.2
w2 x2 y1 1 0.5
w2 x2 y1 2 0.7
w2 x2 y1 3 0.3
w3 x1 y1 1 0.1
w3 x1 y1 2 0.2
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w4 x2 y2 2 0.3
w4 x2 y2 3 0.1
w5 x1 y2 1 0.3
w5 x1 y2 2 0.1
w5 x1 y2 3 0.2
Can I do this w/cast()? I am able to return just the max number of Z per unique W-X-Y combination, but not the week with the following:
cast(foo, W + X + Y ~ ., max, value="Z")
For the above dataset, I would like the output to look as such:
W X Y week Z
w1 x1 y1 3 0.2
w2 x2 y1 2 0.7
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w5 x1 y2 1 0.3
Mahalo for your suggestions!
cast is not the right tool for this. Consider instead the functions in the plyr package:
library("plyr")
ddply(foo, .(W, X, Y), summarise, week=week[which.max(Z)], Z=max(Z))

How do I convert table formats in R

Specifically,
I used the following set up:
newdata <- tapply(mydata(#), list(mydata(X), mydata(Y)), sum)
I currently have a table that currently is listed as follows:
X= State, Y= County within State, #= a numerical total of something
__ Y1 Y2 Y3 Yn
X1 ## ## ## ##
X2 ## ## ## ##
X3 ## ## ## ##
Xn ## ## ## ##
What I need is a table listed as follows:
X1 Y1 ##
X1 Y2 ##
X1 Y3 ##
X1 Yn ##
X2 Y1 ##
X2 Y2 ##
X2 Y3 ##
X2 Yn ##
Xn Y1 ##
Xn Y2 ##
Xn Y3 ##
Xn Yn ##
library(reshape2)
new_data <- melt(old_data, id.vars=1)
Look into ?melt for more details on syntax.
example:
> df <- data.frame(x=1:5, y1=rnorm(5), y2=rnorm(5))
> df
x y1 y2
1 1 -1.3417817 -1.1777317
2 2 -0.4014688 1.4653270
3 3 0.4050132 1.5547598
4 4 0.1622901 -1.2976084
5 5 -0.7207541 -0.1203277
> melt(df, id.vars=1)
x variable value
1 1 y1 -1.3417817
2 2 y1 -0.4014688
3 3 y1 0.4050132
4 4 y1 0.1622901
5 5 y1 -0.7207541
6 1 y2 -1.1777317
7 2 y2 1.4653270
8 3 y2 1.5547598
9 4 y2 -1.2976084
10 5 y2 -0.1203277
Some example data
mydata <- data.frame(num=rnorm(40),
gp1=rep(LETTERS[1:2],2),
gp2=rep(letters[1:2],each=2))
And applying tapply to it:
tmp <- tapply(mydata$num, list(mydata$gp1, mydata$gp2), sum)
The result of tapply is a matrix, but you can treat it like a table and use as.data.frame.table to convert it. This does not rely on any additional packages.
as.data.frame.table(tmp)
The two different data structures look like:
> tmp
a b
A 8.381483 6.373657
B 2.379303 -1.189488
> as.data.frame.table(tmp)
Var1 Var2 Freq
1 A a 8.381483
2 B a 2.379303
3 A b 6.373657
4 B b -1.189488

Selecting values from a 3-column dataframe in R

I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"

Resources