Suppose I have two symmetric matrices:
library(Matrix)
set.seed(123)
s1<-forceSymmetric(matrix(round(rnorm(25),2),5))
colnames(s1)<-LETTERS[1:5]
rownames(s1)<-LETTERS[6:10]
diag(s1)<-1
s2<-forceSymmetric(matrix(round(rbinom(25,25,0.3),2),5))
colnames(s2)<-LETTERS[1:5]
rownames(s2)<-LETTERS[6:10]
diag(s2)<-1
s1
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1.00 1.72 1.22 1.79 -1.07
# G 1.72 1.00 0.36 0.50 -0.22
# H 1.22 0.36 1.00 -1.97 -1.03
# I 1.79 0.50 -1.97 1.00 -0.73
# J -1.07 -0.22 -1.03 -0.73 1.00
s2
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1 6 8 7 9
# G 6 1 5 9 8
# H 8 5 1 10 9
# I 7 9 10 1 1
# J 9 8 9 1 1
What I wanted is to generate a single heatmap where the upper diagonal and its legend is based on matrix s1 while the lower diagonal and its legend is based on matrix s2. Here is a similar one I can found:
Get upper diagonal
reverse = s1[,ncol(s1):1]
diag(reverse) = 0
reverse[lower.tri(reverse, diag = FALSE)] <- 0
upper = reverse[,ncol(reverse):1]
Get lower diagonal
reverse1 = s2[,ncol(s2):1]
diag(reverse1) = 0
reverse1[upper.tri(reverse1, diag = FALSE)] <- 0
upper1 = reverse1[,ncol(reverse1):1]
Time to add them up.
merged = as.matrix(upper+upper1)
merged
A B C D E
F 1.00 1.72 1.22 1.79 0
G 1.72 1.00 0.36 0.00 8
H 1.22 0.36 0.00 10.00 9
I 1.79 0.00 10.00 1.00 1
J 0.00 8.00 9.00 1.00 1
Plot it.
heatmap(merged)
You can find nicer heatmap plots elsewhere
Related
I have the following dataframe (DF_A):
PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1: 1000000 0.03 0.58 0.01 0.42 0.69 0.98 0.55 0.96
2: 1000001 0.80 0.37 0.10 0.95 0.77 0.69 0.23 0.07
3: 1000002 0.25 0.73 0.79 0.83 0.24 0.82 0.81 0.01
4: 1000003 0.10 0.96 0.53 0.59 0.96 0.10 0.98 0.76
5: 1000004 0.36 0.87 0.76 0.03 0.95 0.40 0.53 0.89
6: 1000005 0.15 0.78 0.24 0.21 0.03 0.87 0.67 0.64
And I have this other dataframe (DF_B):
V1 V2 V3 V4 PARTY_ID
1 0.58 0.69 0.96 0.98 1000000
2 0.69 0.77 0.80 0.95 1000001
3 0.79 0.81 0.82 0.83 1000002
4 0.76 0.96 0.96 0.98 1000003
5 0.76 0.87 0.89 0.95 1000004
6 0.64 0.67 0.78 0.87 1000005
I need to find the position of the elements of the DF_A in the DF_B to have something like this:
PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...
Currently I'm working with match function but it takes a lot of time (I have 400K rows). I'm doing this:
i <- 1
while(i < nrow(DF_A)){
position <- match(DF_B[i,],DF_A[i,])
i <- i + 1
}
Although it works, it's very slow and I know that it's not the best answer to my problem. Can anyone help me please??
You can merge and then Map with a by group operation:
df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
Map(f = function(x,y) which(x==y),
x = list(.SD[,names(df_a), with = FALSE]),
y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]
setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
# PARTY_ID P1 P2 P3 P4
#1: 1000000 3 6 9 7
#2: 1000001 7 6 2 5
#3: 1000002 4 8 7 5
#4: 1000003 9 3 3 8
#5: 1000003 9 6 6 8
#6: 1000004 4 3 9 6
#7: 1000005 9 8 3 7
Here is an example on 1 milion rows with two columns. It takes 14 ms on my computer.
# create data tables with matching ids but on different positions
x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))
# add column to both data tables which will store the position in x and y
x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)
# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")
# merge data tables into one
z <- merge(x,y)
# now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table
z[x_row_nr==100, y_row_nr]
z will contain matching row records from both datasets with there columns attached.
Data Frame:
set.seed(90)
df <- data.frame(id = 1:10, values = round(rnorm(10),1))
id values
1 1 0.1
2 2 -0.2
3 3 -0.9
4 4 -0.7
5 5 0.7
6 6 0.4
7 7 1.0
8 8 0.9
9 9 -0.6
10 10 2.4
Table:
table <- data.frame(values = c(-2.0001,1.0023,0.0005,1.0002,2.00009), final_values = round(rnorm(5),2))
values final_values
1 -2.00010 -0.81
2 1.00230 -0.08
3 0.00050 0.87
4 1.00020 1.66
5 2.00009 -0.24
I need to replace the values in data frame based on the closest match of the values in table.
Final Output:
id final_values
1 1 0.87
2 2 0.87
3 3 -0.08
4 4 -0.08
5 5 1.66
6 6 0.87
7 7 1.66
8 8 1.66
9 9 -0.08
10 10 -0.24
What is the best way to do this with base R?
Here is a way and you can overwrite the result back to df:
sapply(df$values, function(x) table$final_values[which.min(abs(x - table$values))])
[1] 0.87 0.87 -0.08 -0.08 1.66 0.87 1.66 1.66 -0.08 -0.24
I know If I have raw data, I can create a distance matrix, however for this problem I have a distance matrix and I want to be able to run commands in R on it, like hclust. Below is my distance matrix I want in R. I am not sure storing this data in matrix form will work as I will be unable to run hclust on a matrix.
I have tried to create it using as.dist functions to no avail. My faulty code:
test=as.dist(c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51), diag = FALSE, upper = FALSE)
test
1 2 3 4 5 6 7 8 9
2 2.97
3 4.88 2.97
4 3.86 4.88 0.51
5 0.80 3.86 2.97 0.21
6 4.17 0.80 4.88 1.51 0.80
7 1.96 4.17 3.86 0.51 4.17 0.51
8 0.21 1.96 0.80 2.97 1.96 2.97 0.80
9 1.51 0.21 4.17 4.88 0.21 4.88 4.17 0.21
10 0.51 1.51 1.96 3.86 1.51 3.86 1.96 1.51 0.51
Since you already have the distance values, you don't need to use dist() to calculate them. The data can be stored in a regular matrix
test <- matrix(ncol=5,nrow=5)
test[lower.tri(test)] <- c(.76,2.97,4.88,3.86,.8,4.17,1.96,.21,1.51,.51)
diag(test) <- 0
> test
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00 NA NA NA NA
[2,] 0.76 0.00 NA NA NA
[3,] 2.97 0.80 0.00 NA NA
[4,] 4.88 4.17 0.21 0.00 NA
[5,] 3.86 1.96 1.51 0.51 0
In order to apply hclust(), this matrix can then be converted to a distance matrix with as.dist():
> test <- as.dist(test, diag = TRUE)
1 2 3 4 5
1 0.00
2 0.76 0.00
3 2.97 0.80 0.00
4 4.88 4.17 0.21 0.00
5 3.86 1.96 1.51 0.51 0.00
> hclust(test)
#
#Call:
#hclust(d = test)
#
#Cluster method : complete
#Number of objects: 5
> plot(hclust(test))
I have the following the data set:
TRAIN dataset
Sr A B C XX
1 0.09 0.52 11.1 high
2 0.13 0.25 11.1 low
3 0.20 0.28 11.1 high
4 0.29 0.50 11.1 low
5 0.31 0.58 11.1 high
6 0.32 0.37 11.1 high
7 0.37 0.58 11.1 low
8 0.38 0.40 11.1 low
9 0.42 0.65 11.1 high
10 0.42 0.79 11.1 low
11 0.44 0.34 11.1 high
12 0.45 0.89 11.1 low
13 0.57 0.72 11.1 low
TEST dataset
Sr A B C XX
1 0.54 1.36 9.80 low
2 0.72 0.82 9.80 low
3 0.19 0.38 9.90 high
4 0.25 0.44 9.90 high
5 0.29 0.54 9.90 high
6 0.30 0.54 9.90 high
7 0.42 0.86 9.90 low
8 0.44 0.86 9.90 low
9 0.49 0.66 9.90 low
10 0.54 0.76 9.90 low
11 0.54 0.76 9.90 low
12 0.68 1.08 9.90 low
13 0.88 0.51 9.90 high
Sr : Serial Number
A-C : Parameters
XX : Output Binary Parameter
I am trying to use the KNN classifier to develop a predictor model with 5 nearest neighbors. Following is the code that I have written:
train_input <- as.matrix(train[,-ncol(train)])
train_output <- as.factor(train[,ncol(train)])
test_input <- as.matrix(test[,-ncol(test)])
prediction <- knn(train_input, test_input, train_output, k=5, prob=TRUE)
resultdf <- as.data.frame(cbind(test[,ncol(test)], prediction))
colnames(resultdf) <- c("Actual","Predicted")
RESULT dataset
A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
I have the following concerns:
What should I do to obtain probability values? Is this a probability of getting high or low i.e. P(high) or P(low)?
The levels are set to 1 (high) and 2 (low), which is based on the order of first appearance. If low appeared before high in the train dataset, it would have a value 1. I feel this is not good practice. Is there anyway I can avoid this?
If there were more classes (more than 2) in the classifier, how would I handle this in the classifier?
I am using the class and e1071 library.
Thanks.
Utility function built before the "text" argument to scan was introduced:
rd.txt <- function (txt, header = TRUE, ...)
{ tconn <- textConnection(txt)
rd <- read.table(tconn, header = header, ...)
close(tconn)
rd}
RESULT <- rd.txt(" A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
")
> prop.table(table(RESULT))
P
A 1 2
1 0.15385 0.23077
2 0.07692 0.53846
You can also set up prop.table to deliver row or column proportions (AKA probabilities).
Hello I'm looking for the cleanest/fastest way to solve the following problem:
My setup looks like this
library(data.table)
set.seed(1234)
DT1 <- data.table(replicate(12,runif(5)))
setnames(DT1,LETTERS[1:12])
DT1[,time:=100]
DT2 <- data.table(time=rep(100,12), grp=rep(c("X","Y","Z"),each=4),
sub=LETTERS[1:12], weight=sample(1:100,12))
options(digits=2)
DT1
A B C D E F G H I J K L time
1: 0.11 0.6403 0.69 0.84 0.32 0.811 0.46 0.76 0.55 0.50 0.074 0.50 100
2: 0.62 0.0095 0.54 0.29 0.30 0.526 0.27 0.20 0.65 0.68 0.310 0.49 100
3: 0.61 0.2326 0.28 0.27 0.16 0.915 0.30 0.26 0.31 0.48 0.717 0.75 100
4: 0.62 0.6661 0.92 0.19 0.04 0.831 0.51 0.99 0.62 0.24 0.505 0.17 100
5: 0.86 0.5143 0.29 0.23 0.22 0.046 0.18 0.81 0.33 0.77 0.153 0.85 100
> DT2
time grp sub weight
1: 100 X A 87
2: 100 X B 5
3: 100 X C 32
4: 100 X D 2
5: 100 Y E 23
6: 100 Y F 68
7: 100 Y G 29
8: 100 Y H 48
9: 100 Z I 99
10: 100 Z J 52
11: 100 Z K 11
12: 100 Z L 80
I want to compute a weighted average (per row) of the columns of DT1 by referencing the groups, subclasses & weights from DT2, while joining per time point.
E.g. so DT1 then gets columns X,Y & Z bound to it, so in this case the column X of the first row is 87*0.11 + 5*0.64 + 32*0.69 + 2*0.84 / (87 + 5 + 32 + 2)
There are millions of rows in DT1 with different time points, so memory might be a limiting factor though
Any advice would be much appreciated!
Something like this perhaps:
library(reshape2)
setkey(DT2, time, sub)
DT2[melt(DT1, id.var = 'time')[, row := 1:.N, by = list(time, variable)]][,
sum(weight * value) / sum(weight), by = list(time, grp, row)]
# time grp row V1
# 1: 100 X 1 0.29
# 2: 100 X 2 0.57
# 3: 100 X 3 0.51
# 4: 100 X 4 0.69
# 5: 100 X 5 0.69
# 6: 100 Y 1 0.67
# 7: 100 Y 2 0.36
# 8: 100 Y 3 0.52
# 9: 100 Y 4 0.71
#10: 100 Y 5 0.31
#11: 100 Z 1 0.50
#12: 100 Z 2 0.59
#13: 100 Z 3 0.51
#14: 100 Z 4 0.39
#15: 100 Z 5 0.59
You can also reshape the above result if you like:
# assuming you called the above table "res"
dcast.data.table(res, row + time ~ grp)
#Using 'V1' as value column. Use 'value.var' to override
# row time X Y Z
#1: 1 100 0.29 0.67 0.50
#2: 2 100 0.57 0.36 0.59
#3: 3 100 0.51 0.52 0.51
#4: 4 100 0.69 0.71 0.39
#5: 5 100 0.69 0.31 0.59