How can I find differences between several columns - r

I have a data like this
df <- structure(list(X1 = c(37L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, NA,
11L, 12L), X2 = c(40L, NA, 35L, 35L, 35L, 34L, 29L, 28L, 28L,
NA, 25L, 24L), X3 = c(60L, 44L, 49L, 41L, NA, NA, NA, 25L, 26L,
NA, NA, 22L), T1 = c(19L, 55L, 47L, 46L, 36L, 42L, 25L, NA, 33L,
42L, 50L, 22L), T2 = c(75L, NA, 32L, 44L, 27L, 31L, 17L, NA,
18L, 45L, 10L, 11L), T3 = c(5L, 6L, 7L, 8L, 9L, 10L, 11L, NA,
46L, 36L, 42L, NA)), class = "data.frame", row.names = c(NA,
-12L))
I want to be able to obtain the following info
Number_of_values_X1-X3 11
Number_of_missing_in_X1 1
Number_of_missing_in_X2 2
Number_of_missing_in_X3 5
Number_of_missing_in_X1X2X3 1
Number_of_Missing_in_X1_X2 0
Number_of_missing_in_X1_X3 0
Number_of_missing_in_X2_X3 0
The same for the three other columns
Number_of_values _T1-T3 11
Number_of_missing_in_T1 1
Number_of_missing_in_T2 2
Number_of_missing_in_T3 2
Number_of_missing_in_X1X2X3 1
Number_of_Missing_in_X1_X2 0
Number_of_missing_in_X1_X3 0
Number_of_missing_in_X2_X3 0
I tried to do this with the following function but when it is more columns involved, I cannot figure out how to amend this
myData <- function(Ecol) {
N_V <- length(setdiff(df[[column]], NA))
N_Missing <- sum(is.na(df[[column]]))
print(paste("Number of values in", column, N_V))
print(paste("Number of missing in", column, N_Missing))
}

The following will give the total number of NA values per column and the number of NA values per row of the selected columns.
First define a function and compute a logical matrix.
na_cols <- function(X, cols){
all_na <- apply(X[, cols], 1, function(y) Reduce('&', y))
sum(all_na)
}
na <- sapply(df, is.na)
Now the total number of NA values per column.
apply(na[, 1:3], 2, sum)
#X1 X2 X3
# 1 2 5
And the number of NA values per combination of columns.
na_cols(na, 1:3)
#[1] 1
na_cols(na, 1:2)
#[1] 1
na_cols(na, c(1, 3))
#[1] 1
na_cols(na, 2:3)
#[1] 1
For columns starting with T, the equivalent code is
apply(na[, 4:6], 2, sum)
na_cols(na, 4:6)
na_cols(na, 4:5)
na_cols(na, c(4, 6))
na_cols(na, 5:6)

Related

How to geometrically project 3D xyz vectors onto a plane orthogonal to the surface normals of a triangulated surface? R

I'm solving an equation with OLS which returns a timing point on an x/y/z surface mesh. I can solve for speed and directionality per the method outlined below. The problem is, my vectors are not parallel to the surface mesh and I'm uncertain of the correct transforms needed to get it in the correct direction.
The literature states:
"For the visualization of the vectors in the 3D geometry ... If only surface data were used in the calculation of the vectors, the conduction velocity vectors were geometrically projected into the plane orthogonal to the surface normals. This constraint corrected vectors pointing inside or outside the surface"
I have no idea how to do that.
Reprex Data:
library(plotly)
library(dplyr)
tri_list <- list(surf_tri = structure(c(1L, 1L, 4L, 8L, 11L, 12L, 17L, 6L,
19L, 13L, 3L, 5L, 22L, 1L, 26L, 18L, 18L, 8L, 21L, 14L, 27L,
4L, 14L, 7L, 6L, 29L, 22L, 16L, 23L, 26L, 3L, 28L, 2L, 13L, 12L,
31L, 33L, 35L, 32L, 28L, 33L, 31L, 20L, 9L, 36L, 12L, 29L, 20L,
31L, 27L, 17L, 26L, 8L, 15L, 9L, 29L, 9L, 35L, 15L, 11L, 16L,
16L, 38L, 31L, 37L, 32L, 24L, 35L, 23L, 23L, 23L, 37L, 10L, 35L,
2L, 4L, 5L, 7L, 10L, 13L, 16L, 18L, 20L, 2L, 2L, 7L, 3L, 3L,
25L, 17L, 27L, 17L, 7L, 2L, 18L, 27L, 28L, 5L, 7L, 1L, 30L, 29L,
30L, 31L, 13L, 5L, 5L, 12L, 14L, 33L, 10L, 25L, 25L, 20L, 12L,
12L, 28L, 36L, 15L, 19L, 4L, 7L, 37L, 4L, 27L, 13L, 36L, 36L,
7L, 16L, 10L, 32L, 16L, 15L, 23L, 30L, 16L, 34L, 34L, 37L, 35L,
24L, 13L, 24L, 26L, 11L, 11L, 38L, 3L, 2L, 2L, 6L, 9L, 14L, 15L,
8L, 9L, 14L, 13L, 21L, 23L, 22L, 24L, 8L, 17L, 15L, 20L, 28L,
6L, 6L, 19L, 4L, 4L, 22L, 16L, 22L, 22L, 32L, 23L, 21L, 28L,
31L, 19L, 34L, 34L, 32L, 26L, 19L, 9L, 33L, 21L, 11L, 11L, 9L,
1L, 9L, 32L, 29L, 29L, 31L, 9L, 8L, 8L, 17L, 33L, 38L, 38L, 38L,
39L, 23L, 39L, 37L, 10L, 38L, 39L, 25L, 26L, 39L, 24L, 38L, 37L,
39L), dim = c(74L, 3L), dimnames = list(NULL, c("V1", "V2", "V3"
))), coordinates = structure(c(1.9194540977478, 11.1435489654541,
5.42491292953491, -1.66259050369263, 7.44240760803223, -6.69524192810059,
-0.0333123430609703, -4.23227882385254, 21.1540908813477, 28.65553855896,
26.0386810302734, 24.6995525360107, 38.495361328125, 22.120174407959,
-2.17373514175415, 0.292593091726303, -11.4042282104492, -10.3343534469604,
17.5129470825195, 11.7786016464233, 9.42895698547363, -1.22890889644623,
23.1779632568359, 31.624174118042, 38.5983734130859, 45.1263847351074,
-11.1577320098877, 12.0059661865234, -2.91624927520752, 0.309472501277924,
30.9889793395996, 42.6134490966797, 23.7120399475098, 26.9033088684082,
30.3514823913574, 7.96694564819336, 31.1637763977051, 31.1527690887451,
25.4455451965332, -61.4243927001953, -56.5476036071777, -45.9739837646484,
-66.2752914428711, -66.6171951293945, -84.5177154541016, -78.7427749633789,
-92.5044403076172, -86.8224487304688, -90.4450073242188, -100.261276245117,
-64.494514465332, -54.8481636047363, -57.9494781494141, -109.206504821777,
-76.7904586791992, -95.8042831420898, -88.341796875, -64.2513122558594,
-69.0683898925781, -68.2359237670898, -69.7075653076172, -76.6936798095703,
-76.4758529663086, -78.4939727783203, -64.2876815795898, -86.8634643554688,
-61.3102798461914, -68.8540496826172, -74.4825897216797, -79.063835144043,
-81.4954528808594, -86.3758010864258, -87.7453536987305, -83.0354080200195,
-109.458961486816, -90.4418487548828, -90.170280456543, -80.8295745849609,
189.259994506836, 207.376663208008, 187.388595581055, 200.771896362305,
215.514175415039, 214.221862792969, 217.531005859375, 215.624313354492,
211.211242675781, 205.350524902344, 203.870376586914, 210.061508178711,
195.598449707031, 210.308349609375, 195.291015625, 177.159469604492,
194.172241210938, 210.24040222168, 213.330307006836, 216.293930053711,
216.293441772461, 180.884536743164, 163.47248840332, 163.422592163086,
168.801071166992, 186.711502075195, 207.275924682617, 212.505798339844,
190.655349731445, 176.77978515625, 204.541122436523, 191.065948486328,
209.147216796875, 206.617050170898, 172.961532592773, 211.9609375,
203.223968505859, 179.055145263672, 170.081588745117), dim = c(39L,
3L)), activation_timing_ms = c(-25.9116878216266, -35, 17.8156572141554,
-32.3085435482249, -30.425790486056, -22.7921350518616, -19.7547757364543,
-17.1422476398632, -5.21201605679312, 6.529324221754, 6.37415402817896,
-18.6397688908753, 6.16765189118496, -23.9073664920415, -6.74190321661786,
-17, -15.3633306040704, -20.0486637091315, -23.098251364298,
-23.5797990311573, -23.6085020691451, -27.8322123439857, 11.4184099588406,
12.5422915301849, 9.07414737038971, 5.42139612786991, -21.8214591939045,
-29.3013020135356, -33.236757970705, -19.3552220222891, -5.4698807256143,
-2.27373675443232e-13, -3.44126922278861, -0.401211436169433,
17.9550207172763, 1, 5.84672565064739, -2.04852001824679, 26.5347286690835
))
# Polynomial Surface Fitting
lm_dat <- cbind(tri_list$coordinates, activation_timing_ms = tri_list$activation_timing_ms) |>
as.data.frame() |>
rename(x = V1, y = V2, z = V3)
lm <- lm(activation_timing_ms~polym(x,y,z, degree=2, raw = TRUE), lm_dat)
lm_coef <- coef(lm)
a <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)2.0.0"]]
b <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)0.2.0"]]
c <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)0.0.2"]]
d <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)1.1.0"]]
e <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)1.0.1"]]
f <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)0.1.1"]]
g <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)1.0.0"]]
h <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)0.1.0"]]
i <- lm_coef[["polym(x, y, z, degree = 2, raw = TRUE)0.0.1"]]
j <- lm_coef[["(Intercept)"]]
# Differentiation
t_x_fun <- function(x,y,z){
2*a*x + d*y + e*z + g
}
t_y_fun <- function(x,y,z){
2*b*y + d*x + f*z + h
}
t_z_fun <- function(x,y,z){
2*c*z + e*x + f*y + i
}
CV_x_fun <- function(t_x,t_y,t_z){
t_x/(t_x^2 + t_y^2 + t_z^2)
}
CV_y_fun <- function(t_x,t_y,t_z){
t_y/(t_x^2 + t_y^2 + t_z^2)
}
CV_z_fun <- function(t_x,t_y,t_z){
t_z/(t_x^2 + t_y^2 + t_z^2)
}
fitted_activation_timing <- fitted(lm)
# Loop through the V1 coordinates
node_indices_tmp <- vector(length=nrow(tri_list$surf_tri))
v1_vertex_coord <- vector("list",
length = nrow(tri_list$coordinates))
for(i in seq_along(1:nrow(tri_list$coordinates))) {
v3_index_tmp <- tri_list$surf_tri[i,3]
v2_index_tmp <- tri_list$surf_tri[i,2]
v1_index_tmp <- tri_list$surf_tri[i,1]
timing_sorted <- matrix(c(fitted_activation_timing[v1_index_tmp], v1_index_tmp,
fitted_activation_timing[v2_index_tmp], v2_index_tmp,
fitted_activation_timing[v3_index_tmp], v3_index_tmp),
nrow = 3,
ncol = 2,
byrow=T)
timing_sorted <- timing_sorted[order(timing_sorted[, 1]), ]
v3_index <- timing_sorted[3,2]
v2_index <- timing_sorted[2,2]
v1_index <- timing_sorted[1,2]
v1_vertex_coord[[i]] <- tri_list$coordinates[v1_index,]
}
v1_vertex_coord <- matrix(unlist(v1_vertex_coord),
nrow = length(v1_vertex_coord),
ncol = 3,
byrow=T)
CV_x <- vector(length=nrow(v1_vertex_coord))
CV_y <- vector(length=nrow(v1_vertex_coord))
CV_z <- vector(length=nrow(v1_vertex_coord))
for(i in seq_along(1:nrow(v1_vertex_coord))) {
x <- v1_vertex_coord[i,1]
y <- v1_vertex_coord[i,2]
z <- v1_vertex_coord[i,3]
t_x <- t_x_fun(x,y,z)
t_y <- t_y_fun(x,y,z)
t_z <- t_z_fun(x,y,z)
CV_x[i] <- CV_x_fun(t_x,t_y,t_z)
CV_y[i] <- CV_y_fun(t_x,t_y,t_z)
CV_z[i] <- CV_z_fun(t_x,t_y,t_z)
}
CV <- cbind(lm_dat, CV_x, CV_y, CV_z, fitted_activation_timing) |> as.data.frame()
plot_ly() |>
add_trace(
name = "Faces",
type = "mesh3d",
x = tri_list$coordinates[,1], y = tri_list$coordinates[,2], z = tri_list$coordinates[,3],
i = tri_list$surf_tri[,1]-1, j = tri_list$surf_tri[,2]-1, k = tri_list$surf_tri[,3]-1,
opacity = 0.8,
flatshading = TRUE # we don't want smoothing
) |>
add_trace(
type = "scatter3d",
mode = "markers",
x = tri_list$coordinates[,1],
y = tri_list$coordinates[,2],
z = tri_list$coordinates[,3]) |>
add_trace(
type="cone",
name = "CV Vectors",
x = CV[,"x"],
y = CV[,"y"],
z = CV[,"z"],
u = CV[,"CV_x"],
v = CV[,"CV_y"],
w = CV[,"CV_z"],
opacity=0.8,
sizemode= 'absolute',
sizeref= 1
)
Relevant Sections of the literature:
Calculating the weights is beyond the scope of MRE and I left it out.

Creating new column based on values in one column over another

I have some data
structure(list(Factor = c(0L, 1L, 0L, 1L, 1L, 0L, 1L), Col_A = c(45L,
23L, 35L, 43L, 42L, 23L, 11L), Col_B = c(85L, 67L, 55L, 40L,
27L, 85L, 12L), New_Column = c(45L, 67L, 35L, 40L, 27L, 23L,
12L)), class = "data.frame", row.names = c(NA, -7L))
Pretend that the 4th column is not there. I need to write a script that based on the value in the Factor column will take a value from either Col_A or Col_B and put in New_Column. If the value in Factor is 0 it should take the value in Col_A so the value in New_Column in the first row is 45.
A base solution:
df$New_Column <- ifelse(df$Factor == 0, df$Col_A, df$Col_B)
We can use row/column indexing to get the value
df1$New_Column <- df1[2:3][cbind(seq_len(nrow(df1)), df1$Factor + 1)]

Looping and clustering

I have to admit this's too hard for me to do it on my own. I have to analyze some data and this step is crucial for me.
Data which I want to analyze:
> dput(tbl_clustering)
structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L,
13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L,
27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L,
21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L,
5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L,
24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L,
12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L,
18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130",
"AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080",
"AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510",
"AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930",
"AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220",
"AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720",
"ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L,
54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L,
34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L,
42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L,
75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L,
31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L,
48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L,
35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L,
22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150",
"AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820",
"AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170",
"AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830",
"AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590",
"AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050",
"AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840",
"AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110",
"AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400",
"AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770",
"AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530",
"AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200",
"AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120",
"AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630",
"AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160",
"AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380",
"ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820",
"ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L,
5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L,
5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L,
5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L,
6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L,
8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L,
6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA,
-98L))
To explain better what I want to achieve I will paste some rows over here:
P1 P2 No_Interactions
1 AT3G61540 AT4G30920 8
2 AT3G61540 AT4G30910 5
3 AT1G58080 AT2G45440 5
4 AT1G58080 AT1G09795 9
5 AT1G58080 AT5G52100 7
6 AT5G04710 AT5G60160 6
7 AT4G03520 AT1G75330 5
8 AT3G58730 AT1G20630 5
9 AT3G58730 AT5G02500 5
10 AT3G58730 AT5G35790 5
First of all the new column Cluster has to be created. Next we focus only on two columns P1 and P2. As you can see in first row we have two names AT3G61540 and AT4G30920 and that's our starting point (loop I believe will be necessary). We put the number 1 in Cluster column. Than we take first name AT3G61540 and scan through both columns P1 and P2 if we find this name once again somewhere with other name than in first row we put number 1 as well in Cluster. Next we take second name from first row AT4G30920 and do the same screening through whole data.
The next step will be to analyze next row and do exactly the same things. In that case in the next row we have exactly the same name for P1 that means we don't need to screen it but the second name AT4G30910 is different so would be great to screen with that one. The problem which appears here is that this row should be the cluster 1 as well. The cluster 2 starts with third row because we have completely new pair of names.
I am aware that's not so easy task and probably it has to be done in couple steps.
EDIT:
The output I would like to get:
P1 P2 No_Interactions Cluster
1 AT3G61540 AT4G30920 8 1
2 AT3G61540 AT4G30910 5 1
3 AT1G58080 AT2G45440 5 2
4 AT1G58080 AT1G09795 9 2
5 AT1G58080 AT5G52100 7 2
6 AT5G04710 AT5G60160 6 3
7 AT5G52100 AT1G75330 5 2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080
8 AT3G58730 AT1G20630 5 5
9 AT3G58730 AT5G02500 5 5
10 AT3G58730 AT3G61540 5 1 ## Cluster 1 because AT3G61540 was found in first row.
I corrected my initial answer and propose you a functional programming approach, using Map and recursion to find your clusters:
library(magrittr)
similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE
clusterify = function(df)
{
clusters = df$cluster
if(!any(clusters==0)) return(df)
idx = pmatch(0, clusters)
lst = Map(c, as.character(df[,1]), as.character(df[,2]))
el = c(as.character(df[idx, 1]), as.character(df[idx, 2]))
K = lst %>%
sapply(similar, v=el) %>%
add(0)
mask = if(any(clusters!=0 & K==1))
if(any(mask))
{
cl = min(clusters[mask])
df[K==1,]$cluster = cl
}
else
{
df[K==1,]$cluster = max(clusters) + 1
}
clusterify(df)
}
You can use it by clusterify(transform(df, cluster=0))
For example, the clustering operates correctly on your example, by taking cluster 9 (you can check other clusters):
subset(clusterify(transform(df, cluster=0)), cluster==9)
# P1 P2 No_Interactions cluster
#25 AT5G19220 AT5G48300 10 9
#26 AT5G19220 AT2G34590 6 9
#27 AT5G19220 AT5G10920 6 9
#32 AT5G19220 AT3G01500 8 9
#33 AT5G19220 AT2G37020 5 9
#34 AT5G19220 AT2G35040 5 9
#92 AT4G22930 AT5G48300 5 9
#93 AT4G22930 AT2G34590 5 9
#94 AT4G22930 AT5G35630 5 9
#95 AT4G22930 AT4G34200 7 9
#96 AT4G22930 AT1G17745 5 9
#97 AT4G22930 AT4G00570 5 9
#98 AT4G22930 AT2G44350 6 9
You could use library igraph to make an undirected graph in which you cluster connected composents:
library('igraph')
# make graph and cluster
g = graph.data.frame(tbl_clustering[,c('P1', 'P2')], directed=FALSE)
c = clusters(g)
# append cluster number to original data
tbl_clustering$cluster = sapply(as.vector(tbl_clustering$P1), function(x)c$membership[x])
This assigns clusters to the entries (here the first rows):
> head(tbl_clustering, 8)
P1 P2 No_Interactions cluster
1 AT3G61540 AT4G30920 8 1
2 AT3G61540 AT4G30910 5 1
3 AT1G58080 AT2G45440 5 2
4 AT1G58080 AT1G09795 9 2
5 AT1G58080 AT5G52100 7 2
6 AT5G04710 AT5G60160 6 3
7 AT4G03520 AT1G75330 5 4
8 AT3G58730 AT1G20630 5 5
I believe you want to divide your data set into equivalence classes. I have an implementation based on a algorithm in Numerical Recipes. I have included the code below. It can be used as follows:
source("equivalence.R")
ids <- unique(c(levels(data[[1]]), levels(data[[2]])))
classes <- equivalence(ids, data[1:2])
data$class <- classes[match(data$P1, ids)]
equivalence.R
library(Rcpp)
Rcpp::sourceCpp('equivalence.cpp')
equivalence <- function(x, rules) {
tmp <- unique(x)
tmp <- tmp[!is.na(tmp)]
a <- match(rules[[1]], tmp)
b <- match(rules[[2]], tmp)
sel <- !is.na(a) & !is.na(b)
if (any(!sel)) {
warning("Not all values in rules are present in x.")
a <- a[sel]
b <- b[sel]
}
res <- c_equivalence(as.integer(a)-1L, as.integer(b)-1L,
as.integer(length(tmp)))
res[match(x, tmp)] + 1L
}
equivalence.cpp
#include <R.h>
#include <Rinternals.h>
#include <string>
extern "C" {
// [[Rcpp::export]]
SEXP c_equivalence(SEXP ra, SEXP rb, SEXP rn) {
try {
if (LENGTH(ra) != LENGTH(rb))
throw std::string("Lengths of a and be do not match.");
int* a = INTEGER(ra);
int* b = INTEGER(rb);
int m = LENGTH(ra);
int n = INTEGER(rn)[0];
SEXP classes = PROTECT(allocVector(INTSXP, n));
int* cls = INTEGER(classes);
//Initialize each element its own class.
for (int k = 0; k < n; k++) cls[k] = k;
//For each piece of input information...
for (int l = 0; l < m; l++) {
//Track first element up to its ancestor.
int j = a[l];
while (cls[j] != j) j = cls[j];
//Track second element up to its ancestor.
int k = b[l];
while (cls[k] != k) k = cls[k];
//If they are not already related, make them so.
if (j != k) {
cls[j] = k;
}
}
//Final sweep up to highest ancestors.
for (int j = 0; j < n; j++) {
while (cls[j] != cls[cls[j]]) cls[j] = cls[cls[j]];
}
UNPROTECT(1);
return classes;
} catch(const std::string& e) {
error(e.c_str());
return R_NilValue;
} catch (...) {
error("Uncaught exception.");
return R_NilValue;
}
}
}
Okay, here is a new answer, which goes some of the way. Again, dat is the data frame.
Cluster <- rep(NA, length(dat[, 1])) #initialise
for(r in 1:length(Cluster)){
if(identical(as.numeric(r), 1)){Cmatches <- matrix(c(as.character(dat[1, 1]), as.character(dat[1, 2])), 2, 1)}
matched <- F
for(cl in 1:length(Cmatches[1,])){
if(sum(c(as.character(dat[r, 1]), as.character(dat[r, 2])) %in% Cmatches[, cl]) != 0){
#add P1 and P2 from this row to the cluster which it matches
Cmatches <- rbind(Cmatches, matrix(c(if(cl != 1){rep("", (cl - 1)*2)}else{character(0)}, as.character(dat[r, 1]), as.character(dat[r, 2]), if(cl != length(Cmatches[1,])){rep("", (length(Cmatches[1, ]) - cl)*2)}else{character(0)}), 2, length(Cmatches[1,]), byrow = F))
matched <- T
Cluster[r] <- cl
}
}
if(!matched){
#add a new cluster, because doesn't match any existing
Cmatches <- cbind(Cmatches, c(c(as.character(dat[r, 1]), as.character(dat[r, 2])), rep("", length(Cmatches[, 1]) - 2)))
Cluster[r] <- length(Cmatches[1,])
}
}
After this, you would take the Cmatch matrix and then check for matches between the clusters using if(sum(match(Cmatch[, cl1], Cmatch[, cl2], incomparables = ""), na.rm = T) != 0) (where cl1 and cl2 are cluster numbers to be compared). If that test was true, then those clusters should be grouped.
Cmatched <- rep(NA, length(Cmatches[1,]))
for(cl in 1:(length(Cmatches[1,]) - 1)){
for(cl2 in (cl + 1):length(Cmatches[1,])){
if(sum(match(Cmatches[, cl], Cmatches[, cl2], incomparables = ""), na.rm = T) != 0){
if(is.na(Cmatched[cl])){
Cluster[Cluster == cl2] <- cl
Cmatched[cl2] <- cl
}else{
Cluster[Cluster == cl2] <- Cmatched[cl]
Cmatched[cl2] <- cl
}
}
}
}
And I think that there is your answer. Then just dat <- cbind(dat, Cluster).
It sounds like you want to do categorical clustering. You should look into k-modes clustering which is an an extension of k-means. The k-modes algorithm mirrors the steps of k-means. Here is the outline provided in the paper.
Randomly select k unique objects as the initial cluster centers (modes).
Calculate the distances between each object and the cluster mode; assign the object to the cluster whose center has the shortest distance to the object; repeat this step until all objects are assigned to clusters.
Select a new mode for each cluster and compare it with the previous mode. If different, go back to Step 2; otherwise, stop
There are other issues discussed like k-prototypes (for mixing categorical with numerical data), fuzzy k-modes (for assigning cluster membership), and initialization of k-modes.

Errors when exporting plots to Plot.ly

I have this data (sample of the first 20 rows):
Codering variable value
1 Z1 Week.0 0
2 Z2 Week.0 0
3 Z3 Week.0 0
4 Z4 Week.0 0
5 Z5 Week.0 0
6 Z6 Week.0 0
7 Z7 Week.0 0
8 Z8 Week.0 0
9 Z9 Week.0 0
10 Z101 Week.0 NA
11 Z102 Week.0 NA
12 Z1 Week.1 0
13 Z2 Week.1 0
14 Z3 Week.1 0
15 Z4 Week.1 0
16 Z5 Week.1 0
17 Z6 Week.1 0
18 Z7 Week.1 0
19 Z8 Week.1 0
and I plot it using:
pZ <- ggplot(zmeltdata,aes(x=variable,y=value,color=Codering,group=Codering)) +
geom_line()+
geom_point()+
theme_few()+
theme(legend.position="right")+
scale_color_hue(name = "Treatment group:")+
scale_y_continuous(labels = percent)+
ylab("Germination percentage")+
xlab("Week number")+
labs(title = "Z. monophyllum germination data")
pZ
The graph displays just fine:
Yet when I want to export this to Plot.ly I get the following errors:
> py <- plotly()
> response<-py$ggplotly(pZ)
Error in if (all(xcomp) && all(ycomp)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In trace.list[[lind[1]]]$y == trace.list[[lind[2]]]$y :
longer object length is not a multiple of shorter object length
And I have searched for these errors, yet the explanation thoroughly confuses me. "The missing value where TRUE/FALSE needed." is supposed to occur if you use logical termms as IF/ELSE/TRUE/FALSE and such in your process, which I don't at all! Even when checking for any NA's in the value of the graph I get:
> is.na(pZ)
data layers scales mapping theme coordinates facet plot_env labels
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
and the 'longer object length is not multiple of shorter object length' is supposed to pop up when you have objects of different lengths, but I'm only using 1 object with 3 rows that have exactly the same length.. The value of the graph does give me a NULL when I ask for those rows, but that is supposed to happen..
> nrow(zmeltdata)
[1] 143
> nrow(test)
NULL
All in all, I'm very confused and don't know how to correctly handle these errors, could someone elaborate?
Thanks for your time.
EDIT: I have tried to export a different graph to Plot.ly using a random sample of 1:100 and that worked just fine, I'm pretty sure the error is in my data, I just can't figure out how to fix it.
EDIT2: In response to #Gregor:
> dput(head(zmeltdata, 20))
structure(list(Codering = structure(c(16L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L), .Label = c("B1", "C2", "C3", "C8", "M1", "M101", "M102",
"M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9", "Z1", "Z101",
"Z102", "Z2", "Z3", "Z4", "Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"),
variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Week.0",
"Week.1", "Week.2", "Week.3", "Week.4", "Week.5", "Week.6",
"Week.7", "Week.8", "Week.9", "Week.10", "Week.11", "Week.12"
), class = "factor"), value = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Codering",
"variable", "value"), row.names = c(NA, 20L), class = "data.frame")
And the tail:
> dput(tail(zmeltdata, 43))
structure(list(Codering = structure(c(19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L,
18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L, 18L), .Label = c("B1",
"C2", "C3", "C8", "M1", "M101", "M102", "M2", "M3", "M4", "M5",
"M6", "M7", "M8", "M9", "Z1", "Z101", "Z102", "Z2", "Z3", "Z4",
"Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"), variable = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L), .Label = c("Week.0", "Week.1", "Week.2", "Week.3",
"Week.4", "Week.5", "Week.6", "Week.7", "Week.8", "Week.9", "Week.10",
"Week.11", "Week.12"), class = "factor"), value = c(0.1, 0.06,
0.05, 0.09, 0.04, 0.08, 0.05, 0.08, 0, 0, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Codering",
"variable", "value"), row.names = 101:143, class = "data.frame")
I am not at all surprised by these, there are quite some NA's in the dataset but they shouldn't prove to be an issue, since I have used a similar (bigger) dataset before.
And I also have the .csv file for you to use if you wish: https://www.mediafire.com/?jij1vlp14a29ntt
The issue is about handling NA's... I got https://plot.ly/~marianne2/417/z-monophyllum-germination-data/ by running the following code:
pZ <- ggplot(na.omit(zmeltdata), aes(x=variable, y=value, color=Codering,
group=Codering)) +
geom_line() +
geom_point() +
# theme_few() +
theme(legend.position="right") +
scale_color_hue(name="Treatment group:") +
# scale_y_continuous(labels = percent) +
ylab("Germination percentage") +
xlab("Week number") +
labs(title="Z. monophyllum germination data")
py$ggplotly(pZ, kwargs=list(fileopt="overwrite", filename="test_zdata"))
Note that I had to comment out theme_few() and scale_y_continuous(labels = percent) because from loading only "ggplot2", I would get the following errors:
Error: could not find function "theme_few"
and
Error in structure(list(call = match.call(), aesthetics = aesthetics, :
object 'percent' not found
respectively. I guess these are dependency issues (maybe you're using a version of "ggthemes"?).
I don't know what kind of magic theme_few() does, but if I don't use na.omit() on zmeltdata, my pZ plot looks like this:
Eww, "Week.10" comes after "Week.1" instead of after "Week.9"... So you wouldn't want to send this to plotly anyway! So I cannot exactly reproduce your ggplot example. But I wonder if you really want to keep these NA's (the CSV itself reads "NA", I was expecting blank "cells"). Don't you want to pre-process these anyway?
Note that I get the following warning message when I don't use na.omit() on zmeltdata:
Warning messages:
1: Removed 20 rows containing missing values (geom_path).
2: Removed 47 rows containing missing values (geom_point).
Again, beyond pure displaying/plotting considerations, since this looks like scientific data, wouldn't you want to number weeks with actual numbers, or pad the digits if you really want a string? ("Week.01", "Week.02", etc.)
And it looks like the missing data is all trailing... There's just no data (yet) for weeks 10+, right?
Thanks for reporting,
Marianne

Adding legend and structuring data for ggplot

In the data included below I have three sites (AAA,BBB,CCC) and individuals within each site (7, 12, 7 respectively). For each individual I have observed values (ObsValues) and three sets of predicted values each with a standard error. I have 26 rows (i.e. 26 individuals) and 9 columns.
The data is included here through dput()
help <- structure(list(StudyArea = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L), .Label = c("AAA", "BBB", "CCC"), class = "factor"),
Ind = structure(1:26, .Label = c("AAA_F01", "AAA_F17", "AAA_F33",
"AAA_F49", "AAA_F65", "AAA_F81", "AAA_F97", "BBB_P01", "BBB_P02",
"BBB_P03", "BBB_P04", "BBB_P05", "BBB_P06", "BBB_P07", "BBB_P08",
"BBB_P09", "BBB_P10", "BBB_P11", "BBB_P12", "CCC_F02", "CCC_F03",
"CCC_F04", "CCC_F05", "CCC_F06", "CCC_F07", "CCC_F08"), class = "factor"),
ObsValues = c(22L, 50L, 8L, 15L, 54L, 30L, 11L, 90L, 6L,
53L, 9L, 42L, 72L, 40L, 60L, 58L, 1L, 20L, 37L, 2L, 50L,
68L, 20L, 19L, 58L, 5L), AAAPred = c(28L, 52L, 6L, 15L, 35L,
31L, 13L, 79L, 6L, 58L, 5L, 42L, 88L, 49L, 68L, 60L, 1L,
26L, 46L, 0L, 34L, 71L, 20L, 15L, 35L, 5L), AAAPredSE = c(3.5027829,
4.7852191, 1.231803, 2.5244013, 4.873907, 3.8854192, 2.3532752,
6.3444402, 1.7387295, 5.605111, 1.667818, 4.4709107, 7.0437967,
5.447496, 6.0840486, 5.4371275, 0.8156916, 3.5153847, 4.698754,
0, 3.8901103, 5.993616, 3.1720272, 2.6777869, 4.5647313,
1.4864128), BBBPred = c(14L, 43L, 5L, 13L, 26L, 32L, 14L,
80L, 5L, 62L, 4L, 44L, 67L, 44L, 55L, 42L, 1L, 20L, 47L,
0L, 26L, 51L, 15L, 16L, 34L, 6L), BBBPredSE = c(3.1873435,
4.8782831, 1.3739863, 2.5752273, 4.4155679, 3.8102168, 2.3419518,
6.364606, 1.7096028, 5.6333421, 1.5861323, 4.4951428, 6.6046699,
5.302902, 5.9244328, 5.1887055, 0.8268689, 3.4014041, 4.6600598,
0, 3.8510512, 5.5776686, 3.0569531, 2.6358433, 4.5273782,
1.4263518), CCCPred = c(29L, 53L, 7L, 15L, 44L, 32L, 15L,
86L, 8L, 61L, 5L, 46L, 99L, 54L, 74L, 67L, 1L, 30L, 51L,
1L, 37L, 94L, 21L, 17L, 36L, 6L), CCCPredSE = c(3.4634488,
4.7953389, 0.9484051, 2.5207022, 5.053452, 3.8072731, 2.2764727,
6.3605968, 1.6044067, 5.590048, 1.6611899, 4.4183913, 7.0124638,
5.6495918, 6.1091934, 5.4797929, 0.8135164, 3.4353934, 4.6261147,
0.8187396, 3.7936333, 5.6512378, 3.1686123, 2.633179, 4.5841921,
1.3989955)), .Names = c("StudyArea", "Ind", "ObsValues",
"AAAPred", "AAAPredSE", "BBBPred", "BBBPredSE", "CCCPred", "CCCPredSE"
), class = "data.frame", row.names = c(NA, -26L))
The head() and dim() of help are below too
head(help)
StudyArea Ind ObsValues AAAPred AAAPredSE BBBPred BBBPredSE CCCPred CCCPredSE
1 AAA AAA_F01 22 28 3.502783 14 3.187343 29 3.4634488
2 AAA AAA_F17 50 52 4.785219 43 4.878283 53 4.7953389
3 AAA AAA_F33 8 6 1.231803 5 1.373986 7 0.9484051
4 AAA AAA_F49 15 15 2.524401 13 2.575227 15 2.5207022
5 AAA AAA_F65 54 35 4.873907 26 4.415568 44 5.0534520
6 AAA AAA_F81 30 31 3.885419 32 3.810217 32 3.8072731
dim(help)
> dim(help)
[1] 26 9
I am a relative newcomer to ggplot and am trying to make a plot that displays the observed and predicted values for each individual with a different color for each StudyArea. I can manually add points and force the color with the code below, however this feel rather clunky and also does not produce a legend as I have not specified color in aes().
require(ggplot2)
ggplot(help, aes(x=Ind, y=ObsValues))+
geom_point(color="red", pch = "*", cex = 10)+
geom_point(aes(y = AAAPred), color="blue")+
geom_errorbar(aes(ymin=AAAPred-AAAPredSE, ymax=AAAPred+AAAPredSE), color = "blue")+
geom_point(aes(y = BBBPred), color="darkgreen")+
geom_errorbar(aes(ymin=BBBPred-BBBPredSE, ymax=BBBPred+BBBPredSE), color = "darkgreen")+
geom_point(aes(y = CCCPred), color="black")+
geom_errorbar(aes(ymin=CCCPred-CCCPredSE, ymax=CCCPred+CCCPredSE), color = "black")+
theme(axis.text.x=element_text(angle=30, hjust=1))
In the figure above, the asterisks are the observed values and the values are the predicted values, one from each StudyArea.
I tried to melt() the data, but ran into more problems plotting. That being said, I suspect melt()ing or reshape()ing is the best option.
Any suggestions on how to best alter/restructure the help data so that I can plot the observed and predicted values for each individual with a different color for each StudyArea would be greatly appreciated.
I also hope to produce a legend - the likely default once the data is correctly formatted
Note: Indeed the resulting figure is very busy will likely be simplified once I get a better handle on ggplot.
thanks in advance.
Try this:
library(reshape2)
x.value <- melt(help,id.vars=1:3, measure.vars=c(4,6,8))
x.se <- melt(help,id.vars=1:3, measure.vars=c(5,7,9))
gg <- data.frame(x.value,se=x.se$value)
ggplot(gg)+
geom_point(aes(x=Ind, y=ObsValues),size=5,shape=18)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
Produces this:
Edit:: Response to #B.Davis' questions below:
You have to group the ObsValues by StudyArea, not variable. But when you do that you get six colors, three for StudyArea and three for the predictor groups (variable). If we give the predictor groups (e.g., AAAPred, etc.) the same names as the StudyArea groups (e.g. AAA, etc.), then ggplot just generates three colors.
gg$variable <- substring(gg$variable,1,3) # removes "Pred" from group names
ggplot(gg)+
geom_point(aes(x=Ind, y=ObsValues, color=StudyArea),size=5,shape=18)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
Produces this:
Similar to #jlhoward solution but I choose to treat ObsValues as a variable to get it in the legend.
help <- dat
x.value <- melt(help,id.vars=1:2, measure.vars=c(3,4,6,8))
x.se <- melt(help,id.vars=1:2, measure.vars=c(3,5,7,9))
gg <- data.frame(x.value,se=x.se$value)
ggplot(gg)+
geom_point(aes(x=Ind, y=value, color=variable),size=3, shape=1)+
geom_errorbar(data= subset(gg,variable!='ObsValues'),
aes(x=Ind, ymin=value-se, ymax=value+se, color=variable))+
theme(axis.text.x=element_text(angle=-90))
This is a little clumsy, but gets you what you want:
# jlhoward's melting is more elegant.
require(reshape2)
melted.points<-melt(help[,c('Ind','ObsValues','AAAPred','BBBPred','CCCPred')])
melted.points$observed<-ifelse(melted.points$variable=='ObsValues','observed','predicted')
melted.points.se<-melt(help[,c('Ind','AAAPredSE','BBBPredSE','CCCPredSE')])
melted.points.se$variable<-gsub('SE','',melted.points.se$variable,)
help2<-merge(melted.points,melted.points.se,by=c('Ind','variable'),all.x=TRUE)
help2<-rename(help2,c(value.x='value',value.y='se'))
And now the actual plot:
ggplot(help2,aes(x=Ind,y=value,color=variable,size=observed,shape=observed,ymin=value-se,ymax=value+se)) +
geom_point() +
geom_errorbar(size=1) +
scale_colour_manual(values = c("red","blue","darkgreen", "black")) +
scale_size_manual(values=c(observed=4,predicted=3)) +
scale_shape_manual(values=c(observed=8,predicted=16))

Resources