Looping and clustering

Looping and clustering - r

I have to admit this's too hard for me to do it on my own. I have to analyze some data and this step is crucial for me.
Data which I want to analyze:
> dput(tbl_clustering)
structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L,
13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L,
27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L,
21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L,
5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L,
24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L,
12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L,
18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130",
"AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080",
"AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510",
"AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930",
"AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220",
"AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720",
"ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L,
54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L,
34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L,
42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L,
75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L,
31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L,
48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L,
35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L,
22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150",
"AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820",
"AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170",
"AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830",
"AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590",
"AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050",
"AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840",
"AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110",
"AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400",
"AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770",
"AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530",
"AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200",
"AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120",
"AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630",
"AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160",
"AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380",
"ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820",
"ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L,
5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L,
5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L,
5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L,
6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L,
8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L,
6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA,
-98L))
To explain better what I want to achieve I will paste some rows over here:
P1 P2 No_Interactions
1 AT3G61540 AT4G30920 8
2 AT3G61540 AT4G30910 5
3 AT1G58080 AT2G45440 5
4 AT1G58080 AT1G09795 9
5 AT1G58080 AT5G52100 7
6 AT5G04710 AT5G60160 6
7 AT4G03520 AT1G75330 5
8 AT3G58730 AT1G20630 5
9 AT3G58730 AT5G02500 5
10 AT3G58730 AT5G35790 5
First of all the new column Cluster has to be created. Next we focus only on two columns P1 and P2. As you can see in first row we have two names AT3G61540 and AT4G30920 and that's our starting point (loop I believe will be necessary). We put the number 1 in Cluster column. Than we take first name AT3G61540 and scan through both columns P1 and P2 if we find this name once again somewhere with other name than in first row we put number 1 as well in Cluster. Next we take second name from first row AT4G30920 and do the same screening through whole data.
The next step will be to analyze next row and do exactly the same things. In that case in the next row we have exactly the same name for P1 that means we don't need to screen it but the second name AT4G30910 is different so would be great to screen with that one. The problem which appears here is that this row should be the cluster 1 as well. The cluster 2 starts with third row because we have completely new pair of names.
I am aware that's not so easy task and probably it has to be done in couple steps.
EDIT:
The output I would like to get:
P1 P2 No_Interactions Cluster
1 AT3G61540 AT4G30920 8 1
2 AT3G61540 AT4G30910 5 1
3 AT1G58080 AT2G45440 5 2
4 AT1G58080 AT1G09795 9 2
5 AT1G58080 AT5G52100 7 2
6 AT5G04710 AT5G60160 6 3
7 AT5G52100 AT1G75330 5 2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080
8 AT3G58730 AT1G20630 5 5
9 AT3G58730 AT5G02500 5 5
10 AT3G58730 AT3G61540 5 1 ## Cluster 1 because AT3G61540 was found in first row.

I corrected my initial answer and propose you a functional programming approach, using Map and recursion to find your clusters:
library(magrittr)
similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE
clusterify = function(df)
{
clusters = df$cluster
if(!any(clusters==0)) return(df)
idx = pmatch(0, clusters)
lst = Map(c, as.character(df[,1]), as.character(df[,2]))
el = c(as.character(df[idx, 1]), as.character(df[idx, 2]))
K = lst %>%
sapply(similar, v=el) %>%
add(0)
mask = if(any(clusters!=0 & K==1))
if(any(mask))
{
cl = min(clusters[mask])
df[K==1,]$cluster = cl
}
else
{
df[K==1,]$cluster = max(clusters) + 1
}
clusterify(df)
}
You can use it by clusterify(transform(df, cluster=0))
For example, the clustering operates correctly on your example, by taking cluster 9 (you can check other clusters):
subset(clusterify(transform(df, cluster=0)), cluster==9)
# P1 P2 No_Interactions cluster
#25 AT5G19220 AT5G48300 10 9
#26 AT5G19220 AT2G34590 6 9
#27 AT5G19220 AT5G10920 6 9
#32 AT5G19220 AT3G01500 8 9
#33 AT5G19220 AT2G37020 5 9
#34 AT5G19220 AT2G35040 5 9
#92 AT4G22930 AT5G48300 5 9
#93 AT4G22930 AT2G34590 5 9
#94 AT4G22930 AT5G35630 5 9
#95 AT4G22930 AT4G34200 7 9
#96 AT4G22930 AT1G17745 5 9
#97 AT4G22930 AT4G00570 5 9
#98 AT4G22930 AT2G44350 6 9

You could use library igraph to make an undirected graph in which you cluster connected composents:
library('igraph')
# make graph and cluster
g = graph.data.frame(tbl_clustering[,c('P1', 'P2')], directed=FALSE)
c = clusters(g)
# append cluster number to original data
tbl_clustering$cluster = sapply(as.vector(tbl_clustering$P1), function(x)c$membership[x])
This assigns clusters to the entries (here the first rows):
> head(tbl_clustering, 8)
P1 P2 No_Interactions cluster
1 AT3G61540 AT4G30920 8 1
2 AT3G61540 AT4G30910 5 1
3 AT1G58080 AT2G45440 5 2
4 AT1G58080 AT1G09795 9 2
5 AT1G58080 AT5G52100 7 2
6 AT5G04710 AT5G60160 6 3
7 AT4G03520 AT1G75330 5 4
8 AT3G58730 AT1G20630 5 5

I believe you want to divide your data set into equivalence classes. I have an implementation based on a algorithm in Numerical Recipes. I have included the code below. It can be used as follows:
source("equivalence.R")
ids <- unique(c(levels(data[[1]]), levels(data[[2]])))
classes <- equivalence(ids, data[1:2])
data$class <- classes[match(data$P1, ids)]
equivalence.R
library(Rcpp)
Rcpp::sourceCpp('equivalence.cpp')
equivalence <- function(x, rules) {
tmp <- unique(x)
tmp <- tmp[!is.na(tmp)]
a <- match(rules[[1]], tmp)
b <- match(rules[[2]], tmp)
sel <- !is.na(a) & !is.na(b)
if (any(!sel)) {
warning("Not all values in rules are present in x.")
a <- a[sel]
b <- b[sel]
}
res <- c_equivalence(as.integer(a)-1L, as.integer(b)-1L,
as.integer(length(tmp)))
res[match(x, tmp)] + 1L
}
equivalence.cpp
#include <R.h>
#include <Rinternals.h>
#include <string>
extern "C" {
// [[Rcpp::export]]
SEXP c_equivalence(SEXP ra, SEXP rb, SEXP rn) {
try {
if (LENGTH(ra) != LENGTH(rb))
throw std::string("Lengths of a and be do not match.");
int* a = INTEGER(ra);
int* b = INTEGER(rb);
int m = LENGTH(ra);
int n = INTEGER(rn)[0];
SEXP classes = PROTECT(allocVector(INTSXP, n));
int* cls = INTEGER(classes);
//Initialize each element its own class.
for (int k = 0; k < n; k++) cls[k] = k;
//For each piece of input information...
for (int l = 0; l < m; l++) {
//Track first element up to its ancestor.
int j = a[l];
while (cls[j] != j) j = cls[j];
//Track second element up to its ancestor.
int k = b[l];
while (cls[k] != k) k = cls[k];
//If they are not already related, make them so.
if (j != k) {
cls[j] = k;
}
}
//Final sweep up to highest ancestors.
for (int j = 0; j < n; j++) {
while (cls[j] != cls[cls[j]]) cls[j] = cls[cls[j]];
}
UNPROTECT(1);
return classes;
} catch(const std::string& e) {
error(e.c_str());
return R_NilValue;
} catch (...) {
error("Uncaught exception.");
return R_NilValue;
}
}
}

Okay, here is a new answer, which goes some of the way. Again, dat is the data frame.
Cluster <- rep(NA, length(dat[, 1])) #initialise
for(r in 1:length(Cluster)){
if(identical(as.numeric(r), 1)){Cmatches <- matrix(c(as.character(dat[1, 1]), as.character(dat[1, 2])), 2, 1)}
matched <- F
for(cl in 1:length(Cmatches[1,])){
if(sum(c(as.character(dat[r, 1]), as.character(dat[r, 2])) %in% Cmatches[, cl]) != 0){
#add P1 and P2 from this row to the cluster which it matches
Cmatches <- rbind(Cmatches, matrix(c(if(cl != 1){rep("", (cl - 1)*2)}else{character(0)}, as.character(dat[r, 1]), as.character(dat[r, 2]), if(cl != length(Cmatches[1,])){rep("", (length(Cmatches[1, ]) - cl)*2)}else{character(0)}), 2, length(Cmatches[1,]), byrow = F))
matched <- T
Cluster[r] <- cl
}
}
if(!matched){
#add a new cluster, because doesn't match any existing
Cmatches <- cbind(Cmatches, c(c(as.character(dat[r, 1]), as.character(dat[r, 2])), rep("", length(Cmatches[, 1]) - 2)))
Cluster[r] <- length(Cmatches[1,])
}
}
After this, you would take the Cmatch matrix and then check for matches between the clusters using if(sum(match(Cmatch[, cl1], Cmatch[, cl2], incomparables = ""), na.rm = T) != 0) (where cl1 and cl2 are cluster numbers to be compared). If that test was true, then those clusters should be grouped.
Cmatched <- rep(NA, length(Cmatches[1,]))
for(cl in 1:(length(Cmatches[1,]) - 1)){
for(cl2 in (cl + 1):length(Cmatches[1,])){
if(sum(match(Cmatches[, cl], Cmatches[, cl2], incomparables = ""), na.rm = T) != 0){
if(is.na(Cmatched[cl])){
Cluster[Cluster == cl2] <- cl
Cmatched[cl2] <- cl
}else{
Cluster[Cluster == cl2] <- Cmatched[cl]
Cmatched[cl2] <- cl
}
}
}
}
And I think that there is your answer. Then just dat <- cbind(dat, Cluster).

It sounds like you want to do categorical clustering. You should look into k-modes clustering which is an an extension of k-means. The k-modes algorithm mirrors the steps of k-means. Here is the outline provided in the paper.
Randomly select k unique objects as the initial cluster centers (modes).
Calculate the distances between each object and the cluster mode; assign the object to the cluster whose center has the shortest distance to the object; repeat this step until all objects are assigned to clusters.
Select a new mode for each cluster and compare it with the previous mode. If different, go back to Step 2; otherwise, stop
There are other issues discussed like k-prototypes (for mixing categorical with numerical data), fuzzy k-modes (for assigning cluster membership), and initialization of k-modes.

Related

Tabu search in R

Good evening,
As part of a data analysis course we have been thrown into the Metaheuristics realm.....and I am really struggling to understand how to implement a Tabu search in R since my background in programming is rather limited.
I haven't found any R or Python example on Google or youtube either so I'm really praying I'll find something here.
The problem I have is similar to the "location problem" in optimisation. I need to find the best combination of Hubs that minimizes the total distance between Hubs and nodes.
I need to find 5 hubs, and the total capacity for each one is 120
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
hubs_required = 5
total_capacity = 120
My strategy was to create a distance matrix, then I will create another 50 x 50 matrix to represent wether a node becomes a hub or not, and finally I will multiply both and add all the distances to get the total distance.
I created the dataframe:
nodes_df <- as.data.frame(nodes)
colnames(nodes_df) <- c("x", "y", "node_demand")
rownames(nodes_df) <- paste('Node',1:50)
I created the distance matrix
distance_df <-as.data.frame(as.matrix(round(dist(nodes_df,method = "euclidean",diag = TRUE,upper = TRUE))))
colnames(distance_df) <- paste("Node",1:50)
I created the node demand matrix:
demand <- as.vector(rep(c(nodes_df[,'node_demand']),50))
demand_matrix <- matrix(demand,nrow=50,ncol=50,byrow = TRUE)
diag(demand_matrix) <- 0
demand_matrix <- as.data.frame(demand_matrix)
I created an empty matrix to show whether a node becomes a hub "1" or not "0"
hubs_matrix <- matrix(0,nrow = 50,ncol = 50,byrow = TRUE)
colnames(hubs_matrix) <- paste("Hub",1:50)
rownames(hubs_matrix) <- paste("Node",1:50)
Then to create the initial solution I randomly assign Hubs and calculate the distance and demand.
set.seed(37)
hubs_matrix <- do.call("cbind", lapply(1:50, function(x) sample(c(1, rep(0, 49)), 50)))
sum_distances <- (hubs_matrix * distance_df)
sum(rowSums(sum_distances))
The idea is to try different combinations of '1'' and '0' as to minimise the total distance but I am having the following issues:
I got no idea how to do the local search and do the permutations from the initial solution.
I got no idea how to prevent R to use the best solution for a certain period of time, i.e the Tabu list
I got no idea how to deal with the supply restriction for each node ( total demand from each node < 120), I could do it with a loop but since in this case I'm multiplying matrices I'm pretty lost.
Anybody could give me a hand???
Many thanks!

Subsetting from a dataframe in R

I have sampled 'n' rows from a dataframe called nodes:
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
To sample I use this code:
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
Which returns :
node_number x y node_demand
33 33 8 26 12
14 14 2 83 6
42 42 41 97 20
13 13 44 76 3
10 10 59 72 6
I would like to return all the rows that haven't been selected so that I can perform a series of calculations on them.
I thought that using something like data[-sample,] would work but I get the following error
Error in xj[i] : invalid subscript type 'list'.
Anybody know who could I get these values?

It would be easier to keep the list of indexes that selected. Somthing like
hubs <- nodes[keep <- sample(1:total_nodes, hubs_required, replace = FALSE),]
other_hubs <- nodes[-keep, ]
Otherwise, if your data has some sort of key/ID, you can do something like
other_hubs <- nodes[nodes%node_number %in% hubs$node_number, ]
or with dplyr, this can be an anti-join
nodes %>% anti_join(hubs, by="node_number")

Running rapply on lists of dataframes

To follow-up on two rapply questions, here and here from years ago, it seems rapply only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame class.
In most cases and demonstrated below, the rapply equivalent is nested lapply and its variant wrappers, v/sapply where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply and rapply between vector, matrix, and dataframe types. All but datafames fail to equalize.
Question
Is there a use case in base R for rapply() to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply base R docs? Most tutorials do not show rapply dataframe examples.
One Dimension (character vector)
Below shows how rapply is equivalent to nested lapply on simple character vectors running count of characters, and even shows how rapply is appreciably faster in processing:
library(microbenchmark)
ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))
microbenchmark(
ScriptsLists1 <- lapply(ScriptLists, function(i){
unname(vapply(i, function(x){
nchar(x)
}, numeric(1)))
})
)
# Unit: microseconds
# min lq mean median uq max neval
# 384 408.782 524.1363 434.7675 678.016 886.377 100
microbenchmark(
ScriptsLists2 <- rapply(ScriptLists, function(x){
nchar(x)
}, how="list")
)
# Unit: microseconds
# min lq mean median uq max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722 100
all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE
Two Dimension Type (matrix vs. data.frame)
Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).
df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L,
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L,
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L,
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin",
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight",
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff",
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet",
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler",
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo",
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch",
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L,
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L,
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L,
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters",
"http://www.stackoverflow.com//users/1144035/gordon-linoff",
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r",
"http://www.stackoverflow.com//users/1227923/alexey-mezenin",
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo",
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler",
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc",
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen",
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin",
"http://www.stackoverflow.com//users/209103/frank-van-puffelen",
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer",
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet",
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael",
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan",
"http://www.stackoverflow.com//users/335858/dasblinkenlight",
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch",
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew",
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet",
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv",
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre",
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L,
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L,
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L,
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands",
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States",
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA",
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France",
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States",
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom",
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria",
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States",
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L,
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L,
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L,
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604",
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886",
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179",
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475",
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188",
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"),
total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L,
3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L,
8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L,
16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134",
"220,515", "229,553", "233,368", "269,380", "289,989", "30,027",
"31,602", "36,950", "401,595", "41,183", "411,535", "418,780",
"455,157", "475,813", "499,408", "507,043", "508,310", "509,365",
"525,176", "529,137", "61,135", "616,135", "64,476", "651,397",
"672,118", "7,932", "703,046", "709,683", "71,032", "77,211",
"83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L,
2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L,
8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L,
15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android",
"angular2", "c", "c#", "firebase", "git", "java", "javascript",
"laravel", "pandas", "python", "r", "regex", "ruby", "sql",
"swift"), class = "factor"), tag2 = structure(c(23L, 24L,
19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L,
10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L,
7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net",
"arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database",
"github", "hibernate", "html", "ios", "java", "javascript",
"jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x",
"ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"),
tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L,
5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L,
19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L,
7L, 14L, 2L), .Label = c(".net", "android", "android-intent",
"arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#",
"c++", "css", "dataframe", "docker", "git-pull", "html",
"java", "java-8", "javascript", "jquery", "laravel-5.3",
"mysql", "numpy", "object", "protractor", "python-2.7", "r",
"servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
), class = "factor")), .Names = c("user", "link", "location",
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA,
-36L))
R Code
Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply() for matrix returns same as nested lapply, but not for dataframe returns.
# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
java=list(), javascript=list(), ruby=list(), `c++`=list())
LangLists <- setNames(mapply(function(i, j){
df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))
return(list(as.matrix(df))) # MATRIX TYPE
# return(list(df)) # DF TYPE
}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------
# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
cbind(mean(as.numeric(df[,5])),
mean(as.numeric(df[,6])))
})
})
LangLists2 <- rapply(LangLists, function(i){
cbind(mean(as.numeric(i[,5])),
mean(as.numeric(i[,6])))
}, classes="matrix", how="list")
all.equal(LangLists1, LangLists2)
# [1] TRUE
# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
data.frame(year_rep=mean(df$year_rep),
total_rep=mean(df$total_rep))
})
})
LangLists2 <- rapply(LangLists, function(i){
data.frame(year_rep=mean(i$year_rep),
total_rep=mean(i$total_rep))
}, classes="data.frame", how="list")
all.equal(LangLists1, LangLists2)
# [1] "Component “c#”: Component 1: Names: 2 string mismatches"
# [2] "Component “c#”: Component 1: Attributes: < names for target but not for current >"
# [3] "Component “c#”: Component 1: Attributes: < Length mismatch: comparison on first 0 components >"
# [4] "Component “c#”: Component 1: Length mismatch: comparison on first 2 components"
# [5] "Component “c#”: Component 1: Component 1: Modes: numeric, NULL"
...
In fact, whereas the nested lapply remains a list of intact dataframes of the two columns for rep means, the rapply for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?
# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL
# $`c#`[[1]]$user
# NULL
# $`c#`[[1]]$link
# NULL
# $`c#`[[1]]$location
# NULL
# $`c#`[[1]]$year_rep
# NULL
# $`c#`[[1]]$total_rep
# NULL
# $`c#`[[1]]$tag1
# NULL
# $`c#`[[1]]$tag2
# NULL
# $`c#`[[1]]$tag3
# NULL
# $python
# $python[[1]]
# $python[[1]]$X
# NULL
# $python[[1]]$user
# NULL
# $python[[1]]$link
# NULL
# $python[[1]]$location
# NULL
# $python[[1]]$year_rep
# NULL
# $python[[1]]$total_rep
# NULL
# $python[[1]]$tag1
# NULL
# $python[[1]]$tag2
# NULL
# $python[[1]]$tag3
# NULL

It appears that rapply is not designed to process lists of data.frames.
From the Details section of ?rapply it says, if
how = "list" or how = "unlist", the list is copied, all non-list elements which have a class included in classes are replaced by the result of applying f to the element and all others are replaced by deflt.
Since data.frames are lists, they do not fall under the first category. Thus, they fall into the all others catch-all and are replaced by dflt, whose default value is NULL. This explains the result of the final line of code in the question.
The final alternative argument to how is "replace" and it appears that data.frames are simply ignored under this "mode"
If how = "replace", each element of the list which is not itself a list and has a class included in classes is replaced by the result of applying f to the element.
No mention of elements which are themselves lists and running the code above with how="replace" appears to return a nested list where what were data.frames are now simple lists. So it appears that rapply went through and stripped the class attribute.

Create and output multiple plots from list

I am attempting to create and output as pdfs a list of 64 items. My data takes the form:
QQJAN List of 64
file1: List of 2
..$x: num [1:161] 96.7 96.8 97.5 ...
..$y: num [1:161] 9.3 10.3 17.3 ...
..................................................................
file64: List of 2
..$x: num [1:161] 42.6 59.9 70.4 ...
..$y: num [1:161] 9.3 10.3 17.3 ...
I can do this for any single item in the list using:
plot(QQJAN$file1)
and can then output these files to my working directory as pdfs, but how can all 64 files in the list be plotted and outputted with their names, i.e. file1.pdf, file 2.pdf etc.
Can the lapply function be used here?
A reproducible example:
QQJAN$file1$x=c(1,2,3,4)
QQJAN$file1$y=c(2,4,5,6)
QQJAN$file2$x=c(2,2,3,5)
QQJAN$file2$y=c(2,4,5,6)

Not tested due to lack of a reproducible example:
for (i in seq_along(QQJAN)) {
pdf(sprintf("plot%i.pdf", i)) #or pdf(paste0(names(QQJAN)[i], ".pdf"))
plot(QQJAN[[i]])
dev.off()
}
If you are only interested in side effects, such as plotting, a for loop is usually appropriate. You should use lapply if you need a return value.

We can use lapply to loop over the names of the list elements, create the pdf file by pasteing the individual names with .pdf, subset the list (QQJAN[[x]]), plot.
invisible(lapply(names(QQJAN), function(x) {
pdf(paste0(x, '.pdf'))
plot(QQJAN[[x]])
dev.off()}))
data
QQJAN <- structure(list(file1 = structure(list(x = c(6L, 5L, 15L, 11L,
14L, 19L, 6L, 16L, 17L, 6L, 13L, 8L, 14L, 14L, 7L, 19L, 4L, 1L,
11L, 3L, 2L, 12L, 15L, 3L, 5L, 14L, 2L, 12L, 13L, 1L, 7L, 5L,
8L, 3L, 19L, 5L, 15L, 13L, 14L, 20L), y = c(29L, 23L, 17L, 14L,
3L, 5L, 24L, 22L, 16L, 21L, 28L, 52L, 28L, 43L, 33L, 60L, 28L,
18L, 11L, 9L, 30L, 15L, 17L, 8L, 44L, 19L, 57L, 59L, 45L, 30L,
9L, 13L, 1L, 60L, 39L, 21L, 35L, 50L, 3L, 44L)), .Names = c("x",
"y")), file2 = structure(list(x = c(11L, 3L, 11L, 5L, 8L, 7L,
6L, 18L, 8L, 17L, 7L, 15L, 19L, 3L, 10L, 12L, 13L, 2L, 9L, 10L,
15L, 13L, 3L, 6L, 16L, 1L, 20L, 5L, 9L, 4L, 12L, 1L, 6L, 13L,
18L, 7L, 18L, 19L, 15L, 13L), y = c(56L, 31L, 40L, 43L, 20L,
45L, 55L, 8L, 43L, 26L, 7L, 52L, 7L, 31L, 11L, 14L, 55L, 26L,
4L, 42L, 34L, 44L, 12L, 4L, 30L, 60L, 23L, 44L, 29L, 55L, 6L,
37L, 11L, 14L, 36L, 52L, 28L, 22L, 31L, 33L)), .Names = c("x",
"y"))), .Names = c("file1", "file2"))

Loop when the output for each iteration is a data.frame

I want to do the looping for the following data. The output for a single iteration is a data.frame. My code is:
Data <- structure(list(v = c(15L, 15L, 15L, 15L, 16L, 16L, 16L, 17L,
17L, 18L, 19L, 19L, 19L, 20L, 20L, 21L, 21L, 22L, 22L, 25L, 25L
), b = c(35L, 70L, 42L, 35L, 20L, 48L, 16L, 68L, 68L, 51L, 57L,
57L, 57L, 95L, 76L, 70L, 21L, 77L, 77L, 100L, 30L), r = c(7L,
14L, 14L, 14L, 5L, 15L, 6L, 16L, 20L, 17L, 9L, 12L, 18L, 19L,
19L, 10L, 5L, 14L, 21L, 12L, 6L), k = c(3L, 3L, 5L, 6L, 4L, 5L,
6L, 4L, 5L, 6L, 3L, 4L, 6L, 4L, 5L, 3L, 5L, 4L, 6L, 3L, 5L),
lambda = c(1L, 2L, 4L, 5L, 1L, 4L, 2L, 3L, 5L, 5L, 1L, 2L,
5L, 3L, 4L, 1L, 1L, 2L, 5L, 1L, 1L)), .Names = c("v", "b",
"r", "k", "lambda"), class = "data.frame", row.names = c(NA,
-21L))
library(AlgDesign)
BIB <- list()
for(i in 1:nrow(Data)){
BIB[[i]] <- data.frame(optBlock(~., withinData = factor(1:Data[i, "v"]), blocksize = rep(Data[i, "k"], Data[i, "b"]))$Blocks)
dimnames(BIB[[i]]) <- list(1:Data[i, "k"], paste("Block", 1:Data[i, "b"], sep = " "))
}
BIB
Is there an easy way to accomplish the same task?

BIB <- list()
for(i in 1:nrow(Data)){
BIB[[i]] <- data.frame(optBlock(~., withinData = factor(1:Data[i, "v"]), blocksize = rep(Data[i, "k"], Data[i, "b"]))$Blocks)
dimnames(BIB[[i]]) <- list(1:Data[i, "k"], paste("Block", 1:Data[i, "b"], sep = "_"))
}
print(BIB)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Looping and clustering - r

Related

Tabu search in R

Subsetting from a dataframe in R

Running rapply on lists of dataframes

Create and output multiple plots from list

Loop when the output for each iteration is a data.frame

Categories

Resources