I have 963 lists, all containing the same type of information per list instance. The amount of data in any list at a given instance can vary, however. Instead of creating many lists, is there an efficient way to group the lists? Examples follow.
list001 <- c(originApt = 'ATL', destinApt = 'BOS', flightIndxs = c( 1 : 7 ) )
list002 <- c(originApt = 'ATL', destinApt = 'DEN', flightIndxs = c( 9 :19 ) )
:
list963 <- c(originApt = 'DCA', destinApt = 'TPA', flightIndxs = c( 8582, 8583, 8584, 8585, 8586, 8587 ) )
and so forth. Note that the length of integers in the third entry of each list varies in length. In matlab, I'd just construct a structure called 'flight' with an index for each list instance. Is there a way to organize my lists in R short of having many individual instances?
You can create a list of lists:
list001 <- list(originApt = 'ATL', destinApt = 'BOS', flightIndxs = c( 1 : 7 ) )
list002 <- list(originApt = 'ATL', destinApt = 'DEN', flightIndxs = c( 9 :19 ) )
large_list = list(list001, list002)
> large_list
[[1]]
[[1]]$originApt
[1] "ATL"
[[1]]$destinApt
[1] "BOS"
[[1]]$flightIndxs
[1] 1 2 3 4 5 6 7
[[2]]
[[2]]$originApt
[1] "ATL"
[[2]]$destinApt
[1] "DEN"
[[2]]$flightIndxs
[1] 9 10 11 12 13 14 15 16 17 18 19
A list can contain any other R object as a member. Do note to not construct the sublists as c(), but also use list.
You can also create a long formatted data.frame:
do.call('rbind', lapply(large_list, function(x) as.data.frame(do.call('cbind', x))))
originApt destinApt flightIndxs
1 ATL BOS 1
2 ATL BOS 2
3 ATL BOS 3
4 ATL BOS 4
5 ATL BOS 5
6 ATL BOS 6
7 ATL BOS 7
8 ATL DEN 9
9 ATL DEN 10
10 ATL DEN 11
11 ATL DEN 12
12 ATL DEN 13
13 ATL DEN 14
14 ATL DEN 15
15 ATL DEN 16
16 ATL DEN 17
17 ATL DEN 18
18 ATL DEN 19
Do note that this only works because flightIndxs is the only entry to have multiple values, and there is a clear interpretation that each flight index only has one origin and destination. It can also work with multiple variables having multiple values, as long as they all contain the same number of multiple values.
Related
I have an array of 4 dimensions: location(3) x species(3) x Season(6) x Depth (2). Like this matrix 12 times.
Season = 1, depth = 1
[A] [B] [C]
[a] 12 52 55
[b] 13 14 235
[c] 13 76 355
I would like to merge everything in one big matrix like:
Season = 1, depth = 1
[A] [B] [C]
[a11] 12 52 55
[b11] 13 14 235
[c11] 13 76 355
[a12] 12 52 55
[b12] 13 14 235
[c12] 13 76 355
[a21] 12 52 55
[b21] 13 14 235
[c21] 13 76 355
...
and so on. The first number would refer to one extra dimension, and the second for the other one. Does it make sense? Any idea?
Thanks a lot!! :)
This transposes the array with aperm and then makes a matrix.
location = 3
species = 3
Season = 6
Depth = 2
set.seed(1)
myArr <- array(sample(1000, location * species * Season * Depth), dim = c(location, species, Season, Depth))
myArrPerm <- aperm(myArr, perm = c(1,3,4,2))
matrix(myArrPerm, ncol = species)
I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette
Suppose we have the following data-set:
dat<-data.frame(num = 20:29, names = c(rep("Harry",2), rep("Gary",2), rep("Dairy",3), rep("Harry", 3)))
num names
1 20 Harry
2 21 Harry
3 22 Gary
4 23 Gary
5 24 Dairy
6 25 Dairy
7 26 Dairy
8 27 Harry
9 28 Harry
10 29 Harry
And we also have the following values for each factor level:
fvals <- c(Harry = 1, Gary = 2, Dairy = 3)
The goal is to multiply the num level by these factors by the fvals according to the names level (matching it in the fvals variable). For example, the desired output for this set of data should be
20 # 20 * 1
21 # 21 * 1
44 # 22 * 2
46 # 23 * 2
72 # 24 * 3
75 # 25 * 3
78 # 26 * 3
27 # 27 * 1
28 # 28 * 1
29 # 29 * 1
I have been doing this by converting the factor variable into a matrix with binary variables for each level, and then proceeding with matrix multiplication. But it was quite confusing trying to convert the matrices / vectors such that R can do multiplication on (and such that the level columns match). Also I'm not sure if the mat mul method will be efficient with a large number of observations. Just wondering if there's a better alternative to do this.
Here is an idea. Notice that I set stringsAsFactors = FALSE because it is easier to work with character vector directly.
dat<-data.frame(num = 20:29,
names = c(rep("Harry",2), rep("Gary",2), rep("Dairy",3), rep("Harry", 3)),
stringsAsFactors = FALSE)
fvals <- c(Harry = 1, Gary = 2, Dairy = 3)
dat$num * fvals[dat$names]
# Harry Harry Gary Gary Dairy Dairy Dairy Harry Harry Harry
# 20 21 44 46 72 75 78 27 28 29
Given the following example:
library(metafor)
dat <- escalc(measure = "RR", ai = tpos, bi = tneg, ci = cpos, di = cneg, data = dat.bcg, append = TRUE)
dat
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
trial author year tpos tneg cpos cneg ablat alloc yi vi
1 1 Aronson 1948 4 119 11 128 44 random -0.8893 0.3256
2 2 Ferguson & Simes 1949 6 300 29 274 55 random -1.5854 0.1946
3 3 Rosenthal et al 1960 3 228 11 209 42 random -1.3481 0.4154
4 4 Hart & Sutherland 1977 62 13536 248 12619 52 random -1.4416 0.0200
5 5 Frimodt-Moller et al 1973 33 5036 47 5761 13 alternate -0.2175 0.0512
6 6 Stein & Aronson 1953 NA NA NA NA 44 alternate NA NA
7 7 Vandiviere et al 1973 8 2537 10 619 19 random -1.6209 0.2230
8 8 TPT Madras 1980 505 87886 499 87892 NA random 0.0120 0.0040
9 9 Coetzee & Berjak 1968 29 7470 45 7232 27 random -0.4694 0.0564
10 10 Rosenthal et al 1961 17 1699 65 1600 42 systematic -1.3713 0.0730
11 11 Comstock et al 1974 186 50448 141 27197 18 systematic -0.3394 0.0124
12 12 Comstock & Webster 1969 5 2493 3 2338 33 systematic 0.4459 0.5325
13 13 Comstock et al 1976 27 16886 29 17825 33 systematic -0.0173 0.0714
Now what i basically want is to iterate with the rma() command (only for mods argument) from - let's say - [7:8] and to store this result in a variable equal to the columnname.
Two problems:
1) When i enter the command:
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
The modname is named as dat[[8]]. But I want the modname to be the columname (i.e. colnames(dat[i]))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 0.5543 1.4045 0.3947 0.7312 -5.4888 6.5975
dat[[8]] -0.0312 0.0435 -0.7172 0.5477 -0.2185 0.1560
2) Now imagine that I have a lot of columns more and I want to iterate from [8:53], such that each result gets stored in a variable named equal to the columnname.
Problem 2) has been solved:
for(i in 7:8){
assign(paste(colnames(dat[i]), i, sep=""), rma(yi, vi, data = dat, mods = ~dat[[i]], subset = (alloc=="systematic"), knha = TRUE))}
To answers 1st part of your question, you can change the names by accessing the attributes of the model object.
In this case
# inspect the attributes
attr(model$vb, which = "dimnames")
# assign the name
attr(model$vb, which = "dimnames")[[1]][2] <- paste(colnames(dat)[8])
In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:
whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:
> sort(table(data$State))
05 27 35 40 54 84 9 AP AU BE BI DI G GP GU GZ HN HR JA JM KE KU L LD LI MH NA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
OU P PL RM SR TB TP TW U VD VI VS WS X ZH 47 BL BS DL M MB NB RP TU 11 DU KA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
BW ND NS WY AK SD 13 QC 01 BC MT AB HE ID J NO LN NM ON NE VT UT IA MS AO AR ME
4 4 4 4 5 5 6 6 7 7 7 8 8 8 9 10 11 17 23 26 26 30 31 31 38 40 44
OR KS HI NV WI OK KY IN WV AL CO WA MN NH MO SC LA TN AZ IL NC MI GA OH ** CT DE
45 47 48 57 57 64 106 108 112 113 120 125 131 131 135 138 198 200 233 492 511 579 645 646 840 873 1432
RI DC TX MA FL VA MD CA NJ PA NY
1782 2513 6992 7027 10527 11016 11836 12221 15485 16359 34045
Now when I use whittle it returns me the following message:
> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other", :
duplicated levels in factors are deprecated
How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?
I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:
whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}
Testing on some sample data:
state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD MD MD Other Other
# Levels: MD Other
I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).
whittle <- function(data, cutoff_val){
tab <- table(data)
shouldmerge <- tab < cutoff_val
tokeep <- names(tab)[!shouldmerge]
tomerge <- names(tab)[shouldmerge]
nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
levels(data)<-nv
return(data)
}
And we test it with
set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
# a b c d e f g h i j k l m
# 5 11 8 8 7 5 13 14 14 15 2 3 5
y <- whittle(x, 9)
table(y)
# y
# b g h i j Other
# 11 13 14 14 15 43
It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.
Using #MrFlick's data:
x <- factor(c(sample(letters[1:10], 100, replace=T),
sample(letters[11:13], 10, replace=T)))
library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table
# b g h i j Other
#11 13 14 14 15 43
The n argument specifies the number of most common values to preserve.
Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.
whittle <- function(x, thresh) {
belowThresh <- names(which(table(x) < thresh))
x[x %in% belowThresh] <- belowThresh[1]
levels(x)[levels(x) == belowThresh[1]] <- "Other"
factor(x)
}