I am trying to add all the elements in a matrix. This is an example of my matrix (the actual matrix is bigger):
m = matrix(c(528,479,538,603),nrow=2,ncol=2)
m
A B
male 528 538
female 479 603
I am trying to do:
sum.elements = colSums(colSums(m))
but it gives the following error:
Error in colSums(colSums(m)) : 'x' must be an array of at least two
dimensions
I have tried doing:
x = colSums(m)
sum.elements = x[1] + x[2]
but this would be very long when you have a 100-column matrix...
Any help would be greatly appreciated!
You can do sum. It also has the option na.rm to remove the NA values.
sum(m)
#[1] 2148
In general, sum works for vector, matrix and data.frame
Benchmarks
set.seed(24)
m1 <- matrix(sample(0:20, 5000*5000, replace=TRUE), ncol=5000)
system.time(sum(m1))
# user system elapsed
# 0.027 0.000 0.026
system.time(sum(colSums(m1)))
# user system elapsed
# 0.027 0.000 0.027
system.time(Reduce('+', m1))
# user system elapsed
#25.977 0.644 26.673
Reduce will work
Reduce(`+`,m)
[1] 2148
Related
I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).
I'm making a very dirty version of an autocorrelation function in R.
I have a loop that works up to a specified max lag and then returns all the correlations as a matrix, as the acf() function does.
The idea is to replicate the output of the acf() function as shown:
Autocorrelations of series ‘acfData’, by lag
0 1 2 3 4 5 6 7 8
1.000 -0.038 0.253 0.266 0.250 0.267 -0.182 0.281 -0.013
9 10 11 12 13
-0.067 -0.122 -0.115 -0.023 -0.337
What I have so far is the input of data, the specified max lag and the code then works over the range by sliding the data frame back the needed amount and then performing the covariance and standard deviation calculations over the necessary range of data matrices. This is repeated over the range of lags and then appended to the matrices as shown, I also included the cor() function with the data frames created to test.
My problem is that the code returns the correct value for the first loop, or slide, and then returns slightly wrong values from then on.
myAcf <- function(dat, lg){
dataF <- data.frame("data" = dat)
names(dataF)[1] <- "acfData"
lagMat <- c()
testMat <- c()
for(i in 0:lg){
dataLag <- slide(dataF, "acfData", slideBy = -i)
covacf <- cov(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
sd1 <- sd(dataLag[(1+i):nrow(dataLag[1]), 1])
sd2 <- sd(dataLag[(1+i):nrow(dataLag[1]), 2])
corrCalc <- covacf/(sd1 * sd2)
lagMat <- c(lagMat, corrCalc)
a <- cor(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
testMat <- c(testMat, a)
}
plot(lagMat)
return(list(lagMat, testMat))
}
My code then returns for the same data as the acf() function input:
[[1]]
[1] 1.00000000 -0.03786539 0.27700596 0.30197418 0.31009956
[6] 0.37123797 -0.19520518 0.44399863 0.05684766 0.02063488
[11] -0.03724332
[[2]]
[1] 1.00000000 -0.03842146 0.27502462 0.29292583 0.35052131
[6] 0.40931426 -0.23637159 0.52320559 0.07270497 0.02555461
[11] -0.04524035
Any help is greatly appreciated!
What would be the simplest way to round up a value to a specific significant figure?
Something like the function signif(), but only perform rounding up (not down)
For example: Round up 0.001145288 to 1 significant figure would yield 0.002
Any suggestions will be appreciated :)
Cheers!
This function provides the desired output in the case of one significant figure:
upround1 <- function(x) {
if (isTRUE(all.equal(x,0))) return (x)
decs <- 10^floor(log10(abs(x)))
ceiling(x/decs)*decs
}
Examples:
upround1(0.001145288)
#[1] 0.002
upround1(0.0008145258)
#[1] 9e-04
upround1(11)
#[1] 20
upround1(-11)
#[1] -10
upround1(-0.023)
#[1] -0.02
I created a function which changes the format to scientific, splits the string by 'e' and uses the information to ceiling the number.
my_ceiling <- function(x){
num_string <- format(x, scientific=TRUE)
n <- strsplit(num_string, "e")
n1 <- sapply(n, function(x) as.numeric(x[1]))
n2 <- sapply(n, function(x) as.numeric(x[2]))
ceiling(n1) * 10^(n2)
}
my_ceiling(0.001145288)
# [1] 0.002
my_ceiling(0.0974343)
# [1] 0.1
my_ceiling(0)
# [1] 0
set.seed(1)
x <- runif(10, 0.001, 0.01)
my_ceiling(x)
# [1] 0.004 0.005 0.007 0.010 0.003 0.010 0.010 0.007
# [9] 0.007 0.002
I’ve been having some trouble with the plotCalibration() function, I have managed to get it to work before, but recently whilst working with another dataset (here is a link to the .Rda data file), I have been unable to shake off an error message which keeps cropping up:
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
Error in plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort) : The specified outcome is not a binary variable.`
When I’ve tried to set the cOutcome column to factors or to logical, it still doesn’t work.
I’ve looked at the source of the function and the only time the error message comes up is in the first if()else{} statement:
if (length(unique(y))!=2) {stop(" The specified outcome is not a binary variable.\n")}
else{
But I have checked that the length(unique(y)) is indeed ==2, and so don’t understand why the error message still crops up!
Be sure you're passing a dataframe to PlotCalibration. Passing a dplyr tibble can cause this error. Converting with the normal as.data.frame() worked for me.
Using the data you sent earlier, I do not see any error though:
Following output were produced along with a calibration plot:
> library(PredictABEL)
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
$Table_HLtest
total meanpred meanobs predicted observed
[0.000632,0.00129) 340 0.001 0.000 0.31 0
0.001287 198 0.001 0.000 0.25 0
[0.001374,0.00201) 283 0.002 0.004 0.53 1
0.002009 310 0.002 0.000 0.62 0
[0.002505,0.00409) 154 0.003 0.000 0.52 0
[0.004086,0.00793) 251 0.006 0.000 1.42 0
[0.007931,0.00998) 116 0.008 0.009 0.96 1
[0.009981,0.19545] 181 0.024 0.011 4.40 2
$Chi_square
[1] 4.906
$df
[1] 8
$p_value
[1] 0.7676
Please try using table(data[,2],useNA = "ifany") to see the number of levels of the outcome variable of your dataset.
The function plotCalibration will execute when the outcome is a binary variable (two levels).
Is there a simple and fast way to obtain the frequency of each integer that occurs in a vector of integers in R?
Here are my attempts so far:
x <- floor(runif(1000000)*1000)
print('*** using TABLE:')
system.time(as.data.frame(table(x)))
print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))
print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})
print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})
print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})
#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))
#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied
and here are the results:
[1] "*** using TABLE:"
user system elapsed
1.26 0.03 1.29
[1] "*** using HIST:"
user system elapsed
0.11 0.00 0.10
[1] "*** using SORT"
user system elapsed
0.22 0.02 0.23
[1] "*** using ECDF"
user system elapsed
0.17 0.00 0.17
[1] "*** counting in loop"
user system elapsed
3.12 0.00 3.12
As you can see table is ridiculously slow and hist seems to be the fastest. But hist (as I am using it) is working on arbitrarily-specifiable breakpoints, whereas I simply want to bin integers. Isn't there a way to trade that flexibility for better performance?
In C, for(i=0;i<1000000;i++)h[x[i]]++; would be blisteringly fast.
The fastest is to use tabulate but it requires positive integers as input, so you have to do a quick monotonic transformation.
set.seed(21)
x <- as.integer(runif(1e6)*1000)
system.time({
adj <- 1L - min(x)
y <- setNames(tabulate(x+adj), sort(unique(x)))
})
Don't forget you can inline C++ code in R.
library(inline)
src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
int test = max(xa);
Rcpp::NumericVector xab(test);
for (int i = 0; i < n_xa; i++)
xab[xa[i]-1]++;
return xab;
'
fun <- cxxfunction(signature(a = "numeric"),src, plugin = "Rcpp")
I think tabulate or the C++ versions are the way to go but here's some code using rbenchmark which is a great package for looking at timings (I added a few slower function tests too):
######################
### ---Clean Up--- ###
######################
rm(list = ls())
gc()
######################
### ---Packages--- ###
#####################
require(parallel)
require(data.table)
require(rbenchmark)
require(inline)
#######################
### ---Functions--- ###
#######################
# Competitor functions by Breyal
Breyal.using_datatable <- function(x) {DT <- data.table(x = x, weight = 1, key = "x"); DT[, length(weight), by = x]}
Breyal.using_lapply_1c_eq <- function(x = sort(x)) { lapply(unique(x), function(u) sum(x == u)) } # 1 core
Breyal.using_mclapply_8c_eq <- function(x = sort(x)) { mclapply(unique(x), function(u) sum(x == u), mc.cores = 8L) } # 8 cores
# Competitor functions by tennenrishin
tennenrishin.using_table <- function(x) as.data.frame(table(x))
tennenrishin.using_hist <- function(x) hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE)
tennenrishin.using_sort <- function(x) {cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf}
tennenrishin.using_ecdf <- function(x) {i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])}
tennenrishin.using_counting_loop <- function(x) {h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h}
# Competitor function by Ulrich
Ulrich.using_tabulate <- function(x) {
adj <- 1L - min(x)
y <- setNames(tabulate(x+adj), sort(unique(x)))
return(y)
}
# I couldn't get the Joe's C++ version to work (my laptop won't install inline) butI suspect that would be the fastest solution
##################
### ---Data--- ###
##################
# Set seed so results are reproducable
set.seed(21)
# Data vector
x <- floor(runif(1000000)*1000)
#####################
### ---Timings--- ###
#####################
# Benchmarks using Ubuntu 13.04 x64 with 8GB RAM and i7-2600K CPU # 3.40GHz
benchmark(replications = 5,
tennenrishin.using_table(x),
tennenrishin.using_hist(x),
tennenrishin.using_sort(x),
tennenrishin.using_ecdf(x),
tennenrishin.using_counting_loop(x),
Ulrich.using_tabulate(x),
Breyal.using_datatable(x),
Breyal.using_lapply_1c_eq(x),
Breyal.using_mclapply_8c_eq(x),
order = "relative")
Which results in the following timings
test replications elapsed relative user.self sys.self user.child sys.child
6 Ulrich.using_tabulate(x) 5 0.176 1.000 0.176 0.000 0.00 0.000
2 tennenrishin.using_hist(x) 5 0.468 2.659 0.468 0.000 0.00 0.000
3 tennenrishin.using_sort(x) 5 0.687 3.903 0.688 0.000 0.00 0.000
4 tennenrishin.using_ecdf(x) 5 0.749 4.256 0.748 0.000 0.00 0.000
7 Breyal.using_datatable(x) 5 2.960 16.818 2.960 0.000 0.00 0.000
1 tennenrishin.using_table(x) 5 4.651 26.426 4.596 0.052 0.00 0.000
9 Breyal.using_mclapply_8c_eq(x) 5 10.817 61.460 0.140 1.196 54.62 7.112
5 tennenrishin.using_counting_loop(x) 5 10.922 62.057 10.912 0.000 0.00 0.000
8 Breyal.using_lapply_1c_eq(x) 5 36.807 209.131 36.768 0.000 0.00 0.000