What would be the simplest way to round up a value to a specific significant figure?
Something like the function signif(), but only perform rounding up (not down)
For example: Round up 0.001145288 to 1 significant figure would yield 0.002
Any suggestions will be appreciated :)
Cheers!
This function provides the desired output in the case of one significant figure:
upround1 <- function(x) {
if (isTRUE(all.equal(x,0))) return (x)
decs <- 10^floor(log10(abs(x)))
ceiling(x/decs)*decs
}
Examples:
upround1(0.001145288)
#[1] 0.002
upround1(0.0008145258)
#[1] 9e-04
upround1(11)
#[1] 20
upround1(-11)
#[1] -10
upround1(-0.023)
#[1] -0.02
I created a function which changes the format to scientific, splits the string by 'e' and uses the information to ceiling the number.
my_ceiling <- function(x){
num_string <- format(x, scientific=TRUE)
n <- strsplit(num_string, "e")
n1 <- sapply(n, function(x) as.numeric(x[1]))
n2 <- sapply(n, function(x) as.numeric(x[2]))
ceiling(n1) * 10^(n2)
}
my_ceiling(0.001145288)
# [1] 0.002
my_ceiling(0.0974343)
# [1] 0.1
my_ceiling(0)
# [1] 0
set.seed(1)
x <- runif(10, 0.001, 0.01)
my_ceiling(x)
# [1] 0.004 0.005 0.007 0.010 0.003 0.010 0.010 0.007
# [9] 0.007 0.002
Related
I am running Quadratic Assignment Procedure (QAP) to find the correlation of a dependent network matrix with four independent monadic node covariate matrices in R using the sna package.
I am using the following code:
cor2<-sna::gcor(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender))
p2<-qaptest(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender), gcor, g1=1, g2=2, reps=1000)
p2<-qaptest(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender), gcor, g1=1, g2=3, reps=1000)
#etc
However I do not want to run a qaptest individually for each dependent variable like shown in this question.
Is there a way of returning all correlation coefficients and p-values in table or matrix format in R?
I read that it is possible with UCINET software, but I personally do not have access to it.
UCINET is able to provide a result like shown in the screenshot below. I would like to produce a similar output in R.
A very simple solution would be the following. This can be optimized in several ways, but this is quite straightforward, utilizing the functionality from the sna package.
# some random networks
g <- array(dim = c(3, 10, 10))
g[1, , ] <- sna::rgraph(10)
g[2, , ] <- sna::rgraph(10, tprob = g[1, , ]*0.8)
g[3, , ] <- 1
g[3, 1, 2] <- 0
socmat <- network::as.sociomatrix(g)
reps = 1000
testval <- sna::gcor(g)
out <- array(dim = c(dim(g)[1], dim(g)[1], reps))
for (i in 1:reps) {
out[, , i] <- sna::gcor(sna::rmperm(socmat)) - testval
}
(pgreq <- rowMeans(out > 0, dims = 2))
(pleeq <- rowMeans(out <= 0, dims = 2))
testval
Here, pgreq are the p-values of the correlation being greater than the observed correlation, pleeq contains the p-values for the correlation being smaller or equal to the observed correlation, and testval contains the correlations between the observed networks.
> pgreq
[,1] [,2] [,3]
[1,] 0.000 0.000 0.436
[2,] 0.000 0.000 0.619
[3,] 0.436 0.619 0.000
> pleeq
[,1] [,2] [,3]
[1,] 1.000 1.000 0.564
[2,] 1.000 1.000 0.381
[3,] 0.564 0.381 1.000
> testval
1 2 3
1 1.00000000 0.7135061 -0.09480909
2 0.71350607 1.0000000 -0.13287777
3 -0.09480909 -0.1328778 1.00000000
You can easily package this into a function and make it more efficient. However, this code will probably already be sufficient in many cases.
Given I have 4 different values
intensities <- c(0.1,-0.1,0.05,-0.05)
My goal is to randomly sample every value 5 times but positive and negative values should alternate, e.g.
resultingList = (0.1, -0.05, 0.05, -0.05, 0.1, -0.1, ...)
Does anybody know an elegant way to do this in R?
Maybe something like this
# seed
set.seed(123)
plus <- rep(intensities[intensities >= 0], each = 5)
minus <- rep(intensities[intensities < 0], each = 5)
out <- numeric(length(plus) + length(minus))
out[seq(1, length(out), 2)] <- sample(plus)
out[seq(2, length(out), 2)] <- sample(minus)
out
# [1] 0.10 -0.05 0.05 -0.10 0.10 -0.05 0.05 -0.05 0.05 -0.10 0.10 -0.05 0.05 -0.05 0.05 -0.10
# [17] 0.10 -0.10 0.10 -0.10
If your list of intensities that you are sampling from come in +/- pairs, you could just sample from the list of positive values then change the sign of every other number drawn:
N <- 5
positiveIntensities <- c(0.1, 0.05)
resultingList <- sample(positiveIntensities,N,replace = T) * (-1)^(0:(N-1))
It's my solution, which creates a custom function and the argument n means the length of output. In addition, ceiling() and floor() can decide the lengths of odd and even positions.
mySample <- function(x, n){
res <- c()
res[seq(1, n, 2)] <- sample(x[x >= 0], ceiling(n / 2), T)
res[seq(2, n, 2)] <- sample(x[x < 0], floor(n / 2), T)
return(res)
}
intensities <- c(0.1, -0.1, 0.05, -0.05)
mySample(intensities, 10)
# [1] 0.10 -0.10 0.05 -0.05 0.10 -0.05 0.05 -0.05 0.05 -0.10
I'm making a very dirty version of an autocorrelation function in R.
I have a loop that works up to a specified max lag and then returns all the correlations as a matrix, as the acf() function does.
The idea is to replicate the output of the acf() function as shown:
Autocorrelations of series ‘acfData’, by lag
0 1 2 3 4 5 6 7 8
1.000 -0.038 0.253 0.266 0.250 0.267 -0.182 0.281 -0.013
9 10 11 12 13
-0.067 -0.122 -0.115 -0.023 -0.337
What I have so far is the input of data, the specified max lag and the code then works over the range by sliding the data frame back the needed amount and then performing the covariance and standard deviation calculations over the necessary range of data matrices. This is repeated over the range of lags and then appended to the matrices as shown, I also included the cor() function with the data frames created to test.
My problem is that the code returns the correct value for the first loop, or slide, and then returns slightly wrong values from then on.
myAcf <- function(dat, lg){
dataF <- data.frame("data" = dat)
names(dataF)[1] <- "acfData"
lagMat <- c()
testMat <- c()
for(i in 0:lg){
dataLag <- slide(dataF, "acfData", slideBy = -i)
covacf <- cov(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
sd1 <- sd(dataLag[(1+i):nrow(dataLag[1]), 1])
sd2 <- sd(dataLag[(1+i):nrow(dataLag[1]), 2])
corrCalc <- covacf/(sd1 * sd2)
lagMat <- c(lagMat, corrCalc)
a <- cor(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
testMat <- c(testMat, a)
}
plot(lagMat)
return(list(lagMat, testMat))
}
My code then returns for the same data as the acf() function input:
[[1]]
[1] 1.00000000 -0.03786539 0.27700596 0.30197418 0.31009956
[6] 0.37123797 -0.19520518 0.44399863 0.05684766 0.02063488
[11] -0.03724332
[[2]]
[1] 1.00000000 -0.03842146 0.27502462 0.29292583 0.35052131
[6] 0.40931426 -0.23637159 0.52320559 0.07270497 0.02555461
[11] -0.04524035
Any help is greatly appreciated!
I am trying to add all the elements in a matrix. This is an example of my matrix (the actual matrix is bigger):
m = matrix(c(528,479,538,603),nrow=2,ncol=2)
m
A B
male 528 538
female 479 603
I am trying to do:
sum.elements = colSums(colSums(m))
but it gives the following error:
Error in colSums(colSums(m)) : 'x' must be an array of at least two
dimensions
I have tried doing:
x = colSums(m)
sum.elements = x[1] + x[2]
but this would be very long when you have a 100-column matrix...
Any help would be greatly appreciated!
You can do sum. It also has the option na.rm to remove the NA values.
sum(m)
#[1] 2148
In general, sum works for vector, matrix and data.frame
Benchmarks
set.seed(24)
m1 <- matrix(sample(0:20, 5000*5000, replace=TRUE), ncol=5000)
system.time(sum(m1))
# user system elapsed
# 0.027 0.000 0.026
system.time(sum(colSums(m1)))
# user system elapsed
# 0.027 0.000 0.027
system.time(Reduce('+', m1))
# user system elapsed
#25.977 0.644 26.673
Reduce will work
Reduce(`+`,m)
[1] 2148
Is there a simple and fast way to obtain the frequency of each integer that occurs in a vector of integers in R?
Here are my attempts so far:
x <- floor(runif(1000000)*1000)
print('*** using TABLE:')
system.time(as.data.frame(table(x)))
print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))
print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})
print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})
print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})
#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))
#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied
and here are the results:
[1] "*** using TABLE:"
user system elapsed
1.26 0.03 1.29
[1] "*** using HIST:"
user system elapsed
0.11 0.00 0.10
[1] "*** using SORT"
user system elapsed
0.22 0.02 0.23
[1] "*** using ECDF"
user system elapsed
0.17 0.00 0.17
[1] "*** counting in loop"
user system elapsed
3.12 0.00 3.12
As you can see table is ridiculously slow and hist seems to be the fastest. But hist (as I am using it) is working on arbitrarily-specifiable breakpoints, whereas I simply want to bin integers. Isn't there a way to trade that flexibility for better performance?
In C, for(i=0;i<1000000;i++)h[x[i]]++; would be blisteringly fast.
The fastest is to use tabulate but it requires positive integers as input, so you have to do a quick monotonic transformation.
set.seed(21)
x <- as.integer(runif(1e6)*1000)
system.time({
adj <- 1L - min(x)
y <- setNames(tabulate(x+adj), sort(unique(x)))
})
Don't forget you can inline C++ code in R.
library(inline)
src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
int test = max(xa);
Rcpp::NumericVector xab(test);
for (int i = 0; i < n_xa; i++)
xab[xa[i]-1]++;
return xab;
'
fun <- cxxfunction(signature(a = "numeric"),src, plugin = "Rcpp")
I think tabulate or the C++ versions are the way to go but here's some code using rbenchmark which is a great package for looking at timings (I added a few slower function tests too):
######################
### ---Clean Up--- ###
######################
rm(list = ls())
gc()
######################
### ---Packages--- ###
#####################
require(parallel)
require(data.table)
require(rbenchmark)
require(inline)
#######################
### ---Functions--- ###
#######################
# Competitor functions by Breyal
Breyal.using_datatable <- function(x) {DT <- data.table(x = x, weight = 1, key = "x"); DT[, length(weight), by = x]}
Breyal.using_lapply_1c_eq <- function(x = sort(x)) { lapply(unique(x), function(u) sum(x == u)) } # 1 core
Breyal.using_mclapply_8c_eq <- function(x = sort(x)) { mclapply(unique(x), function(u) sum(x == u), mc.cores = 8L) } # 8 cores
# Competitor functions by tennenrishin
tennenrishin.using_table <- function(x) as.data.frame(table(x))
tennenrishin.using_hist <- function(x) hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE)
tennenrishin.using_sort <- function(x) {cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf}
tennenrishin.using_ecdf <- function(x) {i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])}
tennenrishin.using_counting_loop <- function(x) {h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h}
# Competitor function by Ulrich
Ulrich.using_tabulate <- function(x) {
adj <- 1L - min(x)
y <- setNames(tabulate(x+adj), sort(unique(x)))
return(y)
}
# I couldn't get the Joe's C++ version to work (my laptop won't install inline) butI suspect that would be the fastest solution
##################
### ---Data--- ###
##################
# Set seed so results are reproducable
set.seed(21)
# Data vector
x <- floor(runif(1000000)*1000)
#####################
### ---Timings--- ###
#####################
# Benchmarks using Ubuntu 13.04 x64 with 8GB RAM and i7-2600K CPU # 3.40GHz
benchmark(replications = 5,
tennenrishin.using_table(x),
tennenrishin.using_hist(x),
tennenrishin.using_sort(x),
tennenrishin.using_ecdf(x),
tennenrishin.using_counting_loop(x),
Ulrich.using_tabulate(x),
Breyal.using_datatable(x),
Breyal.using_lapply_1c_eq(x),
Breyal.using_mclapply_8c_eq(x),
order = "relative")
Which results in the following timings
test replications elapsed relative user.self sys.self user.child sys.child
6 Ulrich.using_tabulate(x) 5 0.176 1.000 0.176 0.000 0.00 0.000
2 tennenrishin.using_hist(x) 5 0.468 2.659 0.468 0.000 0.00 0.000
3 tennenrishin.using_sort(x) 5 0.687 3.903 0.688 0.000 0.00 0.000
4 tennenrishin.using_ecdf(x) 5 0.749 4.256 0.748 0.000 0.00 0.000
7 Breyal.using_datatable(x) 5 2.960 16.818 2.960 0.000 0.00 0.000
1 tennenrishin.using_table(x) 5 4.651 26.426 4.596 0.052 0.00 0.000
9 Breyal.using_mclapply_8c_eq(x) 5 10.817 61.460 0.140 1.196 54.62 7.112
5 tennenrishin.using_counting_loop(x) 5 10.922 62.057 10.912 0.000 0.00 0.000
8 Breyal.using_lapply_1c_eq(x) 5 36.807 209.131 36.768 0.000 0.00 0.000