Calculate a percentile of dataframe column efficiently - r

This question is an extension to the StackOverflow question asked and answered here.
My circumstances are different in that I want to calculate the percentile of each value within a vector of 50,000 (or more!) values. For example --
df <- data.frame(val = rnorm(n = 50000, mean = 50, sd = 20))
df$val.percentile <- sapply(X = df$val, function(x) ecdf(df$val)(x))
head(df)
Is there a good way to optimize the process for calculating the percentile for each value? Essentially I'd like to make it as efficient as possible so the run time is as small as possible.

ecdf is already vectorized, there is no reason to use an apply function. You can simply run:
df$val.percentile <- ecdf(df$val)(df$val)

You can implement dplyr::percent_rank() to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist() (Proportion of all values less than or equal to the current rank).
Reproducible example:
set.seed(1)
df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))
Show that percent_rank() differs from cume_dist() and that cume_dist() is the same as ecdf(x)(x):
library(tidyverse)
head(df) %>%
mutate(pr = percent_rank(val),
cd = ecdf(val)(val),
cd2 = cume_dist(val))
val pr cd cd2
1 37.47092 0.4 0.5000000 0.5000000
2 53.67287 0.6 0.6666667 0.6666667
3 33.28743 0.0 0.1666667 0.1666667
4 81.90562 1.0 1.0000000 1.0000000
5 56.59016 0.8 0.8333333 0.8333333
6 33.59063 0.2 0.3333333 0.3333333
Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:
library(microbenchmark)
mbm <- microbenchmark(
pr_dplyr = mutate(df, pr = percent_rank(val)),
cd_dplyr = mutate(df, pr = percent_rank(val)),
cd_base = mutate(df, pr = ecdf(val)(val)),
times = 20
)
autoplot(mbm)

Related

Sampling using conditional probability table

I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)

Large raster frequency table / counts

I try to calculate the frequency/count of pixel values of a raster in R using freq().
Create two example rasters for comparison:
library(raster)
RastSmall <- raster(nrow=70, ncol=70)
RastBig <- raster(nrow=7000, ncol=7000)
set.seed(0)
RastSmall[] <- round(runif(1:ncell(r_hr), 1, 5))
RastBig[] <- round(runif(1:ncell(r_hr), 1, 5))
Get the pixel count using freq()
freq(RastSmall)
value count
[1,] 1 6540000
[2,] 2 12150000
[3,] 3 12140000
[4,] 4 11720000
[5,] 5 6450000
However, it is a fairly large file and takes extremely long, i.e. up to hours. Is there a faster way in R?
Here the speed difference for a small and a large raster:
system.time(freq(RastSmall))
user system elapsed
0.008 0.000 0.004
system.time(freq(RastBig))
user system elapsed
40.484 0.964 41.445
Is there a way to speed this up? Alternatively can this be done in the command line using something like gdal tools?
I did exactly that last week, however I couldn't find other faster ways to do it in R. I've tried to do it with the rqgis package by calling the r.report of GRASS. It works but was slower than the R native function. Maybe you'll have a better luck. Here is my code with grass in case you want to try it:
library(RQGIS)
monqgis <- set_env("C:\\Mrnmicro\\Applic\\OSGeo4W")
find_algorithms(search_term = "report", qgis_env = monqgis)
get_usage(alg = "grass7:r.report", qgis_env = monqgis)
params <- get_args_man(alg = "grass7:r.report", qgis_env = monqgis)
get_usage(alg = "grass7:r.report", qgis_env = monqgis)
params$map <- classif
params$units <- 5
params$rawoutput <- "C:\\temp\\outputRQGIS_raw"
params$html <- "C:\\temp\\outputRQGIS"
system.time(asas <- run_qgis(alg = "grass7:r.report", params=params,load_output = params$OUTPUT, qgis_env = monqgis))
not an amazing saving but if you getValues from your raster and then run the base::table function, it saves about 20%. My raster was c.500m cells.
# read in raster to obtain frequency table
r <- raster("./path/myraster.tif")
# perform tests; traditional freq() vs. getValues() & table()
require(microbenchmark)
mbm <- microbenchmark(
Freq = {freqf <- freq(r,useNA="no");
freq.df <- data.frame(CODE=freqf[,1], N=freqf[,2]},
GetVals = {v <- getValues(r);
vt <- table(v);
getval.df <- data.frame(CODE=as.numeric(names(vt)),N=as.numeric(as.matrix(vt)))},
times=5
)
mbm
Unit: seconds
expr min lq mean median uq max neval
Freq 191.1649 191.8001 198.8567 192.5256 193.0986 225.6942 5
GetVals 153.5552 154.8776 156.9173 157.0539 159.0400 160.0598 5
# check the routines have identical results
identical(freq.df,getval.df)
[1] TRUE
bit of a saving i guess
(N.B. the reason i make the data frames is that I go on to process the data that comes out of the frequency analysis)
I think the most effective way to calculate that is by using GetHistogram( ) from GDAL. Unfortunately, I can't find a way to use it from R. The closest approach is by using gdalUtilities::gdalinfo from R, and use the flag -hist, or hist = TRUE, but is limited the calculations between 0 - 255.
Another option is using rasterDT::freqDT, which is faster than regular options. Here an example:
library(gdalUtilities)
library(raster)
library(rasterDT)
library(microbenchmark)
RastBig <- raster(nrow=7000, ncol=7000)
set.seed(0)
RastBig[] <- round(runif(1:ncell(RastBig), 1, 5))
writeRaster(RastBig, filename = 'C:/temp/RastBig.tif')
mbm <- microbenchmark(times = 50,
freq1 = freq(RastBig),
freq2 = table(RastBig[]),
freq3 = freqDT(RastBig),
freq4 = ({
gdalLog <- capture.output(gdalUtilities::gdalinfo(datasetname = 'C:/temp/RastBig.tif', hist = TRUE));
(bucxml <- as.numeric(sub('buckets.+', '', grep('buckets ', gdalLog, value = TRUE))));
(minxml <- as.numeric(gsub('.+from | to.+', '', grep('buckets ', gdalLog, value = TRUE)) ));
(maxxml <- as.numeric(gsub('.+to |:', '', grep('buckets ', gdalLog, value = TRUE))));
(histxml <- as.numeric(strsplit(split = '[[:space:]]', gsub("^ |^ ", "", gdalLog[grep('buckets', gdalLog)+1]))[[1]]));
labs <- seq(from = minxml, to = maxxml, length.out = bucxml);
df <- data.frame(labs, nwlab = c(ceiling(labs[1]),
round(labs[2:(bucxml-1)]),
floor(labs[bucxml])),
val = histxml);
hist <- aggregate(df$val, by = list(df$nwlab), sum)})
)
Results:
> freq1
value count
[1,] 1 6127755
[2,] 2 12251324
[3,] 3 12249376
[4,] 4 12248938
[5,] 5 6122607
> freq2
1 2 3 4 5
6127755 12251324 12249376 12248938 6122607
> freq3
ID freq
1: 1 6127755
2: 2 12251324
3: 3 12249376
4: 4 12248938
5: 5 6122607
> freq4
Group.1 x
1 1 6127755
2 2 12251324
3 3 12249376
4 4 12248938
5 5 6122607
Unit: milliseconds
expr min lq mean median uq max neval
freq1 58628.486301 59100.539302 59400.304887 59383.913701 59650.412 60841.3975 50
freq2 55912.170401 56663.025202 56954.032395 56919.905051 57202.001 58307.9500 50
freq3 3785.767301 4006.858102 4288.699531 4292.447250 4536.382 4996.0598 50
freq4 7.892201 8.883102 9.255641 9.154001 9.483 15.6072 50
EDIT: using this is quite faster than option 3:
rB <- raster('C:/temp/RastBig.tif')
freq3B <- freqDT(rB)

Performing simulations and combining the data into one data frame

For each reported study, I want to do 1000 simulations of a parameter X using normal or log-normal distribution (based on a flag) and then combine all the simulations in one data frame. I am looking for an automated way of doing this.
What I have is a data frame with the following columns:
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0
when DIST==0 then it is normal distribution, if DIST==1 then it is log-normal.
I am able to do the simulations and combine them using hard coding: for example:
#for Study1:
set.seed <-1
NSUB <- 10
MEAN <- 1.5
SD <- 0.3
DIST <- 0 #Normal distribution
df1 <- data.frame("SOURCE"="Study1","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rnorm(1000, mean=MEAN, sd=SD))
#For study2
set.seed <-2
NSUB <- 5
MEAN <- 2.5
SD <- 0.4
DIST <- 1 #log-normal distribution
df2 <- data.frame("SOURCE"="Study2","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rlnorm(1000, meanlog=log(MEAN), sdlog=SD))
#Combine all
dfall <- rbind(df1,df2)
However, this would be tedious to me I have alot of reported means and SD for the parameter. I need help in how to make this automated so it does 1000 simulation for each row (using MEAN and SD) and then combine all simulated data in one data frame.
In the interest of implementing readable and general code, you should do two things here:
Write a function that takes each row of your simulation configuration dataset and returns the simulated values as a data_frame (doSim below). This makes it easier to test your simulation code separately from your iteration over simulation configurations.
Use dplyr to pass each row of the function to this function, and collect up the results as a data_frame.
Here is some sample code:
library(dplyr)
# read in the simultation configuration dataset
dfX = read.table(textConnection("
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0"),
header = TRUE, stringsAsFactors = FALSE)
# write a function that takes each row of the configuration
# data.frame and returns the simulations
doSim = function(simConfig, seed = 12345) {
set.seed(seed)
dist = if(simConfig[["DIST"]] == 0) rnorm else rlnorm
mean = if(simConfig[["DIST"]] == 0) simConfig[["MEAN"]] else log(simConfig[["MEAN"]])
return(
data_frame(
source = simConfig[["SOURCE"]],
nsub = simConfig[["NSUB"]],
value = dist(1000, mean = mean, sd = simConfig[["SD"]])
)
)
}
# test the function
doSim(dfX[1, ])
# apply over dfX
dfX %>%
rowwise() %>%
do(doSim(.))

trying to perform a t.test for each row and count all rows where p-value is less than 0.05

I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Resources