This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 7 years ago.
I want to integrate a one dimensional vector in R, How should I do that?
Let's say I have:
d=hist(p, breaks=100, plot=FALSE)$density
where p is a sample like:
p=rnorm(1e5)
How can I calculate an integral over d?
If we assume that the values in d correspond to the y values of a function then we can calculate the integral by using a discrete approximation. We can for example use the trapezium rule or Simpsons rule for this purpose. We then also need to input the stepsize that corresponds to the discrete interval on the x-axis in order to "approximate the area under the curve".
Discrete integration functions defined below:
p=rnorm(1e5)
d=hist(p,breaks=100,plot=FALSE)$density
discreteIntegrationTrapeziumRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
stepsize*(0.5*integrand[1]+sum(integrand[2:(l-1)])+0.5*v[l])
}
discreteIntegrationSimpsonRule <- function(v,lower=1,upper=length(v),stepsize=1)
{
if(upper > length(v))
upper=length(v)
if(lower < 1)
lower=1
integrand <- v[lower:upper]
l <- length(integrand)
a = seq(from=2,to=l-1,by=2);
b = seq(from=3,to=l-1,by=2)
(stepsize/3)*(integrand[1]+4*sum(integrand[a])+2*sum(integrand[b])+integrand[l])
}
As an example, let's approximate the complete area under the curve while assuming discrete x steps of size 1 and then do the same for the second half of d while we assume x-steps of size 0.2.
> plot(1:length(d),d) # stepsize one on x-axis
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d) # integrate over complete interval, assume x-stepsize = 1
> resultTrapeziumRule
[1] 9.9999
> resultSimpsonRule
[1] 10.00247
> plot(seq(from=-10,to=(-10+(length(d)*0.2)-0.2),by=0.2),d) # stepsize 0.2 on x-axis
> resultTrapziumRule <- discreteIntegrationTrapeziumRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultSimpsonRule <- discreteIntegrationSimpsonRule(d,ceiling(length(d)/2),length(d),0.2) # integrate over second part of vector, x-stepsize=0.2
> resultTrapziumRule
[1] 1.15478
> resultSimpsonRule
[1] 1.11678
In general, the Simpson rule offers better approximations of the integral. The more y-values you have (and the smaller the x-axis stepsize), the better your approximations will become.
Small EDIT for clarity:
In this particular case the stepsize should obviously be 0.1. The complete area under the density curve is then (approximately) equal to 1, as expected.
> d=hist(p,breaks=100,plot=FALSE)$density
> hist(p,breaks=100,plot=FALSE)$mids # stepsize = 0.1
[1] -4.75 -4.65 -4.55 -4.45 -4.35 -4.25 -4.15 -4.05 -3.95 -3.85 -3.75 -3.65 -3.55 -3.45 -3.35 -3.25 -3.15 -3.05 -2.95 -2.85 -2.75 -2.65 -2.55
[24] -2.45 -2.35 -2.25 -2.15 -2.05 -1.95 -1.85 -1.75 -1.65 -1.55 -1.45 -1.35 -1.25 -1.15 -1.05 -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25
[47] -0.15 -0.05 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05
[70] 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 3.55 3.65 3.75 3.85 3.95 4.05 4.15
> resultTrapeziumRule <- discreteIntegrationTrapeziumRule(d,stepsize=0.1)
> resultTrapeziumRule
[1] 0.999985
Related
I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".
I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?
I have a small table of summary data with the odds ratio, upper and lower confidence limits for four categories, with six levels within each category. I'd like to produce a chart using ggplot2 that looks similar to the usual one created when you specify a lm and it's se, but I'd like R just to use the pre-specified values I have in my table. I've managed to create the line graph with error bars, but these overlap and make it unclear. The data look like this:
interval OR Drug lower upper
14 0.004 a 0.002 0.205
30 0.022 a 0.001 0.101
60 0.13 a 0.061 0.23
90 0.22 a 0.14 0.34
180 0.25 a 0.17 0.35
365 0.31 a 0.23 0.41
14 0.84 b 0.59 1.19
30 0.85 b 0.66 1.084
60 0.94 b 0.75 1.17
90 0.83 b 0.68 1.01
180 1.28 b 1.09 1.51
365 1.58 b 1.38 1.82
14 1.9 c 0.9 4.27
30 2.91 c 1.47 6.29
60 2.57 c 1.52 4.55
90 2.05 c 1.31 3.27
180 2.422 c 1.596 3.769
365 2.83 c 1.93 4.26
14 0.29 d 0.04 1.18
30 0.09 d 0.01 0.29
60 0.39 d 0.17 0.82
90 0.39 d 0.2 0.7
180 0.37 d 0.22 0.59
365 0.34 d 0.21 0.53
I have tried this:
limits <- aes(ymax=upper, ymin=lower)
dodge <- position_dodge(width=0.9)
ggplot(data, aes(y=OR, x=days, colour=Drug)) +
geom_line(stat="identity") +
geom_errorbar(limits, position=dodge)
and searched for a suitable answer to create a pretty plot, but I'm flummoxed!
Any help greatly appreciated!
You need the following lines:
p<-ggplot(data=data, aes(x=interval, y=OR, colour=Drug)) + geom_point() + geom_line()
p<-p+geom_ribbon(aes(ymin=data$lower, ymax=data$upper), linetype=2, alpha=0.1)
Here is a base R approach using polygon() since #jmb requested a solution in the comments. Note that I have to define two sets of x-values and associated y values for the polygon to plot. It works by plotting the outer perimeter of the polygon. I define plot type = 'n' and use points() separately to get the points on top of the polygon. My personal preference is the ggplot solutions above when possible since polygon() is pretty clunky.
library(tidyverse)
data('mtcars') #built in dataset
mean.mpg = mtcars %>%
group_by(cyl) %>%
summarise(N = n(),
avg.mpg = mean(mpg),
SE.low = avg.mpg - (sd(mpg)/sqrt(N)),
SE.high =avg.mpg + (sd(mpg)/sqrt(N)))
plot(avg.mpg ~ cyl, data = mean.mpg, ylim = c(10,30), type = 'n')
#note I have defined c(x1, x2) and c(y1, y2)
polygon(c(mean.mpg$cyl, rev(mean.mpg$cyl)),
c(mean.mpg$SE.low,rev(mean.mpg$SE.high)), density = 200, col ='grey90')
points(avg.mpg ~ cyl, data = mean.mpg, pch = 19, col = 'firebrick')
I have a data.frame (Centroid) that contains points in virtual 3D space (columns = AV, V and A), each representing a character (column = Character). Each row contains a different character.
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
Character<-c("a","A","b")
centroid = data.frame(AV,V,A,Character)
centroid
AV V A Character
1 37.90 1.07 0.04 a
2 10.87 1.14 -1.23 A
3 40.05 1.90 -1.10 b
I wish to know the similarity/dissimilarity between each character. For example, "a" corresponds to 37.9, 1.07 and 0.04 whilst "A" corresponds to 10.87, 1.14, -1.23. I want to know the distance between these characters/ 3 points.
I believe I can calculate this using Euclidean distance between each character, but am unsure of the code to run.
I have attempted to use
dist(as.matrix(Centroids))
But have been unsuccessful, as this just gives a big print in the console. Any assistance would be greatly appreciated.
Following may be helpful:
AV<-c(37.9,10.87,40.05)
V<-c(1.07,1.14,1.9)
A<-c(0.04,-1.23,-1.1)
centroid = data.frame(A,V,AV)
centroid
A V AV
1 0.04 1.07 37.90
2 -1.23 1.14 10.87
3 -1.10 1.90 40.05
mm = as.matrix(centroid)
mm
A V AV
[1,] 0.04 1.07 37.90
[2,] -1.23 1.14 10.87
[3,] -1.10 1.90 40.05
dist(mm)
1 2
2 27.059909
3 2.571186 29.190185
as.dist(mm)
A V
V -1.23
AV -1.10 1.90
It is not clear what you mean by "Character<-c(a,A,b)"
I have a data.table object similar to this one
library(data.table)
c <- data.table(CO = c(10000,10000,10000,20000,20000,20000,20000),
SH = c(1427,1333,1333,1000,1000,300,350),
PRC = c(6.5,6.125,6.2,0.75,0.5,3,3.5),
DAT = c(0.5,-0.5,0,-0.1,NA_real_,0.2,0.5),
MM = c("A","A","A","A","A","B","B"))
and I am trying to perform calculations using nested grouping, passing an expression as an argument. Here is a simplified version of what I have:
setkey(c,MM)
mycalc <- quote({nobscc <- length(DAT[complete.cases(DAT)]);
list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = ifelse(nobscc>=2, sum(DAT^2,na.rm=TRUE), NA_real_),
COVCOMP = ifelse(nobscc >= 2, head(DAT,n=1), NA_real_),
NOBS = nobscc)})
myresults <- c[,.SD[,{setkey=CO; eval(mycalc)},by=CO],by=MM]
which produces
MM CO MKTCAP SQSUM COVCOMP NOBS
[1,] A 10000 8264.6 0.50 0.5 3
[2,] A 20000 500.0 NA NA 1
[3,] B 20000 1225.0 0.29 0.2 2
In the example above I have two elements of the list which use the ifelse construct (in the actual code there are 3), all doing the same test : if the number of observations is greater than 2, then a certain calculation (which is different for each element of the list, and each could be written as a function) is to be performed, otherwise I want the value of the these elements to be NA. Another thing these elements have in common is that they use one and the same column of my data.table: the one called DAT.
So my question is: is there any way I can do the ifelse test only once, and if it is FALSE, pass the value NA to the respective elements of the list, and if TRUE, evaluate a different expression for each of the elements of the list?
NOTE: My goal is to reduce the system.time (system and elapsed). If this modification will not reduce time and calculations, bearing in mind I have 72 million observations, that's an acceptable answer. I also welcome suggestions to change other parts of the code.
EDIT: Results of summaryRprof()
$by.total
total.time total.pct self.time self.pct
"system.time" 18.94 99.79 0.00 0.00
".Call" 18.92 99.68 0.10 0.53
"[" 18.92 99.68 0.04 0.21
"[.data.table" 18.92 99.68 0.02 0.11
"eval" 18.80 99.05 0.24 1.26
"ifelse" 18.30 96.42 0.46 2.42
"lm" 17.70 93.26 0.58 3.06
"sapply" 8.06 42.47 0.36 1.90
"model.frame" 7.74 40.78 0.16 0.84
"model.frame.default" 7.58 39.94 0.98 5.16
"lapply" 6.62 34.88 0.70 3.69
"FUN" 4.24 22.34 1.10 5.80
"model.matrix" 4.04 21.29 0.02 0.11
"model.matrix.default" 4.02 21.18 0.26 1.37
"match" 3.66 19.28 0.86 4.53
".getXlevels" 3.12 16.44 0.12 0.63
"na.omit" 2.40 12.64 0.24 1.26
"%in%" 2.30 12.12 0.34 1.79
"simplify2array" 2.24 11.80 0.12 0.63
"na.omit.data.frame" 2.16 11.38 0.14 0.74
"[.data.frame" 2.12 11.17 1.18 6.22
"deparse" 1.80 9.48 0.66 3.48
"unique" 1.80 9.48 0.54 2.85
"[[" 1.52 8.01 0.12 0.63
"[[.data.frame" 1.40 7.38 0.54 2.85
".deparseOpts" 1.34 7.06 0.96 5.06
"paste" 1.32 6.95 0.16 0.84
"lm.fit" 1.20 6.32 0.64 3.37
"mode" 1.14 6.01 0.14 0.74
"unlist" 1.12 5.90 0.56 2.95
Instead of forming and operating on data subsets like this:
setkey(c,MM)
myresults <- c[, .SD[,{setkey=CO; eval(mycalc)},by=CO], by=MM]
You could try doing this:
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
This should speed up your code, since it avoids all of the nested subsetting of .SD objects, each of which requires its own call to [.data.table.
On your original question, I doubt the ifelse evaluations are taking much time, but if you want to avoid them, you could take them out of mycalc and use := to overwrite the desired values with NA:
mycalc <- quote(list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = sum(DAT^2,na.rm=TRUE),
COVCOMP = head(DAT,n=1),
NOBS = length(DAT[complete.cases(DAT)])))
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
myresults[NOBS<2, c("SQSUM", "COVCOMP"):=NA_real_]
## Or, alternatively
# myresults[NOBS<2, SQSUM:=NA_real_]
# myresults[NOBS<2, COVCOMP:=NA_real_]