Finding Area Under the Curve (AUC) in R by Trapezoidal Rule - r

I have a below mentioned Sample List containing Data Frames (Each in has ...ID,yobs,x(independent variable)).And I want to find AUC(Trapezoidal rule)for each case(ID)..,
So that my output(master data frame) looks like following (have shown at last)
Can anybody suggest the efficient way of finding this (I have a high number of rows for each ID's)
Thank you
#Some Make up code for only one data frame
Y1=c(0,2,5,7,9)
Y2=c(0,1,3,8,11)
Y3=c(0,4,8,9,12,14,18)
t1=c(0:4)
t2=c(0:4)
t3=c(0:6)
a1=data.frame(ID=1,y=Y1,x=t1)
a2=data.frame(ID=2,y=Y2,x=t2)
a3=data.frame(ID=3,y=Y3,x=t3)
data=rbind(a1,a2,a3)
#dataA(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
9 2 8 3
10 2 11 4
11 3 0 0
12 3 4 1
13 3 8 2
14 3 9 3
15 3 12 4
16 3 14 5
17 3 18 6
#dataB(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
#dataC(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
##Desired output
ID AUC
dataA 1 XX
dataA 2 XX
dataA 3 XX
dataB 1 XX
dataB 2 XX
dataC 1 XX
dataC 2 XX

Here are two other ways. The first uses integrate(...) on a function defined by the linear interpolation between the points. The second uses the trapz(...) function described in the comment from #nrussel.
f <- function(x,df) approxfun(df)(x)
sapply(split(data,data$ID),function(df)c(integrate(f,min(df$x),max(df$x),df[3:2])$value))
# 1 2 3
# 18.5 17.5 56.0
library(caTools)
sapply(split(data,data$ID),function(df) trapz(df$x,df$y))
# 1 2 3
# 18.5 17.5 56.0

I'm guessing something like this would work
calcauc<-function(data) {
psum<-function(x) rowSums(embed(x,2))
stack(lapply(split(data, data$ID), function(z)
with(z, sum(psum(y) * diff(x)/ 2)))
)
}
calcauc(data)
# values ind
# 1 18.5 1
# 2 17.5 2
# 3 56.0 3
Of course normally x and y values are between 0 and 1 for ROC curves which is why we seem to have such large "AUC" values but really this is just the area of the polygon underneath the line defined by the points in the data set.
The psum function is just a helper function to calculate pair-wise sums (useful in the formula for the area of trapezoid).
Basically we use split() to look at one ID at a time, then we calculate the area for each ID, then we use stack() to bring everything back into one data.frame.

Related

anti-join not working - giving 0 rows, why?

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

Relabel samples in kmean results considering the order of centers

I am using kmeans to cluster my data, for the produced result I have a plan.
I wanted to relabel the samples based on ordered centres. Consider following example :
a = c("a","b","c","d","e","F","i","j","k","l","m","n")
b = c(1,2,3,20,21,21,40,41,42,4,23,50)
mydata = data.frame(id=a,amount=b)
result = kmeans(mydata$amount,3,nstart=10)
Here is the result :
clus$cluster
2 2 2 3 3 3 1 1 1 2 3 1
clus$centers
1 43.25
2 2.50
3 21.25
mydata = data.frame(mydata,label =clus$cluster)
mydata
id amount label
1 a 1 2
2 b 2 2
3 c 3 2
4 d 20 3
5 e 21 3
6 F 21 3
7 i 40 1
8 j 41 1
9 k 42 1
10 l 4 2
11 m 23 3
12 n 50 1
What I am looking for is sorting the centres and producing the labels accordingly:
1 2.50
2 21.25
3 43.25
and label the samples going to:
1 1 1 2 2 2 3 3 3 1 2 3
and the result should be :
id amount label
1 a 1 1
2 b 2 1
3 c 3 1
4 d 20 2
5 e 21 2
6 F 21 2
7 i 40 3
8 j 41 3
9 k 42 3
10 l 4 1
11 m 23 2
12 n 50 3
I think it is possible to do it by, order the centres and for each sample taking the index of minimum distance of samples with centres as the label of that cluster.
Is there another way that R can do it automatically ?
One idea is to create a named vector by matching your centers with the sorted centers. Then match the vector with mydata$label and replace with the names of the vector, i.e.
i1 <- setNames(match(sort(result$centers), result$centers), rownames(result$centers))
as.numeric(names(i1)[match(mydata$label, i1)])
# [1] 1 1 1 2 2 2 3 3 3 1 2 3
You can use for loop, if you don't mind loops
cls <- result$cluster
for (i in 1 : length(result$cluster))
result$cluster[cls == order(result$centers)[i]] <- i
result$cluster
#[1] 1 1 1 2 2 2 3 3 3 1 2 3

subset.data.frame in R

I have a data frame of raw data:
raw <- data.frame(subj = c(1,1,1,2,2,2,3,3,3,4,4,4),
blah = c(0,0,0,1,1,1,1,0,1,0,0,0))
From it, I want to remove the bad subj.
badsubj <- c(1,4)
trim <- subset.data.frame(raw, subj != badsubj)
But for some reason, all the badsubj values are not removed:
subj blah
2 1 0
4 2 1
5 2 1
6 2 1
7 3 1
8 3 0
9 3 1
11 4 0
What am I doing wrong? Obersvations 2 and 11 should be excluded because they are members of badsubj.
raw[!raw$subj %in% badsubj, ]
wrong use of !=
The problem is that subj and badsubj do not have the same length. Therefore badsubj will be recycled until both vectors have the same length. Then your code compares elementwise the values in the output below.
subj badsubj
1 1 1
2 1 4
3 1 1
4 2 4
5 2 1
6 2 4
7 3 1
8 3 4
9 3 1
10 4 4
11 4 1
12 4 4

Combine minimum values of row and column in matrix

Suppose I have a vector of size n=8 v=(5,8,2,7,9,12,2,1). I would like to know how to build a N x N matrix that compares every pair of values of v and returns the minimum value of each comparation. In this example, it would be like this:
5 5 2 5 5 5 2 1
5 8 2 7 8 8 2 1
2 2 2 2 2 2 2 1
5 7 2 7 7 7 2 1
5 8 2 7 9 9 2 1
5 8 2 7 9 12 2 1
2 2 2 2 2 2 2 1
1 1 1 1 1 1 1 1
Could you help me with this, please?
outer(v, v, pmin)
Notice the use of pmin, not min, as the former is vectorised but not the latter.

In R, how can I make a running count of runs?

Suppose I have an R dataframe that looks like this, where end.group signifies the end of a unique group of observations:
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
I want to return the following, where group.count is a running count of the number of observations in a group, and group is a unique identifier for each group, in number order. Can anyone help me with a piece of R code to do this?
end.group group.count group
0 1 1
0 2 1
1 3 1
0 1 2
0 2 2
1 3 2
1 1 3
0 1 4
0 2 4
0 3 4
1 4 4
1 1 5
1 1 6
0 1 7
1 2 7
You can create group by using cumsum and rev. You need rev because you have the end points of the groups.
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
# create groups
x$group <- rev(cumsum(rev(x$end.group)))
# re-number groups from smallest to largest
x$group <- abs(x$group-max(x$group)-1)
Now you can use ave to create group.count.
x$group.count <- ave(x$end.group, x$group, FUN=seq_along)
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
ends <- which(as.logical(x$end.group))
ends2 <- c(ends[1],diff(ends))
transform(x, group.count=unlist(sapply(ends2,seq)), group=rep(seq(length(ends)),times=ends2))
end.group group.count group
1 0 1 1
2 0 2 1
3 1 3 1
4 0 1 2
5 0 2 2
6 1 3 2
7 1 1 3
8 0 1 4
9 0 2 4
10 0 3 4
11 1 4 4
12 1 1 5
13 1 1 6
14 0 1 7
15 1 2 7

Resources