subtract first value from each subset of dataframe - r

I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4

I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.

Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4

Related

How to add sequence for specific values in R

I have following dataframe in R
a b
1 0
2 0
3 0
4 1
5 1
6 1
7 0
8 0
9 0
10 1
11 1
Desired dataframe would be
a b Flag
1 0 1
2 0 2
3 0 3
4 1 4
5 1 4
6 1 4
7 0 5
8 0 6
9 0 7
10 1 8
11 1 8
The sequence should change for 0 and shall remain same for 1.
I am doing it with following command
df$flag <- with(a, match(b, unique(b)))
But,does not give me desired output.
This has been updated to account for the first element of b being 1. Thanks to #tk3 for pointing out that a change was needed.
It looks like your rule is to increase flag if b is zero OR if it is the first 1 in a sequence.
This will give your answer.
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
If you just wanted to increase flag when b is zero, you could use
cumsum(1-df$b). Except that would not change the flag for the first one in a series. So I wanted to make an altered version of b that would set b=0 for all of the first ones. You can use c(df$b[1], diff(df$b) >0) to get all of the places that b changed from zero to one - the "first ones". Now
df$b - c(df$b[1],diff(df$b)>0)
0 0 0 0 1 1 0 0 0 0 1
changes all of the "first ones" to zeros unless it is the first element of b. With this altered b we can use cumsum as above. We want to take cumsum of
1 - ( df$b - c(df$b[1],diff(df$b)>0) ) = 1 + c(df$b[1],diff(df$b)>0) - df$b
Which was my response
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
The original version worked only for df$b[1] = 0. The updated version should also work for df$b[1] = 1.
The following seems to do what you want.
I find it a bit complicated but it works.
sp <- split(df, cumsum(c(0, abs(diff(df$b)))))
df2 <- lapply(sp, function(DF) {
DF$Flag <- as.integer(DF$b != 1)
if(DF$b[1] == 1) DF$Flag[1] <- 1
DF
})
rm(sp) # clean up
df2 <- do.call(rbind, df2)
df2$Flag <- cumsum(df2$Flag)
row.names(df2) <- NULL
df2
# a b Flag
#1 1 0 1
#2 2 0 2
#3 3 0 3
#4 4 1 4
#5 5 1 4
#6 6 1 4
#7 7 0 5
#8 8 0 6
#9 9 0 7
#10 10 1 8
#11 11 1 8

How to split a dataframe by factor but discontinuous row should be in different group

For example:
> a <- 1:10
> c <- c(1,1,1,0,0,0,1,1,1,0)
> dt <- data.frame(a,c)
> dt
a c
1 1 1
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 1
8 8 1
9 9 1
10 10 0
I want the data should be seperated in 4 group by c:
The first group:
a c
1 1 1
2 2 1
3 3 1
The second one:
a c
1 4 0
2 5 0
3 6 0
The third one:
a c
1 7 1
2 8 1
3 9 1
The forth one:
a c
1 10 0
We can use rleid from data.table to create a grouping variable and use that to split the 'dt' into a list of data.frames.
library(data.table)
split(dt, rleid(dt$c))
Or as #ZheyuanLi mentioned, the rle from base R can be used to create the grouping variable
split(dt, with(rle(dt$c), rep(seq_along(values), lengths)))

Finding Area Under the Curve (AUC) in R by Trapezoidal Rule

I have a below mentioned Sample List containing Data Frames (Each in has ...ID,yobs,x(independent variable)).And I want to find AUC(Trapezoidal rule)for each case(ID)..,
So that my output(master data frame) looks like following (have shown at last)
Can anybody suggest the efficient way of finding this (I have a high number of rows for each ID's)
Thank you
#Some Make up code for only one data frame
Y1=c(0,2,5,7,9)
Y2=c(0,1,3,8,11)
Y3=c(0,4,8,9,12,14,18)
t1=c(0:4)
t2=c(0:4)
t3=c(0:6)
a1=data.frame(ID=1,y=Y1,x=t1)
a2=data.frame(ID=2,y=Y2,x=t2)
a3=data.frame(ID=3,y=Y3,x=t3)
data=rbind(a1,a2,a3)
#dataA(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
9 2 8 3
10 2 11 4
11 3 0 0
12 3 4 1
13 3 8 2
14 3 9 3
15 3 12 4
16 3 14 5
17 3 18 6
#dataB(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
#dataC(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
##Desired output
ID AUC
dataA 1 XX
dataA 2 XX
dataA 3 XX
dataB 1 XX
dataB 2 XX
dataC 1 XX
dataC 2 XX
Here are two other ways. The first uses integrate(...) on a function defined by the linear interpolation between the points. The second uses the trapz(...) function described in the comment from #nrussel.
f <- function(x,df) approxfun(df)(x)
sapply(split(data,data$ID),function(df)c(integrate(f,min(df$x),max(df$x),df[3:2])$value))
# 1 2 3
# 18.5 17.5 56.0
library(caTools)
sapply(split(data,data$ID),function(df) trapz(df$x,df$y))
# 1 2 3
# 18.5 17.5 56.0
I'm guessing something like this would work
calcauc<-function(data) {
psum<-function(x) rowSums(embed(x,2))
stack(lapply(split(data, data$ID), function(z)
with(z, sum(psum(y) * diff(x)/ 2)))
)
}
calcauc(data)
# values ind
# 1 18.5 1
# 2 17.5 2
# 3 56.0 3
Of course normally x and y values are between 0 and 1 for ROC curves which is why we seem to have such large "AUC" values but really this is just the area of the polygon underneath the line defined by the points in the data set.
The psum function is just a helper function to calculate pair-wise sums (useful in the formula for the area of trapezoid).
Basically we use split() to look at one ID at a time, then we calculate the area for each ID, then we use stack() to bring everything back into one data.frame.

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Removing specific rows from a dataframe

I have a data frame e.g.:
sub day
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
3 1
3 2
3 3
3 4
and I would like to remove specific rows that can be identified by the combination of sub and day.
For example say I wanted to remove rows where sub='1' and day='2' and sub=3 and day='4'. How could I do this?
I realise that I could specify the row numbers, but this needs to be applied to a huge dataframe which would be tedious to go through and ID each row.
DF[ ! ( ( DF$sub ==1 & DF$day==2) | ( DF$sub ==3 & DF$day==4) ) , ] # note the ! (negation)
Or if sub is a factor as suggested by your use of quotes:
DF[ ! paste(sub,day,sep="_") %in% c("1_2", "3_4"), ]
Could also use subset:
subset(DF, ! paste(sub,day,sep="_") %in% c("1_2", "3_4") )
(And I endorse the use of which in Dirk's answer when using "[" even though some claim it is not needed.)
This boils down to two distinct steps:
Figure out when your condition is true, and hence compute a vector of booleans, or, as I prefer, their indices by wrapping it into which()
Create an updated data.frame by excluding the indices from the previous step.
Here is an example:
R> set.seed(42)
R> DF <- data.frame(sub=rep(1:4, each=4), day=sample(1:4, 16, replace=TRUE))
R> DF
sub day
1 1 4
2 1 4
3 1 2
4 1 4
5 2 3
6 2 3
7 2 3
8 2 1
9 3 3
10 3 3
11 3 2
12 3 3
13 4 4
14 4 2
15 4 2
16 4 4
R> ind <- which(with( DF, sub==2 & day==3 ))
R> ind
[1] 5 6 7
R> DF <- DF[ -ind, ]
R> table(DF)
day
sub 1 2 3 4
1 0 1 0 3
2 1 0 0 0
3 0 1 3 0
4 0 2 0 2
R>
And we see that sub==2 has only one entry remaining with day==1.
Edit The compound condition can be done with an 'or' as follows:
ind <- which(with( DF, (sub==1 & day==2) | (sub=3 & day=4) ))
and here is a new full example
R> set.seed(1)
R> DF <- data.frame(sub=rep(1:4, each=5), day=sample(1:4, 20, replace=TRUE))
R> table(DF)
day
sub 1 2 3 4
1 1 2 1 1
2 1 0 2 2
3 2 1 1 1
4 0 2 1 2
R> ind <- which(with( DF, (sub==1 & day==2) | (sub==3 & day==4) ))
R> ind
[1] 1 2 15
R> DF <- DF[-ind, ]
R> table(DF)
day
sub 1 2 3 4
1 1 0 1 1
2 1 0 2 2
3 2 1 1 0
4 0 2 1 2
R>
Here's a solution to your problem using dplyr's filter function.
Although you can pass your data frame as the first argument to any dplyr function, I've used its %>% operator, which pipes your data frame to one or more dplyr functions (just filter in this case).
Once you are somewhat familiar with dplyr, the cheat sheet is very handy.
> print(df <- data.frame(sub=rep(1:3, each=4), day=1:4))
sub day
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> print(df <- df %>% filter(!((sub==1 & day==2) | (sub==3 & day==4))))
sub day
1 1 1
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
10 3 3
One simple solution:
cond1 <- df$sub == 1 & df$day == 2
cond2 <- df$sub == 3 & df$day == 4
df <- df[!(cond1 | cond2),]

Resources