I am trying to subtract the cumulative sum of the previous values minus the mean based on my current position. For example I have:
A
1
2
3
4
5
and i want this:
A B
1 NA
2 3-mean(A)
3 6-mean(A)
4 10-mean(A)
5 15-mean(A)
Not sure why you want NA as the first value for the B column. Here I use 1-mean(A) instead:
> A <- 1:5
> data.frame(A=A, B=cumsum(A)-mean(A))
A B
1 1 -2
2 2 0
3 3 3
4 4 7
5 5 12
library(dplyr)
x<-data.frame(a=1:6)
x %>%
mutate(mycol=mean(a)-lag(cumsum(a),1))
a mycol
1 1 NA
2 2 2.5
3 3 0.5
4 4 -2.5
5 5 -6.5
6 6 -11.5
Related
For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.
This question already has answers here:
Replace NA with average of the case before and after the NA
(2 answers)
Closed 5 years ago.
Say I have a data.frame:
t<-c(1,1,2,4,NA,3)
u<-c(1,3,4,6,4,2)
v<-c(2,3,4,NA,3,2)
w<-c(2,3,4,5,2,3)
x<-c(2,3,4,5,6,NA)
df<-data.frame(t,u,v,w,x)
df
t u v w x
1 1 1 2 2 2
2 1 3 3 3 3
3 2 4 4 4 4
4 4 6 NA 5 5
5 NA 4 3 2 6
6 3 2 2 3 NA
I would like to change the NAs so that the NA becomes replaced by the average of the one value before the NA and the one value after the NA. However, if a row starts with an NA I would like it to be replaced by the value that follows it. When a row ends with NA, I would like it to be replaced by the value before the NA. Thus, I would like to get the following result:
t u v w x
1 1 1 2 2 2
2 1 3 3 3 3
3 2 4 4 4 4
4 4 6 5.5 5 5 --> NA becomes average of 6 and 5
5 4 4 3 2 6 --> NA becomes value of next case
6 3 2 2 3 3 --> NA becomes value of previous case
I have thousands of rows, so any help is very much appreciated!
Based on previous na.approx solutions this might do the trick:
library(zoo)
t(apply(df, 1,function(x) na.approx(x,rule=2)))
Always search for parameter na.rm = T in functions that you use.
In this case you want to use mean of one of the column with the na.rm param set to true.
Then you want to substitute NA-s.
dt[is.na(dt[,'t']),'t'] = 0
(assuming that I did not reverse the order of dimensions)
Here is a possible solution,
if is NA replace with (lag + lead) /2 if still NA replace with lag if still NA replace with lead.
library(dplyr)
t(apply(df, 1, function(x){
lagx = dplyr::lag(x)
leadx = dplyr::lead(x)
b = ifelse(is.na(x),(leadx+lagx)/2, x)
b = ifelse(is.na(b), leadx, b)
b = ifelse(is.na(b), lagx, b)
return(b)
}
))
#output
t u v w x
[1,] 1 1 2.0 2 2
[2,] 1 3 3.0 3 3
[3,] 2 4 4.0 4 4
[4,] 4 6 5.5 5 5
[5,] 4 4 3.0 2 6
[6,] 3 2 2.0 3 3
t<-c(1,1,2,4,NA,3)
u<-c(1,3,4,6,4,2)
v<-c(2,3,4,NA,3,2)
w<-c(2,3,4,5,2,3)
x<-c(2,3,4,5,6,NA)
df<-data.frame(t,u,v,w,x)
df[which(is.na(t)), "t"] <- df[which(is.na(t)), "u"]
df[which(is.na(x)), "x"] <- df[which(is.na(x)), "w"]
df[which(is.na(v)), "v"] <- (df[which(is.na(v)), "u"] + df[which(is.na(v)), "w"])/2
> df
t u v w x
1 1 1 2.0 2 2
2 1 3 3.0 3 3
3 2 4 4.0 4 4
4 4 6 5.5 5 5
5 4 4 3.0 2 6
6 3 2 2.0 3 3
Given that I have a three by three systematic matrix.
> x<-matrix(1:9,3)
> x[lower.tri(x)] = t(x)[lower.tri(x)]
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 5 8
[3,] 7 8 9
Then I apply library reshape2 to make it in long-format.
> library(reshape2)
> x <- melt(x)
> x
Var1 Var2 value
1 1 1 1
2 2 1 4
3 3 1 7
4 1 2 4
5 2 2 5
6 3 2 8
7 1 3 7
8 2 3 8
9 3 3 9
As the upper diagonal and bottom diagonal are identical, I only need half of result, which will look like below.
Var1 Var2 value
1 1 1
2 1 4
3 1 7
2 2 5
3 2 8
3 3 9
Any elegant approach to do this?
You can change the values for the bottom or upper half to NA, and then melt ignoring missing values, assume there are not missing values in the matrix originally or you don't need to keep them in the result if there are:
x[upper.tri(x)] = NA
reshape2::melt(x, na.rm=T)
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9
As the 'x' was already assigned and melted, we can get a logical index of the non-duplicate rows after sorting the subset of dataset with 1st and 2nd column by row and then use it to subset the rows
x[!duplicated(t(apply(x[1:2], 1, sort))),]
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9
I have a dataframe as described below. Now I want to reverse the order of column B without hampering the total order of the dataframe. So now the column B has 5,4,3,2,1. I want to change it to 1,2,3,4,5. I don't want to sort as it will hamper the total ordering.
A B C
1 5 6
2 4 8
3 3 5
4 2 5
5 1 3
You can replace just that column:
x$B <- rev(x$B)
On your data:
> x$B <- rev(x$B)
> x
A B C
1 1 1 6
2 2 2 8
3 3 3 5
4 4 4 5
5 5 5 3
transform is also handy for this:
> transform(x, B = rev(B))
A B C
1 1 1 6
2 2 2 8
3 3 3 5
4 4 4 5
5 5 5 3
This doesn't modify x so you need to assign the result to something (perhaps back to x).
I have a below mentioned Sample List containing Data Frames (Each in has ...ID,yobs,x(independent variable)).And I want to find AUC(Trapezoidal rule)for each case(ID)..,
So that my output(master data frame) looks like following (have shown at last)
Can anybody suggest the efficient way of finding this (I have a high number of rows for each ID's)
Thank you
#Some Make up code for only one data frame
Y1=c(0,2,5,7,9)
Y2=c(0,1,3,8,11)
Y3=c(0,4,8,9,12,14,18)
t1=c(0:4)
t2=c(0:4)
t3=c(0:6)
a1=data.frame(ID=1,y=Y1,x=t1)
a2=data.frame(ID=2,y=Y2,x=t2)
a3=data.frame(ID=3,y=Y3,x=t3)
data=rbind(a1,a2,a3)
#dataA(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
9 2 8 3
10 2 11 4
11 3 0 0
12 3 4 1
13 3 8 2
14 3 9 3
15 3 12 4
16 3 14 5
17 3 18 6
#dataB(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
#dataC(Just to show)
ID obs time
1 1 0 0
2 1 2 1
3 1 5 2
4 1 7 3
5 1 9 4
6 2 0 0
7 2 1 1
8 2 3 2
##Desired output
ID AUC
dataA 1 XX
dataA 2 XX
dataA 3 XX
dataB 1 XX
dataB 2 XX
dataC 1 XX
dataC 2 XX
Here are two other ways. The first uses integrate(...) on a function defined by the linear interpolation between the points. The second uses the trapz(...) function described in the comment from #nrussel.
f <- function(x,df) approxfun(df)(x)
sapply(split(data,data$ID),function(df)c(integrate(f,min(df$x),max(df$x),df[3:2])$value))
# 1 2 3
# 18.5 17.5 56.0
library(caTools)
sapply(split(data,data$ID),function(df) trapz(df$x,df$y))
# 1 2 3
# 18.5 17.5 56.0
I'm guessing something like this would work
calcauc<-function(data) {
psum<-function(x) rowSums(embed(x,2))
stack(lapply(split(data, data$ID), function(z)
with(z, sum(psum(y) * diff(x)/ 2)))
)
}
calcauc(data)
# values ind
# 1 18.5 1
# 2 17.5 2
# 3 56.0 3
Of course normally x and y values are between 0 and 1 for ROC curves which is why we seem to have such large "AUC" values but really this is just the area of the polygon underneath the line defined by the points in the data set.
The psum function is just a helper function to calculate pair-wise sums (useful in the formula for the area of trapezoid).
Basically we use split() to look at one ID at a time, then we calculate the area for each ID, then we use stack() to bring everything back into one data.frame.