Replace NA value with next or previous non-NA value conditional on other column - r

Below is an example data set similar to what I'm working with.
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
In this example we have a string of values ranging from + to - values or vice versa (Loc). What I am trying to do accomplish is to fill these NA values, where B is always a associated with negative values of Loc, however, positive values can either take on values A if NA's are between A and B or C if NA's are between B and C.
The desired output should look like the following
df2<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c(rep("A",6),rep("B",8),rep("C",6)))
I have looked into the na.locf from the zoo package but I'm not sure how to order which direction the funcion looks for the non-NA value to get the desired output.
df$Reg2<-ifelse(df$Loc<=0,df$Reg2<-"B",na.locf(df$Reg,fromLast = F))
The above code is only returning the right response for some of the rows depending on the direction (i.e. fromLast = T or F)
Any help on this would be much appreciated.

Use ave splitting by a grouping variable generated from rleid of the sign. Then omit the NAs leaving the single non-NA in each group which ave will copy for all values in that group.
library(data.table)
transform(df, Reg = ave(Reg, rleid(Loc >= 0), FUN = na.omit))
giving:
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

Here is a data.table solution which reproduces OP's expected answer:
library(data.table)
result <- as.data.table(df)[, Reg := first(Reg[!is.na(Reg)]), by = rleid(Loc >= 0)][]
result
Loc Reg
1: 5 A
2: 4 A
3: 3 A
4: 2 A
5: 1 A
6: 0 A
7: -1 B
8: -2 B
9: -3 B
10: -4 B
11: -4 B
12: -3 B
13: -2 B
14: -1 B
15: 0 C
16: 1 C
17: 2 C
18: 3 C
19: 4 C
20: 5 C
identical(as.data.frame(result), df2)
[1] TRUE
Note that this approach is similar to G. Grothendiek's base R solution in that it uses rleid(Loc >= 0) to group the data but it does not call transform() and ave() but updates Reg by reference, i.e., without copying the whole object.

Here is a quick solution with dplyr:
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
c <- match("C",df$Reg)
a <- match("A",df$Reg)
df2 <- df %>%
mutate(newReg=case_when(Loc < 0 ~ "B",
Loc >= 0 & abs(row_number()-c)<abs(row_number()-a)~ "C",
TRUE ~ "A"))

Note: This is hideous and I am doubtful this is reproducible for more use cases... this is probably better suited for some type of dplyr::case_when function, but I just couldn't think it through at this point.
lapply(2:nrow(df), function(i){
this_row <- df[i, ]
last_row <- i - 1
if(is.na(this_row[['Reg']])){
if(this_row[['Loc']] < 0){
df[i, 'Reg'] <<- "B"
}else if(df[i - 1, 'Reg'] == "A"){
df[i, 'Reg'] <<- "A"
}else {
df[i, "Reg"] <<- "C"
}
}
})
> df
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

Related

Labelling rows according to how many times the group appeared in previous rows

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".
You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp
Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

Iteratively calculate columns of a data.table one row at a time (recursive column definitions)

Background/Example
Hi all,
I am trying to use existing columns within a data.table to calculate new columns. However, the columns rely on the previous row's value. For example, say my column Rt = At + Bt + Rt-1. I have two columns that make up my key, scenario and t. How I have been trying to do this is:
Current solution:
for(i in 1:maxScenario){
for(j in 2:nrow(dt)) {
dt[scenario == i & t == j, "R"] <- dt[scenario == i & t == j - 1, "R"]
+ dt[scenario == i & t == j, "A"] + dt[scenario == i & t == j, "B"]
} # end for loop for t
} # end for loop for scenario
The distinction here is that after the "<-" I'm using j - 1 instead of j for R to retrieve the previous row's value.
Question
I realize this is adding a lot of computation time, and is a pretty rough way to go about this. Is there a better way within the data.table package to do this? I have tried using shift() but ran into problems there. Using shift() doesn't "recalculate" the columns based on A and B.
I have considered using a recursive formula, but I wasn't sure what that would do to efficiency and run time. Ideally, I'm hoping to run about 100K scenarios and need these calculations tacked on after the stochastic scenarios are completed.
Thanks!
Edit: Example
Here's an attempt at a small example. Each row's value of R depends on the value from the previous row.
t R A B
1 0 1 2
2 3 2 3
3 8 2 5
4 15 8 5
5 28 10 8
Edit 2: Further Clarification
I was finally able to translate my actual problem function into algebra:
Rt = λ * Pt + λ * Rt-1 - min{λ * Pt + λ * Rt-1, Dt} - A(t) * max{λ * Pt + λ * Rt-1 - Dt - Mt, 0} where Pt, Dt, and Mt are other known columns and A(t) is an indicator function that returns 0 when t % 4 is != 0, and 1 otherwise.
Is there a way to use shift() and cumsum() with such a nested equation?
Here is an option using Rcpp with data.table as its easier to think/code in cpp for recursive equation:
DT[, A := +(t %% 4 == 0)]
library(Rcpp)
cppFunction('NumericVector recur(double lambda, NumericVector P,
NumericVector D, NumericVector M, NumericVector A) {
int sz = P.size(), t;
NumericVector R(sz);
for (t=1; t<sz; t++) {
R[t] = lambda * P[t] + lambda * R[t-1] -
std::min(lambda * P[t] + lambda * R[t-1], D[t]) -
A[t] * std::max(lambda * P[t] * lambda * R[t-1] - D[t] - M[t], 0.0);
}
return(R);
}')
DT[, R := recur(lambda, P, D, M, A)]
output:
t P D M A R
1: 1 1.262954285 0.25222345 -0.4333103 0 0.00000000
2: 2 -0.326233361 -0.89192113 -0.6494716 0 0.72880445
3: 3 1.329799263 0.43568330 0.7267507 0 0.59361856
4: 4 1.272429321 -1.23753842 1.1519118 1 1.89610128
5: 5 0.414641434 -0.22426789 0.9921604 0 1.37963924
6: 6 -1.539950042 0.37739565 -0.4295131 0 0.00000000
7: 7 -0.928567035 0.13333636 1.2383041 0 0.00000000
8: 8 -0.294720447 0.80418951 -0.2793463 1 0.00000000
9: 9 -0.005767173 -0.05710677 1.7579031 0 0.05422319
10: 10 2.404653389 0.50360797 0.5607461 0 0.72583032
11: 11 0.763593461 1.08576936 -0.4527840 0 0.00000000
12: 12 -0.799009249 -0.69095384 -0.8320433 1 -1.23154792
13: 13 -1.147657009 -1.28459935 -1.1665705 0 0.09499689
14: 14 -0.289461574 0.04672617 -1.0655906 0 0.00000000
15: 15 -0.299215118 -0.23570656 -1.5637821 0 0.08609900
16: 16 -0.411510833 -0.54288826 1.1565370 1 0.38018234
data:
library(data.table)
set.seed(0L)
nr <- 16L
DT <- data.table(t=1L:nr, P=rnorm(nr), D=rnorm(nr), M=rnorm(nr))
lambda <- 0.5
This creates a new column R2 wirth the same values as R
DT[, R2 := shift( cumsum(A+B), type = "lag", fill = 0 ) ][]
# t R A B R2
# 1: 1 0 1 2 0
# 2: 2 3 2 3 3
# 3: 3 8 2 5 8
# 4: 4 15 8 5 15
# 5: 5 28 10 8 28
There is to my knowledge there is no way to iteratively calculate the rows with buildin functions from data.table. I even believe there's a duplicate question out there, that has a similar question (although I cannot find it right now).
We can however speed up the calculations by noting the tricks we could use in the formulation. First to obtain the result in the example provided, we can note this is just cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))
dt <- fread('t R A B
1 0 1 2
2 3 2 3
3 8 2 5
4 15 8 5
5 28 10 8')
dt[, R2 := cumsum(shift(A, 1, fill = 0) + shift(B, 1, fill = 0))]
dt
t R A B R2
1: 1 0 1 2 0
2: 2 3 2 3 3
3: 3 8 2 5 8
4: 4 15 8 5 15
5: 5 28 10 8 28
However for the exact problem described Rt = At + Bt + Rt-1 we will have to be a bit smarter
dt[, R3 := cumsum(A + B) - head(A + B, 1)]
dt
t R A B R2 R3
1: 1 0 1 2 0 0
2: 2 3 2 3 3 5
3: 3 8 2 5 8 12
4: 4 15 8 5 15 25
5: 5 28 10 8 28 43
Which follows the above description. Note that I remove the first row, with the assumption that R0 = 0, otherwise it simply becomes cumsum(A + B)
Edit
As the question is asking for some possibly more complicated situations, I'll add an example using a slower (but more general) example. The idea here is to use the set function, to avoid intermediary shallow copes (see help(set) or help("datatable-optimize")).
dt[, R4 := 0]
for(i in seq.int(2, dt[, .N])){
#dummy complicated scenario
f <- dt[seq(i), lm(A ~ B - 1)]
set(dt, i, 'R4', unname(unlist(coef(f))))
}
dt
t R A B R2 R3 R4
1: 1 0 1 2 0 0 0.0000000
2: 2 3 2 3 3 5 0.6153846
3: 3 8 2 5 8 12 0.4736842
4: 4 15 8 5 15 25 0.9206349
5: 5 28 10 8 28 43 1.0866142

Selecting rows by offsetting

I have this data frame, lets call it my_df.
It looks like this:
my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10))
names(my_df) <- c("num","let")
head(my_df)
num let
1 0.01202600 a
2 1.09025768 a
3 -0.08656178 a
4 -0.04847073 a
5 -0.63750258 a
6 0.58846135 a
What I want to do is select all of the rows when my_df$let == "b" as well as the five rows before the first row when my_df$let == "b", and the five rows after the last row when my_df == "b". So basically my_df[6:25,].
The data I'm actually working with is hundreds of thousands of lines long and I don't know what rows is what, and besides that each set of data doesn't match up row wise and I can't take the time to go through each set of data individually. I've been using a subset to select the data I want, but I don't know how to select the additional rows outside of the subset (1000 rows before and after).
Here's my subset for what I'm doing:
#The following lines seperate pXX_NoNegative into individual field sections
p04_HighWeeds <- subset(p04_NoNegative, subset = p04_NoNegative$GS_Field == "High Weeds")
I want to select all of the rows that the above code selects, but I also want 100 rows before that, and 1000 rows after that.
If you need any additional information that may help you please ask.
Here's another idea using dplyr:
library(dplyr)
my_df %>% filter(lead(let == "b", 5) | lag(let == "b", 5))
Or as per #akrun suggestion using the devel version of data.table:
setDT(my_df)[shift(let == "b", 5) | shift(let == "b", type = "lead", 5)]
Which gives:
# num let
#1 0.36723709 a
#2 0.24743170 a
#3 -0.33339924 a
#4 -0.57024317 a
#5 0.03390278 a
#6 -0.43495096 b
#7 -0.85107347 b
#8 0.53048931 b
#9 -0.26739611 b
#10 -0.96029355 b
#11 -0.71737408 b
#12 0.34324685 b
#13 0.12319646 b
#14 0.75207703 b
#15 0.18134006 b
#16 -0.02230777 c
#17 0.42646106 c
#18 -0.11055478 c
#19 0.06013187 c
#20 0.50782158 c
Normally splitting a data frame into a list of data frames based on some categorization is straightforward -- you would use split(my_df, my_df$let) in your case. However with the added complication that you want some number of rows before or after I would operate over the set of unique categorizations, selecting the rows you want in each case:
before <- 5
after <- 5
ret <- setNames(lapply(unique(my_df$let), function(x) {
positions <- which(my_df$let == x)
start.pos <- max(1, min(positions)-before)
end.pos <- min(nrow(my_df), max(positions)+after)
my_df[start.pos:end.pos,]
}), unique(my_df$let))
You can grab the observations for any category you want out of the returned list:
ret$b # Also works: ret[["b"]]
# num let
# 6 -0.197901427 a
# 7 0.194607192 a
# 8 -0.107318203 a
# 9 -0.365313233 a
# 10 -0.188926562 a
# 11 0.636272295 b
# 12 -0.058791973 b
# 13 -0.231029510 b
# 14 0.519441716 b
# 15 0.239510912 b
# 16 0.107025658 b
# 17 -0.446644081 b
# 18 0.145052077 b
# 19 -0.426090749 b
# 20 -0.356062993 b
# 21 -0.155012203 c
# 22 -0.007968255 c
# 23 -0.504253089 c
# 24 0.081624303 c
# 25 -0.657008233 c
I recently answered a nearly identical question: Select n rows after specific number. Adapting the single-segment solution to your data:
set.seed(1); my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10));
names(my_df) <- c("num","let");
brange <- range(which(my_df$let=='b'));
my_df$offb <- c((1-brange[1]):-1,rep(0,diff(brange)+1),1:(nrow(my_df)-brange[2]));
my_df;
## num let offb
## 1 -0.313226905 a -10
## 2 0.091821662 a -9
## 3 -0.417814306 a -8
## 4 0.797640401 a -7
## 5 0.164753886 a -6
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5
## 26 -0.028064370 c 6
## 27 -0.077897753 c 7
## 28 -0.735376192 c 8
## 29 -0.239075028 c 9
## 30 0.208970780 c 10
subset(my_df,offb>=-5&offb<=5);
## num let offb
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5

R: using if/else to append column in a list with objects of varying lengths

I am trying to append a column of values to the elements of an R list, where each element is of varying length. Here is an example list foo:
A B C
1 1 150
1 2 25
1 4 30
2 1 200
2 3 15
3 4 30
First, I split foo into list foo with elements based on each unique value of A. Now, I would like to write a function that a) the sums the values of C for each value of A, but that b) excludes B when B == 4. c) The sum is appended as a new column D, and d) C is divided by D to yield a proportion (column E). Ultimately, it would be combined in a new df to look like:
A B C D E
1 1 150 175 0.857
1 2 25 175 0.143
1 4 30 175 0.171
2 1 200 215 0.930
2 3 15 215 0.070
3 4 30 0 0/NA
However, I'm having problems because in some cases, for a given value of A, there are only cases when B == 4 (here, where A == 3), so when I try to divide C by D, I get error messages.
Is there a way to incorporate an if/else statement into the function so that when A is unique and the only possible value of B is 4, the operation is skipped and a default non-zero value is placed in the appended column?
Subsetting the df to excluded cases where B == 4 makes later operations more difficult, but including cases where B == 4 makes the sum/proportion calculate inaccurate.
Any help is appreciated! Here is the current code:
goo <- lapply(foo,function(df){
df$D <- sum(df$C, na.rm = TRUE)
df$E <- df$C / df$D
### .....
df
})
Here's how I would do it using dplyr
library(dplyr)
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = C/D)
#newfoo # the resulting data.frame
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 Inf
Or if you want to avoid Inf, you can use ifelse like this:
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = ifelse(D == 0, 0, C/D))
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 0.00000000
And a data.table (possible) solution
library(data.table)
setDT(foo)[, D := sum(C[B != 4]), by = A][, E := C/D]
# foo
# A B C D E
# 1: 1 1 150 175 0.85714286
# 2: 1 2 25 175 0.14285714
# 3: 1 4 30 175 0.17142857
# 4: 2 1 200 215 0.93023256
# 5: 2 3 15 215 0.06976744
# 6: 3 4 30 0 Inf
Not sure what you want to put into column E when A == 3, but you can use is.finite for it and avoid messing around with ifelse, for example (replacing with a zero)
setDT(foo)[, D := sum(C[B!=4]), by = A][, E := C/D][!is.finite(E), E := 0]
Here is a solution using the base package.
First, ensure that the data are modeled appropriately by converting A into a factor if it is not one already:
df$A <- factor(df$A)
Now, we can compute D using tapply, which iterates groupwise and returns the result as a table. We do this with the subset of df where B != 4.
df$D <- with(subset(df, B != 4), tapply(C, A, sum))[df$A]
Note that since A is a factor, we can index into the table to perform the merge. Now we can use ifelse to compute E:
df$E <- with(df, ifelse(is.na(D), 0, C/D))

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Resources