R: using if/else to append column in a list with objects of varying lengths - r

I am trying to append a column of values to the elements of an R list, where each element is of varying length. Here is an example list foo:
A B C
1 1 150
1 2 25
1 4 30
2 1 200
2 3 15
3 4 30
First, I split foo into list foo with elements based on each unique value of A. Now, I would like to write a function that a) the sums the values of C for each value of A, but that b) excludes B when B == 4. c) The sum is appended as a new column D, and d) C is divided by D to yield a proportion (column E). Ultimately, it would be combined in a new df to look like:
A B C D E
1 1 150 175 0.857
1 2 25 175 0.143
1 4 30 175 0.171
2 1 200 215 0.930
2 3 15 215 0.070
3 4 30 0 0/NA
However, I'm having problems because in some cases, for a given value of A, there are only cases when B == 4 (here, where A == 3), so when I try to divide C by D, I get error messages.
Is there a way to incorporate an if/else statement into the function so that when A is unique and the only possible value of B is 4, the operation is skipped and a default non-zero value is placed in the appended column?
Subsetting the df to excluded cases where B == 4 makes later operations more difficult, but including cases where B == 4 makes the sum/proportion calculate inaccurate.
Any help is appreciated! Here is the current code:
goo <- lapply(foo,function(df){
df$D <- sum(df$C, na.rm = TRUE)
df$E <- df$C / df$D
### .....
df
})

Here's how I would do it using dplyr
library(dplyr)
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = C/D)
#newfoo # the resulting data.frame
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 Inf
Or if you want to avoid Inf, you can use ifelse like this:
newfoo <- foo %>%
group_by(A) %>%
mutate(D = sum(C[B != 4]),
E = ifelse(D == 0, 0, C/D))
#Source: local data frame [6 x 5]
#Groups: A
#
# A B C D E
#1 1 1 150 175 0.85714286
#2 1 2 25 175 0.14285714
#3 1 4 30 175 0.17142857
#4 2 1 200 215 0.93023256
#5 2 3 15 215 0.06976744
#6 3 4 30 0 0.00000000

And a data.table (possible) solution
library(data.table)
setDT(foo)[, D := sum(C[B != 4]), by = A][, E := C/D]
# foo
# A B C D E
# 1: 1 1 150 175 0.85714286
# 2: 1 2 25 175 0.14285714
# 3: 1 4 30 175 0.17142857
# 4: 2 1 200 215 0.93023256
# 5: 2 3 15 215 0.06976744
# 6: 3 4 30 0 Inf
Not sure what you want to put into column E when A == 3, but you can use is.finite for it and avoid messing around with ifelse, for example (replacing with a zero)
setDT(foo)[, D := sum(C[B!=4]), by = A][, E := C/D][!is.finite(E), E := 0]

Here is a solution using the base package.
First, ensure that the data are modeled appropriately by converting A into a factor if it is not one already:
df$A <- factor(df$A)
Now, we can compute D using tapply, which iterates groupwise and returns the result as a table. We do this with the subset of df where B != 4.
df$D <- with(subset(df, B != 4), tapply(C, A, sum))[df$A]
Note that since A is a factor, we can index into the table to perform the merge. Now we can use ifelse to compute E:
df$E <- with(df, ifelse(is.na(D), 0, C/D))

Related

Replace NA value with next or previous non-NA value conditional on other column

Below is an example data set similar to what I'm working with.
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
In this example we have a string of values ranging from + to - values or vice versa (Loc). What I am trying to do accomplish is to fill these NA values, where B is always a associated with negative values of Loc, however, positive values can either take on values A if NA's are between A and B or C if NA's are between B and C.
The desired output should look like the following
df2<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c(rep("A",6),rep("B",8),rep("C",6)))
I have looked into the na.locf from the zoo package but I'm not sure how to order which direction the funcion looks for the non-NA value to get the desired output.
df$Reg2<-ifelse(df$Loc<=0,df$Reg2<-"B",na.locf(df$Reg,fromLast = F))
The above code is only returning the right response for some of the rows depending on the direction (i.e. fromLast = T or F)
Any help on this would be much appreciated.
Use ave splitting by a grouping variable generated from rleid of the sign. Then omit the NAs leaving the single non-NA in each group which ave will copy for all values in that group.
library(data.table)
transform(df, Reg = ave(Reg, rleid(Loc >= 0), FUN = na.omit))
giving:
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C
Here is a data.table solution which reproduces OP's expected answer:
library(data.table)
result <- as.data.table(df)[, Reg := first(Reg[!is.na(Reg)]), by = rleid(Loc >= 0)][]
result
Loc Reg
1: 5 A
2: 4 A
3: 3 A
4: 2 A
5: 1 A
6: 0 A
7: -1 B
8: -2 B
9: -3 B
10: -4 B
11: -4 B
12: -3 B
13: -2 B
14: -1 B
15: 0 C
16: 1 C
17: 2 C
18: 3 C
19: 4 C
20: 5 C
identical(as.data.frame(result), df2)
[1] TRUE
Note that this approach is similar to G. Grothendiek's base R solution in that it uses rleid(Loc >= 0) to group the data but it does not call transform() and ave() but updates Reg by reference, i.e., without copying the whole object.
Here is a quick solution with dplyr:
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
c <- match("C",df$Reg)
a <- match("A",df$Reg)
df2 <- df %>%
mutate(newReg=case_when(Loc < 0 ~ "B",
Loc >= 0 & abs(row_number()-c)<abs(row_number()-a)~ "C",
TRUE ~ "A"))
Note: This is hideous and I am doubtful this is reproducible for more use cases... this is probably better suited for some type of dplyr::case_when function, but I just couldn't think it through at this point.
lapply(2:nrow(df), function(i){
this_row <- df[i, ]
last_row <- i - 1
if(is.na(this_row[['Reg']])){
if(this_row[['Loc']] < 0){
df[i, 'Reg'] <<- "B"
}else if(df[i - 1, 'Reg'] == "A"){
df[i, 'Reg'] <<- "A"
}else {
df[i, "Reg"] <<- "C"
}
}
})
> df
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

How do I delete rows with NAs and those that follow the NAs?

I have some data where I want to remove the NAs and the data that follows the NAs by the level of a factor.
Removing the NAs is easy:
df <- data.frame(a=c("A","A","A","B","B","B","C","C","C","D","D","D"), b=c(0,1,0,0,0,0,0,1,0,0,0,1) ,c=c(4,5,3,2,1,5,NA,5,1,6,NA,2))
df
newdf<-df[complete.cases(df),];newdf
The final result should remove all of the rows for C and the final two rows of D.
Hope you can help.
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'a', get the cumulative sum of logical vector of NA elements in 'c' and check whether it is less than 1 to subset
library(data.table)
setDT(df)[, .SD[cumsum(is.na(c))<1], by= a]
Or a faster option with .I to return the row index of the logical vector and subset the rows.
setDT(df)[df[, .I[cumsum(is.na(c)) < 1], by = a]$V1]
# a b c
#1: A 0 4
#2: A 1 5
#3: A 0 3
#4: B 0 2
#5: B 0 1
#6: B 0 5
#7: D 0 6
A classic split-apply-combine in base R:
do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
Here it is again, but in several lines:
split_df <- split(df, df$a)
apply_df <- lapply(split_df, function(x)x[cumsum(is.na(x$c))<1,])
combine_df <- do.call(rbind, apply_df)
The result:
> do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
# a b c
#A.1 A 0 4
#A.2 A 1 5
#A.3 A 0 3
#B.4 B 0 2
#B.5 B 0 1
#B.6 B 0 5
#D D 0 6
A similar solution in dplyr would be
library(dplyr)
df %>% group_by(a) %>% filter(!is.na(cumsum(c)))
Output:
Source: local data frame [7 x 3]
Groups: a [3]
a b c
<fctr> <dbl> <dbl>
1 A 0 4
2 A 1 5
3 A 0 3
4 B 0 2
5 B 0 1
6 B 0 5
7 D 0 6
If we take the cumulative sum of variable C, any values after the first NA will be converted to NA. Performing this at the group level allows us to remove NA rows and get the desired output.

R--aggregate like function that ensures interactions of all factor levels

I'm wondering how I can ensure I included all interactions of factors when using aggregate even if they don't appear in the given dataset.
dff <- data.frame(a=as.factor(c(rep(1,3), rep(2,4), rep(3,3))),
b=as.factor(c(rep("A", 4), rep("B",6))),
c=sample(100,10))
levels(dff$b) <- c(levels(dff$b), "C")
levels(dff$a) <- c(levels(dff$a), 10)
dff$b
#[1] A A A A B B B B B B
#Levels: A B C
dff$a
#[1] 1 1 1 2 2 2 2 3 3 3
#Levels: 1 2 3 10
aggregate(c~a+b, dff, sum)
# a b c
#1 1 A 233
#2 2 A 78
#3 2 B 212
#4 3 B 73
what I want is
a b c
1 1 A 233
2 1 B 0
3 1 C 0
4 2 A 78
5 2 B 212
6 2 C 0
7 3 A 0
8 3 B 73
9 3 C 0
10 10 A 0
11 10 B 0
12 10 C 0
NA is fine too.
The reason I want it in this format is because I need to interact dff$c with results from other datasets and they may be of different length if not all factor levels are accounted for. I'm trying avoid merge and instead use vector calculation.
Thank you in advance.
If your aggregation function is just going to be sum, you can just use xtabs, which would create an object that includes the class table. As such, you can use data.frame, which would call the respective "method", which creates a "long" data.frame.
data.frame(xtabs(c ~ b + a, dff))
# b a Freq
# 1 A 1 121
# 2 B 1 0
# 3 C 1 0
# 4 A 2 89
# 5 B 2 203
# 6 C 2 0
# 7 A 3 0
# 8 B 3 126
# 9 C 3 0
# 10 A 10 0
# 11 B 10 0
# 12 C 10 0
This is similar to #nicola's suggestion to use as.data.frame.table, which explicitly calls the method for something that is not explicitly of the class "table" but can be treated as one.
One advantage of this approach (and all the others that follow) is that you can use different functions other than sum.
as.data.frame.table(tapply(dff$c, dff[c("a","b")], sum))
If merge is OK, you can continue with your aggregate step. In this case, we use expand.grid on the levels of your factor vectors:
merge(expand.grid(lapply(dff[c(1, 2)], levels)),
aggregate(c~a+b, dff, sum, drop = FALSE), all = TRUE)
A similar approach can be taken in "data.table":
library(data.table)
as.data.table(dff)[, sum(c), by = .(a, b)][do.call(CJ, lapply(dff[c(1, 2)], levels)), on = c("a", "b")]
Or using "dplyr" + "tidyr" (which essentially hides the merge, but ultimately uses left_join to create the missing combinations):
library(dplyr)
library(tidyr)
dff %>%
group_by(a, b) %>%
summarise(c = sum(c)) %>%
complete(a, b, fill = list(c = 0))

Selecting rows by offsetting

I have this data frame, lets call it my_df.
It looks like this:
my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10))
names(my_df) <- c("num","let")
head(my_df)
num let
1 0.01202600 a
2 1.09025768 a
3 -0.08656178 a
4 -0.04847073 a
5 -0.63750258 a
6 0.58846135 a
What I want to do is select all of the rows when my_df$let == "b" as well as the five rows before the first row when my_df$let == "b", and the five rows after the last row when my_df == "b". So basically my_df[6:25,].
The data I'm actually working with is hundreds of thousands of lines long and I don't know what rows is what, and besides that each set of data doesn't match up row wise and I can't take the time to go through each set of data individually. I've been using a subset to select the data I want, but I don't know how to select the additional rows outside of the subset (1000 rows before and after).
Here's my subset for what I'm doing:
#The following lines seperate pXX_NoNegative into individual field sections
p04_HighWeeds <- subset(p04_NoNegative, subset = p04_NoNegative$GS_Field == "High Weeds")
I want to select all of the rows that the above code selects, but I also want 100 rows before that, and 1000 rows after that.
If you need any additional information that may help you please ask.
Here's another idea using dplyr:
library(dplyr)
my_df %>% filter(lead(let == "b", 5) | lag(let == "b", 5))
Or as per #akrun suggestion using the devel version of data.table:
setDT(my_df)[shift(let == "b", 5) | shift(let == "b", type = "lead", 5)]
Which gives:
# num let
#1 0.36723709 a
#2 0.24743170 a
#3 -0.33339924 a
#4 -0.57024317 a
#5 0.03390278 a
#6 -0.43495096 b
#7 -0.85107347 b
#8 0.53048931 b
#9 -0.26739611 b
#10 -0.96029355 b
#11 -0.71737408 b
#12 0.34324685 b
#13 0.12319646 b
#14 0.75207703 b
#15 0.18134006 b
#16 -0.02230777 c
#17 0.42646106 c
#18 -0.11055478 c
#19 0.06013187 c
#20 0.50782158 c
Normally splitting a data frame into a list of data frames based on some categorization is straightforward -- you would use split(my_df, my_df$let) in your case. However with the added complication that you want some number of rows before or after I would operate over the set of unique categorizations, selecting the rows you want in each case:
before <- 5
after <- 5
ret <- setNames(lapply(unique(my_df$let), function(x) {
positions <- which(my_df$let == x)
start.pos <- max(1, min(positions)-before)
end.pos <- min(nrow(my_df), max(positions)+after)
my_df[start.pos:end.pos,]
}), unique(my_df$let))
You can grab the observations for any category you want out of the returned list:
ret$b # Also works: ret[["b"]]
# num let
# 6 -0.197901427 a
# 7 0.194607192 a
# 8 -0.107318203 a
# 9 -0.365313233 a
# 10 -0.188926562 a
# 11 0.636272295 b
# 12 -0.058791973 b
# 13 -0.231029510 b
# 14 0.519441716 b
# 15 0.239510912 b
# 16 0.107025658 b
# 17 -0.446644081 b
# 18 0.145052077 b
# 19 -0.426090749 b
# 20 -0.356062993 b
# 21 -0.155012203 c
# 22 -0.007968255 c
# 23 -0.504253089 c
# 24 0.081624303 c
# 25 -0.657008233 c
I recently answered a nearly identical question: Select n rows after specific number. Adapting the single-segment solution to your data:
set.seed(1); my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10));
names(my_df) <- c("num","let");
brange <- range(which(my_df$let=='b'));
my_df$offb <- c((1-brange[1]):-1,rep(0,diff(brange)+1),1:(nrow(my_df)-brange[2]));
my_df;
## num let offb
## 1 -0.313226905 a -10
## 2 0.091821662 a -9
## 3 -0.417814306 a -8
## 4 0.797640401 a -7
## 5 0.164753886 a -6
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5
## 26 -0.028064370 c 6
## 27 -0.077897753 c 7
## 28 -0.735376192 c 8
## 29 -0.239075028 c 9
## 30 0.208970780 c 10
subset(my_df,offb>=-5&offb<=5);
## num let offb
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Resources