How to adopt ifelse statement to NA value in R? - r

I am trying to create new column by condition, where if value in A equal to 1, value is copied from B column, otherwise from C column. When A has NA, condition does not work. I dont want to drop this NA and do following:
If A contains NA, values has to taken from C column
df <- data.frame (A = c(1,1,NA,2,2,2),
B = c(10,20,30,40,25,45),
C = c(11,23,33,45,56,13))
#If Customer faced defect only in Wave_3 his Boycotting score is taken from wave 3, otherwise from wave 4
df$D1 <- ifelse(df$A ==1 , df$B ,df$C)
Expected output:

Use Boolean operation here:
df$D1 <- ifelse(df$A ==1 & !is.na(df$A), df$B ,df$C)
alternate way is
df$D1 <- ifelse(is.na(df$A), df$C, ifelse(df$A ==1 , df$B ,df$C))

Using case_when
library(dplyr)
df %>%
mutate(D1 = case_when(A %in% 1 ~ B, TRUE ~ C))
A B C D1
1 1 10 11 10
2 1 20 23 20
3 NA 30 33 33
4 2 40 45 45
5 2 25 56 56
6 2 45 13 13

Same idea as #Mohanasundaram, but implemented in dplyr chain:
library(dplyr)
df %>%
mutate(D1 = ifelse(A == 1 & !is.na(A), B, C))
Output:
A B C D1
1 1 10 11 10
2 1 20 23 20
3 NA 30 33 33
4 2 40 45 45
5 2 25 56 56
6 2 45 13 13

Another option is to use dplyr::if_else() which has a missing argument to handle the NA values. The advantage of dplyr::if_else() is that it checks that the true and false are the same types.
dplyr::mutate(
.data = df,
D1 = dplyr::if_else(
condition = A == 1, true = B, false = C, missing = C
)
)
Output:
A B C D1
1 1 10 11 10
2 1 20 23 20
3 NA 30 33 33
4 2 40 45 45
5 2 25 56 56
6 2 45 13 13

Related

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

I want to take the values of a (large) data frame:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]
Which eventually reaches a 2, a 3, etc.
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables
sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
sub.data
x min max
1 a 2 6
2 c 50 53
3 d 25 30
The output should look like something like this:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the 'if' part complains.
I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr version 1.1.0, we could also use non-equi joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as #Davis Vaughan mentioned, use between with a left_joion
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))

Create "row" from first non-NA value in an R data frame

I want to create a "row" containing the first non-NA value that appears in a data frame. So for example, given this test data frame:
test.df <- data.frame(a=c(11,12,13,14,15,16),b=c(NA,NA,23,24,25,26), c=c(31,32,33,34,35,36), d=c(NA,NA,NA,NA,45,46))
test.df
a b c d
1 11 NA 31 NA
2 12 NA 32 NA
3 13 23 33 NA
4 14 24 34 NA
5 15 25 35 45
6 16 26 36 46
I know that I can detect the first appearance of a non-NA like this:
first.appearance <- as.numeric(sapply(test.df, function(col) min(which(!is.na(col)))))
first.appearance
[1] 1 3 1 5
This tells me that the first element in column 1 is not NA, the third element in column 2 is not NA, the first element in column 3 is not NA, and the fifth element in column 4 is not NA. But when I put the pieces together, it yields this (which is logical, but not what I want):
> test.df[first.appearance,]
a b c d
1 11 NA 31 NA
3 13 23 33 NA
1.1 11 NA 31 NA
5 15 25 35 45
I would like the output to be the first non-NA in each column. What is a base or dplyr way to do this? I am not seeing it. Thanks in advance.
a b c d
1 11 23 31 45
We can use
library(dplyr)
test.df %>%
slice(first.appearance) %>%
summarise_all(~ first(.[!is.na(.)]))
# a b c d
#1 11 23 31 45
Or it can be
test.df %>%
summarise_all(~ min(na.omit(.)))
# a b c d
#1 11 23 31 45
Or with colMins
library(matrixStats)
colMins(as.matrix(test.df), na.rm = TRUE)
#[1] 11 23 31 45
You can use :
library(tidyverse)
df %>% fill(everything(), .direction = "up") %>% head(1)
a b c d
<dbl> <dbl> <dbl> <dbl>
1 11 23 31 45

Assigning values to a new column based on a condition between two dataframes

I have two dataframes. I need to add the value of one column to every row in the other dataframe where the values of a particular column meet a condition from the first dataframe.
df1:
a b
x 23
s 34
v 15
g 05
k 69
df2:
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
Desired output:
a b n
x 23 3
s 34 4
v 15 2
g 05 1
k 69 7
In my dataset the intervals are large, and it's unlikely that a value from df1 is exactly on the boundary of a df2 interval.
Essentially for every row in df1 I need to assign the number which corresponds to which range it fits into in df2. So if df1$b is between df2$y and df2$z, then assign the value of output$n as the corresponding value of df2$x. This is quite a wordy question, so please ask if I need to clarify.
df1 = read.table(text = "
a b
x 23
s 34
v 15
g 05
k 69
", header=T, stringsAsFactors=F)
df2 = read.table(text = "
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
", header=T, stringsAsFactors=F)
# function
f = function(x) min(which(x >= df2$y & x <= df2$z))
f = Vectorize(f)
# apply function
df1$n = f(df1$b)
# check updated dataset
df1
# a b n
# 1 x 23 3
# 2 s 34 4
# 3 v 15 2
# 4 g 5 1
# 5 k 69 7
You can try:
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(n=df2[ b > df2$y & b <= df2$z,1]) %>%
ungroup()
# A tibble: 5 x 3
a b n
<chr> <int> <int>
1 x 23 3
2 s 34 4
3 v 15 2
4 g 5 1
5 k 69 7
as already commented you have to change < or > to <= or >= accordingly to your needs.

selecting middle n rows in R

I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.
Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60
You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.

how to use apply-like function on data frame? [please see details below]

I have a dataframe with columns A, B and C.
I want to apply a function on each row of a dataframe in which a function will check the value of row$A and row$B and will update row$C based on those values. How can I achieve that?
Example:
A B C
1 1 10 10
2 2 20 20
3 NA 30 30
4 NA 40 40
5 5 50 50
Now I want to update all rows in C column to B/2 value in that same row if value in A column for that row is NA.
So the dataframe after changes would look like:
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50
I would like to know if this can be done without using a for loop.
Or if you want to update the column by reference (without copying the whole data set when updating the column) could also try data.table
library(data.table)
setDT(dat)[is.na(A), C := B/2]
dat
# A B C
# 1: 1 10 10
# 2: 2 20 20
# 3: NA 30 15
# 4: NA 40 20
# 5: 5 50 50
Edit:
Regarding #aruns comment, checking the address before and after the change implies it was updated by reference still.
library(pryr)
address(dat$C)
## [1] "0x2f85a4f0"
setDT(dat)[is.na(A), C := B/2]
address(dat$C)
## [1] "0x2f85a4f0"
Try this:
your_data <- within(your_data, C[is.na(A)] <- B[is.na(A)] / 2)
Try
indx <- is.na(df$A)
df$C[indx] <- df$B[indx]/2
df
# A B C
#1 1 10 10
#2 2 20 20
#3 NA 30 15
#4 NA 40 20
#5 5 50 50
here is simple example using library(dplyr).
Fictional dataset:
df <- data.frame(a=c(1, NA, NA, 2), b=c(10, 20, 50, 50))
And you want just those rows where a == NA, therefore you can use ifelse:
df <- mutate(df, c=ifelse(is.na(a), b/2, b))
Another approach:
dat <- transform(dat, C = B / 2 * (i <- is.na(A)) + C * !i)
# A B C
# 1 1 10 10
# 2 2 20 20
# 3 NA 30 15
# 4 NA 40 20
# 5 5 50 50
Try:
> ddf$C = with(ddf, ifelse(is.na(A), B/2, C))
>
> ddf
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50

Resources