Set NA to 0 in R - r

After merging a dataframe with another im left with random NA's for the occasional row. I'd like to set these NA's to 0 so I can perform calculations with them.
Im trying to do this with:
bothbeams.data = within(bothbeams.data, {
bothbeams.data$x.x = ifelse(is.na(bothbeams.data$x.x) == TRUE, 0, bothbeams.data$x.x)
bothbeams.data$x.y = ifelse(is.na(bothbeams.data$x.y) == TRUE, 0, bothbeams.data$x.y)
})
Where $x.x is one column and $x.y is the other of course, but this doesn't seem to work.

You can just use the output of is.na to replace directly with subsetting:
bothbeams.data[is.na(bothbeams.data)] <- 0
Or with a reproducible example:
dfr <- data.frame(x=c(1:3,NA),y=c(NA,4:6))
dfr[is.na(dfr)] <- 0
dfr
x y
1 1 0
2 2 4
3 3 5
4 0 6
However, be careful using this method on a data frame containing factors that also have missing values:
> d <- data.frame(x = c(NA,2,3),y = c("a",NA,"c"))
> d[is.na(d)] <- 0
Warning message:
In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
It "works":
> d
x y
1 0 a
2 2 <NA>
3 3 c
...but you likely will want to specifically alter only the numeric columns in this case, rather than the whole data frame. See, eg, the answer below using dplyr::mutate_if.

A solution using mutate_all from dplyr in case you want to add that to your dplyr pipeline:
library(dplyr)
df %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Result:
A B C
1 0 0 0
2 1 0 0
3 2 0 2
4 3 0 5
5 0 0 2
6 0 0 1
7 1 0 1
8 2 0 5
9 3 0 2
10 0 0 4
11 0 0 3
12 1 0 5
13 2 0 5
14 3 0 0
15 0 0 1
If in any case you only want to replace the NA's in numeric columns, which I assume it might be the case in modeling, you can use mutate_if:
library(dplyr)
df %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
or in base R:
replace(is.na(df), 0)
Result:
A B C
1 0 0 0
2 1 <NA> 0
3 2 0 2
4 3 <NA> 5
5 0 0 2
6 0 <NA> 1
7 1 0 1
8 2 <NA> 5
9 3 0 2
10 0 <NA> 4
11 0 0 3
12 1 <NA> 5
13 2 0 5
14 3 <NA> 0
15 0 0 1
Update
with dplyr 1.0.0, across is introduced:
library(dplyr)
# Replace `NA` for all columns
df %>%
mutate(across(everything(), ~ ifelse(is.na(.), 0, .)))
# Replace `NA` for numeric columns
df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
Data:
set.seed(123)
df <- data.frame(A=rep(c(0:3, NA), 3),
B=rep(c("0", NA), length.out = 15),
C=sample(c(0:5, NA), 15, replace = TRUE))

You can use replace_na() from tidyr package
df %>% replace_na(list(column1 = 0, column2 = 0)

To add to James's example, it seems you always have to create an intermediate when performing calculations on NA-containing data frames.
For instance, adding two columns (A and B) together from a data frame dfr:
temp.df <- data.frame(dfr) # copy the original
temp.df[is.na(temp.df)] <- 0
dfr$C <- temp.df$A + temp.df$B # or any other calculation
remove('temp.df')
When I do this I throw away the intermediate afterwards with remove/rm.

If you only want to replace NAs with 0s for a few select columns you also use an lapply solution, e.g:
data = data.frame(
one = c(NA,0),
two = c(NA,NA),
three = c(1,2),
four = c("A",NA)
)
data[1:2] = lapply(data[1:2],function(x){
x[is.na(x)] = 0
return(x)
})
data

Why not try this
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
na.zero(df)

Related

Iterating over columns to create flagging variables

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

How to calculate values for the first row that meets a certain condition?

I have the following dummy dataframe:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
a b
0 0
0 0
2 4
4 6
5 5
I want to replace just the first value that it is not zero for the column b. Imagine that the row that meets this criteria is i. I want to replace t$b[i] with t[i+2]+t[i+1] and the rest of t$b should remain the same. So the output would be
a b
0 0
0 0
2 11
4 6
5 5
In fact the dataset is dynamic so I cannot directly point to a specific row, it has to meet the criteria of being the first row not equal to zero in column b.
How can I create this new t$b?
Here is a straight forward solution in base R:
t <- data.frame(
a= c(0,0,2,4,5),
b= c(0,0,4,6,5))
ind <- which(t$b > 0)[1L]
t$b[ind] <- t$b[ind+2L] + t$b[ind+1L]
t
a b
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5
Here is a roundabout way of getting there with a combination of group_by() and mutate():
library(tidyverse)
t %>%
mutate(
b_cond = b != 0,
row_number = row_number()
) %>%
group_by(b_cond) %>%
mutate(
min_row_number = row_number == min(row_number),
b = if_else(b_cond & min_row_number, lead(b, 1) + lead(b, 2), b)
) %>%
ungroup() %>%
select(a, b) # optional, to get back to original columns
# A tibble: 5 × 2
a b
<dbl> <dbl>
1 0 0
2 0 0
3 2 11
4 4 6
5 5 5

Removing a group with conditional statement in r

The conditional statement is that in any event, if there are two or more consecutive rows with values higher than 1, the group should be deleted.
For example:
Event<- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
Value<- c(1,0,0,0,8,7,1,0,0,0,8,0,0,0,0)
A<- data.frame(Event, Value)
Event Value
1 1
1 0
1 0
1 0
2 8
2 7
2 1
2 0
2 0
2 0
3 8
3 0
3 0
3 0
3 0
In this example the group of event 2 should be deleted because it has two consecutive rows with values higher than 1. So it should looks like:
Event Value
1 1
1 0
1 0
1 0
3 8
3 0
3 0
3 0
3 0
Any suggestion?
We can use rle by groups.
library(dplyr)
A %>%
group_by(Event) %>%
filter(!any(with(rle(Value > 1), lengths[values] > 1)))
#Opposite way using all
#filter(all(with(rle(Value > 1), lengths[values] < 2)))
# Event Value
# <dbl> <dbl>
#1 1 1
#2 1 0
#3 1 0
#4 1 0
#5 3 8
#6 3 0
#7 3 0
#8 3 0
#9 3 0
The same logic can be used in base R :
subset(A, !ave(Value > 1, Event, FUN = function(x)
any(with(rle(x), lengths[values] > 1))))
as well as data.table
library(data.table)
setDT(A)[, .SD[!any(with(rle(Value > 1), lengths[values] > 1))], Event]
Using dplyr
A %>%
group_by(Event) %>%
mutate(consec = if_else(Value > 1, row_number(), 0L),
remove = if_else(consec > 1,"Y","N")) %>%
filter(!any(remove == "Y")) %>%
select(-c("consec","remove"))
A base R approach:
# split the dataframe by event into separate lists, record whether values are > 1 (T/F)
A_split <- split(A$Value > 1, Event)
# for each item in the list, record the number of consecutive T values;
# make T/F vector "keep" with row names corresponding to A$Event
keep <- sapply(A_split, function(x) sum(x[1:length(x) - 1] * x[2:length(x)])) == 0
# convert keep to numeric vector of A$Event values
keep <- as.numeric(names(keep == T))
# subset A based on keep vector
A[A$Event %in% keep, ]

Add X number of columns to a data.frame

I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Resources