Replace NAN with 0 in R [duplicate] - r

I tried to replace NaN values with zeros using the following script:
rapply( data123, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
# [31] 0.00000000 -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
The NaN value was showing to be zero but when I typed in the name of the data frame and tried to review it, the value was still remaining NaN.
data123$contri_us
# [31] NaN -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
I am not sure whether the rapply command was actually applying the adjustment in the data frame, or just replaced the value as per shown.
Any idea how to actually change the NaN value to zero?

It would seem that is.nan doesn't actually have a method for data frames, unlike is.na. So, let's fix that!
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
data123[is.nan(data123)] <- 0

In fact, in R, this operation is very easy:
If the matrix 'a' contains some NaN, you just need to use the following code to replace it by 0:
a <- matrix(c(1, NaN, 2, NaN), ncol=2, nrow=2)
a[is.nan(a)] <- 0
a
If the data frame 'b' contains some NaN, you just need to use the following code to replace it by 0:
#for a data.frame:
b <- data.frame(c1=c(1, NaN, 2), c2=c(NaN, 2, 7))
b[is.na(b)] <- 0
b
Note the difference is.nan when it's a matrix vs. is.na when it's a data frame.
Doing
#...
b[is.nan(b)] <- 0
#...
yields: Error in is.nan(b) : default method not implemented for type 'list' because b is a data frame.
Note: Edited for small but confusing typos

The following should do what you want:
x <- data.frame(X1=sample(c(1:3,NaN), 200, replace=TRUE), X2=sample(c(4:6,NaN), 200, replace=TRUE))
head(x)
x <- replace(x, is.na(x), 0)
head(x)

Here is a tidyverse solution. I've generated sample data with both NaN and NA. The first column is fully complete.
df <- tibble(x = LETTERS[1:5],
y = c(1:3, NaN, 4),
z = c(rep(NaN, 3), NA, 5))
df
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 NaN
2 B 2 NaN
3 C 3 NaN
4 D NaN NA
5 E 4 5
Then we can apply mutate_all with replace to the dataframe:
df %>%
mutate_all(~replace(., is.nan(.), 0))
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5
We've replaced NaN values with zero and touched neither NA values nor the x column.
UPDATE to dplyr 1.0.0
Since the mutate_all is deprecated we can now rewrite the expression using across() like following:
df %>%
mutate(across(everything(), ~replace(.x, is.nan(.x), 0)))
# A tibble: 5 × 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5

Related

How to assign 1s and 0s to columns if variable in row matches or not match in R

I'm an absolute beginner in coding and R and this is my third week doing it for a project. (for biologists, I'm trying to find the sum of risk alleles for PRS) but I need help with this part
df
x y z
1 t c a
2 a t a
3 g g t
so when code applied:
x y z
1 t 0 0
2 a 0 1
3 g 1 0
```
I'm trying to make it that if the rows in y or z match x the value changes to 1 and if not, zero
I started with:
```
for(i in 1:ncol(df)){
df[, i]<-df[df$x == df[,i], df[ ,i]<- 1]
}
```
But got all NA values
In reality, I have 100 columns I have to compare with x in the data frame. Any help is appreciated
An alternative way to do this is by using ifelse() in base R.
df$y <- ifelse(df$y == df$x, 1, 0)
df$z <- ifelse(df$z == df$x, 1, 0)
df
# x y z
#1 t 0 0
#2 a 0 1
#3 g 1 0
Edit to extend this step to all columns efficiently
For example:
df1
# x y z w
#1 t c a t
#2 a t a a
#3 g g t m
To apply column editing efficiently, a better approach is to use a function applied to all targeted columns in the data frame. Here is a simple function to do the work:
edit_col <- function(any_col) any_col <- ifelse(any_col == df1$x, 1, 0)
This function takes a column, and then compare the elements in the column with the elements of df1$x, and then edit the column accordingly. This function takes a single column. To apply this to all targeted columns, you can use apply(). Because in your case x is not a targeted column, you need to exclude it by indexing [,-1] because it is the first column in df.
# Here number 2 indicates columns. Use number 1 for rows.
df1[, -1] <- apply(df1[,-1], 2, edit_col)
df1
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
Of course you can also define a function that edit the data frame so you don't need to do apply() manually.
Here is an example of such function
edit_df <- function(any_df){
edit_col <- function(any_col) any_col <- ifelse(any_col == any_df$x, 1, 0)
# Create a vector containing all names of the targeted columns.
target_col_names <- setdiff(colnames(any_df), "x")
any_df[,target_col_names] <-apply( any_df[,target_col_names], 2, edit_col)
return(any_df)
}
Then use the function:
edit_df(df1)
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
A tidyverse approach
library(dplyr)
df <-
tibble(
x = c("t","a","g"),
y = c("c","t","g"),
z = c("a","a","t")
)
df %>%
mutate(
across(
.cols = c(y,z),
.fns = ~if_else(. == x,1,0)
)
)
# A tibble: 3 x 3
x y z
<chr> <dbl> <dbl>
1 t 0 0
2 a 0 1
3 g 1 0

How do I create a conditional variable based on another variable in R?

I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Use case_when().
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse() works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors
We can use findInterval:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).
cut it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))

How to change specific values in a dataframe

Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))

recoding variable into two new variables in R

I have a variable A containing continuous numeric values and a binary variable B. I would like to create a new variable A1 which contains the same values as A if B=1 and missing values (NA) if B=2.
Many thanks!
You can use ifelse() for that:
a1 <- ifelse(B == 1, A, NA)
Here's a simple and efficient approach without ifelse:
A <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
B <- rep(1:2, 5)
# [1] 1 2 1 2 1 2 1 2 1 2
A1 <- A * NA ^ (B - 1)
# [1] 1 NA 3 NA 5 NA 7 NA 9 NA
You can use ifelse for this:
A = runif(100)
B = sample(c(0,1), 100, replace = TRUE)
B1 = ifelse(B == 1, A, NA)
You can even leave out the == 1 as R interprets 0 as FALSE and any other number as TRUE:
B1 = ifelse(B, A, NA)
Although the == 1 is both more flexible and makes it more clear what happens. So I'd go for the first approach.

How to replace NaN value with zero in a huge data frame?

I tried to replace NaN values with zeros using the following script:
rapply( data123, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
# [31] 0.00000000 -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
The NaN value was showing to be zero but when I typed in the name of the data frame and tried to review it, the value was still remaining NaN.
data123$contri_us
# [31] NaN -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
I am not sure whether the rapply command was actually applying the adjustment in the data frame, or just replaced the value as per shown.
Any idea how to actually change the NaN value to zero?
It would seem that is.nan doesn't actually have a method for data frames, unlike is.na. So, let's fix that!
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
data123[is.nan(data123)] <- 0
In fact, in R, this operation is very easy:
If the matrix 'a' contains some NaN, you just need to use the following code to replace it by 0:
a <- matrix(c(1, NaN, 2, NaN), ncol=2, nrow=2)
a[is.nan(a)] <- 0
a
If the data frame 'b' contains some NaN, you just need to use the following code to replace it by 0:
#for a data.frame:
b <- data.frame(c1=c(1, NaN, 2), c2=c(NaN, 2, 7))
b[is.na(b)] <- 0
b
Note the difference is.nan when it's a matrix vs. is.na when it's a data frame.
Doing
#...
b[is.nan(b)] <- 0
#...
yields: Error in is.nan(b) : default method not implemented for type 'list' because b is a data frame.
Note: Edited for small but confusing typos
The following should do what you want:
x <- data.frame(X1=sample(c(1:3,NaN), 200, replace=TRUE), X2=sample(c(4:6,NaN), 200, replace=TRUE))
head(x)
x <- replace(x, is.na(x), 0)
head(x)
Here is a tidyverse solution. I've generated sample data with both NaN and NA. The first column is fully complete.
df <- tibble(x = LETTERS[1:5],
y = c(1:3, NaN, 4),
z = c(rep(NaN, 3), NA, 5))
df
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 NaN
2 B 2 NaN
3 C 3 NaN
4 D NaN NA
5 E 4 5
Then we can apply mutate_all with replace to the dataframe:
df %>%
mutate_all(~replace(., is.nan(.), 0))
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5
We've replaced NaN values with zero and touched neither NA values nor the x column.
UPDATE to dplyr 1.0.0
Since the mutate_all is deprecated we can now rewrite the expression using across() like following:
df %>%
mutate(across(everything(), ~replace(.x, is.nan(.x), 0)))
# A tibble: 5 × 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5

Resources