How to replace NaN value with zero in a huge data frame? - r

I tried to replace NaN values with zeros using the following script:
rapply( data123, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
# [31] 0.00000000 -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
The NaN value was showing to be zero but when I typed in the name of the data frame and tried to review it, the value was still remaining NaN.
data123$contri_us
# [31] NaN -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
I am not sure whether the rapply command was actually applying the adjustment in the data frame, or just replaced the value as per shown.
Any idea how to actually change the NaN value to zero?

It would seem that is.nan doesn't actually have a method for data frames, unlike is.na. So, let's fix that!
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
data123[is.nan(data123)] <- 0

In fact, in R, this operation is very easy:
If the matrix 'a' contains some NaN, you just need to use the following code to replace it by 0:
a <- matrix(c(1, NaN, 2, NaN), ncol=2, nrow=2)
a[is.nan(a)] <- 0
a
If the data frame 'b' contains some NaN, you just need to use the following code to replace it by 0:
#for a data.frame:
b <- data.frame(c1=c(1, NaN, 2), c2=c(NaN, 2, 7))
b[is.na(b)] <- 0
b
Note the difference is.nan when it's a matrix vs. is.na when it's a data frame.
Doing
#...
b[is.nan(b)] <- 0
#...
yields: Error in is.nan(b) : default method not implemented for type 'list' because b is a data frame.
Note: Edited for small but confusing typos

The following should do what you want:
x <- data.frame(X1=sample(c(1:3,NaN), 200, replace=TRUE), X2=sample(c(4:6,NaN), 200, replace=TRUE))
head(x)
x <- replace(x, is.na(x), 0)
head(x)

Here is a tidyverse solution. I've generated sample data with both NaN and NA. The first column is fully complete.
df <- tibble(x = LETTERS[1:5],
y = c(1:3, NaN, 4),
z = c(rep(NaN, 3), NA, 5))
df
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 NaN
2 B 2 NaN
3 C 3 NaN
4 D NaN NA
5 E 4 5
Then we can apply mutate_all with replace to the dataframe:
df %>%
mutate_all(~replace(., is.nan(.), 0))
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5
We've replaced NaN values with zero and touched neither NA values nor the x column.
UPDATE to dplyr 1.0.0
Since the mutate_all is deprecated we can now rewrite the expression using across() like following:
df %>%
mutate(across(everything(), ~replace(.x, is.nan(.x), 0)))
# A tibble: 5 × 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5

Related

How to assign 1s and 0s to columns if variable in row matches or not match in R

I'm an absolute beginner in coding and R and this is my third week doing it for a project. (for biologists, I'm trying to find the sum of risk alleles for PRS) but I need help with this part
df
x y z
1 t c a
2 a t a
3 g g t
so when code applied:
x y z
1 t 0 0
2 a 0 1
3 g 1 0
```
I'm trying to make it that if the rows in y or z match x the value changes to 1 and if not, zero
I started with:
```
for(i in 1:ncol(df)){
df[, i]<-df[df$x == df[,i], df[ ,i]<- 1]
}
```
But got all NA values
In reality, I have 100 columns I have to compare with x in the data frame. Any help is appreciated
An alternative way to do this is by using ifelse() in base R.
df$y <- ifelse(df$y == df$x, 1, 0)
df$z <- ifelse(df$z == df$x, 1, 0)
df
# x y z
#1 t 0 0
#2 a 0 1
#3 g 1 0
Edit to extend this step to all columns efficiently
For example:
df1
# x y z w
#1 t c a t
#2 a t a a
#3 g g t m
To apply column editing efficiently, a better approach is to use a function applied to all targeted columns in the data frame. Here is a simple function to do the work:
edit_col <- function(any_col) any_col <- ifelse(any_col == df1$x, 1, 0)
This function takes a column, and then compare the elements in the column with the elements of df1$x, and then edit the column accordingly. This function takes a single column. To apply this to all targeted columns, you can use apply(). Because in your case x is not a targeted column, you need to exclude it by indexing [,-1] because it is the first column in df.
# Here number 2 indicates columns. Use number 1 for rows.
df1[, -1] <- apply(df1[,-1], 2, edit_col)
df1
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
Of course you can also define a function that edit the data frame so you don't need to do apply() manually.
Here is an example of such function
edit_df <- function(any_df){
edit_col <- function(any_col) any_col <- ifelse(any_col == any_df$x, 1, 0)
# Create a vector containing all names of the targeted columns.
target_col_names <- setdiff(colnames(any_df), "x")
any_df[,target_col_names] <-apply( any_df[,target_col_names], 2, edit_col)
return(any_df)
}
Then use the function:
edit_df(df1)
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
A tidyverse approach
library(dplyr)
df <-
tibble(
x = c("t","a","g"),
y = c("c","t","g"),
z = c("a","a","t")
)
df %>%
mutate(
across(
.cols = c(y,z),
.fns = ~if_else(. == x,1,0)
)
)
# A tibble: 3 x 3
x y z
<chr> <dbl> <dbl>
1 t 0 0
2 a 0 1
3 g 1 0

Replace NAN with 0 in R [duplicate]

I tried to replace NaN values with zeros using the following script:
rapply( data123, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
# [31] 0.00000000 -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
The NaN value was showing to be zero but when I typed in the name of the data frame and tried to review it, the value was still remaining NaN.
data123$contri_us
# [31] NaN -0.67994832 0.50287454 0.63979527 1.48410571 -2.90402836
I am not sure whether the rapply command was actually applying the adjustment in the data frame, or just replaced the value as per shown.
Any idea how to actually change the NaN value to zero?
It would seem that is.nan doesn't actually have a method for data frames, unlike is.na. So, let's fix that!
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
data123[is.nan(data123)] <- 0
In fact, in R, this operation is very easy:
If the matrix 'a' contains some NaN, you just need to use the following code to replace it by 0:
a <- matrix(c(1, NaN, 2, NaN), ncol=2, nrow=2)
a[is.nan(a)] <- 0
a
If the data frame 'b' contains some NaN, you just need to use the following code to replace it by 0:
#for a data.frame:
b <- data.frame(c1=c(1, NaN, 2), c2=c(NaN, 2, 7))
b[is.na(b)] <- 0
b
Note the difference is.nan when it's a matrix vs. is.na when it's a data frame.
Doing
#...
b[is.nan(b)] <- 0
#...
yields: Error in is.nan(b) : default method not implemented for type 'list' because b is a data frame.
Note: Edited for small but confusing typos
The following should do what you want:
x <- data.frame(X1=sample(c(1:3,NaN), 200, replace=TRUE), X2=sample(c(4:6,NaN), 200, replace=TRUE))
head(x)
x <- replace(x, is.na(x), 0)
head(x)
Here is a tidyverse solution. I've generated sample data with both NaN and NA. The first column is fully complete.
df <- tibble(x = LETTERS[1:5],
y = c(1:3, NaN, 4),
z = c(rep(NaN, 3), NA, 5))
df
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 NaN
2 B 2 NaN
3 C 3 NaN
4 D NaN NA
5 E 4 5
Then we can apply mutate_all with replace to the dataframe:
df %>%
mutate_all(~replace(., is.nan(.), 0))
# A tibble: 5 x 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5
We've replaced NaN values with zero and touched neither NA values nor the x column.
UPDATE to dplyr 1.0.0
Since the mutate_all is deprecated we can now rewrite the expression using across() like following:
df %>%
mutate(across(everything(), ~replace(.x, is.nan(.x), 0)))
# A tibble: 5 × 3
x y z
<chr> <dbl> <dbl>
1 A 1 0
2 B 2 0
3 C 3 0
4 D 0 NA
5 E 4 5

How to change specific values in a dataframe

Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Joining two data frames of different lengths

I have a data frame which has 25 weeks data on sales. I have computed a lagged moving average. Now, say x <- c(1,2,3,4) and moving average y <- c(Nan,1,1.5,2,2.5).
If I use z <- data.frame(x,y) it's giving me error as the dimensions are not matching. Is there any way to join them as a data frame by inserting an NA value at the end of the x column? '
Is the same thing possible when x is a data frame with n rows, m columns and I want to append a column of length (m+1) to the right of it?
Yet another way of doing it
data.frame(x[1:length(y)], y)
If x is a data frame, you can use
data.frame(x[1:length(y), ], y)
You could do this
> lst <- list(x = x, y = y)
> m <- max(sapply(lst, length))
> as.data.frame(lapply(lst, function(x){ length(x) <- m; x }))
# x y
# 1 1 NaN
# 2 2 1.0
# 3 3 1.5
# 4 4 2.0
# 5 NA 2.5
In response to your comment, if x is a matrix and y is a vector, it would depend on the number of columns in x. But for this example
cbind(append(x, rep(NA, length(y)-length(x))), y)
If x has multiple columns, you could use some variety of
apply(x, 2, append, NA)
But again, it depends on what's in the columns and what's in y
May be this also helps:
x<- 1:4
x1 <- matrix(1:8,ncol=2)
y <- c(NaN,1,1.5,2,2.5)
do.call(`merge`, c(list(x,y),by=0,all=TRUE))[,-1]
# x y
# 1 1 NaN
# 2 2 1.0
# 3 3 1.5
# 4 4 2.0
# 5 NA 2.5
do.call(`merge`, c(list(x1,y),by=0,all=TRUE))[,-1]
# V1 V2 y
#1 1 5 NaN
#2 2 6 1.0
#3 3 7 1.5
#4 4 8 2.0
#5 NA NA 2.5

Resources