Applying custom function to each row uses only first value of argument - r

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)

Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA

Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3

Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.

Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

Related

How select and remove rows based on position for a specific range in R

Suppose I have two data frames like this:
df1 <- data.frame(a = c(1,2,4,0,0),
b = c(0,3,5,5,0),
c = c(0,0,6,7,6))
df2 <- data.frame(a = c(3,6,8,0,0),
b = c(0,9,10,4,0),
c = c(0,0,1,4,9))
And then I joint it, like
df3 <- full_join(df1, df2)
print(df3)
a b c
1 1 0 0
2 2 3 0
3 4 5 6
4 0 5 7
5 0 0 6
6 3 0 0
7 6 9 0
8 8 10 1
9 0 4 4
10 0 0 9
Note that I have always the same pattern, with zeros in rows 1 and 2; and in rows 9 and 10. And I also have zeros between rows 4 and 7.
I want to remove, only, the zeros between rows 4 and 7.
So, I can solve it, like:
df3[4,1] <- NA
df3[5,1] <- NA
df3[5,2] <- NA
df3[6,2] <- NA
df3[6,3] <- NA
df3[7,3] <- NA
new.df3 <- as.data.frame(lapply(df3, na.omit))
print(new.df3)
a b c
1 1 0 0
2 2 3 0
3 4 5 6
4 3 5 7
5 6 9 6
6 8 10 1
7 0 4 4
8 0 0 9
But it is not elegant and very time-consuming.
Any thoughts? I really appreciate it, thanks in advance.
Best!
df3 %>%
mutate(rn = between(row_number(), 4, 7)) %>%
summarise(across(-rn, ~.x[!(.x == 0 & rn)]))
a b c
1 1 0 0
2 2 3 0
3 4 5 6
4 3 5 7
5 6 9 6
6 8 10 1
7 0 4 4
8 0 0 9
First, you find which one is zero between rows 4 and 7.
to_remove <- apply(df3[4:7, ], 1, function(x) which(x == 0))
Then, you substitute them by NAs.
for(i in seq(length(to_remove))){
df3[as.numeric(names(to_remove))[i], to_remove[[i]]] <- NA
}
And, finally, drop them.
new.df3 <- as.data.frame(lapply(df3, na.omit))
print(new.df3)
Here's a different approach:
mask <- !(seq(nrow(df3)) %in% 4:7 & df3 == 0)
df.lst <- lapply(1:3, function(x) df3[mask[, x], x])
sapply(df.lst, length)
# [1] 8 8 8 # Check to make sure the columns are the same length
names(df.lst) <- colnames(df3)
(new.df3 <- as.data.frame(df.lst))
# a b c
# 1 1 0 0
# 2 2 3 0
# 3 4 5 6
# 4 3 5 7
# 5 6 9 6
# 6 8 10 1
# 7 0 4 4
# 8 0 0 9

how make some value of a column NA with respect of another column

I want to make value of each row of column A , NA ,where column B is 2:
data
A B
1 2
2 4
NA 5
6 2
output
A B
NA 2
2 4
NA 5
NA 2
first and last row of B was 2 so A got NA in those.
Here's a way using ifelse in base R -
df$A <- ifelse(df$B == 2, NA_real_, df$A)
set.seed(0)
df <- data.frame(A = sample(1:10, size=5, replace=T),
B = sample(1:10, size=5, replace=T))
df
A B
1 9 7
2 4 2
3 7 3
4 1 1
5 2 5
df$A[df$B == 2] <- NA
df
A B
1 9 7
2 NA 2
3 7 3
4 1 1
5 2 5

How to replace 0 or missing value with NA in R [duplicate]

This question already has answers here:
Replace all 0 values to NA
(11 answers)
Closed 4 years ago.
this is what i have already done so far
data is numeric data type
if (is.na(data) || attribute==0){replace(data,NA)}
it gives me error message that
Error in replace(attribute, NA) : argument "values" is missing, with no default
With mutate_all:
library(dplyr)
df %>%
mutate_all(~replace(., . == 0, NA))
or with mutate_if to be safe:
df %>%
mutate_if(is.numeric, ~replace(., . == 0, NA))
Note that there is no need to check for NA's, because we are replacing with NA anyway.
Output:
> df %>%
+ mutate_all(~replace(., . == 0, NA))
X Y Z
1 1 5 <NA>
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 <NA>
6 NA 4 <NA>
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 <NA>
12 2 5 2
13 4 4 4
14 3 4 <NA>
15 NA NA 3
16 5 2 1
17 1 4 <NA>
18 NA 1 4
19 1 1 5
20 5 1 2
> df %>%
+ mutate_if(is.numeric, ~replace(., . == 0, NA))
X Y Z
1 1 5 0
2 4 4 2
3 2 3 2
4 5 5 2
5 5 3 0
6 NA 4 0
7 3 3 1
8 5 3 2
9 3 1 1
10 2 NA 5
11 5 5 0
12 2 5 2
13 4 4 4
14 3 4 0
15 NA NA 3
16 5 2 1
17 1 4 0
18 NA 1 4
19 1 1 5
20 5 1 2
Data:
set.seed(123)
df <- data.frame(X = sample(0:5, 20, replace = TRUE),
Y = sample(0:5, 20, replace = TRUE),
Z = as.character(sample(0:5, 20, replace = TRUE)))
You could just use replace without any additional function / package:
data <- replace(data, data == 0, NA)
This is now assuming that data is your data frame.
Otherwise you can simply insert the column name, e.g. if your data frame is df and column name data:
df$data <- replace(df$data, df$data == 0, NA)
Assuming that data is a dataframe then you could use sapply to update your values based on a set of filters:
new.data = as.data.frame(sapply(data,FUN= function(x) replace(x,is.na(x) | x == 0)))

Find the index of columns containing more than 5 NA values

I want to subset a dataframe and extract only the columns that contain 5 or more NA values.
data.frame(A = rep(1, 10), B = c(rep(2,5), rep(3,5)), D = rep(5, 10), E = c(rep(1,2), rep(NA,6), rep(6,2)), F = c(rep(NA,2), rep(2,8)))
A B D E F
1 1 2 5 1 NA
2 1 2 5 1 NA
3 1 2 5 NA 2
4 1 2 5 NA 2
5 1 2 5 NA 2
6 1 3 5 NA 2
7 1 3 5 NA 2
8 1 3 5 NA 2
9 1 3 5 6 2
10 1 3 5 6 2
So in this example I want to have the index of the column "E".
My original dataset has about 3000 columns, so speed is more or less important.
I have been trying to do this with sum(is.na) and filter_if(any_vars) but all to no avail..
Using ColSums with is.na
names(df)[colSums(is.na(df))>5]
[1] "E"
We can use colSums on logical matrix (is.na(df1)), get the index with which and extract the names
names(which(colSums(is.na(df1)) >= 5))
#[1] "E"
which(unlist(lapply(df, function(x) sum(is.na(x)) > 5)))
4

R language check missing data for columns and rows

I have a data frame sells and I want to check the missing data in both rows and columns
What I did for rows is:
sells[, complete.cases(sells)]
nrows(sells[, complete.cases(sells)])
but I didn't know who to solve if for columns
Help please
First let's take the iris dataframe and insert randomly some NA's:
iris.demo <- iris
iris.nas <- matrix(as.logical(sample(FALSE:TRUE, size = 150*5,
prob = c(.9,.1),replace = TRUE)),ncol = 5)
iris.demo[iris.nas] <- NA
For rows, it is pretty straightforward:
sum(complete.cases(iris.demo))
# [1] 75
For columns, two possibilities (among several possible others):
Transposing the whole dataframe
sum(complete.cases(t(iris.demo)))
# [1] 0 # 0 columns are complete
Using lapply to count the "non-missing" on every column and see if it's equal to nrow:
sum(lapply(iris.demo, function(x) sum(!is.na(x))) == nrow(iris.demo))
# [1] 0
You could do it like this:
set.seed(1)
(sells <- data.frame(replicate(2, sample(c(1:3, NA), 10, T)), x3 = 1:10))
# X1 X2 x3
# 1 NA 2 1
# 2 1 3 2
# 3 3 2 3
# 4 1 1 4
# 5 2 NA 5
# 6 2 3 6
# 7 1 NA 7
# 8 2 1 8
# 9 NA 3 9
# 10 2 2 10
Rows:
sells[complete.cases(sells), ]
# X1 X2 x3
# 1 2 1 1
# 2 2 1 2
# 3 3 3 3
# 9 3 2 9
nrow(sells[complete.cases(sells), ])
# [1] 6
Columns:
sells[, sapply(sells, function(col) any(is.na(col)))]
# X1 X2
# 1 2 1
# 2 2 1
# 3 3 3
# 4 NA 2
# 5 1 NA
# 6 NA 2
# 7 NA 3
# 8 3 NA
# 9 3 2
# 10 1 NA
sum(sapply(sells, function(col) any(is.na(col))))
# [1] 2

Resources