R dataframe: combine conditions by processing - r

I have to find all columns with all NA-values. If there are not all NA-values in column, I have to replace NAs with 0.
My solution is:
NA_check <- colSums(is.na(frame)) == nrow(frame) #True or False - all NA or not
frame[is.na(frame) & which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])] <- 0
These conditions work separately, but they don't work together or I get some errors combining them. How can I solve my problem?
P.S. This modification also doesn't work if NA_checkis not all FALSE:
frame[is.na(frame[which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])])] <- 0

You can find out columns which has atleast one non-NA value (not all values are NA) and replace NA in that subset to 0.
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
We can check this with an example :
frame <- data.frame(a = c(NA, NA, 3, 4), b = NA, c = c(NA, 1:3), d = NA)
frame
# a b c d
#1 NA NA NA NA
#2 NA NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
frame
# a b c d
#1 0 NA 0 NA
#2 0 NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
We can also do this with dplyr :
library(dplyr)
frame %>% mutate(across(where(~any(!is.na(.))), tidyr::replace_na, 0))

Related

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

Delete records containing more than 5 null values?

I would like to know how I can remove from a dataset the records that have more than 5 null values in the columns that define them. The following code allows you to delete records with any NA in any column. However, how can I modify it to do exactly what I ask? Any ideas?
df [ complete.cases (df),]
Here is an example data frame. One of the rows has 6 NA values.
We sum the NA values by row in a new column, filter where the number of NA is less than or equal to 5, then remove the new column.
df <- data.frame(a = c(1,NA,1,1),
b = c(1, NA, NA, 1),
c = c(1, NA, NA, NA),
d = c(1, NA, NA ,NA),
e = c(1, NA, NA, NA),
f = c(1, NA, NA, NA))
a b c d e f
1 1 1 1 1 1 1
2 NA NA NA NA NA NA
3 1 NA NA NA NA NA
4 1 1 NA NA NA NA
df %>%
mutate(count = rowSums(is.na(df))) %>%
filter(count <= 5) %>%
select(-count)
a b c d e f
1 1 1 1 1 1 1
2 1 NA NA NA NA NA
3 1 1 NA NA NA NA
I'm assuming you are referring to values of NA in your data indicating a missing value. NULL is returned by expressions and functions whose value is undefined. First create some reproducible data:
set.seed(42)
vals <- sample.int(1000, 250)
idx <- sample.int(250, 100)
vals[idx] <- NA
example <- as.data.frame(matrix(vals, 25))
Now compute the number of missing values by row and exclude the rows with more than 5 missing values:
na.count <- rowSums(is.na(example))
example[na.count<=5, ]

dplyr/purrr iterate over columns as well as rows

I'm trying to drop (set to NA) values in 1 column, based on values in another column; and to do this over a large set of columns. The idea is to then pass the data to a plotting function, to generate different plots for different cuts of the data.
Here's a reproducible example:
d <- data.frame("A_agree" = sample(1:7, 20, replace=T),
"B_agree" = sample(1:7, 20, replace=T),
"C_agree" = sample(1:7, 20, replace=T),
"A_change" = sample(1:5, 20, replace=T),
"B_change" = sample(1:5, 20, replace=T),
"C_change" = sample(1:5, 20, replace=T))
I've already found the following solution using base R, but it's of course slow, and I'm trying to learn more and more dplyr, so was wondering how to achieve this in dplyr
d.positive <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.positive)) {
d.positive[i, paste0(n, "_agree")] <- ifelse(d.positive[i, paste0(n, "_change")] > 3,
d.positive[i, paste0(n, "_agree")],
NA)
}
}
d.neutral <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.neutral)) {
d.neutral[i, paste0(n, "_agree")] <- ifelse(d.neutral[i, paste0(n, "_change")] == 3,
d.neutral[i, paste0(n, "_agree")],
NA)
}
}
d.negative <- d
for (n in (c("A","B","C"))) {
for (i in 1:nrow(d.negative)) {
d.negative[i, paste0(n, "_agree")] <- ifelse(d.negative[i, paste0(n, "_change")] < 3,
d.negative[i, paste0(n, "_agree")],
NA)
}
}
I thought I would use gather(), and then check for each row whether the corresponding column (hence the !!dimension) is bigger than a certain value (3 in this case), but it doesn't seem to work?
d %>%
gather(dimension,
value,
paste0(c("A","B","C"), "_agree")
) %>%
case_when(!!dimension > 3 ~ value=NA)
Alternatively, I thought I'd use map2_dfr from purrr, but I don't think it iterates over cells, just takes the entire column, hence this doesn't work:
map2_dfr(.x = d %>%
select( paste0(c("A","B","C"), "_agree") ),
.y = d %>%
select( paste0(c("A","B","C"), "_change") ),
~ if_else(.y > 3, x, NA)} )
Any pointers would be really helpful, to keep learning about the wonderful world of dplyr !
I get that you want to learn about purrr, but base R is just easier here:
d.positive <- d
check <- d.positive[4:6] <= 3 #it's the same condition
d.positive[,1:3][check] <- NA
> d.positive
A_agree B_agree C_agree A_change B_change C_change
1 1 NA NA 4 3 2
2 2 2 NA 4 5 2
3 4 NA NA 4 3 1
4 1 NA NA 4 1 2
5 NA 1 NA 2 4 1
6 NA 7 NA 3 5 1
7 NA 6 NA 1 5 1
8 NA 6 4 2 5 5
9 4 NA NA 4 1 2
10 1 NA NA 5 1 2
11 NA NA NA 3 1 2
12 NA NA NA 1 3 3
13 NA NA NA 1 1 1
14 NA NA NA 3 2 3
15 1 NA NA 5 3 3
16 2 NA NA 4 3 2
17 NA NA 6 1 1 4
18 NA NA NA 1 1 2
19 NA NA NA 2 3 1
20 NA NA NA 1 3 1
I would suggest to use tidyr package in combination with dplyr. In it there are new functions pivot_longer and pivot_wider which replace older gather and spread.
Using a combination of both the solution could be as follows:
d.neutral1 =
d %>%
mutate(row = row_number() ) %>%
pivot_longer(-row, names_sep = "_", names_to = c("name","type") ) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(result = if_else(change == 3, agree, NA_integer_))
and if you want a similar shape to the original
d.neutral1 %>%
select(-agree, -change) %>%
pivot_wider(names_from = name, values_from = result)

Merge data.frame columns on set number of columns removing na's unless not enough values in row

I'd like to remove the NA values from my columns, merge all columns into four columns, while keeping NA's if there is not 4 values in each row.
Say I have data like this,
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
#> a b c d e f g
#> 1 1 3 NA 4 NA 1 NA
#> 2 4 NA 2 2 5 NA NA
#> 3 NA 3 NA NA 3 NA NA
#> 4 3 NA NA NA NA 4 4
My desired outcome would be,
df.desired <- data.frame('a' = c(1,4,3,3),
'b' = c(3,2,3,4),
'c' = c(4,2,NA,4),
'd' = c(1,5,NA,NA))
df.desired
#> a b c d
#> 1 1 3 4 1
#> 2 4 2 2 5
#> 3 3 3 NA NA
#> 4 3 4 4 NA
You could've probably explored a bit more on SO to tweak two answers 1 & 2.
Shifting all the Numbers with NAs
Remove the columns where you've got All NAs
Result:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df.new<-do.call(rbind,lapply(1:nrow(df),function(x) t(matrix(df[x,order(is.na(df[x,]))])) ))
colnames(df.new)<-colnames(df)
df.new
df.new[,colSums(is.na(df.new))<nrow(df.new)]
Output:
> df.new[,colSums(is.na(df.new))<nrow(df.new)]
a b c d
[1,] 1 3 4 1
[2,] 4 2 2 5
[3,] 3 3 NA NA
[4,] 3 4 4 NA
I believe there are more efficient ways, anyhow that is my try:
x00=sapply(1:nrow(df),function(x) df[x,][!is.na( df[x,])])
x01=lapply(x00,function(x) x=c(x,rep(NA,7-length(x)-1)))
x02=as.data.frame(do.call("rbind",x01))
x02 <- x02[,colSums(is.na(x02))<nrow(x02)]
I have following solution:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df
x <-list()
for(i in 1:nrow(df)){
x[[i]] <- df[i,]
x[[i]] <- x[[i]][!is.na(x[[i]])]
# x[[i]] <- as.data.frame(x[[i]], stringsAsFactors = FALSE)
x[[i]] <- c(x[[i]], rep(0, 5 -length(x[[i]])))
}
result <- do.call(rbind, x)
result

Conditional replacement of specific row values

I have a problem with conditional replacement. Let's assume I have the following code for a dataframe
a=c("0","1","0","B","NA","NA","NA","NA","NA")
b=c(0,1,0,0,1,0,1,0,1)
c=c(0,0,0,0,1,0,0,1,1)
d=c("0","1","0","0","1","0","B","NA","NA")
dat=data.frame(rbind(a,b,c,d))
names(dat)=c("P1","P2","P3","P4","C1","C2","C3","C4","C5")
Now I want to replace the row values of P1:P4 with NA if one of these values is B and I also want to replace the row values of C1:C5 with NA if one of these values is B. So I want the Dataframe to look like this:
a=c(**"NA","NA","NA","NA"**,"NA","NA","NA","NA","NA")
b=c(0,1,0,0,1,0,1,0,1)
c=c(0,0,0,0,1,0,0,1,1)
d=c("0","1","0","0",**"NA","NA","NA"**,"NA","NA")
dat=data.frame(rbind(a,b,c,d))
names(dat)=c("P1","P2","P3","P4","C1","C2","C3","C4","C5")
I hope the problem is understandable and I would appreciate any help.
Considering dat to be the original provided dataframe, I'm providing a comparatively lengthy code for better understanding. Hope it helps.
dat2 <- data.frame()
for(i in 1:nrow(dat)){
datSubset <- with(dat, dat[i,])
col.num.of.B <- which(datSubset == "B", arr.ind = T)[2]
if(is.na(col.num.of.B)){
datSubset <- datSubset
} else if(col.num.of.B < 5) {
datSubset[,c(1:4)] <- NA
} else {
datSubset[,c(5:9)] <- NA
}
dat2 <- rbind(dat2, datSubset)
}
dat2
# P1 P2 P3 P4 C1 C2 C3 C4 C5
# a <NA> <NA> <NA> <NA> NA NA NA NA NA
# b 0 1 0 0 1 0 1 0 1
# c 0 0 0 0 1 0 0 1 1
# d 0 1 0 0 <NA> <NA> <NA> <NA> <NA>
As I understood it... If the value B is found in columns P1 to P4, then set all the values within P1 to P4 to NA.
You can try:
nm <- c("P1", "P2", "P3", "P4")
cols <- which(names(dat) %in% nm)
dat[,cols][any(dat[,cols] == "B")] <- NA
dat
# P1 P2 P3 P4 C1 C2 C3 C4 C5
# a NA NA NA NA NA NA NA NA NA
# b NA NA NA NA 1 0 1 0 1
# c NA NA NA NA 1 0 0 1 1
# d NA NA NA NA 1 0 B NA NA
If you want to apply this to only the first row, then use dat[1,cols][any(dat[,cols] == "B")] <- NA.

Resources