Remove rows with specific NA column - r

I have the Following dataset where some entries (unique A) Don't have data in B and others that have sometimes.
A B
1 NA
2 NA
3 77
1 NA
2 81
I want to delete the entries that Always have NA and keep the rest
A B
2 NA
3 77
2 81

We can use ave grouped by A and remove the groups that has all NAs
df[!with(df, ave(is.na(B), A, FUN = all)), ]
# A B
#2 2 NA
#3 3 77
#5 2 81
Using the same logic with dplyr
library(dplyr)
df %>%
group_by(A) %>%
filter(!all(is.na(B)))

Assuming the input shown reproducibly in the Note at the end, for each group defined by A we return TRUE if any of its elements in B are not NA.
subset(DF, ave(!is.na(B), A, FUN = any))
Note
Lines <- "
A B
1 NA
2 NA
3 77
1 NA
2 81"
DF <- read.table(text = Lines, header = TRUE)

We can use data.table
library(data.table)
setDT(df1)[, .SD[any(!is.na(B))], A]
# A B
#1: 2 NA
#2: 2 81
#3: 3 77
data
df1 <- structure(list(A = c(1L, 2L, 3L, 1L, 2L), B = c(NA, NA, 77L,
NA, 81L)), class = "data.frame", row.names = c(NA, -5L))

Related

Using case_when() within mutate_at() to recode several columns with different types of NA

Given the data:
df <- structure(list(cola = structure(c(5L, 9L, 6L, 2L, 7L, 10L, 3L,
8L, 1L, 4L), .Label = c("a", "b", "d", "g", "q", "r", "t", "w",
"x", "z"), class = "factor"), colb = c(156L, 8L, 6L, 100L, 49L,
31L, 189L, 77L, 154L, 171L), colc = c(0.207140279468149, 0.51990159181878,
0.402017514919862, 0.382948065642267, 0.488511856179684, 0.263168515404686,
0.38591041485779, 0.774066215148196, 0.763264901703224, 0.474355421960354
), cold = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
df
# cola colb colc cold
# 1 q 156 0.2071403 a
# 2 x 8 0.5199016 b
# 3 r 6 0.4020175 a
# 4 b 100 0.3829481 b
# 5 t 49 0.4885119 a
# 6 z 31 0.2631685 b
# 7 d 189 0.3859104 a
# 8 w 77 0.7740662 b
# 9 a 154 0.7632649 a
# 10 g 171 0.4743554 b
If the value in colc in a particular row is >= 0.5, I would like to replace the contents of all the other cells in that row with NA, except for the contents of cold for that row (which I would like to retain as it is).
I attempted this with dplyr::mutate_at() and base::ifelse(), and it works fine:
df %>% mutate_at(vars(-c(cold)), funs(ifelse(colc >= 0.5, NA, .)))
# cola colb colc cold
# 1 5 156 0.2071403 a
# 2 NA NA NA b
# 3 6 6 0.4020175 a
# 4 2 100 0.3829481 b
# 5 7 49 0.4885119 a
# 6 10 31 0.2631685 b
# 7 3 189 0.3859104 a
# 8 NA NA NA b
# 9 NA NA NA a
# 10 4 171 0.4743554 b
But I would like to do this with dplyr::case_when(), as I might have more than one replacement condition to fulfill (e.g., replace with "foo" if colc < 0.5 & colc >= 0.3. But case_when() does not appear to be playing nice:
df %>% mutate_at(vars(-c(cold)), funs(case_when(colc >= 0.5 ~ NA, TRUE ~ .)))
Error: must be a logical vector, not a factor object
Why is this happening and what can I do to fix it? I assume this is because I am trying to convert multiple columns with different data types to NA. I tried to look for a solution online, but I wasn't able to find one.
Edit: in specific, I would like to preserve the data types of the various columns as they are.
library(dplyr)
df %>%
mutate_at(vars(-c(cold)), ~ case_when(colc >= 0.5 ~ `is.na<-`(., TRUE), TRUE ~ .))
# cola colb colc cold
# 1 q 156 0.2071403 a
# 2 <NA> NA NA b
# 3 r 6 0.4020175 a
# 4 b 100 0.3829481 b
# 5 t 49 0.4885119 a
# 6 z 31 0.2631685 b
# 7 d 189 0.3859104 a
# 8 <NA> NA NA b
# 9 <NA> NA NA a
# 10 g 171 0.4743554 b
Description
When you use case_when to assign NA, you need to specify the type of NA, i.e. NA_integer_, NA_real_, NA_complex_ and NA_character_. However, mutate_at transforms multiple columns at the same time and these columns have different types, so you cannot apply one statement on all columns. Ideally, there might exist something like NA_guess to identify the type, but I don't find that so far. This method is a little tricky. I use is.na() to convert the input vector to NAs, and these NAs will be the same type as the input vector. For example:
x <- 1:5
is.na(x) <- TRUE ; x
# [1] NA NA NA NA NA
class(x)
# [1] "integer"
y <- letters[1:5]
is.na(y) <- TRUE ; y
# [1] NA NA NA NA NA
class(y)
# [1] "character"
Work around similar to #NelsonGon :
library(dplyr)
df %>%
mutate_all(as.character) %>%
mutate_at(vars(-c(cold)),
~case_when(colc >= 0.5 ~ NA_character_, # ifelse(is.numeric(.), NA_real_, NA_character_),
TRUE ~ .
)
) %>%
mutate(colb = as.numeric(colb),
colc = as.numeric(colc)
)
#> cola colb colc cold
#> 1 q 156 0.2071403 a
#> 2 <NA> <NA> NA b
#> 3 r 6 0.4020175 a
#> 4 b 100 0.3829481 b
#> 5 t 49 0.4885119 a
#> 6 z 31 0.2631685 b
#> 7 d 189 0.3859104 a
#> 8 <NA> <NA> NA b
#> 9 <NA> <NA> NA a
#> 10 g 171 0.4743554 b

R: Return rows with only 1 non-NA value for a set of columns

Suppose I have a data.table with the following data:
colA colB colC result
1 2 3 231
1 NA 2 123
NA 3 NA 345
11 NA NA 754
How would I use dplyr and magrittr to only select the following rows:
colA colB colC result
NA 3 NA 345
11 NA NA 754
The selection criteria is: only 1 non-NA value for columns A-C (i.e. colA, colB, ColC)
I have been unable to find a similar question; guessing this is an odd situation.
A base R option would be
df[apply(df, 1, function(x) sum(!is.na(x)) == 1), ]
# colA colB colC
#3 NA 3 NA
#4 11 NA NA
A dplyr option is
df %>% filter(rowSums(!is.na(.)) == 1)
Update
In response to your comment, you can do
df[apply(df[, -ncol(df)], 1, function(x) sum(!is.na(x)) == 1), ]
# colA colB colC result
#3 NA 3 NA 345
#4 11 NA NA 754
Or the same in dplyr
df %>% filter(rowSums(!is.na(.[-length(.)])) == 1)
This assumes that the last column is the one you'd like to ignore.
Sample data
df <-read.table(text = "colA colB colC
1 2 3
1 NA 2
NA 3 NA
11 NA NA", header = T)
Sample data for update
df <- read.table(text =
"colA colB colC result
1 2 3 231
1 NA 2 123
NA 3 NA 345
11 NA NA 754
", header = T)
Another option is filter with map
library(dplyr)
library(purrr)
df %>%
filter(map(select(., starts_with('col')), ~ !is.na(.)) %>%
reduce(`+`) == 1)
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
Or another option is to use transmute_at
df %>%
transmute_at(vars(starts_with('col')), ~ !is.na(.)) %>%
reduce(`+`) %>%
magrittr::equals(1) %>% filter(df, .)
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
data
df <- structure(list(colA = c(1L, 1L, NA, 11L), colB = c(2L, NA, 3L,
NA), colC = c(3L, 2L, NA, NA), result = c(231L, 123L, 345L, 754L
)), class = "data.frame", row.names = c(NA, -4L))
I think this would be possible with filter_at but I was not able to make it work. Here is one attempt with filter and pmap_lgl where you can specify the range of columns in select or specify by their positions or use other tidyselect helper variables.
library(dplyr)
library(purrr)
df %>%
filter(pmap_lgl(select(., colA:colC), ~sum(!is.na(c(...))) == 1))
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
data
df <- structure(list(colA = c(1L, 1L, NA, 11L), colB = c(2L, NA, 3L,
NA), colC = c(3L, 2L, NA, NA), result = c(231L, 123L, 345L, 754L
)), class = "data.frame", row.names = c(NA, -4L))

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

R split each row of a dataframe into two rows

I would like to splite each row of a data frame(numberic) into two rows. For example, part of the original data frame like this (nrow(original datafram) > 2800000):
ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47
And after spliting each row, we can get:
ID X Y Z
1 3 2 6
22 54 NA NA
6 11 5 9
52 71 NA NA
3 7 2 5
2 34 NA NA
5 10 7 1
23 47 NA NA
the "value_1" and "value_2" columns are split and each element is set to a new row. For example, value_1 = 22 and value_2 = 54 are set to a new row.
Here is one option with data.table. We convert the 'data.frame' to 'data.table' by creating a column of rownames (setDT(df1, keep.rownames = TRUE)). Subset the columns 1:5 and 1, 6, 7 in a list, rbind the list element with fill = TRUE option to return NA for corresponding columns that are not found in one of the datasets, order by the row number ('rn') and assign (:=) the row number column to 'NULL'.
library(data.table)
setDT(df1, keep.rownames = TRUE)[]
rbindlist(list(df1[, 1:5, with = FALSE], setnames(df1[, c(1, 6:7),
with = FALSE], 2:3, c("ID", "X"))), fill = TRUE)[order(rn)][, rn:= NULL][]
# ID X Y Z
#1: 1 3 2 6
#2: 22 54 NA NA
#3: 6 11 5 9
#4: 52 71 NA NA
#5: 3 7 2 5
#6: 2 34 NA NA
#7: 5 10 7 1
#8: 23 47 NA NA
A hadleyverse corresponding to the above logic would be
library(dplyr)
tibble::rownames_to_column(df1[1:4]) %>%
bind_rows(., setNames(tibble::rownames_to_column(df1[5:6]),
c("rowname", "ID", "X"))) %>%
arrange(rowname) %>%
select(-rowname)
# ID X Y Z
#1 1 3 2 6
#2 22 54 NA NA
#3 6 11 5 9
#4 52 71 NA NA
#5 3 7 2 5
#6 2 34 NA NA
#7 5 10 7 1
#8 23 47 NA NA
data
df1 <- structure(list(ID = c(1L, 6L, 3L, 5L), X = c(3L, 11L, 7L, 10L
), Y = c(2L, 5L, 2L, 7L), Z = c(6L, 9L, 5L, 1L), value_1 = c(22L,
52L, 2L, 23L), value_2 = c(54L, 71L, 34L, 47L)), .Names = c("ID",
"X", "Y", "Z", "value_1", "value_2"), class = "data.frame",
row.names = c(NA, -4L))
Here's a (very slow) pure R solution using no extra packages:
# Replicate your matrix
input_df <- data.frame(ID = rnorm(10000),
X = rnorm(10000),
Y = rnorm(10000),
Z = rnorm(10000),
value_1 = rnorm(10000),
value_2 = rnorm(10000))
# Preallocate memory to a data frame
output_df <- data.frame(
matrix(
nrow = nrow(input_df)*2,
ncol = ncol(input_df)-2))
# Loop through each row in turn.
# Put the first four elements into the current
# row, and the next two into the current+1 row
# with two NAs attached.
for(i in seq(1, nrow(output_df), 2)){
output_df[i,] <- input_df[i, c(1:4)]
output_df[i+1,] <- c(input_df[i, c(5:6)],NA,NA)
}
colnames(output_df) <- c("ID", "X", "Y", "Z")
Which results in
> head(output_df)
X1 X2 X3 X4
1 0.5529417 -0.93859275 2.0900276 -2.4023800
2 0.9751090 0.13357075 NA NA
3 0.6753835 0.07018647 0.8529300 -0.9844643
4 1.6405939 0.96133195 NA NA
5 0.3378821 -0.44612782 -0.8176745 0.2759752
6 -0.8910678 -0.37928353 NA NA
This should work
data <- read.table(text= "ID X Y Z value_1 value_2
1 3 2 6 22 54
6 11 5 9 52 71
3 7 2 5 2 34
5 10 7 1 23 47", header=T)
data1 <- data[,1:4]
data2 <- setdiff(data,data1)
names(data2) <- names(data1)[1:ncol(data2)]
combined <- plyr::rbind.fill(data1,data2)
n <- nrow(data1)
combined[kronecker(1:n, c(0, n), "+"),]
Though why you would need to do this beats me.

Find the last row in a data frame that meets certain criteria

I'm looking for a way to refer to a pevious row in my data frame that has one column value in common with the 'current row'. Basically, if this would be my data frame
A B D
1 10
4 5
6 6
3 25
1 40
I would want D(i) to contain the B value of the last row for which A has the same value as A(i). So for the last row that should be 10.
You could try this:
for(i in seq_len(nrow(dat))) {
try(dat$D[i] <- dat$B[tail(which(dat$A[1:i-1] == dat$A[i]),1)],silent=TRUE)
}
Results:
> dat
A B D
1 1 10 NA
2 4 5 NA
3 6 6 NA
4 3 25 NA
5 1 40 10
Data:
dat <- read.csv(text="A,B,D
1,10
4,5
6,6
3,25
1,40")
You may try
library(dplyr)
df1%>%
group_by(A) %>%
mutate(D=lag(B))
# A B D
#1 1 10 NA
#2 4 5 NA
#3 6 6 NA
#4 3 25 NA
#5 1 40 10
Or
library(data.table)#data.table_1.9.5
setDT(df1)[, D:=shift(B), A][]
data
df1 <- structure(list(A = c(1L, 4L, 6L, 3L, 1L), B = c(10L, 5L, 6L,
25L, 40L)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -5L))

Resources