remove NA values and combine non NA values into a single column - r

I have a data set which has numeric and NA values in all columns. I would like to create a new column with all non NA values and preserve the row names
v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 NA NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5
I have tried using the coalesce function from dplyr
digital_metrics_FB <- fb_all_data %>%
mutate(fb_metrics = coalesce("v1",
"v2",
"v3",
"v4",
"v5"))
and also tried an apply function
df2 <- sapply(fb_all_data,function(x) x[!is.na(x)])
still cannot get it to work.
I am looking for the final result to be where all non NA values come together in the final column and the row names are preserved
final
a 1
b 2
c 3
d 4
e 5
any help would be much appreciated

We can use pmax
do.call(pmax, c(fb_all_data , na.rm = TRUE))
If there are more than one non-NA element and want to combine as a string, a simple base R option would be
data.frame(final = apply(fb_all_data, 1, function(x) toString(x[!is.na(x)])))
Or using coalesce
library(dplyr)
library(tibble)
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = coalesce(v1, v2, v3, v4, v5)) %>%
column_to_rownames('rn')
# final
#a 1
#b 2
#c 3
#d 4
#e 5
Or using tidyverse, for multiple non-NA elements
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = pmap_chr(.[-1], ~ c(...) %>%
na.omit %>%
toString)) %>%
column_to_rownames('rn')
NOTE: Here we are showing data that the OP showed as example and not some other dataset
data
fb_all_data <- structure(list(v1 = c(1L, NA, NA, NA, NA), v2 = c(NA, 2L, NA,
NA, NA), v3 = c(NA, NA, 3L, NA, NA), v4 = c(NA, NA, NA, 4L, NA
), v5 = c(NA, NA, NA, NA, 5L)), class = "data.frame",
row.names = c("a",
"b", "c", "d", "e"))

With tidyverse, you can do:
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE) %>%
group_by(rowname) %>%
summarise(val = paste(val, collapse = ", "))
rowname val
<chr> <chr>
1 a 1
2 b 2, 3
3 c 3
4 d 4
5 e 5
Sample data to have a row with more than one non-NA value:
df <- read.table(text = " v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 3 NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5", header = TRUE)

Related

Forming a new column from whichever of two columns isn’t NA [duplicate]

This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)

Editing data in rows

I'm trying to convert my data in R, but I can't manage to get the column I want.
My dataset is as below, and the column I want to get is "total", it is the sum of D1 + D2 + D3 + D4 + D5, and ignores "NA".
NR
D1
D2
D3
D4
D5
total
A
1
NA
NA
1
NA
2
B
NA
NA
NA
NA
NA
NA
C
NA
1
NA
NA
NA
1
It is probably quite a domb question, but I can't get it.
I already tried:
total <- NA
total <- ifelse(D1==1, 1, total)
total <- ifelse(D2==1, total + 1, total)
total <- ifelse(D3==1, total + 1, total)
total <- ifelse(D4==1, total + 1, total)
total <- ifelse(D5==1, total + 1, total)
But it returns all my rows to "NA"
and i tried:
total <- mutate(dataset, total=D1+D2+D3+D4+D5)
but then I don't get an aggregation of the values of D1 to D5.
We could use rowSums
df1$total <- rowSums(df1[startsWith(names(df1), "D")], na.rm = TRUE)
df1$total[df1$total == 0] <- NA
Or the same logic in dplyr
library(dplyr)
df1 %>%
mutate(total = na_if(rowSums(select(., starts_with('D')), na.rm = TRUE), 0))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
data
df1 <- structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA), total = c(2L, NA, 1L)), class = "data.frame", row.names = c(NA,
-3L))
Here is a solution with c_across and rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(Total = sum(c_across(D1:D5 & where(is.numeric)), na.rm = TRUE))
Output:
NR D1 D2 D3 D4 D5 Total
<chr> <int> <int> <lgl> <int> <lgl> <int>
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA 0
3 C NA 1 NA NA NA 1
data:
structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA)), row.names = c(NA, -3L), class = "data.frame")
You can try the code below
df$total <- replace(u <- rowSums(!is.na(df)) - 1, u == 0, NA)
which gives
> df
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
And also this one:
library(dplyr)
library(purrr)
df1 <- df1[, !names(df1) %in% "total"]
df1 %>%
mutate(total = pmap_dbl(select(cur_data(), starts_with("D")), ~ ifelse(all(is.na(c(...))),
NA, sum(c(...), na.rm = TRUE))))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1

R: Return rows with only 1 non-NA value for a set of columns

Suppose I have a data.table with the following data:
colA colB colC result
1 2 3 231
1 NA 2 123
NA 3 NA 345
11 NA NA 754
How would I use dplyr and magrittr to only select the following rows:
colA colB colC result
NA 3 NA 345
11 NA NA 754
The selection criteria is: only 1 non-NA value for columns A-C (i.e. colA, colB, ColC)
I have been unable to find a similar question; guessing this is an odd situation.
A base R option would be
df[apply(df, 1, function(x) sum(!is.na(x)) == 1), ]
# colA colB colC
#3 NA 3 NA
#4 11 NA NA
A dplyr option is
df %>% filter(rowSums(!is.na(.)) == 1)
Update
In response to your comment, you can do
df[apply(df[, -ncol(df)], 1, function(x) sum(!is.na(x)) == 1), ]
# colA colB colC result
#3 NA 3 NA 345
#4 11 NA NA 754
Or the same in dplyr
df %>% filter(rowSums(!is.na(.[-length(.)])) == 1)
This assumes that the last column is the one you'd like to ignore.
Sample data
df <-read.table(text = "colA colB colC
1 2 3
1 NA 2
NA 3 NA
11 NA NA", header = T)
Sample data for update
df <- read.table(text =
"colA colB colC result
1 2 3 231
1 NA 2 123
NA 3 NA 345
11 NA NA 754
", header = T)
Another option is filter with map
library(dplyr)
library(purrr)
df %>%
filter(map(select(., starts_with('col')), ~ !is.na(.)) %>%
reduce(`+`) == 1)
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
Or another option is to use transmute_at
df %>%
transmute_at(vars(starts_with('col')), ~ !is.na(.)) %>%
reduce(`+`) %>%
magrittr::equals(1) %>% filter(df, .)
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
data
df <- structure(list(colA = c(1L, 1L, NA, 11L), colB = c(2L, NA, 3L,
NA), colC = c(3L, 2L, NA, NA), result = c(231L, 123L, 345L, 754L
)), class = "data.frame", row.names = c(NA, -4L))
I think this would be possible with filter_at but I was not able to make it work. Here is one attempt with filter and pmap_lgl where you can specify the range of columns in select or specify by their positions or use other tidyselect helper variables.
library(dplyr)
library(purrr)
df %>%
filter(pmap_lgl(select(., colA:colC), ~sum(!is.na(c(...))) == 1))
# colA colB colC result
#1 NA 3 NA 345
#2 11 NA NA 754
data
df <- structure(list(colA = c(1L, 1L, NA, 11L), colB = c(2L, NA, 3L,
NA), colC = c(3L, 2L, NA, NA), result = c(231L, 123L, 345L, 754L
)), class = "data.frame", row.names = c(NA, -4L))

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

Rearrange data by matching columns

I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

Resources