"summarize" multiple incomplete columns to 1 summary column [duplicate] - r

I have some columns in R and for each row there will only ever be a value in one of them, the rest will be NA's. I want to combine these into one column with the non-NA value. Does anyone know of an easy way of doing this. For example I could have as follows:
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
So I would have
'a' 'x' 'y' 'z'
A 1 NA NA
B 2 NA NA
C NA 3 NA
D NA NA 4
E NA NA 5
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
The names of the columns containing NA changes depending on code earlier in the query so I won't be able to call the column names explicitly, but I have the column names of the columns which contains NA's stored as a vector e.g. in this example cols <- c('x','y','z'), so could call the columns using data[, cols].
Any help would be appreciated.
Thanks

A dplyr::coalesce based solution could be as:
data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
Data
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))

You can use unlist to turn the columns into one vector. Afterwards, na.omit can be used to remove the NAs.
cbind(data[1], mycol = na.omit(unlist(data[-1])))
a mycol
x1 A 1
x2 B 2
y3 C 3
z4 D 4
z5 E 5

Here's a more general (but even simpler) solution which extends to all column types (factors, characters etc.) with non-ordered NA's. The strategy is simply to merge the non-NA values of other columns into your merged column using is.na for indexing:
data$mycol = data$x # your new merged column. Start with x
data$mycol[!is.na(data$y)] = data$y[!is.na(data$y)] # merge with y
data$mycol[!is.na(data$z)] = data$z[!is.na(data$z)] # merge with z
> data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
Note that this will overwrite existing values in mycol if there are several non-NA values in the same row. If you have a lot of columns you could automate this by looping over colnames(data).

I would use rowSums() with the na.rm = TRUE argument:
cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
which gives:
> cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
You have to call the method directly (cbind.data.frame) as the first argument above is not a data frame.

Something like this ?
data.frame(a=data$a, mycol=apply(data[,-1],1,sum,na.rm=TRUE))
gives :
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5

max works too. Also works on strings vectors.
cbind(data[1], mycol=apply(data[-1], 1, max, na.rm=T))

One possibility using dplyr and tidyr could be:
data %>%
gather(variables, mycol, -1, na.rm = TRUE) %>%
select(-variables)
a mycol
1 A 1
2 B 2
8 C 3
14 D 4
15 E 5
Here it transforms the data from wide to long format, excluding the first column from this operation and removing the NAs.

In a related link (suppress NAs in paste()) I present a version of paste with a na.rm option (with the unfortunate name of paste5).
With this the code becomes
cols <- c("x", "y", "z")
cbind.data.frame(a = data$a, mycol = paste2(data[, cols], na.rm = TRUE))
The output of paste5 is a character, which works if you have character data otherwise you'll need to coerce to the type you want.

Though this is not the OP case, it seems some people like the approach based on sums, how about thinking in mean and mode, to make the answer more universal. This answer matches the title, which is what many people will find.
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,9),
'y' = c(NA,6,3,NA,5),
'z' = c(NA,NA,NA,4,5))
splitdf<-split(data[,c(2:4)], seq(nrow(data[,c(2:4)])))
data$mean<-unlist(lapply(splitdf, function(x) mean(unlist(x), na.rm=T) ) )
data$mode<-unlist(lapply(splitdf, function(x) {
tab <- tabulate(match(x, na.omit(unique(unlist(x) ))));
paste(na.omit(unique(unlist(x) ))[tab == max(tab) ], collapse = ", " )}) )
data
a x y z mean mode
1 A 1 NA NA 1.000000 1
2 B 2 6 NA 4.000000 2, 6
3 C NA 3 NA 3.000000 3
4 D NA NA 4 4.000000 4
5 E 9 5 5 6.333333 5

If you want to stick with base,
data <- data.frame('a' = c('A','B','C','D','E'),'x' = c(1,2,NA,NA,NA),'y' = c(NA,NA,3,NA,NA),'z' = c(NA,NA,NA,4,5))
data[is.na(data)]<-","
data$mycol<-paste0(data$x,data$y,data$z)
data$mycol <- gsub(',','',data$mycol)

Related

calculated columns in new datatable without altering the original

I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

Replace NA values using if statement based on group by

I am looking to do the following in a more elegant manner in R. I believe there is a way but just cant wrap my head around it. Following is the problem.
I have a df which contains NAs. However, I want to make the NAs into zeros where if the sum of the NA is not equal to zero and if the sum is NA then leave as NA. The example below should make it clear.
A<-c("A", "A", "A", "A",
"B","B","B","B",
"C","C","C","C")
B<-c(1,NA,NA,1,NA,NA,NA,NA,2,1,2,3)
data<-data.frame(A,B)
Following is how the data looks like
A B
1 A 1
2 A NA
3 A NA
4 A 1
5 B NA
6 B NA
7 B NA
8 B NA
9 C 2
10 C 1
11 C 2
12 C 3
And am looking to get a result as per the following
A B
1 A 1
2 A 0
3 A 0
4 A 1
5 B NA
6 B NA
7 B NA
8 B NA
9 C 2
10 C 1
11 C 2
12 C 3
I know I can use inner join by creating a table first and and then making an IF statement based on that table but I was wondering if there is a way to do it in one or two lines of code in R.
Following is the solution related to the inner join I was referring to
sum_NA <- function(x) if(all(is.na(x))) NA_integer_ else sum(x, na.rm=TRUE)
data2 <- data %>% group_by(A) %>% summarize(x = sum_NA(B), Y =
ifelse(is.na(x), TRUE, FALSE))
data2
data2_1 <- right_join(data, data2, by = "A")
data <- mutate(data2_1, B = ifelse(Y == FALSE & is.na(B), 0,B))
data <- select(data, - Y,-x)
data
Maybe solution like this would work:
data[is.na(B) & A %in% unique(na.omit(data)$A), ]$B <- 0
Here you're asking:
if B is NA
if A is within letters that have non-NA values
Then make those values 0.
Or similarly, with ifelse():
data$B <- ifelse(is.na(data$B) & data$A %in% unique(na.omit(data)$A), 0, data$B)
or with dplyr its:
library(dplyr)
data %>%
mutate(B=ifelse(is.na(B) & A %in% unique(na.omit(data)$A), 0, B))

How to combine two columns of a data-frame with missing data? [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Coalesce two string columns with alternating missing values to one
(7 answers)
Closed 5 years ago.
This is an extension of this earlier question. How can I combine two columns of a data frame as
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c("t",2,NA,NA,NA),
'y' = c(NA,NA,NA,4,"r"))
displayed as
'a' 'x' 'y'
A t NA
B 2 NA
C NA NA
D NA 4
E NA r
to get
'a' 'mycol'
A t
B 2
C NA
D 4
E r
I tried this
cbind(data[1], mycol = na.omit(unlist(data[-1])))
But it obviously doesn't keep the NA row.
You could do it by using ifelse, like this:
data$mycol <- ifelse(!is.na(data$x), data$x, data$y)
> data
## a x y mycol
## 1 A 1 NA 1
## 2 B 2 NA 2
## 3 C NA NA NA
## 4 D NA 4 4
## 5 E NA 5 5
Going with your logic, you can do following:
cbind(data[1], mycol = unlist(apply(data[2:3], 1, function(i) ifelse(
length(is.na(i))==length(i),
na.omit(i),
NA)
)))
# a mycol
#1 A 1
#2 B 2
#3 C NA
#4 D 4
#5 E 5
This has been addressed here indirectly. Here is a simple solution based on that:
data$mycol <- coalesce(data$x, data$y)
Extending the answer to any number of columns, and using the neat max.col() function I've discovered thanks to this question:
coalesce <- function(value_matrix) {
value_matrix <- as.matrix(value_matrix)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
data$mycol <- coalesce(data[, c('x', 'y')])
data
# a x y mycol
# 1 A 1 NA 1
# 2 B 2 NA 2
# 3 C NA NA NA
# 4 D NA 4 4
# 5 E NA 5 5
max.col(..., ties.method = "first") returns, for each row, the index of the first column with the maximum value. Since we're using it on a logical matrix, the max is usually TRUE. So we'll get the first non-NA value for each row. If the entire row is NA, then we'll get an NA value as desired.
After that, the function uses a matrix of row-column indices to subset the values.
Edit
In comparison to mrip's coalesce, my max.col is slower when there are a few long columns, but faster when there are many short columns.
coalesce_reduce <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
coalesce_maxcol <- function(...) {
value_matrix <- cbind(...)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
set.seed(100)
wide <- replicate(
1000,
{sample(c(NA, 1:10), 10, replace = TRUE)},
simplify = FALSE
)
long <- replicate(
10,
{sample(c(NA, 1:10), 1000, replace = TRUE)},
simplify = FALSE
)
microbenchmark(
do.call(coalesce_reduce, wide),
do.call(coalesce_maxcol, wide),
do.call(coalesce_reduce, long),
do.call(coalesce_maxcol, long)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# do.call(coalesce_reduce, wide) 1879.460 1953.5695 2136.09954 2007.303 2152.654 5284.583 100
# do.call(coalesce_maxcol, wide) 403.604 423.5280 490.40797 433.641 456.583 2543.580 100
# do.call(coalesce_reduce, long) 36.829 41.5085 45.75875 43.471 46.942 79.393 100
# do.call(coalesce_maxcol, long) 80.903 88.1475 175.79337 92.374 101.581 3438.329 100

removing columns with NA values only [duplicate]

This question already has answers here:
Remove columns from dataframe where ALL values are NA
(13 answers)
Closed 1 year ago.
I am using this command to remove the columns where all the values are NA.
testing5 <- subset(testing4,
select = -c(kurtosis_picth_belt, skewness_roll_belt,
skewness_roll_belt.1, min_yaw_belt, amplitude_yaw_belt,
kurtosis_roll_arm, kurtosis_picth_arm, kurtosis_yaw_arm,
skewness_roll_arm, skewness_pitch_arm, kurtosis_picth_dumbbell,
skewness_roll_dumbbell, skewness_pitch_dumbbell, min_yaw_dumbbell,
kurtosis_roll_forearm, kurtosis_picth_forearm, skewness_roll_forearm,
skewness_pitch_forearm))
Is there a shorter (programmitic) method?
Thanks and Regards,
Partha
The tidyverse approach would look like this (also using #Rich Scriven data):
d %>% select_if(~any(!is.na(.)))
# x
# 1 NA
# 2 3
# 3 NA
You can remove the columns that contain all NA values with e.g.
d <- data.frame(x = c(NA, 3, NA), y = rep(NA, 3))
# x y
# 1 NA NA
# 2 3 NA
# 3 NA NA
d[!sapply(d, function(x) all(is.na(x)))]
# x
# 1 NA
# 2 3
# 3 NA
On your data, this would be
testing4[!sapply(testing4, function(x) all(is.na(x)))]
Yet another way (a bit more vectorized) using #Richards data
d[!is.nan(colMeans(d, na.rm = TRUE))]
# x
# 1 NA
# 2 3
# 3 NA

Remove columns from dataframe where some of values are NA

I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]

Resources