This question already has answers here:
Using `:=` in data.table to sum the values of two columns in R, ignoring NAs
(2 answers)
Closed 3 years ago.
I try to sum 2 columns with some NA. There are a lot of forum questions like my first question: how to sum and ignore NA, but now I do want it to return NA when both columns have NA in a specific row. This is an example:
df<-data.table(x = c(1,2,NA),
y = c(1,NA,NA))
> df
x y
1 1
2 NA
NA NA
and I want this:
x y final
1 1 2
2 NA 2
NA NA NA
I've tried the following:
df$sum<-rowSums(df[,c("x", "y")], na.rm=TRUE)
df$final<-ifelse (is.na(df$x) && is.na(df$y) , NA,
ifelse (is.na(df$x) | is.na(df$y), df$sum,
ifelse (!is.na(df$x) && !is.na(df$y), df$sum)))
But this doesn't return what I want.. Could someone help me..?
NOTE: Some have said this is a duplicate for the reason that I ask that NA's are ignored, but those questions do not answer my main question: How should 2 x NAget me NA and not 0
I used the following. It gives sums even when there are NAs, but returns NA when all sumed elements are NA.
rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0)
Here are two more options:
ifelse(rowSums(is.na(df)) != ncol(df), rowSums(df, na.rm = TRUE), NA)
#[1] 2 2 NA
and
vals <- rowSums(df, na.rm = TRUE)
NA^(vals == 0) * vals
#[1] 2 2 NA
Related
This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9
Suppose I have this dataframe
df <- data.frame(keep = c(1, NA, 2),
also_want = c(NA, NA, NA),
maybe = c(1, 2, NA),
maybe_2 = c(NA, NA, NA))
Edit: In the actual dataframe there are many columns I'd like to keep, so spelling them all out isn't viable. These columns are all the columns that do not start with maybe. The maybe columns, instead, do have a common naming like maybe, maybe_1 etc. that could work with grep or stringr::str_detect
I want to select keep, and also_want. I also want any of the maybe columns that have values other than NA
desired_df
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I can use select_if to get all columns that have non-NA values, but then I lose also_want
library(dplyr)
df %>%
select_if(~sum(!is.na(.)) > 0)
keep maybe
1 1 1
2 NA 2
3 2 NA
Thoughts?
With dplyr 1.0.0 you can use the where function inside a select statement to test for conditions that your variables have to satisfy, but first you specify the variables you also want to keep.
EDIT
I've inserted the condition that only the "maybe" variables have to contain values other than NA; before, we select every column that does not start with "maybe".
df %>%
select(!starts_with("maybe"), starts_with("maybe") & where(~sum(!is.na(.)) > 0))
Output
# keep also_want maybe
# 1 1 NA 1
# 2 NA NA 2
# 3 2 NA NA
following your comments, in Base-R we can use
df[,!apply(
rbind(
grepl("maybe",colnames(df)),
!apply(df, 2, function(x) !all(is.na(x)))
)
,2,all)]
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
Or if you prefer seeing the same code all on 1 line:
df[,!apply(rbind(grepl("maybe",colnames(df)),!apply(df, 2, function(x) !all(is.na(x)))),2,all)]
I eventually figured this out. Using str_detect to select all non-maybe columns, and then using a one-liner inside sapply to also select any other columns (i.e. any maybe columns) that have non-NA values.
library(dplyr)
library(stringr)
df %>%
select_if(stringr::str_detect(names(.), "maybe", negate = TRUE) |
sapply(., function(x) {
sum(!is.na(x))
} > 0))
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)
This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10
I am trying to work out a way to create a single column from multiple columns in R. What I want to do is for R to go through all rows for multiple columns and if it finds a positive result in one of those columns, to pass that result into an 'amalgam' column (sorry I don't know a better word for it).
See the toy dataset below
x <- c(NA, NA, NA, NA, NA, 1)
y <- c(NA, NA, 1, NA, NA, NA)
z <- c(NA, 1, NA, NA, NA, NA)
df <- data.frame(cbind(x, y, z))
df[, "compCol"] <- NA
df
x y z compCol
1 NA NA NA NA
2 NA NA 1 NA
3 NA 1 NA NA
4 NA NA NA NA
5 NA NA NA NA
6 1 NA NA NA
I need to pass positive results from each of the columns into the compCol column while changing negative results to 0. So that it looks like this.
x y z compCol
1 NA NA NA 0
2 NA NA 1 3
3 NA 1 NA 2
4 NA NA NA 0
5 NA NA NA 0
6 1 NA NA 1
I know if probably requires an if else statement nested inside a for loop but all the ways I have tried result in errors that I don't understand.
I tried the following just for a single column
for (i in 1:length(x)) {
if (df$x[i] == 1) {
df$compCol[i] <- df$x[i]
}
}
But it didn't work at all.
I got the message 'Error in if (df$x[i] == 1) { : missing value where TRUE/FALSE needed'
And that makes sense but I can't see where to put the TRUE/FALSE statement
You can also use reshaping with NA removal
library(dplyr)
library(tidyr)
df.id = df %>% mutate(ID = 1:n() )
df.id %>%
gather(variable, value,
x, y, z,
na.rm = TRUE) %>%
left_join(df.id)
We can use max.col. Create a logical matrix by checking whether the selected columns are greater than 0 and are not NA ('ind'). We use max.col to get the column index for each row and multiply with rowSums of 'ind' so that if there is 0 TRUE values for a row, it will be 0.
ind <- df > 0 & !is.na(df)
df$compCol <- max.col(ind) *rowSums(ind)
df$compCol
#[1] 0 3 2 0 0 1
Or another option is pmax after multiplying with the col(df)
do.call(pmax,col(df)*replace(df, is.na(df), 0))
#[1] 0 3 2 0 0 1
NOTE: I used the dataset before creating the 'compCol' in the OP's post.