Replace NA values using if statement based on group by - r

I am looking to do the following in a more elegant manner in R. I believe there is a way but just cant wrap my head around it. Following is the problem.
I have a df which contains NAs. However, I want to make the NAs into zeros where if the sum of the NA is not equal to zero and if the sum is NA then leave as NA. The example below should make it clear.
A<-c("A", "A", "A", "A",
"B","B","B","B",
"C","C","C","C")
B<-c(1,NA,NA,1,NA,NA,NA,NA,2,1,2,3)
data<-data.frame(A,B)
Following is how the data looks like
A B
1 A 1
2 A NA
3 A NA
4 A 1
5 B NA
6 B NA
7 B NA
8 B NA
9 C 2
10 C 1
11 C 2
12 C 3
And am looking to get a result as per the following
A B
1 A 1
2 A 0
3 A 0
4 A 1
5 B NA
6 B NA
7 B NA
8 B NA
9 C 2
10 C 1
11 C 2
12 C 3
I know I can use inner join by creating a table first and and then making an IF statement based on that table but I was wondering if there is a way to do it in one or two lines of code in R.
Following is the solution related to the inner join I was referring to
sum_NA <- function(x) if(all(is.na(x))) NA_integer_ else sum(x, na.rm=TRUE)
data2 <- data %>% group_by(A) %>% summarize(x = sum_NA(B), Y =
ifelse(is.na(x), TRUE, FALSE))
data2
data2_1 <- right_join(data, data2, by = "A")
data <- mutate(data2_1, B = ifelse(Y == FALSE & is.na(B), 0,B))
data <- select(data, - Y,-x)
data

Maybe solution like this would work:
data[is.na(B) & A %in% unique(na.omit(data)$A), ]$B <- 0
Here you're asking:
if B is NA
if A is within letters that have non-NA values
Then make those values 0.

Or similarly, with ifelse():
data$B <- ifelse(is.na(data$B) & data$A %in% unique(na.omit(data)$A), 0, data$B)

or with dplyr its:
library(dplyr)
data %>%
mutate(B=ifelse(is.na(B) & A %in% unique(na.omit(data)$A), 0, B))

Related

why doses sub-setting dataframe results in NA rows [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

differences between 'dplyr::filter' and [conditions, ] [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

"summarize" multiple incomplete columns to 1 summary column [duplicate]

I have some columns in R and for each row there will only ever be a value in one of them, the rest will be NA's. I want to combine these into one column with the non-NA value. Does anyone know of an easy way of doing this. For example I could have as follows:
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
So I would have
'a' 'x' 'y' 'z'
A 1 NA NA
B 2 NA NA
C NA 3 NA
D NA NA 4
E NA NA 5
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
The names of the columns containing NA changes depending on code earlier in the query so I won't be able to call the column names explicitly, but I have the column names of the columns which contains NA's stored as a vector e.g. in this example cols <- c('x','y','z'), so could call the columns using data[, cols].
Any help would be appreciated.
Thanks
A dplyr::coalesce based solution could be as:
data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
Data
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
You can use unlist to turn the columns into one vector. Afterwards, na.omit can be used to remove the NAs.
cbind(data[1], mycol = na.omit(unlist(data[-1])))
a mycol
x1 A 1
x2 B 2
y3 C 3
z4 D 4
z5 E 5
Here's a more general (but even simpler) solution which extends to all column types (factors, characters etc.) with non-ordered NA's. The strategy is simply to merge the non-NA values of other columns into your merged column using is.na for indexing:
data$mycol = data$x # your new merged column. Start with x
data$mycol[!is.na(data$y)] = data$y[!is.na(data$y)] # merge with y
data$mycol[!is.na(data$z)] = data$z[!is.na(data$z)] # merge with z
> data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
Note that this will overwrite existing values in mycol if there are several non-NA values in the same row. If you have a lot of columns you could automate this by looping over colnames(data).
I would use rowSums() with the na.rm = TRUE argument:
cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
which gives:
> cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
You have to call the method directly (cbind.data.frame) as the first argument above is not a data frame.
Something like this ?
data.frame(a=data$a, mycol=apply(data[,-1],1,sum,na.rm=TRUE))
gives :
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
max works too. Also works on strings vectors.
cbind(data[1], mycol=apply(data[-1], 1, max, na.rm=T))
One possibility using dplyr and tidyr could be:
data %>%
gather(variables, mycol, -1, na.rm = TRUE) %>%
select(-variables)
a mycol
1 A 1
2 B 2
8 C 3
14 D 4
15 E 5
Here it transforms the data from wide to long format, excluding the first column from this operation and removing the NAs.
In a related link (suppress NAs in paste()) I present a version of paste with a na.rm option (with the unfortunate name of paste5).
With this the code becomes
cols <- c("x", "y", "z")
cbind.data.frame(a = data$a, mycol = paste2(data[, cols], na.rm = TRUE))
The output of paste5 is a character, which works if you have character data otherwise you'll need to coerce to the type you want.
Though this is not the OP case, it seems some people like the approach based on sums, how about thinking in mean and mode, to make the answer more universal. This answer matches the title, which is what many people will find.
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,9),
'y' = c(NA,6,3,NA,5),
'z' = c(NA,NA,NA,4,5))
splitdf<-split(data[,c(2:4)], seq(nrow(data[,c(2:4)])))
data$mean<-unlist(lapply(splitdf, function(x) mean(unlist(x), na.rm=T) ) )
data$mode<-unlist(lapply(splitdf, function(x) {
tab <- tabulate(match(x, na.omit(unique(unlist(x) ))));
paste(na.omit(unique(unlist(x) ))[tab == max(tab) ], collapse = ", " )}) )
data
a x y z mean mode
1 A 1 NA NA 1.000000 1
2 B 2 6 NA 4.000000 2, 6
3 C NA 3 NA 3.000000 3
4 D NA NA 4 4.000000 4
5 E 9 5 5 6.333333 5
If you want to stick with base,
data <- data.frame('a' = c('A','B','C','D','E'),'x' = c(1,2,NA,NA,NA),'y' = c(NA,NA,3,NA,NA),'z' = c(NA,NA,NA,4,5))
data[is.na(data)]<-","
data$mycol<-paste0(data$x,data$y,data$z)
data$mycol <- gsub(',','',data$mycol)

Data frame in R: interesting behavior for counting rows [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Subsetting R data frame results in mysterious NA rows

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Resources