This question already has answers here:
Replace all NA with FALSE in selected columns in R
(5 answers)
Closed 2 years ago.
I have this dataframe
dtf <- data.frame(
id = seq(1, 4),
amt = c(1, 4, NA, 123),
xamt = c(1, 4, NA, 123),
camt = c(1, 4, NA, 123),
date = c("2020-01-01", NA, "2020-01-01", NA),
pamt = c(1, 4, NA, 123)
)
I'd like to replace all NA values in case that colname is numeric, in my case amt, xamt, pamt and camt. I'm looking for dplyr way. Normally I would use
replace(is.na(.), 0)
But this not works because of date column.
You can use across :
library(dplyr)
dtf %>% mutate(across(where(is.numeric), ~replace(., is.na(.), 0)))
#mutate_if for dplyr < 1.0.0
#dtf %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
You can also use replace_na from tidyr :
dtf %>% mutate(across(where(is.numeric), tidyr::replace_na, 0))
# id amt xamt camt date pamt
#1 1 1 1 1 2020-01-01 1
#2 2 4 4 4 <NA> 4
#3 3 0 0 0 2020-01-01 0
#4 4 123 123 123 <NA> 123
As suggested by #Darren Tsai we can also use coalesce.
dtf %>% mutate(across(where(is.numeric), coalesce, 0))
Related
I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3
I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.
I would do:
df %>% filter(column1 != column2)
This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)
However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)
How can I include the latter case into the filter function?
Thanks
Or use xor:
df %>% filter(a != b | xor(is.na(a), is.na(b)))
Or as #thelatemail mentioned, you could use Base R:
df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]
Or as #runr mentioned, you could try subset in Base R:
subset(df, a != b | xor(is.na(a), is.na(b)))
You can include them with an OR (|) condition -
library(dplyr)
df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))
df %>% filter(a != b | is.na(a) | is.na(b))
# a b
#1 1 NA
#2 NA 3
#3 5 8
Another option would be to change NA values to string "NA" and then only using a != b should work.
df %>%
mutate(across(.fns = ~replace(., is.na(.), 'NA'))) %>%
filter(a != b) %>%
type.convert(as.is = TRUE)
We can use if_any
library(dplyr)
df %>%
filter(a != b | if_any(everything(), is.na))
a b
1 1 NA
2 NA 3
3 5 8
data
df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
class = "data.frame", row.names = c(NA,
-5L))
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99
This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a dataset1 which is as follows:
dataset1 <- data.frame(
id1 = c(1, 1, 1, 2, 2, 2),
id2 = c(122, 122, 122, 133, 133, 133),
num1 = c(1, NA, NA, 50,NA, NA),
num2 = c(NA, 2, NA, NA, 45, NA),
num3 = c(NA, NA, 3, NA, NA, 4)
)
How to convert multiple rows into a single row?
The desired output is:
id1, id2, num1, num2, num3
1 122 1 2 3
2 133 50 45 4
library(dplyr)
dataset1 %>% group_by(id1, id2) %>%
summarise_all(funs(.[!is.na(.)])) %>%
as.data.frame()
# id1 id2 num1 num2 num3
# 1 1 122 1 2 3
# 2 2 133 50 45 4
Note: Assuming there will be only 1 non-NA item in a column.
Using data.table
library(data.table)
data.table(dataset1)[, lapply(.SD, sum, na.rm = TRUE), by = c("id1", "id2")]
# id1 id2 num1 num2 num3
#1: 1 122 1 2 3
#2: 2 133 50 45 4
You can use dplyr to achieve that:
library(dplyr)
dataset1 %>%
group_by(id1, id2) %>%
mutate(
num1 = sum(num1, na.rm=T),
num2 = sum(num2, na.rm=T),
num3 = sum(num3, na.rm=T)
) %>%
distinct()
Output:
This is also assuming if there's a repeated value in any of the variable we're going to sum it (if id1 = 1 has two values for num1, we're going to sum the value). If you're confident that every id has only one possible value for each of the num (num1 to num3), then don't worry about it.
I have two data frames:
Harry <- c(1, NA, NA, NA)
Tom <- c(NA, 2, NA, NA)
Sally <- c(NA, NA, 3, NA)
Jane <- c(NA, NA, NA, 4)
df <- data.frame(Harry, Tom, Sally, Jane)
Harry <- c(1, NA, NA, NA)
Tom <- c(1, NA, NA, NA)
Mary <- c(NA, NA, 3, NA)
Sarah <- c(NA, NA, NA, 4)
df2 <- data.frame(Harry, Tom, Mary, Sarah)
... where there's only one value per column. I'd like to flatten the data frames into single rows and then vertically concatenate such that each data frame becomes an observation in the new frame. There may be different columns, in which case these columns would be added and hence why I can't use rbind.
In addition and since these are numeric, the NAs should be zeroes and the resulting frame would look as below:
Harry <- c(1, 1)
Tom <- c(2, 1)
Sally <- c(3, 0)
Jane <- c(4, 0)
Mary <- c(0, 3)
Sarah <- c(0, 4)
df <- data.frame(Harry, Tom, Sally, Jane, Mary, Sarah)
I realise I could make everything numeric and total to get each row, but my issue is to get this into a single object.
We can use the gather and spread approach from dplyr and tidyr.
library(dplyr)
library(tidyr)
df_2 <- df %>% gather(Col, Val, na.rm = TRUE)
df2_2 <- df2 %>% gather(Col, Val, na.rm = TRUE)
df3 <- bind_rows(df_2, df2_2, .id = "ID") %>%
spread(Col, Val, fill = 0) %>%
select(-ID)
df3
# Harry Jane Mary Sally Sarah Tom
# 1 1 4 0 3 0 2
# 2 1 0 3 0 4 1
We can get the dataset into a single one with bind_rows, create a grouping column using .id, grouped by 'grp', then get the sum of columns with summarise_all
library(dplyr)
bind_rows(df, df2, .id = 'grp') %>%
group_by(grp) %>%
summarise_all(funs(sum(., na.rm = TRUE))) %>%
ungroup %>%
select(-grp)
# A tibble: 2 x 6
# Harry Tom Sally Jane Mary Sarah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4 0 0
#2 1 1 0 0 3 4