This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a dataset1 which is as follows:
dataset1 <- data.frame(
id1 = c(1, 1, 1, 2, 2, 2),
id2 = c(122, 122, 122, 133, 133, 133),
num1 = c(1, NA, NA, 50,NA, NA),
num2 = c(NA, 2, NA, NA, 45, NA),
num3 = c(NA, NA, 3, NA, NA, 4)
)
How to convert multiple rows into a single row?
The desired output is:
id1, id2, num1, num2, num3
1 122 1 2 3
2 133 50 45 4
library(dplyr)
dataset1 %>% group_by(id1, id2) %>%
summarise_all(funs(.[!is.na(.)])) %>%
as.data.frame()
# id1 id2 num1 num2 num3
# 1 1 122 1 2 3
# 2 2 133 50 45 4
Note: Assuming there will be only 1 non-NA item in a column.
Using data.table
library(data.table)
data.table(dataset1)[, lapply(.SD, sum, na.rm = TRUE), by = c("id1", "id2")]
# id1 id2 num1 num2 num3
#1: 1 122 1 2 3
#2: 2 133 50 45 4
You can use dplyr to achieve that:
library(dplyr)
dataset1 %>%
group_by(id1, id2) %>%
mutate(
num1 = sum(num1, na.rm=T),
num2 = sum(num2, na.rm=T),
num3 = sum(num3, na.rm=T)
) %>%
distinct()
Output:
This is also assuming if there's a repeated value in any of the variable we're going to sum it (if id1 = 1 has two values for num1, we're going to sum the value). If you're confident that every id has only one possible value for each of the num (num1 to num3), then don't worry about it.
Related
This question already has answers here:
How can I count the number of NAs per group?
(3 answers)
Closed 6 months ago.
I have a grouped data frame with some NA values in all columns.
id <- rep(c("a", "b", "c"), 3)
x1 <- c(1, NA, NA, 2, 2, NA, 0, NA, 0)
x2 <- c(1, 2, 3, NA, 12, NA, NA, 4, NA)
df <- cbind.data.frame(id, x1, x2)
I want to group by ID and then summarize the number of NAs across all numeric columns. The resulting data frame should have 3 rows (1 for each ID) and 2 columns (x1 and x2) and should contain the sums of NAs in both columns by ID.
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(c(x1, x2), ~ sum(is.na(.x))))
or, with aggregate:
aggregate(list(x1 = df$x1, x2 = df$x2), by = list(id = df$id), function(x) sum(is.na(x)))
output
id x1 x2
<chr> <int> <int>
1 a 0 2
2 b 2 0
3 c 2 2
Using rowsum in base R
rowsum(+(is.na(df[-1])), df$id, na.rm = TRUE)
x1 x2
a 0 2
b 2 0
c 2 2
I am trying to aggregate records with a specific type into subsequent records.
I have a dataset similar to the following:
df_initial <- data.frame("Id" = c(1, 2, 3, 4, 5),
"Qty" = c(105, 110, 100, 115, 120),
"Type" = c("A", "B", "B", "A", "A"),
"Difference" = c(30, 34, 32, 30, 34))
After sorting on the Id field, I'd like to aggregate records of Type = "B" into the next record of type = "A".
In other words, I'm looking to create df_new, which adds the Qty and Difference values for Ids 2 and 3 into the Qty and Difference values for Id 4, and flags Id 4 as being adjusted (in the field AdjustedFlag).
df_new <- data.frame("Id" = c(1, 4, 5),
"Qty" = c(105, 325, 120),
"Type" = c("A", "A", "A"),
"Difference" = c(30, 96, 34),
"AdjustedFlag" = c(0, 1, 0))
I'd greatly appreciate any advice or ideas about how to do this in R, preferably using data.table.
A data.table solution:
df_initial[, .(
Id = Id[.N], Qty = sum(Qty),
Difference = sum(Difference),
AdjustedFlag = +(.N > 1)
), by = .(grp = rev(cumsum(rev(Type == "A"))))
][, grp := NULL][]
# Id Qty Difference AdjustedFlag
# <num> <num> <num> <int>
# 1: 1 105 30 0
# 2: 4 325 96 1
# 3: 5 120 34 0
This can be solved by creating a new grouping variable, that groups the rows into the groups you describe, with the idea being to utilize that grouping variable for the desired aggregation.
Instead of having
A B B A A
that new grouping variable should look something like this:
1 2 2 2 3
This is not a data.table solution, but the same logic could be applied there:
library(tidyverse)
df_initial |>
mutate(
type2 = ifelse(Type == "A", as.numeric(factor(Type)), 0),
type2 = cumsum(type2),
type2 = ifelse(Type == "B", NA, type2)
) |>
fill(type2, .direction = "up") |>
group_by(type2) |>
summarise(
id = max(Id),
Qty = sum(Qty),
Difference = sum(Difference),
AdjustedFlag = as.numeric(n() > 1)
)
#> # A tibble: 3 × 5
#> type2 id Qty Difference AdjustedFlag
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 105 30 0
#> 2 2 4 325 96 1
#> 3 3 5 120 34 0
Using tidyverse
df_initial %>%
mutate(gn = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 'B', Type),
gr = cumsum(lag(gn, default = 'A') != gn),
adjusted = if_else(lag(Type, default = 'A') == 'B' | Type == 'B', 1, 0)) %>%
group_by(gr) %>%
summarise(Id = last(Id),
Qty = sum(Qty),
Type = 'A',
Difference = sum(Difference),
Adjusted_flg = max(adjusted)) %>% ungroup()
Here we create an interim dataset that looks like:
Id Qty Type Difference gn gr Adjusted
1 1 105 A 30 A 0 0
2 2 110 B 34 B 1 0
3 3 100 B 32 B 1 0
4 4 115 A 30 B 1 1
5 5 120 A 34 A 2 0
And use this to create our final table within the summarise. The gr is a column for indicating a group of values, which is why we group_by it.
This question already has answers here:
Replace all NA with FALSE in selected columns in R
(5 answers)
Closed 2 years ago.
I have this dataframe
dtf <- data.frame(
id = seq(1, 4),
amt = c(1, 4, NA, 123),
xamt = c(1, 4, NA, 123),
camt = c(1, 4, NA, 123),
date = c("2020-01-01", NA, "2020-01-01", NA),
pamt = c(1, 4, NA, 123)
)
I'd like to replace all NA values in case that colname is numeric, in my case amt, xamt, pamt and camt. I'm looking for dplyr way. Normally I would use
replace(is.na(.), 0)
But this not works because of date column.
You can use across :
library(dplyr)
dtf %>% mutate(across(where(is.numeric), ~replace(., is.na(.), 0)))
#mutate_if for dplyr < 1.0.0
#dtf %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
You can also use replace_na from tidyr :
dtf %>% mutate(across(where(is.numeric), tidyr::replace_na, 0))
# id amt xamt camt date pamt
#1 1 1 1 1 2020-01-01 1
#2 2 4 4 4 <NA> 4
#3 3 0 0 0 2020-01-01 0
#4 4 123 123 123 <NA> 123
As suggested by #Darren Tsai we can also use coalesce.
dtf %>% mutate(across(where(is.numeric), coalesce, 0))
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99
This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 3 years ago.
I have a data frame with lots of columns. For example:
sample treatment col5 col6 col7
1 a 3 0 5
2 a 1 0 3
3 a 0 0 2
4 b 0 1 1
I want to select the sample and treatment columns plus all columns that meet the following 2 conditions:
Their value on the row in which treatment == 'b' is 0
Their value from at least one row where treatment == 'a' is not 0.
The expected result should look like this:
sample treatment col5
1 a 3
2 a 1
3 a 0
4 b 0
Example dataframe:
structure(list(sample = 1:4, treatment = structure(c(1L, 1L,
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3,
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA,
-4L))
Here's a way in base R -
cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0
df[, c(TRUE, TRUE, cs_a & cs_b)]
sample treatment col5
1 1 a 3
2 2 a 1
3 3 a 0
4 4 b 0
With dplyr -
df %>%
select_at(which(c(TRUE, TRUE, cs_a & cs_b)))
Here is much more verbose way in tidyverse that does not require manual colSums for each level of treatment:
library(dplyr)
library(purrr)
library(tidyr)
sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)
dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
.tbl = dd,
.predicate = is.numeric,
.funs = function(x)
x == 0
)
# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>%
group_by(treatment) %>%
summarise_at(.vars = vars(col5:col7), .funs = "sum")
# then find the names of the columns you want to keep
keeper_columns <-
number_of_zeros %>%
select(-treatment) %>% # remove the treatment grouping variable
map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
.x = .,
.f = function(x)
all(x > 0)
) %>%
gather(column, keeper) %>% # reformat
filter(keeper == TRUE) %>% # to grab the keepers
select(column) %>% # then select the column with column names
unlist %>% # and convert to character vector
unname
# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)