Merging rows by a group [duplicate] - r

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 5 years ago.
I have a data set
>data.frame(GROUP=c("A","A","A","G","G","F","F","E","T"),
FIRST=c(10,2,3,6,NA,NA,NA,1,NA),
SECOND=c(3,NA,NA,1,NA,4,2,1,NA),
THIRD=c(5,7,NA,NA,NA,1,NA,1,1))
GROUP FIRST SECOND THIRD
1 A 10 3 5
2 A 2 NA 7
3 A 3 NA NA
4 G 6 1 NA
5 G NA NA NA
6 F NA 4 1
7 F NA 2 NA
8 E 1 1 1
9 T NA NA 1
I want to combine the data using the GROUP-column in two ways:
Mean of columns inside a group
GROUP FIRST SECOND THIRD
1 A 5 3 6
2 G 6 1 NA
3 F NA 3 1
4 E 1 1 1
5 T NA NA 1
Column-wise max value inside a group
GROUP FIRST SECOND THIRD
1 A 10 3 7
2 G 6 1 NA
3 F NA 4 1
4 E 1 1 1
5 T NA NA 1
Is there a quick way to do this or should I create a new function?

We can use aggregate from base R
aggregate(.~GROUP, d1, mean, na.rm = TRUE, na.action=NULL)
Or using dplyr
library(dplyr)
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(mean=mean(., na.rm = TRUE)))
Or
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(max=max(., na.rm = TRUE)))

Related

Conditionally replace NAs in Certain Columns Based on Row Values

For a dataframe like I have below, I am trying to selectively replace the NAs in columns a, b, and c with a 0 using R, but only when there is at least one missing value in those columns for that row.
For example, I would want to replace the NAs in rows 1,2, and 5, but leave row 4 alone, and not replace the NA in column d
sample data
df <- data.frame(a = c(1,NA,2,NA,3,4),
b = c(NA,5,6,NA,7,8),
c = c(9,NA,10,NA,NA,11),
d = c("Alpha","Beta","Charlie","Delta",NA,"Foxtrot"))
> df
a b c d
1 1 NA 9 Alpha
2 NA 5 NA Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 NA <NA>
6 4 8 11 Foxtrot
Desired outcome
> df_naReplaced
a b c d
1 1 0 9 Alpha
2 0 5 0 Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 0 <NA>
6 4 8 11 Foxtrot
The solutions that I have found so far only work on conditions by column, but not by row, or would require actively removing those columns from their context (in this example separating it from d).
I have tried using ifelse and an if statement like below but was unable to get it to work as selectively as I would like, as it replaces all NA in that column.
if(df %>% select(a:c) %>% any(!is.na(.))){
df<- df %>% replace_na(list(a= 0,
b= 0,
c= 0)
)
}
Thank you for whatever help you are able to offer!
Here's an R base solution
> df[,-4][(is.na(df[, -4]) & rowSums(is.na(df[, -4])) < 3)] <- 0
> df
a b c d
1 1 0 9 Alpha
2 0 5 0 Beta
3 2 6 10 Charlie
4 NA NA NA Delta
5 3 7 0 <NA>
6 4 8 11 Foxtrot

Coalesce multiple columns at once

My question is similar to existing questions about coalesce, but I want to coalesce several columns by row such that NAs are pushed to the last column.
Here's an example:
If I have
a <- data.frame(A=c(2,NA,4,3,2), B=c(NA,3,4,NA,5), C= c(1,3,6,7,NA), D=c(5,6,NA,4,3), E=c(2,NA,1,3,NA))
A B C D E
1 2 NA 1 5 2
2 NA 3 3 6 NA
3 4 4 6 NA 1
4 3 NA 7 4 3
5 2 5 NA 3 NA
I would like to get
b <- data.frame(A=c(2,3,4,3,2), B=c(1,3,4,7,5), C=c(5,6,6,4,3), D=c(2,NA,1,3,NA))
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
Does anyone have any ideas for how I could do this? I would be so grateful for any tips, as my searches have come up dry.
You can use unite and separate:
library(tidyverse)
a %>%
unite(newcol, everything(), na.rm = TRUE) %>%
separate(newcol, into = LETTERS[1:4])
A B C D
1 2 1 5 2
2 3 3 6 <NA>
3 4 4 6 1
4 3 7 4 3
5 2 5 3 <NA>
Since you have an unknown number of new columns in separate, one can use splitstackshape's function cSplit:
library(splitstackshape)
a %>%
unite(newcol, na.rm = TRUE) %>%
cSplit("newcol", "_", type.convert = F) %>%
rename_with(~ LETTERS)
This could be another solution. From what I understood you basically just want to shift the values in each row after the first NA to the left replacing the NA and I don't think coalesce can help you here.
library(dplyr)
library(purrr)
a %>%
pmap_dfr(~ {x <- c(...)[-which(is.na(c(...)))[1]]
setNames(x, LETTERS[seq_along(x)])})
# A tibble: 5 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
We may use base R - loop over the rows, order based on the NA elements and remove the columns that have all NAs
a[] <- t(apply(a, 1, \(x) x[order(is.na(x))]))
a[colSums(!is.na(a)) > 0]
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA

Remove NA rows based on mulitple columns's name in R [duplicate]

This question already has answers here:
Omit rows containing specific column of NA
(10 answers)
Closed 2 years ago.
Given a small dataset as follows:
A B C
1 2 NA
NA 2 3
1 NA 3
1 2 3
How could I remove rows based on the condition: columns B and C have NAs?
The expected result will like this:
A B C
NA 2 3
1 2 3
Another option in Base R is
df[complete.cases(df[c("B","C")]),]
A B C
2 NA 2 3
4 1 2 3
With base R:
df[!is.na(df$B) & !is.na(df$C),]
Using dplyr:
df %>%
filter(!is.na(B), !is.na(C))
returns
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 NA 2 3
2 1 2 3
or
df %>%
drop_na(B, C)

Remove groups which do not have non-consecutive NA values in R

I have the following Data frame
group <- c(2,2,2,2,4,4,4,4,5,5,5,5)
D <- c(NA,2,NA,NA,NA,2,3,NA,NA,NA,1,1)
df <- data.frame(group, D)
df
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
9 5 NA
10 5 NA
11 5 1
12 5 1
I would like to only keep groups that contain non consecutive NA values at least once. in this case group 5 would be removed because it does not contain non consecutive NA values, but only consecutive NA values. group 2 and 4 remain because they do contain non consecutive NA values (NA values separated by row(s) with a non NA value).
therefore the resulting data frame would look like this:
df2
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
any ideas :)?
How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.

Growth rates, using the last non-NA value by groups

I have a dataframe that looks like this:
value id
1 2 A
2 5 A
3 NA A
4 7 A
5 9 A
6 1 B
7 NA B
8 NA B
9 5 B
10 6 B
And I would like to calculate growth rates of the value using the id variable to group. Usually, I would do something like this:
df <- df %>% group_by(id) %>% mutate(growth = log(value) - as.numeric(lag(value)))
To get this dataframe:
value id growth
(dbl) (chr) (dbl)
1 2 A NA
2 5 A -0.3905621
3 NA A NA
4 7 A NA
5 9 A -4.8027754
6 1 B NA
7 NA B NA
8 NA B NA
9 5 B NA
10 6 B -3.2082405
Now what I want to do is to use the last non NA value as well for the growth rates. Kind of like calculating the growth rates over the "NA-gaps" as well. For example: In row 4 should be the growth rate from 5 to 7 and in row 9 should be the growth rate from 1 to 5.
Thanks!
zoo::na.locf will replace NAs with the last non-NA value, so this may work for you:
df <- df %>%
group_by(id) %>%
mutate(
valuenoNA = zoo::na.locf(value),
growth = log(valuenoNA) - as.numeric(lag(valuenoNA)))
1 2 A NA 2
2 5 A -0.3905621 5
3 NA A -3.3905621 5
4 7 A -3.0540899 7
5 9 A -4.8027754 9
6 1 B NA 1
7 NA B -1.0000000 1
8 NA B -1.0000000 1
9 5 B 0.6094379 5
10 6 B -3.2082405 6
We can use fill from tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill(value) %>%
mutate(growth = log(value) - lag(value))

Resources