Want to remove duplicated rows unless NA value exists in columns - r

I have a data table with 4 columns: ID, Name, Rate1, Rate2.
I want to remove duplicates where ID, Rate1, and Rate 2 are the same, but if they are both NA, I would like to keep both rows.
Basically, I want to conditionally remove duplicates, but only if the conditions != NA.
For example, I would like this:
ID Name Rate1 Rate2
1 Xyz 1 2
1 Abc 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
to become this:
ID Name Rate1 Rate2
1 Xyz 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
Thanks in advance!
EDIT: I know it's possible to just take a subset of the data table where the Rates are NA, then remove duplicates on what's left, then add the NA rows back in - but, I would rather avoid this strategy. This is because in reality there are quite a few couplets of rates that I want to do this for consecutively.
EDIT2: Added in some more rows to the example for clarity.

A base R option would be to use duplicated on the subset of dataset without the 'Name' column i.e. column index 2 to create a logical vector, negate (! - TRUE becomes FALSE and viceversa) so that TRUE would be non-duplicated rows. Along with that create another condition with rowSumson a logical matrix (is.na(df1[3:4]) - Rate columns) to get rows that are all NA's - here we compare it with 2 - i.e. the number of Rate columns in the dataset). Both the conditions are joined by | to create the expected logical index
i1 <- !duplicated(df1[-2])| rowSums(is.na(df1[3:4])) == 2
df1[i1,]
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
Or with Reduce from base R
df1[Reduce(`&`, lapply(df1[3:4], is.na)) | !duplicated(df1[-2]), ]
Wrapping it in a function
f1 <- function(dat, i, method ) {
nm1 <- grep("^Rate", colnames(dat), value = TRUE)
i1 <- !duplicated(dat[-i])
i2 <- switch(method,
"rowSums" = rowSums(is.na(dat[nm1])) == length(nm1),
"Reduce" = Reduce(`&`, lapply(dat[nm1], is.na))
)
i3 <- i1|i2
dat[i3,]
}
-testing
f1(df1, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df1, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df2, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
f1(df2, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
if there are multiple 'Rate' columns (say 100 or more - only thing to change in the first solution is 2 should be changed to the number of 'Rate' columns)
Or using tidyverse
library(tidyvesrse)
df1 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 3 x 4
# Groups: ID [2]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
df2 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 5 x 4
# Groups: ID [3]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
#4 3 Hij 3 5
#5 3 Qrs 3 7
As #Paul mentioned in the comments, the updated tidyverse syntax as on Nov 4 2021 is
library(dplyr)
df2 %>%
group_by(ID) %>%
filter(if_any(cRate1, Rate2), ~ !duplicated(.)|is.na(.)))
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L), Name = c("Xyz", "Abc",
"Def", "Lmn"), Rate1 = c(1L, 1L, NA, NA), Rate2 = c(2L, 2L, NA,
NA)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Name = c("Xyz",
"Abc", "Def", "Lmn", "Hij", "Qrs"), Rate1 = c(1L, 1L, NA, NA,
3L, 3L), Rate2 = c(2L, 2L, NA, NA, 5L, 7L)), class = "data.frame",
row.names = c(NA, -6L))

Related

R Merging non-unique columns to consolidate data frame

I'm having issues figuring out how to merge non-unique columns that look like this:
2_2
2_3
2_4
2_2
3_2
1
2
3
NA
NA
2
3
-1
NA
NA
NA
NA
NA
3
-2
NA
NA
NA
-2
4
To make them look like this:
2_2
2_3
2_4
3_2
1
2
3
NA
2
3
-1
NA
3
NA
NA
-2
-2
NA
NA
4
Essentially reshaping any non-unique columns. I have a large data set to work with so this is becoming an issue!
Note that data.frame doesn't allow for duplicate column names. Even if we create those, it may get modified when we apply functions as make.unique is automatically applied. Assuming we created the data.frame with duplicate names, an option is to use split.default to split the data into list of subset of data, then loop over the list with map and use coalesce
library(dplyr)
library(purrr)
map_dfc(split.default(df1, names(df1)),~ invoke(coalesce, .x))
-output
# A tibble: 4 × 4
`2_2` `2_3` `2_4` `3_2`
<int> <int> <int> <int>
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
data
df1 <- structure(list(`2_2` = c(1L, 2L, NA, NA), `2_3` = c(2L, 3L, NA,
NA), `2_4` = c(3L, -1L, NA, NA), `2_2` = c(NA, NA, 3L, -2L),
`3_2` = c(NA, NA, -2L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
Also using coalesce:
You use non-syntactic names. R is strict in using names see here https://adv-r.hadley.nz/names-values.html and also notice the explanation by #akrun:
library(dplyr)
df %>%
mutate(X2_2 = coalesce(X2_2, X2_2.1), .keep="unused")
X2_2 X2_3 X2_4 X3_2
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4

Join similar observations within a data.frame with R

I want to mix several observations in a data.frame using as a reference one constantly repeated variable.
Example:
id var1 var2 var3
a 1 na na
a na 2 na
a na na 3
b 1 na
b na 2 na
b na na na
c na na 3
c na 2 na
c 1 na na
Expected result:
id var1 var2 var3
a 1 2 3
b 1 2 na
c 1 2 3
A possible solution (replacing "na" by NA with na_if):
library(tidyverse)
df %>%
na_if("na") %>%
group_by(id) %>%
summarize(across(var1:var3, ~ sort(.x)[1]))
#> # A tibble: 3 × 4
#> id var1 var2 var3
#> <chr> <chr> <chr> <chr>
#> 1 a 1 2 3
#> 2 b 1 2 <NA>
#> 3 c 1 2 3
Assumptions:
"na" above is really R's native NA (not a string);
b's first row, var2 should be NA instead of an empty string ""
perhaps from the above, var1:var3 should be numbers
either you will never have a group where there is more than one non-NA in a group/column, or you don't care about anything other than the first and want the remaining discarded
library(dplyr)
dat %>%
group_by(id) %>%
summarize(across(everything(), ~ na.omit(.)[1]))
# # A tibble: 3 x 4
# id var1 var2 var3
# <chr> <int> <int> <int>
# 1 a 1 2 3
# 2 b 1 2 NA
# 3 c 1 2 3
Data
dat <- structure(list(id = c("a", "a", "a", "b", "b", "b", "c", "c", "c"), var1 = c(1L, NA, NA, 1L, NA, NA, NA, NA, 1L), var2 = c(NA, 2L, NA, NA, 2L, NA, NA, 2L, NA), var3 = c(NA, NA, 3L, NA, NA, NA, 3L, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
Assuming that your data has NA, you can use the following base R option using the Data from #r2evans (thanks!):
aggregate(.~id, dat, mean, na.rm = TRUE, na.action=NULL)
Output:
id var1 var2 var3
1 a 1 2 3
2 b 1 2 NaN
3 c 1 2 3

Aggregating rows across multiple values

I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")

Create lag numbers upto unique values

I have a dataframe df, where I need to have the lag values to get the difference between times
df
ColA ColB Lag(ColB)
1 11:00:12 11:00:13
1 11:00:13 11:00:14
1 11:00:14 NA
2 11:00:15 11:00:16
2 11:00:16 11:00:17
2 11:00:17 NA
3 11:00:18 11:00:19
3 11:00:19 11:00:20
3 11:00:20 NA
Above only upto unique values I need to create a lag. If you see, the moment ColA changes from 1 to 2 and from 2 to 3, the lag is NA. So Is it possible to achieve this?
As mentioned by #Sotos, you need to group by your colA before doing the lag column and then calculate the diff time.
Using dplyr and lubridate packages, you can calculate diff time by group
library(dplyr)
library(lubridate)
df %>% group_by(ColA) %>% mutate(NewLag = lead(ColB)) %>%
mutate(diff = hms(NewLag)-hms(ColB))
# A tibble: 9 x 5
# Groups: ColA [3]
ColA ColB `Lag(ColB)` NewLag diff
<int> <chr> <chr> <chr> <dbl>
1 1 11:00:12 11:00:13 11:00:13 1
2 1 11:00:13 11:00:14 11:00:14 1
3 1 11:00:14 NA NA NA
4 2 11:00:15 11:00:16 11:00:16 1
5 2 11:00:16 11:00:17 11:00:17 1
6 2 11:00:17 NA NA NA
7 3 11:00:18 11:00:19 11:00:19 1
8 3 11:00:19 11:00:20 11:00:20 1
9 3 11:00:20 NA NA NA
Is it what you are looking for ?
Example Data
structure(list(ColA = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
ColB = c("11:00:12", "11:00:13", "11:00:14", "11:00:15",
"11:00:16", "11:00:17", "11:00:18", "11:00:19", "11:00:20"
), `Lag(ColB)` = c("11:00:13", "11:00:14", NA, "11:00:16",
"11:00:17", NA, "11:00:19", "11:00:20", NA)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x5569bf9b0310>)

Remove rows with specific NA column

I have the Following dataset where some entries (unique A) Don't have data in B and others that have sometimes.
A B
1 NA
2 NA
3 77
1 NA
2 81
I want to delete the entries that Always have NA and keep the rest
A B
2 NA
3 77
2 81
We can use ave grouped by A and remove the groups that has all NAs
df[!with(df, ave(is.na(B), A, FUN = all)), ]
# A B
#2 2 NA
#3 3 77
#5 2 81
Using the same logic with dplyr
library(dplyr)
df %>%
group_by(A) %>%
filter(!all(is.na(B)))
Assuming the input shown reproducibly in the Note at the end, for each group defined by A we return TRUE if any of its elements in B are not NA.
subset(DF, ave(!is.na(B), A, FUN = any))
Note
Lines <- "
A B
1 NA
2 NA
3 77
1 NA
2 81"
DF <- read.table(text = Lines, header = TRUE)
We can use data.table
library(data.table)
setDT(df1)[, .SD[any(!is.na(B))], A]
# A B
#1: 2 NA
#2: 2 81
#3: 3 77
data
df1 <- structure(list(A = c(1L, 2L, 3L, 1L, 2L), B = c(NA, NA, 77L,
NA, 81L)), class = "data.frame", row.names = c(NA, -5L))

Resources