Delete duplicate rows based on condition in another column

Delete duplicate rows based on condition in another column - r

Let's say I have this data frame:
df <- data.frame(
a = c(NA,6,6,8),
x= c(1,2,2,4),
y = c(NA,2,NA,NA),
z = c("apple", 2, "2", NA),
d = c(NA, 5, 5, 5),stringsAsFactors = FALSE)
Rows 2 and 3 are duplicates and row 3 has an NA value. I want to delete the duplicate row with the NA value so that it looks like this:
df <- data.frame(
a = c(NA,6,8),
x= c(1,2,4),
y = c(NA,2,NA),
z = c("apple", 2, NA),
d = c(NA, 5, 5),stringsAsFactors = FALSE)
I tried this but it doesn't work:
df2 <- df %>% group_by (a,x,z,d) %>% filter(y == max(y))
Any suggestions?

df %>%
arrange_all() %>%
filter(!duplicated(fill(., everything())))
a x y z d
1 NA 1 NA apple NA
2 6 2 2 2 5
3 8 4 NA <NA> 5

df %>% arrange(a,x,z,d) %>% distinct(a,x,z,d,.keep_all=TRUE)
a x y z d
1 6 2 2 2 5
2 8 4 NA <NA> 5
3 NA 1 NA apple NA

Fill NA values with previous non-NA and select unique rows with distinct.
library(dplyr)
library(tidyr)
df %>% fill(everything()) %>% distinct()
# a x y z d
#1 NA 1 NA apple NA
#2 6 2 2 2 5
#3 8 4 NA <NA> 5

Related

Reverse the order of non-NA values in a variable

I am interested in reversing the values for a column that has NA values in a tidy way.
The rev call won't do the trick here:
library(tidyverse)
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = rev(x))
As it completely reverses the values (NAs included).
I essentially want a tidy mutate command (no splitting / joining) that reverses the values for the Which column so that E has value 1 (max becomes min) B has value 16 (min becomes max), etc - and NA values remain NA (F, G, I & J).
Edit:
Several answers do not achieve intended outcome. The question is aimed at effectively having a reverse (rev) work while keeping NAs in position.
#Moody_Mudskipper has a solution to the case where there's no repeats, but it fails when there are repeats, e.g.:
rev_na <- function(x) setNames(sort(x), sort(x, TRUE))[as.character(x)]
Works here:
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = rev_na(x))
Fails here:
tibble(
Which = LETTERS[1:7],
x = c(3,1,9,9,9, 9, 10)
) %>% mutate(y = rev_na(x), z = rev(x))

If you can tolerate a little hack :
tibble(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) %>%
mutate(y = setNames(sort(x), sort(x, TRUE))[as.character(x)])
#> # A tibble: 11 x 3
#> Which x y
#> <chr> <dbl> <dbl>
#> 1 A 3 4
#> 2 B 1 16
#> 3 C 4 3
#> 4 D 2 10
#> 5 E 16 1
#> 6 F NA NA
#> 7 G NA NA
#> 8 H 4 3
#> 9 I NA NA
#> 10 J NA NA
#> 11 K 10 2
Created on 2021-05-11 by the reprex package (v0.3.0)

This will do
data.frame(
Which = LETTERS[1:11],
x = c( c(3,1,4,2,16), NA, NA, 4, rep(NA, 2), 10)) -> df
df %>% group_by(d = is.na(x)) %>%
arrange(x) %>%
mutate(y = ifelse(!d, rev(x), x)) %>%
ungroup %>% select(-d)
# A tibble: 11 x 3
Which x y
<chr> <dbl> <dbl>
1 B 1 16
2 D 2 10
3 A 3 4
4 C 4 4
5 H 4 3
6 K 10 2
7 E 16 1
8 F NA NA
9 G NA NA
10 I NA NA
11 J NA NA
Needless to say you may arrange back the results if your Which was arranged already or creating a row_number() at the start of the syntax.
df %>%
group_by(d = is.na(x)) %>%
arrange(x) %>%
mutate(y = ifelse(!d, rev(x), x)) %>%
ungroup %>% select(-d) %>%
arrange(Which)
# A tibble: 11 x 3
Which x y
<chr> <dbl> <dbl>
1 A 3 4
2 B 1 16
3 C 4 4
4 D 2 10
5 E 16 1
6 F NA NA
7 G NA NA
8 H 4 3
9 I NA NA
10 J NA NA
11 K 10 2

Replace column conditional on matching in another column

I would like to match two columns based on another. I'm trying to use the match function but gets NA values.
a <- data.frame( x = c(1,2,3,4,5))
b <- data.frame( y = c(3,4),
z = c("A","B"))
a$x <- b$z[match(a$x, b$y)]
I get:
> a
x
1 <NA>
2 <NA>
3 A
4 B
5 <NA>
I would like :
> a
x
1 1
2 2
3 A
4 B
5 5

First, rename the numeric column of b so that you can merge the two data frames:
b <- b %>% rename(x = y)
Then, merge them, turn variables into character and replace the values of column x with those of z if not NA.
a <- merge(a, b, by = "x", all.x = TRUE) %>%
mutate_all(as.character) %>%
mutate(x = ifelse(is.na(z), x, z))
Result:
x z
1 1 <NA>
2 2 <NA>
3 A A
4 B B
5 5 <NA>

Without renaming I would propose this which ends with the same result that broti
tmp.merge<- merge(a,b,by.x = "x", by.y="y", all = TRUE)
for (elm in as.numeric(row.names(tmp.merge[which(!is.na(tmp.merge$z)),]))){
tmp.merge[elm,'x'] <- as.character(tmp.merge[elm,'z'])
}
tmp.merge
result :
> tmp.merge
x z
1 1 <NA>
2 2 <NA>
3 A A
4 B B
5 5 <NA>

The following works but you need to set stringsAsFactors = F, when defining dataframe b
a <- data.frame( x = c(1,2,3,4,10,13,12,11))
b <- data.frame( y = c(10,12,13),
z = c("A","B","C"),stringsAsFactors = F)
#
a %>% mutate(x = ifelse(x %in% b$y,b$z[match(x,b$y)],x))
Output
x
1 1
2 2
3 3
4 4
5 A
6 C
7 B
8 11

Replace all NA values for variable with one row equal to 0

Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear

Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3

We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)

df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3

Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3

Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3

Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.

Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3

Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)

One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?

This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)

I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

Create NAs for first two rows using group_by

I am trying to replace my observations with NAs. I would like to replace NAs only for the first two observations with respect to each group represented by a given ID.
So from:
id b
1 1 0.1125294
2 1 -0.6871102
3 1 0.1721639
4 2 0.2714921
5 2 0.1012665
6 2 -0.3538989
Get:
id b
1 1 NA
2 1 NA
3 1 0.1721639
4 2 NA
5 2 NA
6 2 -0.3538989
Tried this, but it does not work...
data<- data %>% group_by(id) %>% mutate(data$b[1:2] = NA)
Thanks for any help!

library(dplyr)
df <- data.frame(id = rep(1:2, each = 3), value = rnorm(6))
df %>% group_by(id) %>% mutate(value=replace(value, 1:2, NA))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete duplicate rows based on condition in another column - r

df %>% arrange_all() %>% filter(!duplicated(fill(., everything()))) a x y z d 1 NA 1 NA apple NA 2 6 2 2 2 5 3 8 4 NA <NA> 5

df %>% arrange(a,x,z,d) %>% distinct(a,x,z,d,.keep_all=TRUE) a x y z d 1 6 2 2 2 5 2 8 4 NA <NA> 5 3 NA 1 NA apple NA

Fill NA values with previous non-NA and select unique rows with distinct. library(dplyr) library(tidyr) df %>% fill(everything()) %>% distinct() # a x y z d #1 NA 1 NA apple NA #2 6 2 2 2 5 #3 8 4 NA <NA> 5

Related

Reverse the order of non-NA values in a variable

Replace column conditional on matching in another column

Replace all NA values for variable with one row equal to 0

Unequal rows in list from unstack() - how to create a dataframe

Create NAs for first two rows using group_by

Categories

Resources