Remove semi duplicate rows in R - r

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.

One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8

First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9

You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Related

Find corresponding "father" row in dataframe

I've a dataframe with A and B rows. Bs are children of A.
For each B line I need to write the corresponding A father.
The number of Bs for each A is variable.
I'm thinking about a for cycle but I don't think it's the right way....
Here a simplified example:
> df <- data.frame(Index=c(1:9),
+ Type=c("A","B","A","B","B","B","A","B","B"),
+ Aindex="")
> df
Index Type Aindex
1 1 A
2 2 B
3 3 A
4 4 B
5 5 B
6 6 B
7 7 A
8 8 B
9 9 B
This is the result I'd like to have:
> df2
Index Type Aindex
1 1 A
2 2 B 1
3 3 A
4 4 B 3
5 5 B 3
6 6 B 3
7 7 A
8 8 B 7
9 9 B 7
here in base R cumsum() is really great for this i use it always to find parent child relations
df <- data.frame(Index=c(1:9), Type=c("A","B","A","B","B","B","A","B","B"))
df$parent <- df$Type == "A"
df$aindex <- cumsum(df$parent)
df$aindex[df$Type == "A"] <- ""
df$aindex[df$aindex > 0] <- df$Index[df$Type == "A"][as.numeric(df$aindex[df$aindex > 0])]
result
Index Type parent aindex
1 1 A TRUE
2 2 B FALSE 1
3 3 A TRUE
4 4 B FALSE 3
5 5 B FALSE 3
6 6 B FALSE 3
7 7 A TRUE
8 8 B FALSE 7
9 9 B FALSE 7
You can use tidyr::fill :
library(dplyr)
library(tidyr)
df %>%
#Turn Aindex to NA if type = 'B'
mutate(Aindex = replace(Index, Type == 'B', NA)) %>%
#fill NA with value above it
fill(Aindex) %>%
#Change the Aindex to empty value where Type = 'A'
mutate(Aindex = replace(Aindex, Type == 'A', ''))
# Index Type Aindex
#1 1 A
#2 2 B 1
#3 3 A
#4 4 B 3
#5 5 B 3
#6 6 B 3
#7 7 A
#8 8 B 7
#9 9 B 7

How to remove rows with NAs from two dataframes based on NAs from one?

I am trying to remove the same rows with NA in df1 from df2.
eg.
df1
A
1 1
2 NA
3 7
4 NA
df2
A B C D
1 2 4 7 10
2 3 6 1 3
3 9 5 1 3
4 4 9 2 5
Intended outcome:
df1
A
1 1
3 7
df2
A B C D
1 2 4 7 10
3 9 5 1 3
I have already tried things along the lines of...
newdf <- df2[-which(rowSums(is.na(df1))),]
and
noNA <- function(x) { x[!rowSums(!is.na(df1)) == 1]}
NMR_6mos_noNA <- as.data.frame(lapply(df2, noNA))
or
noNA <- function(x) { x[,!is.na(df1)]}
newdf3 <- as.data.frame(lapply(df2, noNA))
We can use is.na to create a logical condition and use that to subset the rows of 'df1' and 'df2'
i1 <- !is.na(df1$A)
df1[i1, , drop = FALSE]
# A
#1 1
#3 7
df2[i1,]
# A B C D
# 1 2 4 7 10
#3 9 5 1 3

How to assign a value to a column based on a column index

Having a data frame I would like to assign a calculated value based on a given a column index
df <- data.frame(a = c(2,4,7,3,5,3), b = c(8,3,8,2,6,1))
> df
a b
1 2 8
2 4 3
3 7 8
4 3 2
5 5 6
6 3 1
max <- apply(df, 1, which.max)
> max
[1] 2 1 2 1 2 1
addition <- apply(df, 1, sum)
> addition
[1] 10 7 15 5 11 4
Then some operation which I cannot figure out with the following result being assigned to df2
> df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
highly appreciate your ideas and your help. Thank you
You can use cbind to access your selected columns for each row:
df2 = df
df2[cbind(1:nrow(df2),max)] = addition
df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
Here, cbind returns a matrix of 2 columns and 6 rows that we use to subset the dataframe using matrix subsetting.
You can also use vectorised ifelse directly:
with(df, cbind.data.frame(a = ifelse(a > b, a + b, a), b = ifelse(a > b, b, a + b)));
# a b
#1 2 10
#2 7 3
#3 7 15
#4 5 2
#5 5 11
#6 4 1

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Select rows in a dataframe in r based on values in one row

I have a toy data-frame.
a = rep(1:5, each=3)
b = rep(c("a","b","c"), each = 5)
df = data.frame(a,b)
a b
1 1 a
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 3 b
8 3 b
9 3 b
10 4 b
11 4 c
12 4 c
13 5 c
14 5 c
15 5 c
I also have an index.
idx = c(2,3,5)
I want to select all the rows where the a is either 2, 3, or 5 as specified by the idx.
I've tried the following; but none of them works.
df[df$a==idx, ]
subset(df, df$a==idx)
This shouldn't be too hard.
Use the %in% argument
df[df$a %in% idx,]

Resources