r - dedupe the rows with value in dataframe - r

How to subset only the rows with values in a particular column among the duplicates based on another column.
Example:
df
A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5
So in the above dataset, first 4 rows are duplicate based on column A and C, so among them, I want to choose only the rows which has value in column B.
Desired output,
A B C D
1 5 8 9
2 6 5 8
3 NA 8 5
Thanks.

Using dplyr:
df <- read.table(text="A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5", header=T)
df %>%
group_by(A,C) %>%
filter(n()==1|!is.na(B))
A B C D
<int> <int> <int> <int>
1 1 5 8 9
2 2 6 5 8
3 3 NA 8 5

Duplicates back or forwards and not missing on B; or not a duplicate:
anydup <- duplicated(df[c("A","C")]) | duplicated(df[c("A","C")], fromLast=TRUE)
df[(anydup & (!is.na(df$B))) | (!anydup),]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Or use ave to check the length per group as per #HubertL's dplyr answer:
df[!is.na(df$B) | ave(df$B, df[c("A","C")], FUN=length)==1,]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5

Here is one option with data.table
library(data.table)
setDT(df)[df[, .I[.N==1 | complete.cases(B)] , .(A, C)]$V1]
# A B C D
#1: 1 5 8 9
#2: 2 6 5 8
#3: 3 NA 8 5

Related

I want to delete the IDs that have no information in the remaining columns

Here is a representation of my dataset:
Number<-c(1:10)
AA<-c(head(LETTERS,4), rep(NA,6))
BB<-c(head(letters,6), rep(NA,4))
CC<-c(1:6, rep(NA,4))
DD<-c(10:14, rep(NA,5))
EE<-c(3:8, rep(NA,4))
FF<-c(6:1, rep(NA,4))
mydata<-data.frame(Number,AA,BB,CC,DD,EE,FF)
I want to delete all the IDs (Number) that have no information in the remaining columns, automatically. I want to tell the function that if there is a value in Number but there is only NA in all the remaining columns, delete the row.
I must have the dataframe below:
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Another possible base R solution:
mydata[rowSums(is.na(mydata[,-1])) != ncol(mydata[,-1]), ]
Output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Or we could use apply:
mydata[!apply(mydata[,-1], 1, function(x) all(is.na(x))),]
A possible solution, using janitor::remove_empty:
library(dplyr)
library(janitor)
inner_join(mydata, remove_empty(mydata[-1], which = "rows"))
#> Joining, by = c("AA", "BB", "CC", "DD", "EE", "FF")
#> Number AA BB CC DD EE FF
#> 1 1 A a 1 10 3 6
#> 2 2 B b 2 11 4 5
#> 3 3 C c 3 12 5 4
#> 4 4 D d 4 13 6 3
#> 5 5 <NA> e 5 14 7 2
#> 6 6 <NA> f 6 NA 8 1
We can use if_all/if_all
library(dplyr)
mydata %>%
filter(if_any(-Number, complete.cases))
-output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
or
mydata %>%
filter(!if_all(-Number, is.na))
Or with base R
subset(mydata, rowSums(!is.na(mydata[-1])) >0 )
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
This makes a copy of your data though. If you have a large dataset then you can use:
Filter(function(x)!all(is.na(x)), df)
and depending on your approach you can use
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
If you want to use a data.table which is usually a pretty solid go-to

Creating two columns of cumulative sum based on the categories of one column

I like to create two columns with cumulative frequency of "A" and "B" in the assignment columns.
df = data.frame(id = 1:10, assignment= c("B","A","B","B","B","A","B","B","A","B"))
id assignment
1 1 B
2 2 A
3 3 B
4 4 B
5 5 B
6 6 A
7 7 B
8 8 B
9 9 A
10 10 B
The resulting table would have this format
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7
How to generalize the codes for more than 2 categories (say for "A","B",C")?
Thanks
Use lapply over unique values in assignment to create new columns.
vals <- sort(unique(df$assignment))
df[vals] <- lapply(vals, function(x) cumsum(df$assignment == x))
df
# id assignment A B
#1 1 B 0 1
#2 2 A 1 1
#3 3 B 1 2
#4 4 B 1 3
#5 5 B 1 4
#6 6 A 2 4
#7 7 B 2 5
#8 8 B 2 6
#9 9 A 3 6
#10 10 B 3 7
We can use model.matrix with colCumsums
library(matrixStats)
cbind(df, colCumsums(model.matrix(~ assignment - 1, df[-1])))
A base R option
transform(
df,
A = cumsum(assignment == "A"),
B = cumsum(assignment == "B")
)
gives
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7

merging duplicated colums by which row is greater than others

i have list of dataframes and the dataframes have some duplicated columns. I want to merge duplicated columns which row is greater than others(some data frames have much more duplicates).
example data:
temp <- data.frame(seq_len(15), 5, 3)
colnames(temp) <- c("A", "A", "B")
temp$A[5]=NA
temp$A[3]=NA
temp$A[2]=NA
temp[7,2]=NA
A A B
<int> <dbl> <dbl>
1 5 3
NA 5 3
NA 5 3
4 5 3
NA 5 3
6 5 3
7 NA 3
8 5 3
9 5 3
10 5 3
final output
A B
<int> <dbl>
1 3
5 3
5 3
5 3
5 3
6 3
7 3
8 3
9 3
10 3
Thanks for everyone
A base R approach would be to split the data frame based on similarity of columns and select row-wise maximum using do.call + pmax.
data.frame(sapply(split.default(temp, names(temp)), function(x)
do.call(pmax, c(x, na.rm = TRUE))))
# A B
#1 5 3
#2 5 3
#3 5 3
#4 5 3
#5 5 3
#6 6 3
#7 7 3
#8 8 3
#9 9 3
#10 10 3
#11 11 3
#12 12 3
#13 13 3
#14 14 3
#15 15 3

Merge 2 rows with duplicated pair of values into a single row

I have the dataframe below in which there are 2 rows with the same pair of values for columns A and B -3RD AND 4RTH with 2 3 -, -7TH AND 8TH with 4 6-.
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
A B C D
1 1 1 5 1
2 1 2 2 2
3 2 3 5 5
4 2 3 7 3
5 3 4 7 7
6 3 5 5 5
7 4 6 7 9
8 4 6 9 6
9 5 7 7 7
10 5 8 8 0
I would like to merge these rows into one by adding the pipe | operator between values of C and D. The 2nd and 3rd line for example would be like:
A B C D
2 3 2|5 2|5
I think your combined pairs are off by a row in your example, assuming that's the case, this is what you're looking for. We group by the columns we want to collapse the duplicates out of, and then use summarize_all with paste0 to combine the values with a separator.
library(tidyverse)
master %>% group_by(A,B) %>% summarize_all(funs(paste0(., collapse="|")))
A B C D
<dbl> <dbl> <chr> <chr>
1 1 1 5 1
2 1 2 2 2
3 2 3 5|7 5|3
4 3 4 7 7
5 3 5 5 5
6 4 6 7|9 9|6
7 5 7 7 7
8 5 8 8 0
We can do this in base R with aggregate
aggregate(.~ A + B, master, FUN = paste, collapse= '|')
# A B C D
#1 1 1 5 1
#2 1 2 2 2
#3 2 3 5|7 5|3
#4 3 4 7 7
#5 3 5 5 5
#6 4 6 7|9 9|6
#7 5 7 7 7
#8 5 8 8 0

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

Resources