rbind and overwrite duplicate rows based on key variable? - r

> part1<-data.frame(key=c(5,6,7,8,9),x=c("b","d","a","c","b"))
> part1
key x
1 5 b # key==5,x==b
2 6 d
3 7 a
4 8 c
5 9 b
> part2<-data.frame(key=c(1,2,3,4,5), x=c("c","a","b","d","a"))
> part2
key x
1 1 c
2 2 a
3 3 b
4 4 d
5 5 a # key==5,x==a
There are more than 2 dataframes but I'll just use 2 for this example. I then use lapply to put them all in a list called dflist1, then rbind them. For this example I'll just do it manually.
dflist1<-list(part1,part2)
final<-do.call(rbind,dflist1)
final<-final[order(final$key),] #sort by key
Result:
> final
key x
6 1 c
7 2 a
8 3 b
9 4 d
1 5 b #duplicate from part1
10 5 a #duplicate from part2
2 6 d
3 7 a
4 8 c
5 9 b
I want to get rid of the duplicates. It's easy to use !duplicated() but in this case I specifically want to drop/overwrite the rows from earlier dataframes - i.e., in this case the "5 b" from part1 should get dropped/overwritten by the "5 a" from part2. And if there was a part3 with a value "5 b" then the "5 a" from part2 would then get dropped/overwritten by the "5 b" from part3.
What I want:
key x
6 1 c
7 2 a
8 3 b
9 4 d
10 5 a #this is from part2, no more duplicate from part1
2 6 d
3 7 a
4 8 c
5 9 b
Current solution: The only thing I can think of is to add a function that flags each dataframe with an extra variable, then sort it and use !duplicated on that variable... is there an easier or more elegant solution that doesn't require flagging?

## Create many data.frames
set.seed(1)
## Normally, do this in lapply...
part1 <- data.frame(key=1:6, x=sample(letters, 6))
part2 <- data.frame(key=4:8, x=sample(letters, 5))
part3 <- data.frame(key=8:12, x=sample(letters, 5))
library(data.table)
## Collect all your "parts"
PARTS.LIST <- lapply(ls(pattern="^part\\d+"), function(x) get(x))
DT.part <- rbindlist(PARTS.LIST)
setkey(DT.part, key)
unique(DT.part, by="key")
ORIGINAL UNIQUE
--------- -----------
> DT.part > unique(DT.part, by="key")
key x key x
1: 1 l 1: 1 l
2: 2 v 2: 2 v
3: 3 k 3: 3 k
4: 4 q 4: 4 q
5: 4 i 5: 5 r
6: 5 r 6: 6 f
7: 5 w 7: 7 v
8: 6 f 8: 8 f
9: 6 o 9: 9 j
10: 7 v 10: 10 d
11: 8 f 11: 11 g
12: 8 l 12: 12 m
13: 9 j
14: 10 d
15: 11 g
16: 12 m

Related

Update data.table with another data.table [duplicate]

This question already has answers here:
Update subset of data.table based on join
(3 answers)
Closed 1 year ago.
Assume that given data.table da, there is another data.table, db which has some column names equal to the column names of da. Some of those columns with equal names also have identical content. My question is how to replace those entries in da that have columns matching with db with content of db?
Much simpler put:
require(data.table)
da <- data.table(a=1:10,b=10:1,c=LETTERS[1:10])
## All (a,b) pairs in db exist in da, but c is different.
db <- data.table(a=c(2,6,8),b=c(9,5,3),c=c('x','y','z'))
## Data table dx will have c-column values of db$c where (a,b) matches between da and db.
dx <- db[da,.(a,b,c=fifelse(is.na(c),i.c,c),on=c('a','b')]
Output
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
> da
a b c
1: 1 10 A
2: 2 9 B
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 F
7: 7 4 G
8: 8 3 H
9: 9 2 I
10: 10 1 J
> db
a b c
1: 2 9 x
2: 6 5 y
3: 8 3 z
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
>
I know that the above achieves my goal, but it feels clumsy. Is there a built-in data.table way to do this?
Okay, I figured it
dx <- da[db,c:=i.c,on=c('a','b')]

How do I group_by if the column that I want to summarize with has all the same values

x l
1 1 a
2 3 b
3 2 c
4 3 b
5 2 c
6 4 d
7 5 f
8 2 c
9 1 a
10 1 a
11 3 b
12 4 d
The above is the input.
The below is the output.
x l
1 1 a
2 3 b
3 2 c
4 4 d
5 5 f
I know that column l will have the same value for each group_by(x).
l is a string
# Creation of dataset
x <- c(1,3,2,3,2,4,5,2,1,1,3,4)
l<- c("a","b","c","b","c","d","f","c","a","a","b","d")
df <- data.frame(x,l)
# Simply call unique function on your dataframe
dfu <- unique(df)

R: reshape a data frame when more than 2 dimensions

I am trying to cast a dataframe into an other one, see below for the examples:
> start = data.frame(Aa = c('A','A','A','A','a','a','a','a'),Bb = c('B','B','b','b','B','B','b','b'),Cc = c('C','c','C','c','C','c','C','c'),v=c(1,2,3,4,5,6,7,8))
> start
Aa Bb Cc v
1 A B C 1
2 A B c 2
3 A b C 3
4 A b c 4
5 a B C 5
6 a B c 6
7 a b C 7
8 a b c 8
And I would like to have a data frame like this one:
1 A B 3
2 A b 7
3 a B 11
4 a b 15
5 B C 6
6 B c 8
7 b C 10
8 b c 12
9 A C 4
10 A c 6
11 a C 12
12 a c 14
Where line 1 is calculated because we have A-B-C -> 1 and A-B-c -> 2 so A-B -> 3
The fact is that I can imagine a solution with some for loops on the columns, but I need it to time efficient, I can have 100,000 rows and up to 100 columns so I need something fast, and I don't think that the for loop are really efficient in R.
Do you have any ideas?
Thanks you!
Perhaps you can use combn on the column names.
Here, I've used data.table for its efficient aggregation and for the convenience of rbindlist to put the data back together.
library(data.table)
setDT(start)
rbindlist(combn(names(start)[1:3], 2, FUN = function(x) {
start[, sum(v), x]
}, simplify = FALSE))
# Aa Bb V1
# 1: A B 3
# 2: A b 7
# 3: a B 11
# 4: a b 15
# 5: A C 4
# 6: A c 6
# 7: a C 12
# 8: a c 14
# 9: B C 6
# 10: B c 8
# 11: b C 10
# 12: b c 12

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Multiple joins/merges with data.tables

I have two data.tables, DT and L:
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> L
yv lu
1: 1 h
2: 2 g
3: 3 f
4: 4 e
5: 5 d
6: 6 c
7: 7 b
8: 8 a
9: 12 l
I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:
> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
x y v lu.1 lu.2
1: a 1 1 h h
2: a 2 3 g f
3: a 3 6 f c
4: b 4 1 e h
5: b 5 3 d f
6: b 6 6 c c
7: c 7 1 b h
8: c 8 3 a f
9: c 9 6 NA c
(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)
In SQL this would be simple/straightforward:
SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv
Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.
Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.
There might be better ways. How about this:
> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
or, to illustrate, this is the same :
> cbind( L[J(DT$y)], L[J(DT$v)], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
merge could also be used, if the following feature request was implemented :
FR#2033 Add by.x and by.y to merge.data.table

Resources