R: reshape a data frame when more than 2 dimensions - r

I am trying to cast a dataframe into an other one, see below for the examples:
> start = data.frame(Aa = c('A','A','A','A','a','a','a','a'),Bb = c('B','B','b','b','B','B','b','b'),Cc = c('C','c','C','c','C','c','C','c'),v=c(1,2,3,4,5,6,7,8))
> start
Aa Bb Cc v
1 A B C 1
2 A B c 2
3 A b C 3
4 A b c 4
5 a B C 5
6 a B c 6
7 a b C 7
8 a b c 8
And I would like to have a data frame like this one:
1 A B 3
2 A b 7
3 a B 11
4 a b 15
5 B C 6
6 B c 8
7 b C 10
8 b c 12
9 A C 4
10 A c 6
11 a C 12
12 a c 14
Where line 1 is calculated because we have A-B-C -> 1 and A-B-c -> 2 so A-B -> 3
The fact is that I can imagine a solution with some for loops on the columns, but I need it to time efficient, I can have 100,000 rows and up to 100 columns so I need something fast, and I don't think that the for loop are really efficient in R.
Do you have any ideas?
Thanks you!

Perhaps you can use combn on the column names.
Here, I've used data.table for its efficient aggregation and for the convenience of rbindlist to put the data back together.
library(data.table)
setDT(start)
rbindlist(combn(names(start)[1:3], 2, FUN = function(x) {
start[, sum(v), x]
}, simplify = FALSE))
# Aa Bb V1
# 1: A B 3
# 2: A b 7
# 3: a B 11
# 4: a b 15
# 5: A C 4
# 6: A c 6
# 7: a C 12
# 8: a c 14
# 9: B C 6
# 10: B c 8
# 11: b C 10
# 12: b c 12

Related

Update data.table with another data.table [duplicate]

This question already has answers here:
Update subset of data.table based on join
(3 answers)
Closed 1 year ago.
Assume that given data.table da, there is another data.table, db which has some column names equal to the column names of da. Some of those columns with equal names also have identical content. My question is how to replace those entries in da that have columns matching with db with content of db?
Much simpler put:
require(data.table)
da <- data.table(a=1:10,b=10:1,c=LETTERS[1:10])
## All (a,b) pairs in db exist in da, but c is different.
db <- data.table(a=c(2,6,8),b=c(9,5,3),c=c('x','y','z'))
## Data table dx will have c-column values of db$c where (a,b) matches between da and db.
dx <- db[da,.(a,b,c=fifelse(is.na(c),i.c,c),on=c('a','b')]
Output
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
> da
a b c
1: 1 10 A
2: 2 9 B
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 F
7: 7 4 G
8: 8 3 H
9: 9 2 I
10: 10 1 J
> db
a b c
1: 2 9 x
2: 6 5 y
3: 8 3 z
> dx
a b c
1: 1 10 A
2: 2 9 x
3: 3 8 C
4: 4 7 D
5: 5 6 E
6: 6 5 y
7: 7 4 G
8: 8 3 z
9: 9 2 I
10: 10 1 J
>
I know that the above achieves my goal, but it feels clumsy. Is there a built-in data.table way to do this?
Okay, I figured it
dx <- da[db,c:=i.c,on=c('a','b')]

Keep all the data.table when aggregating a data.table

I would like to aggregate a data.table by a list of column and keep all the columns at the end.
A <- c(1,2,3,4,4,6,4)
B <- c("a","b","c","d","e","f","g")
C <- c(10,11,23,8,8,1,3)
D <- c(2,3,5,9,7,8,4)
dt <- data.table(A,B,C,D)
Now I want to aggregate the column B paste(B,sep=";") by A and C and keep the column D too at the end. Do you know a way to do it please?
EDIT
this is what i obtained using dt[, newCol := toString(B), .(A, C)]
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
5: 4 e 8 7 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g
But i would like to obtain
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g

rbind and overwrite duplicate rows based on key variable?

> part1<-data.frame(key=c(5,6,7,8,9),x=c("b","d","a","c","b"))
> part1
key x
1 5 b # key==5,x==b
2 6 d
3 7 a
4 8 c
5 9 b
> part2<-data.frame(key=c(1,2,3,4,5), x=c("c","a","b","d","a"))
> part2
key x
1 1 c
2 2 a
3 3 b
4 4 d
5 5 a # key==5,x==a
There are more than 2 dataframes but I'll just use 2 for this example. I then use lapply to put them all in a list called dflist1, then rbind them. For this example I'll just do it manually.
dflist1<-list(part1,part2)
final<-do.call(rbind,dflist1)
final<-final[order(final$key),] #sort by key
Result:
> final
key x
6 1 c
7 2 a
8 3 b
9 4 d
1 5 b #duplicate from part1
10 5 a #duplicate from part2
2 6 d
3 7 a
4 8 c
5 9 b
I want to get rid of the duplicates. It's easy to use !duplicated() but in this case I specifically want to drop/overwrite the rows from earlier dataframes - i.e., in this case the "5 b" from part1 should get dropped/overwritten by the "5 a" from part2. And if there was a part3 with a value "5 b" then the "5 a" from part2 would then get dropped/overwritten by the "5 b" from part3.
What I want:
key x
6 1 c
7 2 a
8 3 b
9 4 d
10 5 a #this is from part2, no more duplicate from part1
2 6 d
3 7 a
4 8 c
5 9 b
Current solution: The only thing I can think of is to add a function that flags each dataframe with an extra variable, then sort it and use !duplicated on that variable... is there an easier or more elegant solution that doesn't require flagging?
## Create many data.frames
set.seed(1)
## Normally, do this in lapply...
part1 <- data.frame(key=1:6, x=sample(letters, 6))
part2 <- data.frame(key=4:8, x=sample(letters, 5))
part3 <- data.frame(key=8:12, x=sample(letters, 5))
library(data.table)
## Collect all your "parts"
PARTS.LIST <- lapply(ls(pattern="^part\\d+"), function(x) get(x))
DT.part <- rbindlist(PARTS.LIST)
setkey(DT.part, key)
unique(DT.part, by="key")
ORIGINAL UNIQUE
--------- -----------
> DT.part > unique(DT.part, by="key")
key x key x
1: 1 l 1: 1 l
2: 2 v 2: 2 v
3: 3 k 3: 3 k
4: 4 q 4: 4 q
5: 4 i 5: 5 r
6: 5 r 6: 6 f
7: 5 w 7: 7 v
8: 6 f 8: 8 f
9: 6 o 9: 9 j
10: 7 v 10: 10 d
11: 8 f 11: 11 g
12: 8 l 12: 12 m
13: 9 j
14: 10 d
15: 11 g
16: 12 m

Counting how many times an element occurs in the column of a data.frame

Let's say I have a data.frame with a factor.
d = data.frame(f = c("a","a","a","b","b","b","b","d","d"))
f
1 a
2 a
3 a
4 b
5 b
6 b
7 b
8 d
9 d
And I want to add a column telling me how many times an element occurs.
Like this
f n
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
How would I do this?
Can also use some plyr functions - join & ddply
d <- data.frame(f = c("a","a","a","b","b","b","b","d","d"))
d2 <- join(d, ddply(d, .(f), 'nrow'))
d2
f nrow
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
You can use table like this:
d$n <- table(d$f)[d$f]
# f n
#1 a 3
#2 a 3
#3 a 3
#4 b 4
#5 b 4
#6 b 4
#7 b 4
#8 d 2
#9 d 2
You can use ave and length:
> d$n <- as.numeric(ave(as.character(d$f), d$f, FUN = length))
> d
f n
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
With the "data.table" package, you might do something like:
library(data.table)
D <- data.table(d)
D[, n := as.numeric(.N), by = f]

Multiple joins/merges with data.tables

I have two data.tables, DT and L:
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> L
yv lu
1: 1 h
2: 2 g
3: 3 f
4: 4 e
5: 5 d
6: 6 c
7: 7 b
8: 8 a
9: 12 l
I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:
> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
x y v lu.1 lu.2
1: a 1 1 h h
2: a 2 3 g f
3: a 3 6 f c
4: b 4 1 e h
5: b 5 3 d f
6: b 6 6 c c
7: c 7 1 b h
8: c 8 3 a f
9: c 9 6 NA c
(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)
In SQL this would be simple/straightforward:
SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv
Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.
Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.
There might be better ways. How about this:
> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
or, to illustrate, this is the same :
> cbind( L[J(DT$y)], L[J(DT$v)], DT )
yv lu yv lu x y v
1: 1 h 1 h a 1 1
2: 3 f 2 g a 3 2
3: 6 c 3 f a 6 3
4: 1 h 4 e b 1 4
5: 3 f 5 d b 3 5
6: 6 c 6 c b 6 6
7: 1 h 7 b c 1 7
8: 3 f 8 a c 3 8
9: 6 c 9 NA c 6 9
merge could also be used, if the following feature request was implemented :
FR#2033 Add by.x and by.y to merge.data.table

Resources