combine two different dimension of dataframes to one dataframe - r

I have a problem to combine two different dimension dataframes which each dataframe has huge rows. Let's say, the sample of my dataframes are d and e, and new expected dataframe is de. I would like to make pair between all value in same row both in d and e, and construct those pairs in a new dataframe (de). Any idea/help for solving my problem is really appreciated. Thanks
> d <- data.frame(v1 = c(1,3,5), v2 = c(2,4,6))
> d
v1 v2
1 1 2
2 3 4
3 5 6
> e <- data.frame(v1 = c(11, 14), v2 = c(12,15), v3=c(13,16))
> e
v1 v2 v3
1 11 12 13
2 14 15 16
> de <- data.frame(x = c(1,1,1,2,2,2,3,3,3,4,4,4), y = c(11,12,13,11,12,13,14,15,16,14,15,16))
> de
x y
1 1 11
2 1 12
3 1 13
4 2 11
5 2 12
6 2 13
7 3 14
8 3 15
9 3 16
10 4 14
11 4 15
12 4 16

One solution is to "melt" d and e into long format, then merge, then get rid of the extra columns. If you have very large datasets, data tables are much faster (no difference for this tiny dataset).
library(reshape2) # for melt(...)
library(data.table)
# add id column
d <- cbind(id=1:nrow(d),d)
e <- cbind(id=1:nrow(e),e)
# melt to long format
d.melt <- data.table(melt(d,id.vars="id"), key="id")
e.melt <- data.table(melt(e,id.vars="id"), key="id")
# data table join, remove extra columns
result <- d.melt[e.melt, allow.cartesian=T]
result[,":="(id=NULL,variable=NULL,variable.1=NULL)]
setnames(result,c("x","y"))
setkey(result,x,y)
result
x y
1: 1 12
2: 1 13
3: 1 14
4: 2 12
5: 2 13
6: 2 14
7: 3 15
8: 3 16
9: 3 17
10: 4 15
11: 4 16
12: 4 17

If your data are numeric, like they are in this example, this is pretty straightforward in base R too. Conceptually this is the same as #jlhoward's answer: get your data into a long format, and merge:
merge(cbind(id = rownames(d), stack(d)),
cbind(id = rownames(e), stack(e)),
by = "id")[c("values.x", "values.y")]
# values.x values.y
# 1 1 11
# 2 1 12
# 3 1 13
# 4 2 11
# 5 2 12
# 6 2 13
# 7 3 14
# 8 3 15
# 9 3 16
# 10 4 14
# 11 4 15
# 12 4 16
Or, with the "reshape2" package:
merge(melt(as.matrix(d)),
melt(as.matrix(e)),
by = "Var1")[c("value.x", "value.y")]

Related

Create columns with different rules with data.table in r

I'm trying to better understant the data.table package in r. I want to do different types of calculation with some columns and assign the result to new columns with specific names. Here is an example:
set.seed(122)
df <- data.frame(rain = rep(5,10),temp=1:10, skip = sample(0:2,10,T),
windw_sz = sample(1:2,10,T),city =c(rep("a",5),rep("b",5)),ord=rep(sample(1:5,5),2))
df <- as.data.table(df)
vars <- c("rain","temp")
df[, paste0("mean.",vars) := lapply(mget(vars),mean), by="city" ]
This works just fine. But now I also want to calculate the sum of these variables, so I try:
df[, c(paste0("mean.",vars), paste("sum.",vars)) := list( lapply(mget(vars),mean),
lapply(mget(vars),sum)), by="city" ]
and I get an error.
How could I implement this last part?
Thanks a lot!
Instead of list wrap, we can do a c as the lapply output is a list, and when do list as wrapper, it returns a list of list. However, with c, it concats two list end to end (i.e. c(as.list(1:5), as.list(6:10)) as opposed to list(as.list(1:5), as.list(6:10))) and instead of mget, make use of .SDcols
library(data.table)
df[, paste0(rep(c("mean.", "sum."), each = 2), vars) :=
c(lapply(.SD, mean), lapply(.SD, sum)), by = .(city), .SDcols = vars]
df
# rain temp skip windw_sz city ord mean.rain mean.temp sum.rain sum.temp
# 1: 5 1 0 2 a 2 5 3 25 15
# 2: 5 2 1 1 a 5 5 3 25 15
# 3: 5 3 2 2 a 3 5 3 25 15
# 4: 5 4 2 1 a 4 5 3 25 15
# 5: 5 5 2 2 a 1 5 3 25 15
# 6: 5 6 0 1 b 2 5 8 25 40
# 7: 5 7 2 2 b 5 5 8 25 40
# 8: 5 8 1 2 b 3 5 8 25 40
# 9: 5 9 2 1 b 4 5 8 25 40
#10: 5 10 2 2 b 1 5 8 25 40

How can I stack columns per x columns in R

I'm looking to transform a data frame of 660 columns into 3 columns just by stacking them on each other per 3 columns without manually re-arranging (since I have 660 columns).
In a small scale example per 2 columns with just 4 columns, I want to go from
A B C D
1 4 7 10
2 5 8 11
3 6 9 12
to
A B
1 4
2 5
3 6
7 10
8 11
9 12
Thanks
reshape to the rescue:
reshape(df, direction="long", varying=split(names(df), rep(seq_len(ncol(df)/2), 2)))
# time A B id
#1.1 1 1 4 1
#2.1 1 2 5 2
#3.1 1 3 6 3
#1.2 2 7 10 1
#2.2 2 8 11 2
#3.2 2 9 12 3
rbind.data.frame requires that all columns match up. So use setNames to replace the names of the C:D columns:
rbind( dat[1:2], setNames(dat[3:4], names(dat[1:2])) )
A B
1 1 4
2 2 5
3 3 6
4 7 10
5 8 11
6 9 12
To generalize that to multiple columns use do.call and lapply:
dat <- setNames( as.data.frame( matrix(1:36, ncol=12) ), LETTERS[1:12])
dat
#----
A B C D E F G H I J K L
1 1 4 7 10 13 16 19 22 25 28 31 34
2 2 5 8 11 14 17 20 23 26 29 32 35
3 3 6 9 12 15 18 21 24 27 30 33 36
do.call( rbind, lapply( seq(1,12, by=3), function(x) setNames(dat[x:(x+2)], LETTERS[1:3]) ))
A B C
1 1 4 7
2 2 5 8
3 3 6 9
4 10 13 16
5 11 14 17
6 12 15 18
7 19 22 25
8 20 23 26
9 21 24 27
10 28 31 34
11 29 32 35
12 30 33 36
The 12 would be replaced by 660 and everything else should work.
A classical split-apply-combine approach will scale flexibly:
as.data.frame(lapply(split(unclass(df),
names(df)[seq(ncol(df) / 2)]),
unlist, use.names = FALSE))
## A B
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
or with a hint of purrr,
library(purrr)
df %>% unclass() %>% # convert to non-data.frame list
split(names(.)[seq(length(.) / 2)]) %>% # split columns by indexed names
map_df(simplify) # simplify each split to vector, coerce back to data.frame
## # A tibble: 6 × 2
## A B
## <int> <int>
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
Here is another base R option
i1 <- c(TRUE, FALSE)
`row.names<-`(data.frame(A= unlist(df1[i1]), B = unlist(df1[!i1])), NULL)
# A B
#1 1 4
#2 2 5
#3 3 6
#4 7 10
#5 8 11
#6 9 12
Or another option is melt from data.table
library(data.table)
i1 <- seq(1, ncol(df1), by = 2)
i2 <- seq(2, ncol(df1), by = 2)
melt(setDT(df1), measure = list(i1, i2), value.name = c("A", "B"))
rbindlist from data.table package can also be used for the task and seems to be much more efficient.
# EXAMPLE DATA
df1 <- read.table(text = '
Col1 Col2 Col3 Col4
1 2 3 4
5 6 7 8
1 2 3 4
5 6 7 8', header = TRUE)
library(data.table)
library(microbenchmark)
library(purrr)
microbenchmark(
Map = as.data.frame(Map(c, df1[,1:2], df1[, 3:4])),
Reshape = reshape(df1, direction="long", varying=split(names(df1), rep(seq_len(ncol(df1)/2), 2))),
Purr = df1 %>% unclass() %>% # convert to non-data.frame list
split(names(.)[seq(length(.) / 2)]) %>% # split columns by indexed names
map_df(simplify),
DataTable = rbindlist(list(df1[,1:2], df1[, 3:4])),
Mapply = data.frame(mapply(c, df1[,1:2], df1[, 3:4], SIMPLIFY=FALSE)),
Rbind = rbind(df1[, 1:2],setnames(df1[, 3:4],names(df1[,1:2])))
)
The results are:
Unit: microseconds
expr min lq mean median uq max neval cld
Map 214.724 232.9380 246.2936 244.1240 255.9240 343.611 100 bc
Reshape 716.962 739.8940 778.7912 749.7550 767.6725 2834.192 100 e
Purr 309.559 324.6545 339.2973 334.0440 343.4290 551.746 100 d
DataTable 98.228 111.6080 122.7753 119.2320 129.2640 189.614 100 a
Mapply 233.577 258.2605 271.1881 270.7895 281.6305 339.291 100 c
Rbind 206.001 221.1515 228.5956 226.6850 235.2670 283.957 100 b

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

How to Deal with Textual Data?

In R, you have a certain data frame with textual data, e.g. the second column has words instead of numbers. How can you remove the rows of the data frame with a certain word (e.g. "total") in the second column? data <- data[-(data[,2] == "total"),] does not work for me.
Besides, is there an easy way to convert these words sequentially into numbers? (I.e., first word becomes 1, second appeared word becomes 2, and so on.) I would rather not use a loop...
You can use ! to negate. For the sequence, use either seq_along or as.numeric(factor(.)) depending on what you are actually looking for.
Here's some sample data:
set.seed(1)
mydf <- data.frame(V1 = 1:15, V2 = sample(LETTERS[1:3], 15, TRUE))
mydf
# V1 V2
# 1 1 A
# 2 2 B
# 3 3 B
# 4 4 C
# 5 5 A
# 6 6 C
# 7 7 C
# 8 8 B
# 9 9 B
# 10 10 A
# 11 11 A
# 12 12 A
# 13 13 C
# 14 14 B
# 15 15 C
Let's remove any rows where there is an "A" in column "V2":
mydf2 <- mydf[!mydf$V2 == "A", ]
mydf2
# V1 V2
# 2 2 B
# 3 3 B
# 4 4 C
# 6 6 C
# 7 7 C
# 8 8 B
# 9 9 B
# 13 13 C
# 14 14 B
# 15 15 C
Now, let's create two new columns. The first sequentially counts each occurrence of each "word" in column "V2". The second converts each unique "word" into a number.
mydf2$Seq <- ave(as.character(mydf2$V2), mydf2$V2, FUN = seq_along)
mydf2$WordAsNum <- as.numeric(factor(mydf2$V2))
mydf2
# V1 V2 Seq WordAsNum
# 2 2 B 1 1
# 3 3 B 2 1
# 4 4 C 1 2
# 6 6 C 2 2
# 7 7 C 3 2
# 8 8 B 3 1
# 9 9 B 4 1
# 13 13 C 4 2
# 14 14 B 5 1
# 15 15 C 5 2

Maintaining order in split-apply-combine problems [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to ddply() without sorting?
I have the following data frame
dd1 = data.frame(cond = c("D","A","C","B","A","B","D","C"), val = c(11,7,9,4,3,0,5,2))
dd1
cond val
1 D 11
2 A 7
3 C 9
4 B 4
5 A 3
6 B 0
7 D 5
8 C 2
and now need to compute cumulative sums respecting the factor level in cond. The results should look like that:
> dd2 = data.frame(cond = c("D","A","C","B","A","B","D","C"), val = c(11,7,9,4,3,0,5,2), cumsum=c(11,7,9,4,10,4,16,11))
> dd2
cond val cumsum
1 D 11 11
2 A 7 7
3 C 9 9
4 B 4 4
5 A 3 10
6 B 0 4
7 D 5 16
8 C 2 11
It is important to receive the result data frame in the same order as the input data frame because there are other variables bound to that.
I tried ddply(dd1, .(cond), summarize, cumsum = cumsum(val)) but it didn't produce the result I expected.
Thanks
Use ave instead.
dd1$cumsum <- ave(dd1$val, dd1$cond, FUN=cumsum)
If doing this by hand is an option then split() and unsplit() with a suitable lapply() inbetween will do this for you.
dds <- split(dd1, dd1$cond)
dds <- lapply(dds, function(x) transform(x, cumsum = cumsum(x$val)))
unsplit(dds, dd1$cond)
The last line gives
> unsplit(dds, dd1$cond)
cond val cumsum
1 D 11 11
2 A 7 7
3 C 9 9
4 B 4 4
5 A 3 10
6 B 0 4
7 D 5 16
8 C 2 11
I separated the three steps, but these could be strung together or placed in a function if you are doing a lot of this.
A data.table solution:
require(data.table)
dt <- data.frame(dd1)
dt[, c.val := cumsum(val),by=cond]
> dt
# cond val c.val
# 1: D 11 11
# 2: A 7 7
# 3: C 9 9
# 4: B 4 4
# 5: A 3 10
# 6: B 0 4
# 7: D 5 16
# 8: C 2 11

Resources