I'm trying to combine data frames (hundreds of them), but they have different numbers of rows.
df1 <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2 <- data.frame(c(7,5,3,4,5,8,5), c(43,56,23,78,89,45,78))
df3 <- data.frame(c(7,5,3,4,5,8,5,6,7), c(43,56,23,78,89,45,78,56,67))
colnames(df1) <- c("xVar1","xVar2")
colnames(df2) <- c("yVar1","yVar2")
colnames(df3) <- c("zVar1","zVar2")
a1 <- list(df1,df2,df3)
a1 is what is my initial data actually looks like when I get it.
Now if I do:
b1 <- as.data.frame(a1)
I get an error, because the # of rows is not the same in the data (this would work fine if the # of rows was the same).
How do I make the # of rows equal or work around this issue?
I would like to be able to merge the data in this way (here is a working example with the same # of rows):
df1b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2b <- data.frame(c(7,5,3,4,6), c(43,56,24,48,89))
df3b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
colnames(df1b) <- c("xVar1","xVar2")
colnames(df2b) <- c("yVar1","yVar2")
colnames(df3b) <- c("zVar1","zVar2")
a2 <- list(df1b,df2b,df3b)
b2 <- as.data.frame(a2)
Thanks!
cbind.fill from rowr provides functionality for this and fills missing elements with NA:
library(purrr)
library(rowr)
b1 <- purrr::reduce(a1,cbind.fill,fill=NA)
One can add a key (row count as variable value in this case) to each dataframe then merge by the key.
# get list of dfs (should prob import data into a list of dfs instead)
list_df<-mget(ls(pattern = "df[0-9]"))
#add newcolumn -- "key"
list_df<-lapply(list_df, function(df, newcol) {
df[[newcol]]<-seq(nrow(df))
return(df)
}, "key")
#merge function
MergeAllf <- function(x, y){
df <- merge(x, y, by= "key", all.x= T, all.y= T)
}
#pass list to merge funct
library(tidyverse)
data <- Reduce(MergeAllf, list_df)%>%
select(key, everything())#reorder or can drop "key"
data
key xVar1 xVar2 yVar1 yVar2 zVar1 zVar2
1 1 7 43 7 43 7 43
2 2 5 56 5 56 5 56
3 3 3 23 3 23 3 23
4 4 4 78 4 78 4 78
5 5 5 89 5 89 5 89
6 6 NA NA 8 45 8 45
7 7 NA NA 5 78 5 78
8 8 NA NA NA NA 6 56
9 9 NA NA NA NA 7 67
Solution 1
You can achieve this with rbindlist(). Note that the column names will be the column names of the first data frame in the list:
library(data.table)
b1 = data.frame(rbindlist(a1))
> b1
xVar1 xVar2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67
Solution 2
Alternatively, you make all the columns have the same name, then bind by row:
b1 = lapply(a1, setNames, c("Var1","Var2"))
Now you can bind by rows:
b1 = do.call(dplyr::bind_rows, b1)
> b1
Var1 Var2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67
Related
Let the following be the dataset:
What I need to do is to create new columns wherein I need to multiply all a columns with b columns and name the newly created column as
a1_b1, a1_b2........ a1_b4, a2_b1, a2_b2 as shown in the figure.
I am using R for data analysis. Even though I have stated only two columns by two columns, in reality, it is 1600 by 25. Hence the question.
This might be fast enough:
set.seed(42)
DF <- data.frame(a1 = sample(1:10),
a2 = sample(1:10),
b1 = sample(1:10),
b2 = sample(1:10))
a <- grep("a", names(DF))
b <- grep("b", names(DF))
combs <- expand.grid(a, b)
res <- do.call(mapply, c(list(FUN = \(...) do.call(`*`, DF[, c(...)])), combs))
colnames(res) <- paste(names(DF)[combs[[1]]], names(DF)[combs[[2]]], sep = "_")
cbind(DF, res)
# a1 a2 b1 b2 a1_b1 a2_b1 a1_b2 a2_b2
#1 1 8 9 3 9 72 3 24
#2 5 7 10 1 50 70 5 7
#3 10 4 3 2 30 12 20 8
#4 8 1 4 6 32 4 48 6
#5 2 5 5 10 10 25 20 50
#6 4 10 6 8 24 60 32 80
#7 6 2 1 4 6 2 24 8
#8 9 6 2 5 18 12 45 30
#9 7 9 8 7 56 72 49 63
#10 3 3 7 9 21 21 27 27
The operation in the question is the transpose of the KhatriRao product. We use the Matrix package which comes with R so it does not have to be installed. Using the input in the Note at the end,
pick out the two portions, transpose them, use KhatriRao and transpose back giving a sparse matrix (class "dgCMatrix"). We can use as.matrix to convert to a dense matrix as shown or as.data.frame(as.matrix(...)) to convert to a data.frame.
library(Matrix)
rownames(dat) <- 1:nrow(dat)
ix <- grep("a", colnames(dat))
as.matrix(t(KhatriRao(t(dat[, -ix]), t(dat[, ix]), make.dimnames = TRUE)))
giving:
a1:b1 a2:b1 a1:b2 a2:b2
1 101 838.3 108.3 898.89
2 204 1050.6 220.6 1136.09
3 309 1957.0 357.0 2261.00
4 416 1664.0 464.0 1856.00
5 525 1638.0 578.0 1803.36
6 749 2118.6 838.6 2372.04
Note
dat <- setNames(cbind(BOD, BOD + 100), c("a1", "a2", "b1", "b2"))
dat
giving
a1 a2 b1 b2
1 1 8.3 101 108.3
2 2 10.3 102 110.3
3 3 19.0 103 119.0
4 4 16.0 104 116.0
5 5 15.6 105 115.6
6 7 19.8 107 119.8
Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````
I have several seperate data frames that I would like to keep separated because merging them together would create a very large element.
However, there are variables from another data frame that I would like to merge with all of them now.
Here is an example of what I would like to do:
df1 <- data.frame(ID1 = c(1:10), Var1 = rep(c(1,0),5))
df2 <- data.frame(ID1 = c(1:10), Var2 = c(21:30))
dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
mergewith <- data.frame(ID1 = c(1:10), ID2 = c(41:50))
My goal is that df1 and df2 will look like this:
df1
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
df2
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
What I have tried so far is:
dat = lapply(dfs,function(x){
merge(names(x), mergewith, by = "ID1");x})
list2env(dat,.GlobalEnv)
However, then I get the following message:
"'by' must specify a uniquely valid column"
Is it possible to do this without using a loop?
You can try Map
> Map(function(x, y) merge(x, y, by = "ID1"), dfs, list(mergewith))
[[1]]
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
[[2]]
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
You can use lapply to merge all the dataframes in dfs with mergewith. Use list2env to get the changed dataframes in the global environment.
list2env(lapply(dfs, function(x) merge(x, mergewith, by = 'ID1')), .GlobalEnv)
Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]
I have two dataframes and I want to put one above the other "with" column names of second as a row of the new dataframe. Column names are different and one dataframe has more columns.
For example:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1
V1 V2
1 1 21
2 2 22
3 3 23
4 4 24
5 5 25
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
mydf2
C1 C2 C3
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
Result:
mydf
V1 V2
1 1 21 NA
2 2 22 NA
3 3 23 NA
4 4 24 NA
5 5 25 NA
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
I dont care if all numeric values treated like characters.
Many thanks
You can do this easily without any packages:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1[,3] <- NA
names(mydf1) <- c("one", "two", "three")
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
names <- t(as.data.frame(names(mydf2)))
names <- as.data.frame(names)
names(mydf2) <- c("one", "two", "three")
names(names) <- c("one", "two", "three")
mydf3 <- rbind(mydf1, names)
mydf4 <- rbind(mydf3, mydf2)
> mydf4
one two three
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
>
Of course, you can edit the <- c("one", "two", "three") to make the final column names whatever you'd like. For example:
> mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
> mydf1[,3] <- NA
> names(mydf1) <- c("V1", "V2", "NA")
> mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
> names <- t(as.data.frame(names(mydf2)))
> names <- as.data.frame(names)
> names(mydf2) <- c("V1", "V2", "NA")
> names(names) <- c("V1", "V2", "NA")
> mydf3 <- rbind(mydf1, names)
> mydf4 <- rbind(mydf3, mydf2)
> row.names(mydf4) <- NULL
> mydf4
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
If you need to resort a package for any reason when scaling this up to your real use case, then try melt from reshape2 or the package plyr. However, use of a package shouldn't be necessary.
I don't know what you tried with write.table, but that seems to me like the way to go.
I would create a function something like this:
myFun <- function(...) {
L <- list(...)
temp <- tempfile()
maxCol <- max(vapply(L, ncol, 1L))
lapply(L, function(x)
suppressWarnings(
write.table(x, file = temp, row.names = FALSE,
sep = ",", append = TRUE)))
read.csv(temp, header = FALSE, fill = TRUE,
col.names = paste0("New_", sequence(maxCol)),
stringsAsFactors = FALSE)
}
Usage would then simply be:
myFun(mydf1, mydf2)
# New_1 New_2 New_3
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
The function is written such that you can specify more than two data.frames as input:
mydf3 <- data.frame(matrix(1:8, ncol = 4))
myFun(mydf1, mydf2, mydf3)
# New_1 New_2 New_3 New_4
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
# 18 X1 X2 X3 X4
# 19 1 3 5 7
# 20 2 4 6 8
Here's one approach with the rbind.fill function (part of the plyr package).
library(plyr)
setNames(rbind.fill(setNames(mydf1, names(mydf2[seq(mydf1)])),
rbind(names(mydf2), mydf2)), names(mydf1))
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
Give this a try.
Assign the column names from the second data set to a vector, and then replace the second set's names with the names from the first set. Then create a list where the middle element is the vector you assigned. Now when you call rbind, it should be fine since everything is in the right order.
d1$V3 <- NA
nm <- names(d2)
names(d2) <- names(d1)
dc <- do.call(rbind, list(d1,nm,d2))
rownames(dc) <- NULL
dc