tidyverse: binding list elements efficiently - r

I want to bind data.frames of same number of rows from a list as given below.
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
ls <- list(df1, df2, df3, df4, df5)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
bind_cols(ls[1], ls[2], .id = NULL)
#> New names:
#> * A1 -> A1...1
#> * A1 -> A1...3
#> A1...1 B1 A1...3 C1
#> 1 1 11 1 21
#> 2 2 12 2 22
#> 3 3 13 3 23
#> 4 4 14 4 24
#> 5 5 15 5 25
#> 6 6 16 6 26
#> 7 7 17 7 27
#> 8 8 18 8 28
#> 9 9 19 9 29
#> 10 10 20 10 30
bind_cols(ls[3], ls[4], .id = NULL)
#> New names:
#> * A2 -> A2...1
#> * A2 -> A2...4
#> A2...1 B2 C2 A2...4 D2 E2
#> 1 1 11 31 1 11 51
#> 2 2 12 32 2 12 52
#> 3 3 13 33 3 13 53
#> 4 4 14 34 4 14 54
#> 5 5 15 35 5 15 55
#> 6 6 16 36 6 16 56
#> 7 7 17 37 7 17 57
#> 8 8 18 38 8 18 58
#> 9 9 19 39 9 19 59
#> 10 10 20 40 10 20 60
#> 11 11 21 41 11 21 61
#> 12 12 22 42 12 22 62
#> 13 13 23 43 13 23 63
#> 14 14 24 44 14 24 64
#> 15 15 25 45 15 25 65
In my actual list, I have about twenty data.frames of different number of rows. I wonder if there is a more efficient way of binding data.frames of same number of rows without giving the name and index of list elements.

It is easier to do this by splitting. Create a grouping index with gl
grp <- as.integer(gl(length(ls), 2, length(ls)))
and then use split
library(dplyr)
library(purrr)
library(stringr)
split(ls, grp) %>% # // split by the grouping index
map(bind_cols) %>% # // loop over the `list` and use `bind_cols`
set_names(str_c('df', seq_along(.))) %>% # // name the `list`
list2env(.GlobalEnv) # // create objects in global env
-output
head(df1)
# A1...1 B1 A1...3 C1
#1 1 11 1 21
#2 2 12 2 22
#3 3 13 3 23
#4 4 14 4 24
#5 5 15 5 25
#6 6 16 6 26
head(df2)
# A2...1 B2 C2 A2...4 D2 E2
#1 1 11 31 1 11 51
#2 2 12 32 2 12 52
#3 3 13 33 3 13 53
#4 4 14 34 4 14 54
#5 5 15 35 5 15 55
#6 6 16 36 6 16 56
head(df3)
# A tibble: 1 x 1
# ...1
# <dbl>
#1 5
NOTE:
It is better to keep the elements in the list instead of creating objects in the global environment i.e. list2env
ls is a function name and naming an object with function name is not a good option as it can lead to buggy situations

Maybe not the optimal approach but you can use a loop and bind the dataframes with same number of columns into a new dataframes. The main of this code is to check the dimension of each dataframe and create an unique vector. Then in the loop you can use lapply() to subset the dataframes in ls and the bind their columns. Here the code (Updated considering the little df5, you can make the trick managing it as a dataframe):
library(dplyr)
#Data
df1 <- data.frame(A1 = 1:10, B1 = 11:20)
df2 <- data.frame(A1 = 1:10, C1 = 21:30)
df3 <- data.frame(A2 = 1:15, B2 = 11:25, C2 = 31:45)
df4 <- data.frame(A2 = 1:15, D2 = 11:25, E2 = 51:65)
df5 <- 5
#List
ls <- list(df1, df2, df3, df4,df5)
#Index
index <- sapply(ls,function(x)dim(as.data.frame(x))[1])
m <- unique(index)
#Loop
for(i in 1:length(m))
{
assign(paste0('df',i),do.call(bind_cols,ls[lapply(ls,function(x) dim(as.data.frame(x))[1]==m[i])==T]))
}
Output:
df1
A1...1 B1 A1...3 C1
1 1 11 1 21
2 2 12 2 22
3 3 13 3 23
4 4 14 4 24
5 5 15 5 25
6 6 16 6 26
7 7 17 7 27
8 8 18 8 28
9 9 19 9 29
10 10 20 10 30
df2
A2...1 B2 C2 A2...4 D2 E2
1 1 11 31 1 11 51
2 2 12 32 2 12 52
3 3 13 33 3 13 53
4 4 14 34 4 14 54
5 5 15 35 5 15 55
6 6 16 36 6 16 56
7 7 17 37 7 17 57
8 8 18 38 8 18 58
9 9 19 39 9 19 59
10 10 20 40 10 20 60
11 11 21 41 11 21 61
12 12 22 42 12 22 62
13 13 23 43 13 23 63
14 14 24 44 14 24 64
15 15 25 45 15 25 65
df3
...1
1 5

Related

Transpose from long to wide with pair groups in r

I have descriptive statistics for four groups. My sample dataset is:
df <- data.frame(
Grade = c(3,3,3,3,4,4,4,4),
group = c("none","G1","G2","both","none","G1","G2","both"),
mean=c(10,12,13,12,11,18,19,20),
sd=c(22,12,22,12,11,13,14,15),
N=c(35,33,34,32,43,45,46,47))
> df
Grade group mean sd N
1 3 none 10 22 35
2 3 G1 12 12 33
3 3 G2 13 22 34
4 3 both 12 12 32
5 4 none 11 11 43
6 4 G1 18 13 45
7 4 G2 19 14 46
8 4 both 20 15 47
I would like to compare groups as pairs and need the descriptive information side by side for each pair.
Here is what I would like to have:
So, each grade has 6 pairs of groups.
Does anyone have any idea on this?
Thanks!
1) sqldf We can join df to itself on the indicated condition. Note that we escaped group since group is an sql keyword.
library(sqldf)
sqldf('select
a.Grade,
a.[group] Group1, b.[group] Group2,
a.mean mean1, b.mean mean2,
a.sd sd1, b.sd sd2,
a.N n1, b.N n2
from df a
join df b on a.Grade = b.Grade and a.[group] > b.[group]')
giving:
Grade Group1 Group2 mean1 mean2 sd1 sd2 n1 n2
1 3 none G1 10 12 22 12 35 33
2 3 none G2 10 13 22 22 35 34
3 3 none both 10 12 22 12 35 32
4 3 G2 G1 13 12 22 12 34 33
5 3 both G1 12 12 12 12 32 33
6 3 both G2 12 13 12 22 32 34
7 4 none G1 11 18 11 13 43 45
8 4 none G2 11 19 11 14 43 46
9 4 none both 11 20 11 15 43 47
10 4 G2 G1 19 18 14 13 46 45
11 4 both G1 20 18 15 13 47 45
12 4 both G2 20 19 15 14 47 46
2) base R We can perform a merge on part of the condition and then subset it for the remainder. The names are slightly different so you will need to change them if that is important.
subset(merge(df, df, by = "Grade"), group.x > group.y)
giving:
Grade group.x mean.x sd.x N.x group.y mean.y sd.y N.y
2 3 none 10 22 35 G1 12 12 33
3 3 none 10 22 35 G2 13 22 34
4 3 none 10 22 35 both 12 12 32
8 3 G1 12 12 33 both 12 12 32
10 3 G2 13 22 34 G1 12 12 33
12 3 G2 13 22 34 both 12 12 32
18 4 none 11 11 43 G1 18 13 45
19 4 none 11 11 43 G2 19 14 46
20 4 none 11 11 43 both 20 15 47
24 4 G1 18 13 45 both 20 15 47
26 4 G2 19 14 46 G1 18 13 45
28 4 G2 19 14 46 both 20 15 47

Create multiple new columns in tibble in R based on value of previous row giving prefix to all

I have a tibble as so:
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
I want to create a new tibble, where I want to lag some. I want to create new columns called prev+lagged_col_name, eg prev_a.
In my actual data, there are a lot of cols so I don't want to manually write it out. Additonally I only want to do it for some cols. In this eg, I have done it manually but wanted to know if there is a way to use a function to do it.
df_new <- df %>%
mutate(prev_a = lag(a),
prev_b = lag(b),
prev_d = lag(d))
Thanks for your help!
With the current dplyr version you can create new variable names with mutate_at, using a named list will take the name of the list as suffix. If you want it as a prefix as in your example you can use rename_at to correct the variable naming. With your real data, you need to adjust the vars() selection. For your example data matches("[a-c]") did work.
library(dplyr)
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x)))
#> # A tibble: 10 x 6
#> a b c a_prev b_prev c_prev
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x))) %>%
rename_at(vars(contains( "_prev") ), list( ~paste("prev", gsub("_prev", "", .), sep = "_")))
#> # A tibble: 10 x 6
#> a b c prev_a prev_b prev_c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
Created on 2020-04-29 by the reprex package (v0.3.0)
You could do this this way
df_new <- bind_cols(
df,
df %>% mutate_at(.vars = vars("a","b","c"), function(x) lag(x))
)
Names are a bit nasty but you can rename them check here. Or see #Bas comment to get the names with a suffix.
# A tibble: 10 x 6
a b c a1 b1 c1
<int> <int> <int> <int> <int> <int>
1 1 21 31 NA NA NA
2 2 22 32 1 21 31
3 3 23 33 2 22 32
4 4 24 34 3 23 33
5 5 25 35 4 24 34
6 6 26 36 5 25 35
7 7 27 37 6 26 36
8 8 28 38 7 27 37
9 9 29 39 8 28 38
10 10 30 40 9 29 39
If you have dplyr 1.0 you can use the new accross() function.
See some expamples from the docs, instead of mean you want lag
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))
df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

R data.frame add a column depending on row-values

In R, I have a data.frame that looks like this:
X Y
20 7
25 84
15 62
22 12
60 24
40 10
60 60
12 50
11 17
now, i want a new Colum, lets call it "SumX", that adds two following values of X into a new field of that SumX column, and one that does the same to "SumY" column. So the result data.frame would look like this:
X Y SumX SumY
20 7 20 #first row = X 7 #first row = Y
25 84 45 #X0 + X1 91 #Y0 + Y1
15 62 40 #X1 + X2 146 #Y1 + Y2
22 12 37 #X2 + X3 74 #Y2 + Y3
60 24 82 #X3 + X4 36 #Y3 + Y4
40 10 100 #X4 + X5 34 #Y4 + Y5
60 60 100 #and so on 70 #and so on
12 50 72 110
11 17 23 67
I can do simple X + Y into a new column with
myFrame$SumXY <- with(myFrame, X+Y)
but it there a simple way to add two X (n + (n-1)) values into SumX, and two Y (n + (n-1)) into SumY? Even if it is with a while-loop, though i would prefer a simpler way (its a lot of data like this). Any help is much appreciated! (I'm still pretty new to R)
The rollapply function from the zoo package will work here.
The following code block will create the rolling sum of each 2 adjacent values.
require(zoo)
myFrame$SumX <- rollapply(myFrame$X, 2, sum) # this is a rolling sum of every 2 values
You could add by = 2 as an argument to rollapply in order to not have a rolling sum (i.e. it sums values 1+2, then 3+4, then 5+6 etc.).
Look up ?rollapply for more info.
Here's a dplyr approach.
Use mutate() to add a new colum and var + lag(var, default = 0) to compute your variable. Example:
library(dplyr)
d <- data.frame(
x = 1:10,
y = 11:20,
z = 21:30
)
mutate(d, sumx = x + lag(x, default = 0))
#> x y z sumx
#> 1 1 11 21 1
#> 2 2 12 22 3
#> 3 3 13 23 5
#> 4 4 14 24 7
#> 5 5 15 25 9
#> 6 6 16 26 11
#> 7 7 17 27 13
#> 8 8 18 28 15
#> 9 9 19 29 17
#> 10 10 20 30 19
More variables can be handled similarly:
mutate(d, sumx = x + lag(x, default = 0), sumy = y + lag(y, default = 0))
#> x y z sumx sumy
#> 1 1 11 21 1 11
#> 2 2 12 22 3 23
#> 3 3 13 23 5 25
#> 4 4 14 24 7 27
#> 5 5 15 25 9 29
#> 6 6 16 26 11 31
#> 7 7 17 27 13 33
#> 8 8 18 28 15 35
#> 9 9 19 29 17 37
#> 10 10 20 30 19 39
If you know that you want to do this for many, or even EVERY column in your data frame, then here's a standard evaluation approach with mutate_() that uses a custom function I adapted from this blog post (note you need to have the lazyeval package installed). The function gets applied to each column in a for loop (which could probably be optimised).
f <- function(df, col, new_col_name) {
mutate_call <- lazyeval::interp(~ x + lag(x, default = 0), x = as.name(col))
df %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
for (var in names(d)) {
d <- f(d, var, paste0('sum', var))
}
d
#> x y z sumx sumy sumz
#> 1 1 11 21 1 11 21
#> 2 2 12 22 3 23 43
#> 3 3 13 23 5 25 45
#> 4 4 14 24 7 27 47
#> 5 5 15 25 9 29 49
#> 6 6 16 26 11 31 51
#> 7 7 17 27 13 33 53
#> 8 8 18 28 15 35 55
#> 9 9 19 29 17 37 57
#> 10 10 20 30 19 39 59
Just to continue the tidyverse theme, here's a solution using the purrr package (again, works for all columns, but can subset columns if need to):
library(purrr)
# Create new columns in new data frame.
# Subset `d` here if only want select columns
sum_d <- map_df(d, ~ . + lag(., default = 0))
# Set names correctly and
# bind back to original data
names(sum_d) <- paste0("sum", names(sum_d))
d <- cbind(d, sum_d)
d
#> x y z sumx sumy sumz
#> 1 1 11 21 2 22 42
#> 2 2 12 22 4 24 44
#> 3 3 13 23 6 26 46
#> 4 4 14 24 8 28 48
#> 5 5 15 25 10 30 50
#> 6 6 16 26 12 32 52
#> 7 7 17 27 14 34 54
#> 8 8 18 28 16 36 56
#> 9 9 19 29 18 38 58
#> 10 10 20 30 20 40 60
You can use the lag function to achieve something like this:
myFrame$SumX[1] <- X[1]
myFrame$SumX[2:nrow(myFrame)] <- X[2:nrow(myFrame)]+lag(X)[2:nrow(myFrame)]
#SumX
cumsum(df$X) - c(0, 0, cumsum(df$X)[1:(nrow(df)-2)])
#[1] 20 45 40 37 82 100 100 72 23
#SumY
cumsum(df$Y) - c(0, 0, cumsum(df$Y)[1:(nrow(df)-2)])
#[1] 7 91 146 74 36 34 70 110 67

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?
I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).
One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33
A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

Combine two dataframes one above the other

I have two dataframes and I want to put one above the other "with" column names of second as a row of the new dataframe. Column names are different and one dataframe has more columns.
For example:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1
V1 V2
1 1 21
2 2 22
3 3 23
4 4 24
5 5 25
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
mydf2
C1 C2 C3
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
Result:
mydf
V1 V2
1 1 21 NA
2 2 22 NA
3 3 23 NA
4 4 24 NA
5 5 25 NA
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
I dont care if all numeric values treated like characters.
Many thanks
You can do this easily without any packages:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1[,3] <- NA
names(mydf1) <- c("one", "two", "three")
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
names <- t(as.data.frame(names(mydf2)))
names <- as.data.frame(names)
names(mydf2) <- c("one", "two", "three")
names(names) <- c("one", "two", "three")
mydf3 <- rbind(mydf1, names)
mydf4 <- rbind(mydf3, mydf2)
> mydf4
one two three
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
>
Of course, you can edit the <- c("one", "two", "three") to make the final column names whatever you'd like. For example:
> mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
> mydf1[,3] <- NA
> names(mydf1) <- c("V1", "V2", "NA")
> mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
> names <- t(as.data.frame(names(mydf2)))
> names <- as.data.frame(names)
> names(mydf2) <- c("V1", "V2", "NA")
> names(names) <- c("V1", "V2", "NA")
> mydf3 <- rbind(mydf1, names)
> mydf4 <- rbind(mydf3, mydf2)
> row.names(mydf4) <- NULL
> mydf4
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
If you need to resort a package for any reason when scaling this up to your real use case, then try melt from reshape2 or the package plyr. However, use of a package shouldn't be necessary.
I don't know what you tried with write.table, but that seems to me like the way to go.
I would create a function something like this:
myFun <- function(...) {
L <- list(...)
temp <- tempfile()
maxCol <- max(vapply(L, ncol, 1L))
lapply(L, function(x)
suppressWarnings(
write.table(x, file = temp, row.names = FALSE,
sep = ",", append = TRUE)))
read.csv(temp, header = FALSE, fill = TRUE,
col.names = paste0("New_", sequence(maxCol)),
stringsAsFactors = FALSE)
}
Usage would then simply be:
myFun(mydf1, mydf2)
# New_1 New_2 New_3
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
The function is written such that you can specify more than two data.frames as input:
mydf3 <- data.frame(matrix(1:8, ncol = 4))
myFun(mydf1, mydf2, mydf3)
# New_1 New_2 New_3 New_4
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
# 18 X1 X2 X3 X4
# 19 1 3 5 7
# 20 2 4 6 8
Here's one approach with the rbind.fill function (part of the plyr package).
library(plyr)
setNames(rbind.fill(setNames(mydf1, names(mydf2[seq(mydf1)])),
rbind(names(mydf2), mydf2)), names(mydf1))
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
Give this a try.
Assign the column names from the second data set to a vector, and then replace the second set's names with the names from the first set. Then create a list where the middle element is the vector you assigned. Now when you call rbind, it should be fine since everything is in the right order.
d1$V3 <- NA
nm <- names(d2)
names(d2) <- names(d1)
dc <- do.call(rbind, list(d1,nm,d2))
rownames(dc) <- NULL
dc

Resources