This seems really basic but I can't figure it out. How do you add two arrays together in R by column name? For example:
a<-matrix(1:9,ncol=3)
colnames(a)<-c("A","B","C")
a
# A B C
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
b <-matrix(10:18,ncol=3)
colnames(b)<-c("C","B","D")
b
# C B D
#[1,] 10 13 16
#[2,] 11 14 17
#[3,] 12 15 18
I would like to add them together in such a way to yield:
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
I suppose I could add extra columns to both matrices but it seems like there would be a one line command to accomplish this. Thanks!
Using xtabs , after melting a combined table to a long data.frame:
xtabs(Freq ~ ., data=as.data.frame.table(cbind(a,b)))
# Var2
#Var1 A B C D
# A 1 17 17 16
# B 2 19 19 17
# C 3 21 21 18
The rownames will just be cycling through LETTERS
We could use melt/acast from reshape2 after cbinding both the 'a' and 'b' matrices (inspired from #thelatemail's post).
library(reshape2)
acast(melt(cbind(a,b)), Var1~Var2, value.var='value', sum)
# A B C D
#1 1 17 17 16
#2 2 19 19 17
#3 3 21 21 18
Or we find the column names that are common in both by using intersect, column names that is found in one matrix and not in other with setdiff. By subsetting both the matrices with the common names, we add it together, then cbind the columns in both 'a' and 'b' based on the setdiff output.
nm1 <- intersect(colnames(a), colnames(b))
nm2 <- setdiff(colnames(a), colnames(b))
nm3 <- setdiff(colnames(b), colnames(a))
cbind(a[,nm2, drop=FALSE], a[,nm1]+b[,nm1], b[,nm3,drop=FALSE])
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
Another option would be create another matrix with all the unique columns in 'a' and 'b', and then replace the values in that
nm <- union(colnames(a), colnames(b))
m1 <- matrix(0, ncol=length(nm), nrow=nrow(a), dimnames=list(NULL, nm))
m1[,colnames(a)] <- a
m1[,colnames(b)] <- m1[,colnames(b)] +b
m1
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
We could also cbind both the matrices and use tapply to get the sum after grouping with column and row indices
m2 <- cbind(a, b)
t(tapply(m2,list(colnames(m2)[col(m2)], row(m2)), FUN=sum))
Or we loop through 'nm' and get the sum
sapply(nm, function(i) rowSums(m2[,colnames(m2) ==i, drop=FALSE]))
Related
I have a dataframe with 900 columns. I want to use tidyverse to append/bind columns in multiples of three (or another number). For example, append columns 2:3 to 1; columns 5:6 to 4, columns 8:9 to 7, and so on for the entire dataframe. Thus at the end I will have 300 columns, while keeping the name of the main column (where other columns have been appended to).
How do I do this? Thank you very much :)
A tidyverse approach:
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
n = 3
df %>%
# split the data frame into three data frames
split.default(rep(1:n, ncol(df) / n)) %>%
# rename and row bind the three data frames together
map_df(
~ set_names(.x, names(df)[c(T, rep(F, n - 1))]) %>%
tibble::rownames_to_column('gene')
)
# gene Int_1_A_1 Int_2_A_1 Int_3_A_1
#1 1 1 10 19
#2 2 2 11 20
#3 3 3 12 21
#4 1 4 13 22
#5 2 5 14 23
#6 3 6 15 24
#7 1 7 16 25
#8 2 8 17 26
#9 3 9 18 27
More notes on set_names: c(T, rep(F, n - 1)) first create a vector as c(T, F, F, ...), and so names(df)[c(T, rep(F, n - 1))] picks up a name every n elements due to R Cycling rule.
Or if you start from a matrix, you can reshape it with array function and desired shape:
m = matrix(1:27, ncol=9)
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 1 4 7 10 13 16 19 22 25
#[2,] 2 5 8 11 14 17 20 23 26
#[3,] 3 6 9 12 15 18 21 24 27
array(m, c(nrow(m) * 3, ncol(m) / 3))
# [,1] [,2] [,3]
# [1,] 1 10 19
# [2,] 2 11 20
# [3,] 3 12 21
# [4,] 4 13 22
# [5,] 5 14 23
# [6,] 6 15 24
# [7,] 7 16 25
# [8,] 8 17 26
# [9,] 9 18 27
To keep the names, you can use data.table::melt:
library(data.table)
Sample Data:
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
df
# Int_1_A_1 Int_1_A_2 Int_1_A_3 Int_2_A_1 Int_2_A_2 Int_2_A_3 Int_3_A_1 Int_3_A_2 Int_3_A_3
#1 1 4 7 10 13 16 19 22 25
#2 2 5 8 11 14 17 20 23 26
#3 3 6 9 12 15 18 21 24 27
# create the patterns that group data frames
cols <- paste0('Int_', seq_len(ncol(df) / 3), '_A')
# melt the data.table based on the column patterns and here you also get an id column telling
# you where the data comes from the 1st, 2nd or 3rd ..
setNames(melt(setDT(df), measure=patterns(cols)), c('id', cols))
# id Int_1_A Int_2_A Int_3_A
#1: 1 1 10 19
#2: 1 2 11 20
#3: 1 3 12 21
#4: 2 4 13 22
#5: 2 5 14 23
#6: 2 6 15 24
#7: 3 7 16 25
#8: 3 8 17 26
#9: 3 9 18 27
A solution can be achieved using tidyr::unite and tidyr::separate_rows. The approach is to first unite columns in group of 3 and then use tidyr::separate_rows function to expand those in rows.
I have taken data created by #Psidom in his answer. Also, I should mention that data.table::melt based is most appropriate for problem. But one can explore different ideas using different approach.
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
lapply(split(names(df),cut(1:ncol(df),3, labels = seq_len(ncol(df) / 3))),
function(x){unite_(df[,x], paste(x[1],x[3], sep = ":"), x, sep = ",",
remove = TRUE)}) %>%
bind_cols() %>%
separate_rows(., seq_len(ncol(.)), sep = ",")
# Int_1_A_1:Int_1_A_3 Int_2_A_1:Int_2_A_3 Int_3_A_1:Int_3_A_3
# 1 1 10 19
# 2 4 13 22
# 3 7 16 25
# 4 2 11 20
# 5 5 14 23
# 6 8 17 26
# 7 3 12 21
# 8 6 15 24
# 9 9 18 27
A base R solution:
df <- head(mtcars)[-1:-2] # 9 cols
df[(seq(df)-1) %% 3 == 0] <-
lapply(split(seq(df), (seq(df)-1) %/% 3),
function(x) apply(df[x], 1, paste, collapse="_"))
df <- df[(seq(df)-1) %% 3 == 0]
df
# disp wt am
# Mazda RX4 160_110_3.9 2.62_16.46_0 1_4_4
# Mazda RX4 Wag 160_110_3.9 2.875_17.02_0 1_4_4
# Datsun 710 108_93_3.85 2.32_18.61_1 1_4_1
# Hornet 4 Drive 258_110_3.08 3.215_19.44_1 0_3_1
# Hornet Sportabout 360_175_3.15 3.44_17.02_0 0_3_2
# Valiant 225_105_2.76 3.46_20.22_1 0_3_1
I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17
I have a dataframe that looks like this:
x<-data.frame(a=6, b=5:1, c=7, d=10:6)
> x
a b c d
1 6 5 7 10
2 6 4 7 9
3 6 3 7 8
4 6 2 7 7
5 6 1 7 6
I am trying to get the sums of columns a & b and c&d in another data frame that should look like:
> new
ab cd
1 11 17
2 10 16
3 9 15
4 8 14
5 7 13
I've tried the rowSums() function but it returns the sum of ALL the columns per row, and I tried rowSums(x[c(1,2), c(3,4)]) but nothing works. Please help!!
You can use rowSums on a column subset.
As a data frame:
data.frame(ab = rowSums(x[c("a", "b")]), cd = rowSums(x[c("c", "d")]))
# ab cd
# 1 11 17
# 2 10 16
# 3 9 15
# 4 8 14
# 5 7 13
As a matrix:
cbind(ab = rowSums(x[1:2]), cd = rowSums(x[3:4]))
For a wider data frame, you can also use sapply over a list of column subsets.
sapply(list(1:2, 3:4), function(y) rowSums(x[y]))
For all pairwise column combinations:
y <- combn(ncol(x), 2L, function(y) rowSums(x[y]))
colnames(y) <- combn(names(x), 2L, paste, collapse = "")
y
# ab ac ad bc bd cd
# [1,] 11 13 16 12 15 17
# [2,] 10 13 15 11 13 16
# [3,] 9 13 14 10 11 15
# [4,] 8 13 13 9 9 14
# [5,] 7 13 12 8 7 13
Here's another option:
> sapply(split.default(x, 0:(length(x)-1) %/% 2), rowSums)
0 1
[1,] 11 17
[2,] 10 16
[3,] 9 15
[4,] 8 14
[5,] 7 13
The 0:(length(x)-1) %/% 2 step creates a sequence of groups of 2 that can be used with split. It will also handle odd numbers of columns (treating the final column as a group of its own). Since there's a different default split "method" for data.frames that splits by rows, you need to specify split.default to split the columns into groups.
It would be very helpful to me to be able to create an R list object without having to specify the names of each element. For example:
a1 <- 1
a2 <- 20
a3 <- 1:20
b <- list(a1,a2,a3, inherit.name=TRUE)
> b
[[a1]]
[1] 1
[[a2]]
[1] 20
[[a3]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
This would be ideal. Any suggestions?
The tidyverse package tibble has a function that can do this as well. Try out tibble::lst
tibble::lst(a1, a2, a3)
# $a1
# [1] 1
#
# $a2
# [1] 20
#
# $a3
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Coincidentally, I just wrote this function. It looks a lot like #joran's solution, but it tries not to stomp on already-named arguments.
namedList <- function(...) {
L <- list(...)
snm <- sapply(substitute(list(...)),deparse)[-1]
if (is.null(nm <- names(L))) nm <- snm
if (any(nonames <- nm=="")) nm[nonames] <- snm[nonames]
setNames(L,nm)
}
## TESTING:
a <- b <- c <- 1
namedList(a,b,c)
namedList(a,b,d=c)
namedList(e=a,f=b,d=c)
Copied from comments: if you want something from a CRAN package, you can use Hmisc::llist:
Hmisc::llist(a, b, c, d=a, labels = FALSE)
The only apparent difference is that the individual vectors also have names in this case.
A random idea:
a1<-1
a2<-20
a3<-1:20
my_list <- function(...){
names <- as.list(substitute(list(...)))[-1L]
result <- list(...)
names(result) <- names
result
}
> my_list(a1,a2,a3)
$a1
[1] 1
$a2
[1] 20
$a3
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(The idea is stolen from the code in data.frame.)
Another idea ,
sapply(ls(pattern='^a[0-9]'), get)
$a1
[1] 1
$a2
[1] 20
$a3
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Someone know how can I randomize all the data inside my dataframe?
I mean, I would get a new data frame where data are permuted by rows and by columns, to obtain an aleatory new data frame with the same numbers that I have in the first.
Something like this:
Thanks!
Just use sample() separately on the number of rows and number of columns and then index with the results from sample().
df <- data.frame(matrix(1:25, ncol = 5))
permDF <- function(x) {
nr <- nrow(x)
nc <- ncol(x)
x[sample(nr), sample(nc)]
}
> permDF(df)
X3 X4 X2 X1 X5
4 14 19 9 4 24
5 15 20 10 5 25
1 11 16 6 1 21
3 13 18 8 3 23
2 12 17 7 2 22
> permDF(df)
X1 X2 X4 X3 X5
2 2 7 17 12 22
4 4 9 19 14 24
1 1 6 16 11 21
3 3 8 18 13 23
5 5 10 20 15 25
Note that this keeps values in rows and columns together but the columns and rows are in a different order. If you want the data set fully randomised then there isn't a really simple way with a data frame. I would do this using a matrix but it requires a bit more work, as #DWin shows
mat <- matrix(1:25, ncol = 5)
pmat <- mat
set.seed(42)
pmat[] <- mat[sample(length(mat))]
pmat
> pmat
[,1] [,2] [,3] [,4] [,5]
[1,] 23 11 24 10 5
[2,] 25 21 20 9 8
[3,] 7 3 13 1 18
[4,] 19 12 4 16 2
[5,] 14 17 6 15 22
You can do what I was doing with the data frame in the same way with the matrix using slightly different indexing to the one above
mat[sample(nrow(mat)), sample(ncol(mat))]
> set.seed(42)
> mat[sample(nrow(mat)), sample(ncol(mat))]
[,1] [,2] [,3] [,4] [,5]
[1,] 15 25 5 10 20
[2,] 14 24 4 9 19
[3,] 11 21 1 6 16
[4,] 12 22 2 7 17
[5,] 13 23 3 8 18
It would be a lot faster to do this on a matrix:
dm <- matrix(1:25, ncol = 5); dm
dm[] <- sample(dm); dm
Edit: This is wrong: "I'm pretty sure that permuting first on columns and then on rows should give you the same result as permuting the entire vector and then reshaping to the original dimensions." <\s>
The "Simpson method" would give different results and may be what was requested (but it will be faster with a matrix testbed if this it to be done as part of a simulation effort):
dm <- dm[ sample(nrow(dm)), sample( ncol(dm)) ]
randomize function from NMF package could be what you are looking for.
From the doc:
randomize permutates independently the entries in each column of a
matrix-like object, to produce random data that can be used in
permutation tests or bootstrap analysis.