Someone know how can I randomize all the data inside my dataframe?
I mean, I would get a new data frame where data are permuted by rows and by columns, to obtain an aleatory new data frame with the same numbers that I have in the first.
Something like this:
Thanks!
Just use sample() separately on the number of rows and number of columns and then index with the results from sample().
df <- data.frame(matrix(1:25, ncol = 5))
permDF <- function(x) {
nr <- nrow(x)
nc <- ncol(x)
x[sample(nr), sample(nc)]
}
> permDF(df)
X3 X4 X2 X1 X5
4 14 19 9 4 24
5 15 20 10 5 25
1 11 16 6 1 21
3 13 18 8 3 23
2 12 17 7 2 22
> permDF(df)
X1 X2 X4 X3 X5
2 2 7 17 12 22
4 4 9 19 14 24
1 1 6 16 11 21
3 3 8 18 13 23
5 5 10 20 15 25
Note that this keeps values in rows and columns together but the columns and rows are in a different order. If you want the data set fully randomised then there isn't a really simple way with a data frame. I would do this using a matrix but it requires a bit more work, as #DWin shows
mat <- matrix(1:25, ncol = 5)
pmat <- mat
set.seed(42)
pmat[] <- mat[sample(length(mat))]
pmat
> pmat
[,1] [,2] [,3] [,4] [,5]
[1,] 23 11 24 10 5
[2,] 25 21 20 9 8
[3,] 7 3 13 1 18
[4,] 19 12 4 16 2
[5,] 14 17 6 15 22
You can do what I was doing with the data frame in the same way with the matrix using slightly different indexing to the one above
mat[sample(nrow(mat)), sample(ncol(mat))]
> set.seed(42)
> mat[sample(nrow(mat)), sample(ncol(mat))]
[,1] [,2] [,3] [,4] [,5]
[1,] 15 25 5 10 20
[2,] 14 24 4 9 19
[3,] 11 21 1 6 16
[4,] 12 22 2 7 17
[5,] 13 23 3 8 18
It would be a lot faster to do this on a matrix:
dm <- matrix(1:25, ncol = 5); dm
dm[] <- sample(dm); dm
Edit: This is wrong: "I'm pretty sure that permuting first on columns and then on rows should give you the same result as permuting the entire vector and then reshaping to the original dimensions." <\s>
The "Simpson method" would give different results and may be what was requested (but it will be faster with a matrix testbed if this it to be done as part of a simulation effort):
dm <- dm[ sample(nrow(dm)), sample( ncol(dm)) ]
randomize function from NMF package could be what you are looking for.
From the doc:
randomize permutates independently the entries in each column of a
matrix-like object, to produce random data that can be used in
permutation tests or bootstrap analysis.
Related
I have a dataframe with 900 columns. I want to use tidyverse to append/bind columns in multiples of three (or another number). For example, append columns 2:3 to 1; columns 5:6 to 4, columns 8:9 to 7, and so on for the entire dataframe. Thus at the end I will have 300 columns, while keeping the name of the main column (where other columns have been appended to).
How do I do this? Thank you very much :)
A tidyverse approach:
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
n = 3
df %>%
# split the data frame into three data frames
split.default(rep(1:n, ncol(df) / n)) %>%
# rename and row bind the three data frames together
map_df(
~ set_names(.x, names(df)[c(T, rep(F, n - 1))]) %>%
tibble::rownames_to_column('gene')
)
# gene Int_1_A_1 Int_2_A_1 Int_3_A_1
#1 1 1 10 19
#2 2 2 11 20
#3 3 3 12 21
#4 1 4 13 22
#5 2 5 14 23
#6 3 6 15 24
#7 1 7 16 25
#8 2 8 17 26
#9 3 9 18 27
More notes on set_names: c(T, rep(F, n - 1)) first create a vector as c(T, F, F, ...), and so names(df)[c(T, rep(F, n - 1))] picks up a name every n elements due to R Cycling rule.
Or if you start from a matrix, you can reshape it with array function and desired shape:
m = matrix(1:27, ncol=9)
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 1 4 7 10 13 16 19 22 25
#[2,] 2 5 8 11 14 17 20 23 26
#[3,] 3 6 9 12 15 18 21 24 27
array(m, c(nrow(m) * 3, ncol(m) / 3))
# [,1] [,2] [,3]
# [1,] 1 10 19
# [2,] 2 11 20
# [3,] 3 12 21
# [4,] 4 13 22
# [5,] 5 14 23
# [6,] 6 15 24
# [7,] 7 16 25
# [8,] 8 17 26
# [9,] 9 18 27
To keep the names, you can use data.table::melt:
library(data.table)
Sample Data:
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
df
# Int_1_A_1 Int_1_A_2 Int_1_A_3 Int_2_A_1 Int_2_A_2 Int_2_A_3 Int_3_A_1 Int_3_A_2 Int_3_A_3
#1 1 4 7 10 13 16 19 22 25
#2 2 5 8 11 14 17 20 23 26
#3 3 6 9 12 15 18 21 24 27
# create the patterns that group data frames
cols <- paste0('Int_', seq_len(ncol(df) / 3), '_A')
# melt the data.table based on the column patterns and here you also get an id column telling
# you where the data comes from the 1st, 2nd or 3rd ..
setNames(melt(setDT(df), measure=patterns(cols)), c('id', cols))
# id Int_1_A Int_2_A Int_3_A
#1: 1 1 10 19
#2: 1 2 11 20
#3: 1 3 12 21
#4: 2 4 13 22
#5: 2 5 14 23
#6: 2 6 15 24
#7: 3 7 16 25
#8: 3 8 17 26
#9: 3 9 18 27
A solution can be achieved using tidyr::unite and tidyr::separate_rows. The approach is to first unite columns in group of 3 and then use tidyr::separate_rows function to expand those in rows.
I have taken data created by #Psidom in his answer. Also, I should mention that data.table::melt based is most appropriate for problem. But one can explore different ideas using different approach.
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
lapply(split(names(df),cut(1:ncol(df),3, labels = seq_len(ncol(df) / 3))),
function(x){unite_(df[,x], paste(x[1],x[3], sep = ":"), x, sep = ",",
remove = TRUE)}) %>%
bind_cols() %>%
separate_rows(., seq_len(ncol(.)), sep = ",")
# Int_1_A_1:Int_1_A_3 Int_2_A_1:Int_2_A_3 Int_3_A_1:Int_3_A_3
# 1 1 10 19
# 2 4 13 22
# 3 7 16 25
# 4 2 11 20
# 5 5 14 23
# 6 8 17 26
# 7 3 12 21
# 8 6 15 24
# 9 9 18 27
A base R solution:
df <- head(mtcars)[-1:-2] # 9 cols
df[(seq(df)-1) %% 3 == 0] <-
lapply(split(seq(df), (seq(df)-1) %/% 3),
function(x) apply(df[x], 1, paste, collapse="_"))
df <- df[(seq(df)-1) %% 3 == 0]
df
# disp wt am
# Mazda RX4 160_110_3.9 2.62_16.46_0 1_4_4
# Mazda RX4 Wag 160_110_3.9 2.875_17.02_0 1_4_4
# Datsun 710 108_93_3.85 2.32_18.61_1 1_4_1
# Hornet 4 Drive 258_110_3.08 3.215_19.44_1 0_3_1
# Hornet Sportabout 360_175_3.15 3.44_17.02_0 0_3_2
# Valiant 225_105_2.76 3.46_20.22_1 0_3_1
I'm trying to create an empty data frame then have it filled in based on this for loop.
I want to have a data frame by the dimensions 5x10, which contains the result of the multiplication of the each number in the vectors A and B.
This what I want the end data frame to look like.
So far I'm using a for loop to calculate the product of the 2 vectors but I am not able to insert the result I want into the data frame.
Where am I going wrong?
My Code:
a <- c(1:10)
b <- c(1:5)
#Make a dummy dataframe filled with zeros, thats length(a) long and length(b) high
dummy <- as.data.frame(matrix(0, ncol=5, nrow=10))
heatmap_prep <- function(vector_a,vector_b){
for (i in 1:length(a)){
first_number <- a[i]
for(j in 1:length(b)){
second_number <- b[j]
result <- first_number*second_number
dummy [i,j] <- result
print(result)
}
}
}
Thanks!
Functions don't modify things outside of the function. You should create dummy inside the function, and return the final modified version at the end of your function:
heatmap_prep <- function(vector_a,vector_b){
dummy <- as.data.frame(matrix(0, ncol=length(vector_b), nrow=length(vector_a)))
for (i in 1:length(a)){
first_number <- a[i]
for(j in 1:length(b)){
second_number <- b[j]
result <- first_number*second_number
dummy [i,j] <- result
print(result)
}
}
return(dummy)
}
heatmap_prep(a, b)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 5
# 2 2 4 6 8 10
# 3 3 6 9 12 15
# 4 4 8 12 16 20
# 5 5 10 15 20 25
# 6 6 12 18 24 30
# 7 7 14 21 28 35
# 8 8 16 24 32 40
# 9 9 18 27 36 45
# 10 10 20 30 40 50
However, in this case the built-in outer function is much more succinct. The output is a matrix, but you can easily coerce it to a data.frame.
outer(a, b)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 2 3 4 5
# [2,] 2 4 6 8 10
# [3,] 3 6 9 12 15
# [4,] 4 8 12 16 20
# [5,] 5 10 15 20 25
# [6,] 6 12 18 24 30
# [7,] 7 14 21 28 35
# [8,] 8 16 24 32 40
# [9,] 9 18 27 36 45
# [10,] 10 20 30 40 50
You can also think of this problem as matrix multiplication. This will give the same result.
a %*% t(b)
I have a matrix (RR) that the column names are integer. When I refer to the elements of the marix like RR[x, c("5")] it works fine but when I put change it to
Myindex <-5
RR[x, c("Myindex")]
I get the error subscript out of bounds. I could not understand it so far.
BTW, 5 is just an example.
Any idea?
Thanks
Even though you name the column names as numbers it is taken as character column names.
rr <- matrix(1:15,3,5)
colnames(rr) <- c(21:25)
rr
# 21 22 23 24 25
# [1,] 1 4 7 10 13
# [2,] 2 5 8 11 14
# [3,] 3 6 9 12 15
rr[1,"23"]
# 23 ## column name is 23
# 7
my_index <- 4
rr[3,my_index]
# 24 ## column name is 24
# 12
my_index <- "25"
rr[3,my_index]
# 25 ## column name is 25
# 15
colnames(rr) <- as.integer(c(21:25))
rr
# 21 22 23 24 25
# [1,] 1 4 7 10 13
# [2,] 2 5 8 11 14
# [3,] 3 6 9 12 15
class(colnames(rr))
# [1] "character"
This seems really basic but I can't figure it out. How do you add two arrays together in R by column name? For example:
a<-matrix(1:9,ncol=3)
colnames(a)<-c("A","B","C")
a
# A B C
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
b <-matrix(10:18,ncol=3)
colnames(b)<-c("C","B","D")
b
# C B D
#[1,] 10 13 16
#[2,] 11 14 17
#[3,] 12 15 18
I would like to add them together in such a way to yield:
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
I suppose I could add extra columns to both matrices but it seems like there would be a one line command to accomplish this. Thanks!
Using xtabs , after melting a combined table to a long data.frame:
xtabs(Freq ~ ., data=as.data.frame.table(cbind(a,b)))
# Var2
#Var1 A B C D
# A 1 17 17 16
# B 2 19 19 17
# C 3 21 21 18
The rownames will just be cycling through LETTERS
We could use melt/acast from reshape2 after cbinding both the 'a' and 'b' matrices (inspired from #thelatemail's post).
library(reshape2)
acast(melt(cbind(a,b)), Var1~Var2, value.var='value', sum)
# A B C D
#1 1 17 17 16
#2 2 19 19 17
#3 3 21 21 18
Or we find the column names that are common in both by using intersect, column names that is found in one matrix and not in other with setdiff. By subsetting both the matrices with the common names, we add it together, then cbind the columns in both 'a' and 'b' based on the setdiff output.
nm1 <- intersect(colnames(a), colnames(b))
nm2 <- setdiff(colnames(a), colnames(b))
nm3 <- setdiff(colnames(b), colnames(a))
cbind(a[,nm2, drop=FALSE], a[,nm1]+b[,nm1], b[,nm3,drop=FALSE])
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
Another option would be create another matrix with all the unique columns in 'a' and 'b', and then replace the values in that
nm <- union(colnames(a), colnames(b))
m1 <- matrix(0, ncol=length(nm), nrow=nrow(a), dimnames=list(NULL, nm))
m1[,colnames(a)] <- a
m1[,colnames(b)] <- m1[,colnames(b)] +b
m1
# A B C D
#[1,] 1 17 17 16
#[2,] 2 19 19 17
#[3,] 3 21 21 18
We could also cbind both the matrices and use tapply to get the sum after grouping with column and row indices
m2 <- cbind(a, b)
t(tapply(m2,list(colnames(m2)[col(m2)], row(m2)), FUN=sum))
Or we loop through 'nm' and get the sum
sapply(nm, function(i) rowSums(m2[,colnames(m2) ==i, drop=FALSE]))
I have a dataframe that looks like this:
x<-data.frame(a=6, b=5:1, c=7, d=10:6)
> x
a b c d
1 6 5 7 10
2 6 4 7 9
3 6 3 7 8
4 6 2 7 7
5 6 1 7 6
I am trying to get the sums of columns a & b and c&d in another data frame that should look like:
> new
ab cd
1 11 17
2 10 16
3 9 15
4 8 14
5 7 13
I've tried the rowSums() function but it returns the sum of ALL the columns per row, and I tried rowSums(x[c(1,2), c(3,4)]) but nothing works. Please help!!
You can use rowSums on a column subset.
As a data frame:
data.frame(ab = rowSums(x[c("a", "b")]), cd = rowSums(x[c("c", "d")]))
# ab cd
# 1 11 17
# 2 10 16
# 3 9 15
# 4 8 14
# 5 7 13
As a matrix:
cbind(ab = rowSums(x[1:2]), cd = rowSums(x[3:4]))
For a wider data frame, you can also use sapply over a list of column subsets.
sapply(list(1:2, 3:4), function(y) rowSums(x[y]))
For all pairwise column combinations:
y <- combn(ncol(x), 2L, function(y) rowSums(x[y]))
colnames(y) <- combn(names(x), 2L, paste, collapse = "")
y
# ab ac ad bc bd cd
# [1,] 11 13 16 12 15 17
# [2,] 10 13 15 11 13 16
# [3,] 9 13 14 10 11 15
# [4,] 8 13 13 9 9 14
# [5,] 7 13 12 8 7 13
Here's another option:
> sapply(split.default(x, 0:(length(x)-1) %/% 2), rowSums)
0 1
[1,] 11 17
[2,] 10 16
[3,] 9 15
[4,] 8 14
[5,] 7 13
The 0:(length(x)-1) %/% 2 step creates a sequence of groups of 2 that can be used with split. It will also handle odd numbers of columns (treating the final column as a group of its own). Since there's a different default split "method" for data.frames that splits by rows, you need to specify split.default to split the columns into groups.