Extracting multiple columns across matricies to a new matrix in R - r

I have multiple CSV files that contain data structured as follows:
A,B,C,D,
1,2,3,4,
5,6,7,8,
9,10,11,12,
that were generated using Monte Carlo methods. In order to do some statistical analysis on the data, I need to all of the data from the same column in each file,in a single matrix (i.e., all the data from column A in multiple files in one matrix). I know how to do this by brute forcing things with loops, but is there an easier way to do this in R than that?
Sample data:
A <- c(1,5,9)
B <- c(2,6,10)
C <- c(3,7,11)
D <- c(4,8,12)
data <- data.frame(A,B,C,D)

I recommend storing data from all CSV files in a list; then you can use sapply to extract relevant columns and store resulting columns in a matrix:
# Sample data
df <- read.csv(text =
"A,B,C,D,
1,2,3,4,
5,6,7,8,
9,10,11,12,", header = T)
# Store data in a list
lst <- list(df, df);
# Extract column A and store as matrix by `cbind`ing entries
cbind(sapply(lst, function(x) x$A))
# [,1] [,2]
#[1,] 1 1
#[2,] 5 5
#[3,] 9 9
Or to do this for columns A, B, C, D in one go:
lapply(c("A", "B", "C", "D"), function(s)
cbind.data.frame(sapply(lst, function(x) x[s])))
#[[1]]
# A A
#1 1 1
#2 5 5
#3 9 9
#
#[[2]]
# B B
#1 2 2
#2 6 6
#3 10 10
#
#[[3]]
# C C
#1 3 3
#2 7 7
#3 11 11
#
#[[4]]
# D D
#1 4 4
#2 8 8
#3 12 12

Related

Arranging data.frame's columns based on a reference vector [duplicate]

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

R grouping by name and perform stats (t-test)

I have two data.frames:
word1=c("a","a","a","a","b","b","b")
word2=c("a","a","a","a","c","c","c")
values1 = c(1,2,3,4,5,6,7)
values2 = c(3,3,0,1,2,3,4)
df1 = data.frame(word1,values1)
df2 = data.frame(word2,values2)
df1:
word1 values1
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
df2:
word2 values2
1 a 3
2 a 3
3 a 0
4 a 1
5 c 2
6 c 3
7 c 4
I would like to split these dataframes by word*, and perform two sample t.tests in R.
For example, the word "a" is in both data.frames. What's the t.test between the data.frames for the word "a"? And do this for all the words that are in both data.frames.
The result is a data.frame(result):
word tvalues
1 a 0.4778035
Thanks
Find the words common to both dataframes, then loop over these words, subsetting both dataframes and performing the t.test on the subsets.
E.g.:
df1 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
df2 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
common_words <- sort(intersect(df1$word, df2$word))
setNames(lapply(common_words, function(w) {
t.test(subset(df1, word==w, x), subset(df2, word==w, x))
}), common_words)
This returns a list, where each element is the output of the t.test for one of the common words. setNames just names the list elements so you can see which words they correspond to.
Note I've created new example data here since your example data only have one word in common (a) and so don't really resemble your true problem.
If you just want a matrix of statistics, you can do something like:
t(sapply(common_words, function(w) {
test <- t.test(subset(df1, word==w, x), subset(df2, word==w, x))
c(test$statistic, test$parameter, p=test$p.value,
`2.5%`=test$conf.int[1], `97.5%`=test$conf.int[2])
}))
## t df p 2.5% 97.5%
## a 0.9141839 8.912307 0.38468553 -0.4808054 1.1313220
## b -0.2182582 7.589109 0.83298193 -1.1536056 0.9558315
## c -0.2927253 8.947689 0.77640684 -1.5340097 1.1827691
## d -2.7244728 12.389709 0.01800568 -2.5016301 -0.2826952
## e -0.3683153 7.872407 0.72234501 -1.9404345 1.4072499

Swap (selected/subset) data frame columns in R

What is the simplest way that one can swap the order of a selected subset of columns in a data frame in R. The answers I have seen (Is it possible to swap columns around in a data frame using R?) use all indices / column names for this. If one has, say, 100 columns and need either: 1) to swap column 99 with column 1, or 2) move column 99 before column 1 (but keeping column 1 now as column 2) the suggested approaches appear cumbersome. Funny there is no small package around for this (Wickham's "reshape" ?) - or can one suggest a simple code ?
If you really want a shortcut for this, you could write a couple of simple functions, such as the following.
To swap the position of two columns:
swapcols <- function(x, col1, col2) {
if(is.character(col1)) col1 <- match(col1, colnames(x))
if(is.character(col2)) col2 <- match(col2, colnames(x))
if(any(is.na(c(col1, col2)))) stop("One or both columns don't exist.")
i <- seq_len(ncol(x))
i[col1] <- col2
i[col2] <- col1
x[, i]
}
To move a column from one position to another:
movecol <- function(x, col, to.pos) {
if(is.character(col)) col <- match(col, colnames(x))
if(is.na(col)) stop("Column doesn't exist.")
if(to.pos > ncol(x) | to.pos < 1) stop("Invalid position.")
x[, append(seq_len(ncol(x))[-col], col, to.pos - 1)]
}
And here are examples of each:
(m <- matrix(1:12, ncol=4, dimnames=list(NULL, letters[1:4])))
# a b c d
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
swapcols(m, col1=1, col2=3) # using column indices
# c b a d
# [1,] 7 4 1 10
# [2,] 8 5 2 11
# [3,] 9 6 3 12
swapcols(m, 'd', 'a') # or using column names
# d b c a
# [1,] 10 4 7 1
# [2,] 11 5 8 2
# [3,] 12 6 9 3
movecol(m, col='a', to.pos=2)
# b a c d
# [1,] 4 1 7 10
# [2,] 5 2 8 11
# [3,] 6 3 9 12

Sort columns of a data frame by a vector of column names

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

Read multiple files into separate data frames and process every dataframe

for all the files in one directory, I want to read each file into a data frame then process the file, for example, calculate cor across columns. For example:
files<-list.files(path=".") <br>
names <- substr(files,18,20)
for(i in c(1:length(names))){
name <- names[i]
assign (name, read.table(files[i]))
sapply(3:ncol(name), function(y) cor(name[, 2], name[, y], ))
}
but 'name' is a string in the last statement of the code, how can I process the dataframe 'name'?
This is exactly what R's lists are for. Also calling sapply to get all of the correlations is unnecessary since cor returns the correlation matrix so you can just subset
R> files <- list.files(pattern = "tsv")
R> dat <- lapply(files, read.table)
R> dat
[[1]]
a b c
1 2.802164 4.835557 6
2 1.680186 4.974198 3
3 3.002777 4.670041 6
4 2.182691 5.137982 11
5 4.206979 5.170269 5
6 1.307195 4.753041 9
7 2.919497 4.657171 7
8 2.938614 5.305558 9
9 2.575200 4.893604 2
10 1.548161 4.871108 4
[[2]]
a b c
1 -1.8483890 2 6
2 -2.9035164 0 7
3 -0.6490283 1 6
4 -2.8842633 3 2
5 -1.8803775 0 12
6 -3.0267870 1 9
7 0.5287124 0 7
8 -3.7220733 0 2
9 -2.0663912 2 9
10 -1.6232248 1 6
You can then lapply over this list again to process or do it as a one liner.
R> dat <- lapply(files, function(x) cor(read.table(x))[1,-1] )
R> dat
[[1]]
b c
0.27236143 -0.04973541
[[2]]
b c
-0.1440812 0.2771511
The way to do this is to put all the files you wish to read in in one folder, and then work with lists:
your.dir <- "" # adjust
files <- list.files(your.dir)
your.dfs <- lapply(file.path(your.dir, files), read.table)
your.dfsis now a list holding all your data frames. You can perform functions on all data frames simultaneously using lapply, or you can access individual data frames with the usual subsetting syntax, for example your.dfs[[1]] to access the first data frame.

Resources