Read multiple files into separate data frames and process every dataframe - r

for all the files in one directory, I want to read each file into a data frame then process the file, for example, calculate cor across columns. For example:
files<-list.files(path=".") <br>
names <- substr(files,18,20)
for(i in c(1:length(names))){
name <- names[i]
assign (name, read.table(files[i]))
sapply(3:ncol(name), function(y) cor(name[, 2], name[, y], ))
}
but 'name' is a string in the last statement of the code, how can I process the dataframe 'name'?

This is exactly what R's lists are for. Also calling sapply to get all of the correlations is unnecessary since cor returns the correlation matrix so you can just subset
R> files <- list.files(pattern = "tsv")
R> dat <- lapply(files, read.table)
R> dat
[[1]]
a b c
1 2.802164 4.835557 6
2 1.680186 4.974198 3
3 3.002777 4.670041 6
4 2.182691 5.137982 11
5 4.206979 5.170269 5
6 1.307195 4.753041 9
7 2.919497 4.657171 7
8 2.938614 5.305558 9
9 2.575200 4.893604 2
10 1.548161 4.871108 4
[[2]]
a b c
1 -1.8483890 2 6
2 -2.9035164 0 7
3 -0.6490283 1 6
4 -2.8842633 3 2
5 -1.8803775 0 12
6 -3.0267870 1 9
7 0.5287124 0 7
8 -3.7220733 0 2
9 -2.0663912 2 9
10 -1.6232248 1 6
You can then lapply over this list again to process or do it as a one liner.
R> dat <- lapply(files, function(x) cor(read.table(x))[1,-1] )
R> dat
[[1]]
b c
0.27236143 -0.04973541
[[2]]
b c
-0.1440812 0.2771511

The way to do this is to put all the files you wish to read in in one folder, and then work with lists:
your.dir <- "" # adjust
files <- list.files(your.dir)
your.dfs <- lapply(file.path(your.dir, files), read.table)
your.dfsis now a list holding all your data frames. You can perform functions on all data frames simultaneously using lapply, or you can access individual data frames with the usual subsetting syntax, for example your.dfs[[1]] to access the first data frame.

Related

Add column names of a dataframe or from an R object to another dataframe

I'm currently working with a huge count matrix issued of single cell sequencing ...
So, in order to analyze them with R and my 8 Gb of RAM, I had to split it in several sub-matrices.
I simply used split in order to do that so I loose the heathers of the matrix.
So, I would like to add them back with R or find a better way to split them more efficiently.
My questions are:
1. If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
2. Is there a better way to cut those huge count matrices into multiple parts? (I can't do it through R because I don't have enough RAM, R crashes if I try to import the whole matrix)
If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
You can add headers to a dataframe like this:
dataframe <- data.frame(c("a", "b","c"),
c("d", "e", "f"))
headers <- c("header_1" , "header_2")
names(dataframe) <- headers
dataframe
header_1 header_2
1 a d
2 b e
3 c f
You could use bash for such tasks.
You can access and mutate a data.frames column names with the names function:
df <- data.frame(foo = 1:5, bar = 6:10, opt = 11:15)
original_names <- names(df)
original_names
Returns:
[1] "foo" "bar" "opt"
And to assign new names:
names(df) <- c("new_col1", "new_col2", "new_col3")
Now:
df
Returns:
new_col1 new_col2 new_col3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
And to 'undo' the renaming:
names(df) <- original_names
And df has again its original names:
foo bar opt
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15

Extracting multiple columns across matricies to a new matrix in R

I have multiple CSV files that contain data structured as follows:
A,B,C,D,
1,2,3,4,
5,6,7,8,
9,10,11,12,
that were generated using Monte Carlo methods. In order to do some statistical analysis on the data, I need to all of the data from the same column in each file,in a single matrix (i.e., all the data from column A in multiple files in one matrix). I know how to do this by brute forcing things with loops, but is there an easier way to do this in R than that?
Sample data:
A <- c(1,5,9)
B <- c(2,6,10)
C <- c(3,7,11)
D <- c(4,8,12)
data <- data.frame(A,B,C,D)
I recommend storing data from all CSV files in a list; then you can use sapply to extract relevant columns and store resulting columns in a matrix:
# Sample data
df <- read.csv(text =
"A,B,C,D,
1,2,3,4,
5,6,7,8,
9,10,11,12,", header = T)
# Store data in a list
lst <- list(df, df);
# Extract column A and store as matrix by `cbind`ing entries
cbind(sapply(lst, function(x) x$A))
# [,1] [,2]
#[1,] 1 1
#[2,] 5 5
#[3,] 9 9
Or to do this for columns A, B, C, D in one go:
lapply(c("A", "B", "C", "D"), function(s)
cbind.data.frame(sapply(lst, function(x) x[s])))
#[[1]]
# A A
#1 1 1
#2 5 5
#3 9 9
#
#[[2]]
# B B
#1 2 2
#2 6 6
#3 10 10
#
#[[3]]
# C C
#1 3 3
#2 7 7
#3 11 11
#
#[[4]]
# D D
#1 4 4
#2 8 8
#3 12 12

combine list of data frames in list in specific manner

I got a list which have another list of data frames.
The outside list elements represents years and inside list represent months data.
Now I want to create a final list which will contain data for all months. Each Month columns will be "cbinded" by other years column values.
Alldata <- list()
Alldata[[1]] <- list(data.frame(Jan_2015_A=c(1,2), Jan_2015_B=c(3,4)), data.frame(Feb_2015_C=c(5,6), Feb_2015_D=c(7,8)))
Alldata[[2]] <- list(data.frame(Jan_2016_A=c(1,2), Jan_2016_B=c(3,4)), data.frame(Feb_2016_C=c(5,6), Feb_2016_D=c(7,8)))
Expected output list is as following
I've tried using for loops and its little complex, I want any R function to do this task.
I have done this using for loops using following code. But this is really complex and I myself found this little complicate. Hope I will get any simpler and tidy code for this operation.
I created list with each months and years data as a list item in form of data frames
x2 <- list()
for(l1 in 1: length(Alldata[[1]])){
temp <- list()
for(l2 in 1: length(Alldata)){
temp <- append(temp, list(Alldata[[l2]][[l1]]))
}
x2 <- append(x2, list(temp))
}
# then created final List with succesive years data of each month as list items. This is primarily used for Tracking data for years For Example: how much was count was for Jan_2015 and Jan_2016 for "A"
finalList <- list()
for(l3 in 1: length(x2)){
temp <- x2[[l3]]
td2 <- as.data.frame(matrix("", nrow = nrow(temp[[1]])))
rownames(td2)[rownames(temp[[1]])!=""] <- rownames(temp[[1]])[rownames(temp[[1]])!=""]
for(l4 in 1:ncol(temp[[1]])){
for(l5 in 1: length(temp)){
# lapply(l4, function(x) do.call(cbind,
td2 <- cbind(td2, temp[[l5]][, l4, drop=F])
}
}
finalList <- append(finalList, list(td2))
}
> finalList
[[1]]
V1 Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
1 1 1 3 3
2 2 2 4 4
[[2]]
V1 Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
1 5 5 7 7
2 6 6 8 8
You could do the following below. The lapply will iterate over the outer list and the do.call will cbind the inner list of data frames.
lapply(Alldata, do.call, what = 'cbind')
[[1]]
Jan_2015_A Jan_2015_B Feb_2015_C Feb_2015_D
1 1 3 5 7
2 2 4 6 8
[[2]]
Jan_2016_A Jan_2016_B Feb_2016_C Feb_2016_D
1 1 3 5 7
2 2 4 6 8
You can also use dplyr to get the same results.
library(dplyr)
lapply(Alldata, bind_cols)
Here is a third option proposed by J.R.
lapply(Alldata, Reduce, f = cbind)
EDIT
After clarification from OP, the above solution has been modified (see below) to produce the newly specified output. The solution above has been left there since it is a building block for the solution below.
pattern.vec <- c("Jan", "Feb")
### For a given vector of months/patterns, returns a
### list of elements with only that month.
mon_data <- function(mo) {
return(bind_cols(sapply(Alldata, function(x) { x[grep(pattern = mo, x)]})))
}
### Loop through months/patterns.
finalList <- lapply(pattern.vec, mon_data)
finalList
## [[1]]
## Jan_2015_A Jan_2015_B Jan_2016_A Jan_2016_B
## 1 1 3 1 3
## 2 2 4 2 4
##
## [[2]]
## Feb_2015_C Feb_2015_D Feb_2016_C Feb_2016_D
## 1 5 7 5 7
## 2 6 8 6 8
## Ordering the columns as specified in the original question.
## sorting is by the last character in the column name (A or B)
## and then the year.
lapply(finalList, function(x) x[ order(gsub('[^_]+_([^_]+)_(.*)', '\\2_\\1', colnames(x))) ])
## [[1]]
## Jan_2015_A Jan_2016_A Jan_2015_B Jan_2016_B
## 1 1 1 3 3
## 2 2 2 4 4
##
## [[2]]
## Feb_2015_C Feb_2016_C Feb_2015_D Feb_2016_D
## 1 5 5 7 7
## 2 6 6 8 8

returning from list to data.frame after lapply

I have a very simply question about lapply. I am transitioning from STATA to R and I think there is some very basic concept that I am not getting about looping in R. But I have been reading about it all afternoon and can't figure out a reasonable way to do this very simple thing.
I have three data frames df1, df2, and df3 that all have the same column names, in the same order, etc.
I want to rename their columns all at once.
I put the data frames in a list:
dflist <- list(df1, df2, df3)
What I want the new names to be:
varlist <- c("newname1", "newname2", "newname3")
Write a function that replaces names with those in varlist, and lapply it over the data frames
ChangeNames <- function(x) {
names(x) <- varlist
return(x)
}
dflist <- lapply(dflist, ChangeNames)
So, as far as I understand, R has changed the names of the copies of the data frames that I put in the list, but not the original data frames themselves. I want the data frames themselves to be renamed, not the elements of the list (which are trapped in a list).
Now, I can go
df1 <- as.data.frame(dflist[1])
df2 <- as.data.frame(dflist[2])
df2 <- as.data.frame(dflist[3])
But that seems weird. You need a loop to get back the elements of a loop?
Basically: once you've put some data frames in a list and run your function on them via lapply, how do you get them back out of the list, without starting back at square one?
If you just want to change the names, that isn't too hard in R. Bear in mind that the assignment operator, <-, can be applied in sequence. Hence:
names(df1) <- names(df2) <- names(df3) <- c("newname1", "newname2", "newname3")
I am not sure I understand correctly, do you want to rename the columns of the data frames or the components of the list that contain the data frames?
If it is the first, please always search before asking, the question has been asked here.
So what you can easily do in case you have even more data frames in the list is:
# Creating some sample data first
> dflist <- list(df1 = data.frame(a = 1:3, b = 2:4, c = 3:5),
+ df2 = data.frame(a = 4:6, b = 5:7, c = 6:8),
+ df3 = data.frame(a = 7:9, b = 8:10, c = 9:11))
# See how it looks like
> dflist
$df1
a b c
1 1 2 3
2 2 3 4
3 3 4 5
$df2
a b c
1 4 5 6
2 5 6 7
3 6 7 8
$df3
a b c
1 7 8 9
2 8 9 10
3 9 10 11
# And do the trick
> dflist <- lapply(dflist, setNames, nm = c("newname1", "newname2", "newname3"))
# See how it looks now
> dflist
$df1
newname1 newname2 newname3
1 1 2 3
2 2 3 4
3 3 4 5
$df2
newname1 newname2 newname3
1 4 5 6
2 5 6 7
3 6 7 8
$df3
newname1 newname2 newname3
1 7 8 9
2 8 9 10
3 9 10 11
So the names were changed from a, b and c to newname1, newname2and newname3 for each data frame in the list.
If it is the second, you can do this:
> names(dflist) <- c("newname1", "newname2", "newname3")

Merge in loop R

I am using a for loop to merge multiple files with another file:
files <- list.files("path", pattern=".TXT", ignore.case=T)
for(i in 1:length(files))
{
data <- fread(files[i], header=T)
# Merge
mydata <- merge(mydata, data, by="ID", all.x=TRUE)
rm(data)
}
"mydata" looks as follows (simplified):
ID x1 x2
1 2 8
2 5 5
3 4 4
4 6 5
5 5 8
"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):
ID x3
1 8
2 4
ID x3
3 4
4 5
5 1
When I run my code I get the following dataset:
ID x1 x2 x3.x x3.y
1 2 8 8 NA
2 5 5 4 NA
3 4 4 NA 4
4 6 5 NA 5
5 5 8 NA 1
What I would like to get is:
ID x1 x2 x3
1 2 8 8
2 5 5 4
3 4 4 4
4 6 5 5
5 5 8 1
ID's are unique (never duplicates over the 600 files).
Any idea on how to achieve this as efficiently as possible much appreciated.
It's better suited as comment, But I can't comment yet.
Would it not be better to rbind instead of merge?
This seems to be what you want to acomplish.
Set fill argument TRUE to take care of different column numbers:
asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)
x1 x2
1: 1 4
2: 2 5
3: NA 5
Do this with data and then merge into mydata by ID.
Update for comment
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.
With ldply from plyr rbind all dataframes into one dataframe.
Don't touch mydata yet.
binded.data <- data.table(binded.data, key = ID)
Depending on your mydata you will perform different merge commands.
See:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Update 2
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
Update 3
You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.
ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Resources