Merge in loop R - r

I am using a for loop to merge multiple files with another file:
files <- list.files("path", pattern=".TXT", ignore.case=T)
for(i in 1:length(files))
{
data <- fread(files[i], header=T)
# Merge
mydata <- merge(mydata, data, by="ID", all.x=TRUE)
rm(data)
}
"mydata" looks as follows (simplified):
ID x1 x2
1 2 8
2 5 5
3 4 4
4 6 5
5 5 8
"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):
ID x3
1 8
2 4
ID x3
3 4
4 5
5 1
When I run my code I get the following dataset:
ID x1 x2 x3.x x3.y
1 2 8 8 NA
2 5 5 4 NA
3 4 4 NA 4
4 6 5 NA 5
5 5 8 NA 1
What I would like to get is:
ID x1 x2 x3
1 2 8 8
2 5 5 4
3 4 4 4
4 6 5 5
5 5 8 1
ID's are unique (never duplicates over the 600 files).
Any idea on how to achieve this as efficiently as possible much appreciated.

It's better suited as comment, But I can't comment yet.
Would it not be better to rbind instead of merge?
This seems to be what you want to acomplish.
Set fill argument TRUE to take care of different column numbers:
asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)
x1 x2
1: 1 4
2: 2 5
3: NA 5
Do this with data and then merge into mydata by ID.
Update for comment
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.
With ldply from plyr rbind all dataframes into one dataframe.
Don't touch mydata yet.
binded.data <- data.table(binded.data, key = ID)
Depending on your mydata you will perform different merge commands.
See:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Update 2
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
Update 3
You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.
ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Related

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

returning from list to data.frame after lapply

I have a very simply question about lapply. I am transitioning from STATA to R and I think there is some very basic concept that I am not getting about looping in R. But I have been reading about it all afternoon and can't figure out a reasonable way to do this very simple thing.
I have three data frames df1, df2, and df3 that all have the same column names, in the same order, etc.
I want to rename their columns all at once.
I put the data frames in a list:
dflist <- list(df1, df2, df3)
What I want the new names to be:
varlist <- c("newname1", "newname2", "newname3")
Write a function that replaces names with those in varlist, and lapply it over the data frames
ChangeNames <- function(x) {
names(x) <- varlist
return(x)
}
dflist <- lapply(dflist, ChangeNames)
So, as far as I understand, R has changed the names of the copies of the data frames that I put in the list, but not the original data frames themselves. I want the data frames themselves to be renamed, not the elements of the list (which are trapped in a list).
Now, I can go
df1 <- as.data.frame(dflist[1])
df2 <- as.data.frame(dflist[2])
df2 <- as.data.frame(dflist[3])
But that seems weird. You need a loop to get back the elements of a loop?
Basically: once you've put some data frames in a list and run your function on them via lapply, how do you get them back out of the list, without starting back at square one?
If you just want to change the names, that isn't too hard in R. Bear in mind that the assignment operator, <-, can be applied in sequence. Hence:
names(df1) <- names(df2) <- names(df3) <- c("newname1", "newname2", "newname3")
I am not sure I understand correctly, do you want to rename the columns of the data frames or the components of the list that contain the data frames?
If it is the first, please always search before asking, the question has been asked here.
So what you can easily do in case you have even more data frames in the list is:
# Creating some sample data first
> dflist <- list(df1 = data.frame(a = 1:3, b = 2:4, c = 3:5),
+ df2 = data.frame(a = 4:6, b = 5:7, c = 6:8),
+ df3 = data.frame(a = 7:9, b = 8:10, c = 9:11))
# See how it looks like
> dflist
$df1
a b c
1 1 2 3
2 2 3 4
3 3 4 5
$df2
a b c
1 4 5 6
2 5 6 7
3 6 7 8
$df3
a b c
1 7 8 9
2 8 9 10
3 9 10 11
# And do the trick
> dflist <- lapply(dflist, setNames, nm = c("newname1", "newname2", "newname3"))
# See how it looks now
> dflist
$df1
newname1 newname2 newname3
1 1 2 3
2 2 3 4
3 3 4 5
$df2
newname1 newname2 newname3
1 4 5 6
2 5 6 7
3 6 7 8
$df3
newname1 newname2 newname3
1 7 8 9
2 8 9 10
3 9 10 11
So the names were changed from a, b and c to newname1, newname2and newname3 for each data frame in the list.
If it is the second, you can do this:
> names(dflist) <- c("newname1", "newname2", "newname3")

Using merge command in r for merging depending upon column values

So, I have several dataframes like this
1 2 a
2 3 b
3 4 c
4 5 d
3 5 e
......
1 2 j
2 3 i
3 4 t
3 5 r
.......
2 3 t
2 4 g
6 7 i
8 9 t
......
What I want is, I want to merge all of these files into one single file showing the values of third column for each pair of values in columns 1 and columns 2 and 0 if that pair is not present.
So, the output for this will be, since, there are three files (there are more)
1 2 aj0
2 3 bit
3 4 ct0
4 5 d00
3 5 er0
6 7 00i
8 9 00t
......
What I did was combine all my text .txt files in a single list.
Then,
L <- lapply(seq_along(L), function(i) {
L[[i]][, paste0('DF', i)] <- 1
L[[i]]
})
Which will indicate the presence of a value when we will be merging them.
I don't know how to proceed further. Any inputs will be great. Thanks!
Here is one way to do it with Reduce
# function to generate dummy data
gen_data<- function(){
data.frame(
x = 1:3,
y = 2:4,
z = sample(LETTERS, 3, replace = TRUE)
)
}
# generate list of data frames to merge
L <- lapply(1:3, function(x) gen_data())
# function to merge by x and y and concatenate z
f <- function(x, y){
d <- merge(x, y, by = c('x', 'y'), all = TRUE)
# set merged column to zero if no match is found
d[['z.x']] = ifelse(is.na(d[['z.x']]), 0, d[['z.x']])
d[['z.y']] = ifelse(is.na(d[['z.y']]), 0, d[['z.y']])
d$z <- paste0(d[['z.x']], d[['z.y']])
d['z.x'] <- d['z.y'] <- NULL
return(d)
}
# merge data frames
Reduce(f, L)

Putting output in R into excel

Guys I have a code that generates 2 columns of data (e.g Number, Median) which refers to a particular person...but I have taken samples of 7 people
so basically I get this output:
[[1]
Number Median
1 5
2 3
.....
[[2]]
Number Median
1 6
2 4
....
[[3]]
Number Median
1 3
2 5
So I basically get this output....up til [[7]]
I tried transferring this output in excel using this code
write.csv(cbind(data),"data1.csv")
and I get this type of output:
list(c(Median =.......It lists all the median on the rows
But I want it to save the data referring to the 'median' and 'Number' in columns NOT ROWS
If I just type
write.csv(data,"data1.csv")
I get an error
arguments imply differing number of rows: 157, 179, 178, 180
As Marius said, you have a list of data.frames which can't be written to a .csv file. You need to do:
NewDataFrame <- do.call("rbind", YourList)
write.csv(NewDataFrame, "Data.csv")
do.call takes each of the elements from a list and applies whatever function you tell it (in this case rbind) to all of them.
Here are two options. Both use the following sample data:
myList <- list(data.frame(matrix(1:4, ncol = 2)),
data.frame(matrix(3:10, ncol = 2)),
data.frame(matrix(11:14, ncol =2)))
myList
# [[1]]
# X1 X2
# 1 1 3
# 2 2 4
#
# [[2]]
# X1 X2
# 1 3 7
# 2 4 8
# 3 5 9
# 4 6 10
#
# [[3]]
# X1 X2
# 1 11 13
# 2 12 14
Option 1: Write a csv file where the data.frames are presented as they are in the list
sink("list_of_dataframes.csv", type="output")
invisible(lapply(myList, function(x) dput(write.csv(x))))
sink()
If you open the resulting "list_of_dataframes.csv" file in a text editor, you will get something that looks like this. When you read this into a spreadsheet program, the first column will include the rownames and NULL separating each data.frame:
"","X1","X2"
"1",1,3
"2",2,4
NULL
"","X1","X2"
"1",3,7
"2",4,8
"3",5,9
"4",6,10
NULL
"","X1","X2"
"1",11,13
"2",12,14
NULL
Option 2: Write or search around for a version of cbind that accommodates binding data.frames with differing number of rows.
Here is one such function that I've written.
cbind2 <- function(datalist) {
nrows <- max(sapply(datalist, nrow))
expandmyrows <- function(mydata, rowsneeded) {
temp1 = names(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(data.frame(
matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
rbind(mydata, temp2)
}
do.call(cbind, lapply(datalist, expandmyrows, rowsneeded = nrows))
}
And here is that function applied to your list:
cbind2(myList)
# X1 X2 X1 X2 X1 X2
# 1 1 3 3 7 11 13
# 2 2 4 4 8 12 14
# 3 NA NA 5 9 NA NA
# 4 NA NA 6 10 NA NA
That output should be easy for you to use with write.csv and related functions.

Use object names within a list in lapply/ldply

In attempting to answer a question earlier, I ran into a problem that seemed like it should be simple, but I couldn't figure out.
If I have a list of dataframes:
df1 <- data.frame(a=1:3, x=rnorm(3))
df2 <- data.frame(a=1:3, x=rnorm(3))
df3 <- data.frame(a=1:3, x=rnorm(3))
df.list <- list(df1, df2, df3)
That I want to rbind together, I can do the following:
df.all <- ldply(df.list, rbind)
However, I want another column that identifies which data.frame each row came from. I expected to be able to use the deparse(substitute(x)) method (here and elsewhere) to get the name of the relevant data.frame and add a column. This is how I approached it:
fun <- function(x) {
name <- deparse(substitute(x))
x$id <- name
return(x)
}
df.all <- ldply(df.list, fun)
Which returns
a x id
1 1 1.1138062 X[[1L]]
2 2 -0.5742069 X[[1L]]
3 3 0.7546323 X[[1L]]
4 1 1.8358605 X[[2L]]
5 2 0.9107199 X[[2L]]
6 3 0.8313439 X[[2L]]
7 1 0.5827148 X[[3L]]
8 2 -0.9896495 X[[3L]]
9 3 -0.9451503 X[[3L]]
So obviously each element of the list does not contain the name I think it does. Can anyone suggest a way to get what I expected (shown below)?
a x id
1 1 1.1138062 df1
2 2 -0.5742069 df1
3 3 0.7546323 df1
4 1 1.8358605 df2
5 2 0.9107199 df2
6 3 0.8313439 df2
7 1 0.5827148 df3
8 2 -0.9896495 df3
9 3 -0.9451503 df3
Define your list with names and it should give you an .id column with the data.frame name
df.list <- list(df1=df1, df2=df2, df3=df3)
df.all <- ldply(df.list, rbind)
Output:
.id a x
1 df1 1 1.84658809
2 df1 2 -0.01177462
3 df1 3 0.58579469
4 df2 1 -0.64748756
5 df2 2 0.24384614
6 df2 3 0.59012676
7 df3 1 -0.63037679
8 df3 2 -1.17416295
9 df3 3 1.09349618
Then you can know the data.frame name from the column df.all$.id
Edit:
As per #Gary Weissman's comment if you want to generate the names automatically you can do
names(df.list) <- paste0('df',seq_along(df.list)
Using base only, one could try something like:
dd <- lapply(seq_along(df.list), function(x) cbind(df_name = paste0('df',x),df.list[[x]]))
do.call(rbind,dd)
In your definition, df.list does not have names, however, even then the deparse substitute idiom does not appear to work easilty (as lapply calls .Internal(lapply(X, FUN)) -- you would have to look at the source to see if the object name was available and how to get it
Something like
names(df.list) <- paste('df', 1:3, sep = '')
foo <- function(n, .list){
.list[[n]]$id <- n
.list[[n]]
}
a x id
1 1 0.8204213 a
2 2 -0.8881671 a
3 3 1.2880816 a
4 1 -2.2766111 b
5 2 0.3912521 b
6 3 -1.3963381 b
7 1 -1.8057246 c
8 2 0.5862760 c
9 3 0.5605867 c
if you want to use your function, instead of deparse(substitute(x)) use match.call(), and you want the second argument, making sure to convert it to character
name <- as.character(match.call()[[2]])

Resources