Merging two data.frames by two columns each - r

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".

See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)

It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Turning a data.frame into a list of smaller data.frames in R

Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))

How to chain 2 lapply functions to subset dataframes in R?

I have a list containing 3 dataframes and another list containing 3 vectors of IDs. I'd like to subset each dataframe by checking if the IDs in the 1st dataframe match the ones in the first vector. Same for the second df and 2nd vector and 3rd df and 3rd vector. I can do it using lapply but I get a list of 3 lists, each containing a dataframe subsetted according to each of the 3 values in the list of IDs.
I want to get a list of 3 dataframes, the 1st one resulting of the rows in the 1st dataframe that have id in the 1st vector of IDs, the 2nd one resulting of the rows in the 2nd dataframe that have id in the 2ndvector of IDs... etc
n <- seq(1:20)
id <- paste0("ID_", n)
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df1$id <- replicate(10, sample(id, 1, replace = TRUE))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df2$id <- replicate(7, sample(id, 1, replace = TRUE))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
df3$id <- replicate(8, sample(id, 1, replace = TRUE))
list_df <- list(df1, df2, df3)
list_id <- list(c("ID_13", "ID_1", "ID_5"), c("ID_1", "ID_17", "ID_4",
"ID_9"), c("ID_12", "ID_18"))
subset_df <- lapply(list_df, function(x){
lapply(list_id, function(y) x[x$id %in% y,])
})
Thanks for your help!
As Nicola suggested, you can use Map or mapply in R. Mapply takes multiple vectors/lists of same lengths as parameters and pass the values corresponding to same index in the vector/lists to the function.
In your example, mapply will pass 1st list of list_df and 1 vector of list_id to df and id respectively and do the required processing and will continue for i=2,3 ...
mapply(function(df,id){ df[df$id %in% id,]},list_df,list_id,SIMPLIFY = FALSE)

Joining a list of data.frames with intersected genes and redundant columns into a single unique data.frame

I have a list of data.frames. Some of the data.frames are redundant and among the non-redundant ones the rows (indicated by an id column) are not identical but do overlap:
set.seed(2)
ids.1.2 <- paste0("id",sample(30,10,replace = F))
ids.3.4 <- paste0("id",sample(30,20,replace = F))
df.1 <- data.frame(id = ids.1.2,matrix(rnorm(100),10,10,dimnames = list(NULL,paste0("s.1.2:",1:10))))
df.2 <- df.1
df.3 <- data.frame(id = ids.3.4,matrix(rnorm(300),20,15,dimnames = list(NULL,paste0("s.3.4:",1:15))))
df.4 <- df.3
df.list <- list(df.1, df.2, df.3, df.4)
So in this case, df.1 and df.2 are identical, and so are df.3 and df.4, and both sets intersect on ids:
"id6" "id21" "id17" "id5" "id24" "id11" "id12
Is there a purrr::reduce or similar way to combine this list into a single data.frame with unique columns and the intersecting id's?
I'd use:
purrr::reduce(df.list, dplyr::inner_join,by = "id")
If all data.frames had unique columns. But in my case using this adds the .x, .y, ... suffices to the redundant columns.
I'm not sure if that's what you what, but I'd remove identical dataframes at first and then combine the rest. It's not a pretty solution and you may adjust it here and there, but if I got it right, it gives you your desired result. You might want to include a line that removes identical combinations in the combinations dataframe, so that you can be sure that there are no errors when removing the identical dfs from your list.
library(tidyr)
library(dplyr)
# create all possible combinations
names(df.list) <- 1:length(df.list)
combinations <- crossing(names(df.list), names(df.list))
colnames(combinations) <- c("v1", "v2")
# remove self-combinations
combinations <- combinations[!combinations$v1 == combinations$v2,]
# check which cases are identical
combinations$check <- sapply(1:nrow(combinations), function(x){combinations[x,] <- identical(df.list[[combinations$v1[x]]], df.list[[combinations$v2[x]]])})
combinations <- combinations[combinations$check == T,]
# remove identical cases
for(i in 1:length(df.list)){
if(combinations$v1[i] == names(df.list)[i] & combinations$v1[i] %in% names(df.list)){df.list[i] <- NULL}
}
# combine dataframes
bind_rows(df.list)

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Resources