How to compare elements of two large datasets as efficient as possible? - r

I am an R amateur and learning slowly. I present the situation:
I have two data frames with several columns (4) and +10000 rows looking like this, both:
df1: df2:
Nº x y attr Nº x y attr
1 45 34 X 1 34 23 x
1 48 45 XX 4 123 45 x
1 41 23 X 4 99 69 xx
4 23 12 X 4 112 80 xx
4 28 16 X 5 78 80 x
5 78 80 XXX 5 69 74 xx
...
I would like to compare both data frames based on x,y (coordinates) to delete in df1 all the values that also appear in df2 (all the values/coordinates that are contained in both datasets, delete them in df1).
So in my example, the last row of df1 would be deleted because the same coordinates are in df2.
What I am doing is using a double loop for(), one for one dataset and another one for the other, to compare one by one all the values possible.
I know this is extremely inefficient and it takes also a lot of time if I increase the amount of data.
What other ways are there to do this?
There are probably some functions but I generally don't know how to use them so much and it gives me problems.
Thank you very much!!

A library(data.table) method:
df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
# Nº x y attr
#1: 1 45 34 X
#2: 1 48 45 XX
#3: 1 41 23 X
#4: 4 23 12 X
#5: 4 28 16 X

Not the most elegant solution but gets the job done:
df2 = fread('Nº x y attr
1 34 23 x
4 123 45 x
4 99 69 xx
4 112 80 xx
5 78 80 x
5 69 74 xx')
df1 = fread('Nº x y attr
1 45 34 X
1 48 45 XX
1 41 23 X
4 23 12 X
4 28 16 X
5 78 80 XXX')
> df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),]
Nº x y attr
1: 1 45 34 X
2: 1 48 45 XX
3: 1 41 23 X
4: 4 23 12 X
5: 4 28 16 X
Explanation:
It's best to use vectorised functions rather than loops. !stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_") concatenates the x and y columns into a string and then finds elements from df1 that aren't in df2. This creates a logical vector of TRUE FALSE values which we can then use to subset df1.
EDIT:
I was curious to see if mine or #dww answer was faster:
> library(microbenchmark)
>
> n=100000
>
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
>
>
>
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
expr
df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x, df2$y, sep = "_"), ]
df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
min lq mean median uq max neval
168.40953 199.37183 219.30054 209.61414 222.08134 364.3458 100
41.07557 42.67679 52.34855 44.34379 59.27378 152.1283 100
Seems like the data.table version of dww is ~5x faster.

3 lines of code
#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)
x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)
df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)
#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)
#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

Related

R apply a vector of functions to a dataframe

I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------

Merge data frames without Merge Function

I have two data frames called height.txt
ID: 1 2 3 4 5
Height: 67 60 62 55 69
and next data frame is weight.txt
ID: 1 2 4 5 6
Weight: 110 123 150 170 185
The goal is to merge these two data frames together, without using the merge() function in R, and the output should be in the image. How would I do this? This is for practice, I know merge() does the trick, but I am supposed to do this without it, it is for a class.
Edit:
Data in a copy&paste format.
ID <- scan(text = "1 2 3 4 5")
Height <- scan(text = "67 60 62 55 69")
df1 <- data.frame(ID, Height)
ID <- scan(text = "1 2 4 5 6")
Weight <- scan(text = "110 123 150 170 185")
df2 <- data.frame(ID, Weight)
It's a simple repeated use of match.
Create a data.frame with all the elements of the common column, ID, with no repetitions.
match the ID's of each of the dataframes with the ID of the result res.
Assign the other columns.
Remember to create each of the other columns before assigning values to them.
res <- data.frame(ID = unique(c(df1$ID, df2$ID)))
i <- match(df1$ID, res$ID)
j <- na.omit(match(res$ID, df1$ID))
res$Height <- NA
res$Height[i] <- df1$Height[j]
i <- match(df2$ID, res$ID)
j <- na.omit(match(res$ID, df2$ID))
res$Weight <- NA
res$Weight[i] <- df2$Weight[j]
res
# ID Height Weight
#1 1 67 110
#2 2 60 123
#3 3 62 NA
#4 4 55 150
#5 5 69 170
#6 6 NA 185
identical(res, merge(df1, df2, all = TRUE))
#[1] TRUE
Edit.
Answering to a question in a comment about how general this solution is. From help("merge"):
Details
merge is a generic function whose principal method is for data frames:
the default method coerces its arguments to data frames and calls the
"data.frame" method.
The method merge.data.frame in R 3.6.2 is 158 code lines long, this solution is not general at all.
Edit 2.
A function generalizing the code above could the following.
merge_by_one_col <- function(X, Y, col = "ID"){
common <- unique(c(X[[col]], Y[[col]]))
res <- data.frame(common)
names(res) <- col
i <- match(X[[col]], res[[col]])
j <- na.omit(match(res[[col]], X[[col]]))
for(new in setdiff(names(X), col)){
res[[new]] <- NA
res[[new]][i] <- X[[new]][j]
}
i <- match(Y[[col]], res[[col]])
j <- na.omit(match(res[[col]], Y[[col]]))
for(new in setdiff(names(Y), names(res))){
res[[new]] <- NA
res[[new]][i] <- Y[[new]][j]
}
res
}
merge_by_one_col(df1, df2)
I used cbind after rbinding the missing IDs from each data frame and sorting by ID.
df1_ <- rbind(df1, data.frame(ID=setdiff(df2$ID, df1$ID), Height=NA))
df2_ <- rbind(df2, data.frame(ID=setdiff(df1$ID, df2$ID), Weight=NA))
cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
6 6 NA 185
Edit:
Generalizing so that no column names are required (except the "by" column "ID")
n1 <- setdiff(df1$ID, df2$ID); n1
n2 <- setdiff(df2$ID, df1$ID); n2
df1a <- df1[rep(nrow(df1)+1, length(n1)),]; df1a
df2a <- df2[rep(nrow(df2)+1, length(n2)),]; df2a
df1a$ID <- n2
df2a$ID <- n1
df1_ <- rbind(df1, df1a)
df2_ <- rbind(df2, df2a)
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
rownames(res) <- 1:nrow(res)
res
ID Height Weight
1 1 67 110
2 2 60 123
3 3 62 NA
4 4 55 150
5 5 69 170
NA 6 NA 185
Edit 2: Using rbind.fill from the plyr package:
library(plyr)
df1_ <- rbind.fill(df1, data.frame(ID=setdiff(df2$ID, df1$ID)))
df2_ <- rbind.fill(df2, data.frame(ID=setdiff(df1$ID, df2$ID)))
res <- cbind(df1_[order(df1_$ID),], df2_[order(df2_$ID), -1, drop=FALSE])
identical(res, merge(df1, df2, all=TRUE))
# TRUE

Create list of data.frames with specific rows from list of data.frames

I have a number of data.frames stored in a list (list1) and would like to create a new list (list2) with data.frames where the first contains all the first rows of the data.frames in list1, second contains all the second rows etc. Here is an example:
set.seed(42)
df1 <- data.frame(a=sample.int(100,3),b=sample.int(100,3))
df2 <- data.frame(a=sample.int(100,3),b=sample.int(100,3))
list1 <- list(df1,df2)
list1
[[1]]
a b
1 92 84
2 93 64
3 29 51
[[2]]
a b
1 74 71
2 14 46
3 65 100
From that I would like to create list 2 that should be as follows:
[[1]]
a b
1 92 84
2 74 71
[[2]]
a b
1 93 64
2 14 46
[[3]]
a b
1 29 51
2 65 100
What would be an efficient way to do this in R?
If all have the same number of rows
nr <- nrow(list1[[1]])
lapply(seq_len(nr), function(i) do.call(rbind, lapply(list1, function(x) x[i,])))
Another option is to bind it to a single data.frame, create a sequence by group and split which would take care of lists with unequal number of rows
library(dplyr)
library(data.table)
bind_rows(list1, .id = 'grp') %>%
mutate(rn = rowid(grp)) %>%
{split(.[c('a', 'b')], .$rn)}
Assuming that all the dataframes have the same number of rows and columns, the following works:
split(do.call(rbind, lapply(list1, function(x) x)),
rep(1:nrow(list1[[1]]), length(list1)))
# $`1`
# a b
# 1 92 84
# 2 74 71
#
# $`2`
# a b
# 1 93 64
# 2 14 46
#
# $`3`
# a b
# 1 29 51
# 2 65 100
d = lapply(list1, asplit, 1)
Map(rbind, d[[1]], d[[2]])

Read multiple csvs in loop and write as columns in a master csv

Let's say I have two .csv tables (I actually have hundreds):
Table 1
x mean_snowcover useless_data
1 80 6546156
2 50 6285465
3 60 2859525
Table 2
x mean_snowcover useless_data
1 91 87178
2 89 987189
3 88 879278927
I want a new table that looks like this:
Mean Snowcover
x Table_1 Table_2
1 80 91
2 50 89
3 60 88
This is my current code:
setwd("C:/Users/evan/Desktop/Finished Data SGMA/test")
master1=read.csv("New folder/AllSGMA.csv")
temp = list.files(pattern="*.csv$",recursive=FALSE)
###READ CSVS IN LOOP###
for(x in 1:length(temp)){
mycsv = read.csv(temp[x])
mean_snowcover=mycsv$mean_snowcover
master2=cbind(master1,mean_snowcover)
}
write.csv(master2,"Mean Snowcover.csv")
But the output is a blank table. I've looked at similar questions on stack overflow but I am unable to figure out what I need to change. I'm fairly new to R.
You can use Reduce and dplyr::left_join:
df1 <- read.table(text =
"x mean_snowcover useless_data
1 80 6546156
2 50 6285465
3 60 2859525", header = T)
df2 <- read.table(text =
"x mean_snowcover useless_data
1 91 87178
2 89 987189
3 88 879278927", header = T)
library(dplyr);
library(magrittr);
Reduce(function(x,y)
left_join(x, y, by = "x") %>% select(x, contains("snowcover")), list(df1, df2))
# x mean_snowcover.x mean_snowcover.y
# 1 1 80 91
# 2 2 50 89
# 3 3 60 88
This will work on any number of data.frames, as long as they share a common x column, and you put them all in a list, i.e.
lst <- list(df1, df2, df3, ....)
Reduce(function(x,y)
left_join(x, y, by = "x") %>% select(x, contains("snowcover")), lst)

grouping data with the same name and applying function

I have matrix like this:
I want to group the columns by which they have same name and apply function to the rows of my matrix.
>data
A A A B B C
gene1 1 6 11 16 21 26
gene2 2 7 12 17 22 27
gene3 3 8 13 18 23 28
gene4 4 9 14 19 24 29
gene5 5 10 15 20 25 30
basically, I want put columns with same names like A to group 1, B to group 2,... and after that, I calculate T-test for each genes for all groups.
can anybody help me how can I do this ? first : grouping, then applying the T-test, which return T score for each genes between different groups .
The OP hasn't mentioned what form they want in their output, but I'm entirely updating this answer with a possible solution.
First, some reproducible sample data to work with (that will actually work with t.test).
set.seed(1)
mymat <- matrix(sample(100, 40, replace = TRUE),
ncol = 8, dimnames = list(
paste("gene", 1:5, sep = ""),
c("A", "A", "A", "B", "B", "B", "C", "C")))
mymat
# A A A B B B C C
# gene1 27 90 21 50 94 39 49 67
# gene2 38 95 18 72 22 2 60 80
# gene3 58 67 69 100 66 39 50 11
# gene4 91 63 39 39 13 87 19 73
# gene5 21 7 77 78 27 35 83 42
I've left all the hard work to the combn function. Within the combn function, I've made use of the FUN argument to add a function that creates a vector of the t.test "statistic" by each row (I'm assuming one gene per row). I've also added an attribute to the resulting vector to remind us which columns were used in calculating the statistic.
temp <- combn(unique(colnames(mymat)), 2, FUN = function(x) {
out <- vector(length = nrow(mymat))
for (i in sequence(nrow(mymat))) {
out[i] <- t.test(mymat[i, colnames(mymat) %in% x[1]],
mymat[i, colnames(mymat) %in% x[2]])$statistic
}
attr(out, "NAME") <- paste(x, collapse = "")
out
}, simplify = FALSE)
The output of the above is a list of vectors. It might be more convenient to convert this into a matrix. Since we know that each value in a vector represents one row, and each vector overall represents one column value combination (AB, AC, or BC), we can use that for the dimnames of the resulting matrix.
DimNames <- list(rownames(mymat), sapply(temp, attr, "NAME"))
final <- do.call(cbind, temp)
dimnames(final) <- DimNames
final
# AB AC BC
# gene1 -0.5407966 -0.5035088 0.157386919
# gene2 0.5900350 -0.7822292 -1.645448267
# gene3 -0.2040539 1.7263502 1.438525163
# gene4 0.6825062 0.5933218 0.009627409
# gene5 -0.4384258 -0.9283003 -0.611226402
Some manual verification:
## Should be the same as final[1, "AC"]
t.test(mymat[1, colnames(mymat) %in% "A"],
mymat[1, colnames(mymat) %in% "C"])$statistic
# t
# -0.5035088
## Should be the same as final[5, "BC"]
t.test(mymat[5, colnames(mymat) %in% "B"],
mymat[5, colnames(mymat) %in% "C"])$statistic
# t
# -0.6112264
## Should be the same as final[3, "AB"]
t.test(mymat[3, colnames(mymat) %in% "A"],
mymat[3, colnames(mymat) %in% "B"])$statistic
# t
# -0.2040539
Update
Building on #EDi's answer, here's another approach. It makes use of melt from "reshape2" to convert the data into a "long" format. From there, as before, it's pretty straightforward subsetting work to get what you want. The output there is transposed in relation to the approach taken with the pure combn approach, but the values are the same.
library(reshape2)
mymatL <- melt(mymat)
byGene <- split(mymatL, mymatL$Var1)
RowNames <- combn(unique(as.character(mymatL$Var2)), 2,
FUN = paste, collapse = "")
out <- sapply(byGene, function(combos) {
combn(unique(as.character(mymatL$Var2)), 2, FUN = function(x) {
t.test(value ~ Var2, combos[combos[, "Var2"] %in% x, ])$statistic
}, simplify = TRUE)
})
rownames(out) <- RowNames
out
# gene1 gene2 gene3 gene4 gene5
# AB -0.5407966 0.5900350 -0.2040539 0.682506188 -0.4384258
# AC -0.5035088 -0.7822292 1.7263502 0.593321770 -0.9283003
# BC 0.1573869 -1.6454483 1.4385252 0.009627409 -0.6112264
The first option is considerably faster, at least on this smaller dataset:
microbenchmark(fun1(), fun2())
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 8.812391 9.012188 9.116896 9.20795 17.55585 100
# fun2() 42.754296 43.388652 44.263760 45.47216 67.10531 100

Resources