Im having an issue with speed of using for loops to cross reference 2 data frames. The overall aim is to identify rows in data frame 2 that lie between coordinates specified in data frame 1 (and meet other criteria). e.g. df1:
chr start stop strand
1 chr1 179324331 179327814 +
2 chr21 45176033 45182188 +
3 chr5 126887642 126890780 +
4 chr5 148730689 148734146 +
df2:
chr start strand
1 chr1 179326331 +
2 chr21 45175033 +
3 chr5 126886642 +
4 chr5 148729689 +
My current code for this is:
for (index in 1:nrow(df1)) {
found_miRNAs <- ""
curr_row = df1[index, ];
for (index2 in 1:nrow(df2)){
curr_target = df2[index2, ]
if (curr_row$chrm == curr_target$chrm & curr_row$start < curr_target$start & curr_row$stop > curr_target$start & curr_row$strand == curr_target$strand) {
found_miRNAs <- paste(found_miRNAs, curr_target$start, sep=":")
}
}
curr_row$miRNAs <- found_miRNAs
found_log <- rbind(Mcf7_short_aUTRs2,curr_row)
}
My actual data frames are 400 lines for df1 and > 100 000 lines for df2 and I am hoping to do 500 iterations, so, as you can imagine this unworkably slow. I'm relatively new to R so any hints for functions that may increase the efficiency of this would be great.
Maybe not fast enough, but probably faster and a lot easier to read:
df1 <- data.frame(foo=letters[1:5], start=c(1,3,4,6,2), end=c(4,5,5,9,4))
df2 <- data.frame(foo=letters[1:5], start=c(3,2,5,4,1))
where <- sapply(df2$start, function (x) which(x >= df1$start & x <= df1$end))
This will give you a list of the relevant rows in df1 for each row in df2. I just tried it with 500 rows in df1 and 50000 in df2. It finished in a second or two.
To add criteria, change the inner function within sapply. If you then want to put where into your second data frame, you could do e.g.
df2$matching_rows <- sapply(where, paste, collapse=":")
But you probably want to keep it as a list, which is a natural data structure for it.
Actually, you can even have a list column it in the data frame:
df2$matching_rows <- where
though this is quite unusual.
You've run into two of the most common mistakes people make when coming to R from another programming language. Using for loops instead of vector-based operations and dynamically appending to a data object. I'd suggest as you get more fluent you take some time to read Patrick Burns' R Inferno, it provides some interesting insight into these and other problems.
As #David Arenburg and #zx8754 have pointed out in the comments above there are specialized packages that can solve the problem, and the data.table package and #David's approach can be very efficient for larger datasets. But for your case base R can do what you need it to very efficiently as well. I'll document one approach here, with a few more steps than necessary for clarity, just in case you're interested:
set.seed(1001)
ranges <- data.frame(beg=rnorm(400))
ranges$end <- ranges$beg + 0.005
test <- data.frame(value=rnorm(100000))
## Add an ID field for duplicate removal:
test$ID <- 1:nrow(test)
## This is where you'd set your criteria. The apply() function is just
## a wrapper for a for() loop over the rows in the ranges data.frame:
out <- apply(ranges, MAR=1, function(x) test[ (x[1] < test$value & x[2] > test$value), "ID"])
selected <- unlist(out)
selected <- unique( selected )
selection <- test[ selected, ]
Related
I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)
I have a dataset/table (called behavioural) with data from 24 participants - these are named in the format: 's1' to 's24'.
The first 6 rows in the table/dataset:
head(behavioural)[c(1,17)]
subj recognition_order
1 s1 2
2 s1 6
3 s1 7
4 s1 8
5 s1 9
6 s1 10
I want to create a subset for each participant and order each of these subsets by the variable recognition_order
I have created a loop to do this:
behavioural <- read.table("*my file path*\behavioral.txt", header = TRUE)
subj_counter <- 1
for(i in 1:24) {
subject <- paste("s", subj_counter, sep = "")
subset_name <- paste(subject, "_subset", sep="")
[subset_name] <- behavioural[which(behavioural$subj == subject), ]
[subset_name] <- subset_name[order(subset_name$recognition_order),]
subj_counter = subj_counter + 1
print(subset_name)
print(subj_counter)
}
And I'm pretty sure the logic is solid, except when I run the loop, it does not create 24 subsets. It just creates 1 - s24_subset.
What do I need to do to the bit before "<-" in these 2 lines of code?
[subset_name] <- behavioural[which(behavioural$subj == subject), ]
[subset_name] <- subset_name[order(subset_name$recognition_order),]
Because [subset_name] isn't working.
I want the [subset_name] to be dynamic - i.e. each time the loop runs, its value changes and it creates a new subset/variable each time.
I have seen things online about the assign() function but I'm not quite sure how to implement this into my loop?
Thank you in advance!
If you want to order the items inside the results of a split than just use lapply to pass the needed function calls to do the ordering on a single dataframe at a time (which are re-bundled together by lapply after the ordering:
my_split_list <- split(behavioural, behavioural$subj)
ord.list <- lapply( my_split_list, function(d){
d[ order(d[['recognition_order']]) , ] }
This is a common paradigm called "split-apply-combine": "The Split-Apply-Combine Strategy for Data Analysis" https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf
for(i in 1:5) {
assign(paste0("test_",i),i)
}
test_list <- mget(ls(pattern = "test_"))`
I hope you get a good answer. There is a good way to create an assign function and to bind this pattern in mget. I've outlined a lot of questions for R and Python about dynamically generating variables. Attached to the following link.
- Creating Variables Dynamically (R, Python)
You can accomplish this with eval() and parse(), like so:
eval(parse(text = paste(subset_name, "<- subset_name[order(subset_name$recognition_order),]", sep = '')))
I'm trying to analyze a large set of data so I can't use for loops to search for ID's from one data frame on the other and replace the text.
Basically, first data frame is with IDs and without names. The names are in the other data frame.
(Edit) Input dfs
(Edit) df1
ID------Name
1,2,3---NA
4,5-----NA
6-------NA
(Edit) df2
ID------Name
1-------John
2-------John
3-------John
4-------Stacy
5-------Stacy
6-------Alice
(Edit) Expected output df
ID------Name
1,2,3---John
4,5-----Stacy
6-------Alice
(Edit) Please note that this is very simplified version. df1 actually has 63 columns and 8551 rows, df2 has 5 columns and 37291 rows.
I can search for the IDs and get names on the second data frame like this. It' super fast!
namer <- function(df2, ids) {
ids <- gsub(',', '|', ids);
names <- df2[which(apply(df2, 1, function(x) any(grepl(ids, x)))),][['Name']];
if (length(names) != 0) {
return(names[[1]]);
} else {
return(NA);
}
}
But, I can't replace using apply families. I know doing it with for loops and it's super slow because I have around 8500 rows in the first data frame.
for (k in 1:nrow(df1)) {
df1$Name[k] <- namer(df2, df1$ID[k]);
}
Can you please help to do convert for loops into apply functions as well to speed it up?
Thanks in advance
You can try
df1$Name <- sapply(as.character(df1$ID),
function(x) paste(unique(df2[match(strsplit(x, ",")[[1]], df2$ID), "Name"]), collapse = ","))
df1
# ID Name
# 1 1,2,3 John
# 2 4,5 Stacy
# 3 6 Alice
Although I doubt sapply will be faster than a for loop. I've also added paste function here in case you have more than one name matched in df1$ID
I am working on a function that takes a list of data tables with the same column names as an input and returns a single data table that has the unique rows from each data frame combined using successive rbind as shown below.
The function would be applied on a "very" large data.table (10s of millions of rows) which is why I had to split it up into several smaller data tables and assign them into a list to use recursion. At each step depending upon the length of the list of data tables (odd or even), I find the unique of data.table at that list index and the data table at the list index x - 1 and then successively rbind the 2 and assign to list index x - 1, and more list index x.
I must be missing something obvious, because although I can produce the final unique-d data.table when I print it (eg., print (listelement[[1]]), when I return (listelement[[1]]) I get NULL. Would help if someone can spot what I am missing ... or suggest if there is perhaps any other more efficient way to perform this.
Also, instead of having to add each data.table to a list, can I add them as "references" in the list ? I believe doing something like list(datatable1, datatable2 ...) would actually copy them ?
## CODE
returnUnique2 <- function (alist) {
if (length(alist) == 1) {
z <- (alist[[1]])
print (class(z))
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
}
if (length(alist) %% 2 == 0) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
if (length(alist) %% 2 == 1 && length(alist) > 2) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
}
## OUTPUT with print statement
t1 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t2 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t3 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
tempList <- list(t1, t2, t3)
returnUnique2(tempList)
[1] "list"
[[1]]
col1 col2
1: a 3
2: a 2
3: a 5
4: a 9
5: a 10
6: a 7
7: a 1
8: a 8
9: a 4
10: a 6
Changing the following,
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
to read
return(z)
returns NULL
Thanks in advance.
Please correct me if I misunderstand what you're doing, but it sounds like you have one big data.table and are trying to split it up to run some function on it and would then combine everything back and run a unique on that. The data.table way of doing that would be to use by, e.g.
fn = function(d) {
# do whatever to the subset and return the resulting data.table
# in this case, do nothing
d
}
N = 10 # number of pieces you like
dt[, fn(.SD), by = (seq_len(nrow(dt)) - 1) %/% (nrow(dt)/N)][, seq_len := NULL]
dt = dt[!duplicated(dt)]
Seems like this could be a good use case for a for loop. With many rows the overhead of using a for loop should be relatively small compared to the computation time. I would try combining my data.table's into a list (called ll in my example), then for each one remove duplicated rows, then rbind to the previous data.table with unique rows and then subset by unique rows again.
If you have many duplicated rows in each chunk then this might save some time, overall I'm not sure how effective it will be, but worth a shot?
# Create empty data.table for results (I have columns x and y in this case)
res <- data.table( x= numeric(0),y=numeric(0))
# loop over all data.tables in a list called 'll'
for( i in 1:length(ll) ){
# rbind the unique rows from the current list element to the results from all previous iterations
res <- rbind( res , ll[[i]][ ! duplicated(ll[[i]]) , ] )
# Keep only unique records at each iteration
res <- res[ ! duplicated(res) , ]
}
On another note, have you looked at the documentation for data.table? It explicitly states,
Because data.tables are usually sorted by key, tests for duplication
are especially quick.
So you might just be better off running on the entire data.table?
DT[ ! duplicated(DT) , ]
Add an id column to each data.table
t1$id=1
t2$id=2
t3$id=3
then combine them all at once and do a unique using by=.
If the data.tables are huge you could use setkey(...) to create an index on id before calling unique.
tall=rbind(t1,t2,t3)
tall[,unique(col1,col2),by=id]
I have two R data frame with differing dimensions. However but data frames have an id column
df1:
nrow(df1)=22308
c1 c2 c3 pattern1.match
ENSMUSG00000000001_at 10.175115 10.175423 10.109524 0
ENSMUSG00000000003_at 2.133651 2.144733 2.106649 0
ENSMUSG00000000028_at 5.713781 5.714827 5.701983 0
df2:
Genes Pattern.Count
ENSMUSG00000000276 ENSMUSG00000000276_at 1
ENSMUSG00000000876 ENSMUSG00000000876_at 1
ENSMUSG00000001065 ENSMUSG00000001065_at 1
ENSMUSG00000001098 ENSMUSG00000001098_at 1
nrow(df2)=425
I would like to loop through df2, and find all genes that have pattern.count=1 and check it in df1$pattern1.match column.
Basically I would like to overwrite the fields GENES AND pattern1.match with the df2$Genes and df2$Pattern.Count. All the elements from df2$Pattern.Count are equal to one.
I wrote this function, but R freezes while looping through all these rows.
idcol <- ncol(df1)
return.frame.matches <- function(df1, df2, idcol) {
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2))
if(df1[i, 1] == df2[j, 1]) {
df1[i, idcol] = 1
break
}
}
return (df1)
}
Is there another way of doing that without almost killing the computer?
I'm not sure I get exactly what you are doing, but the following should at least get you closer.
The first column of df1 doesn't seem to have a name, are they rownames?
If so,
df1$Genes <- rownames(df1)
Then you could then do a merge to create a new dataframe with the genes you require:
merge(df1,subset(df2,Pattern.Count==1))
Note they are matching on the common column Genes. I'm not sure what you want to do with the pattern1.match column, but a subset on the df1 part of merge can incorporate conditions on that.
Edit
Going by the extra information in the comments,
df1$pattern1.match <- as.numeric(df1$Genes %in% df2$Genes)
should achieve what you are looking for.
Your sample data is not enough to play around with, but here is what I would start with:
dfm <- merge( df1, df2, by = idcol, all = TRUE )
dfm_pc <- subset( dfm, Pattern.Count == 1 )
I took the "idcol" from your code, don't see it in the data.