I have a dataframe as shown below.
dataframe
Data for replication:
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I'm trying to come up with a way to analyze the 'group' column for any groups with similar elements and remove the row with the higher count.
For example,
element 20000128 is present in rows 1,2 & 5. Since row 1 has a lower number of characters, I want to remove rows 2 & 5. I appreciate any help!!
Ideally the end result should only have Cluster 1,3,4. Each element should only appear once. (the clusters with the lowest character count)
Exploring this problem has been fun. I've learned that this is a variation of the set cover problem and is NP Complete.
It would help to understand the scope of your problem. If we are talking 10s of clusters, we could use brute force. If it's thousands of clusters, we are going to have to use an approximation.
I have learned there is an R implementation of the greedy algorithm in the RcppGreedySetCover package.
First we need to convert to two column long form. We can use dplyr.
library(tidyverse)
longx <- x %>%
mutate(splitgroups = strsplit(as.character(groups), " ")) %>%
unnest(splitgroups) %>% select(cluster, splitgroups)
Then we can use greedySetCover to approximate the smallest set that covers all elements.
library(RcppGreedySetCover)
greedySetCover(longx)
#100% covered by 3 sets.
# cluster splitgroups
# 1: 2 20000127
# 2: 3 20000129
# 3: 3 20000130
# 4: 3 20000131
# 5: 3 20000132
# 6: 5 20000128
# 7: 5 20000133
# 8: 5 20000134
# 9: 5 20000135
#10: 5 20000136
This suggests the set of 2,3, and 5 covers everything. But this does not fully answer your question, because, as you know there is a set of clusters that is shorter.
However, what we have learned, is that the minimum set is 3 clusters. Now we can test all combinations of 3 clusters.
set.size <- length(unique(greedySetCover(longx)$cluster))
binary.matrix <- table(longx)
combinations <- combn(unique(x$cluster),set.size)
total.lengths <- apply(combinations,2,function(x){
if(sum(as.logical(colSums(binary.matrix[x,]))) == ncol(binary.matrix))
{sum(rowSums(binary.matrix[x,]))}
else {NA}})
min.length <- min(total.lengths,na.rm = TRUE)
min.set <- combinations[,which(total.lengths == min.length)]
x[min.set,]
# cluster groups chr
#1 1 20000127 20000128 17
#3 3 20000129 20000130 20000131 20000132 35
#4 4 20000133 20000134 20000135 20000136 35
Data
x <- data.frame(cluster=c(1,2,3,4,5),
groups=c('20000127 20000128',
'20000127 20000128 20000134',
'20000129 20000130 20000131 20000132',
'20000133 20000134 20000135 20000136',
'20000128 20000133 20000134 20000135 20000136'),
chr=c(17,26,35,35,44), stringsAsFactors=FALSE)
I had to use a while loop, maybe there's a less loopy solution...
foo <- function(x) {
i <- 1
while(i < nrow(x)) {
grps <- strsplit(x$groups, " ")
keep <- unlist(lapply(grps, function(x) identical(x, grps[[i]]) | !any((length(x) > length(grps[[i]]) & duplicated(c(grps[[i]], x))))))
x <- x[keep,]
i <- i+1
}
x
}
foo(x)
cluster groups chr
1 1 20000127 20000128 17
3 3 20000129 20000130 20000131 20000132 35
4 4 20000133 20000134 20000135 20000136 35
Explanation.
# I created a function to keep things compact and allow it to be used for other datasets.
# The `x` is the argument, assumed to be your data frame.
# 1: foo <- function(x) {
# Start the ball rolling with a counter to use in the while loop.
# 2: i <- 1
# This starts the while loop and will continue until "i" reaches the end of the data.
# But note later that the data may change if there are rows that meet your condition.
# 3: while(i < nrow(x)) {
# Split the groups variable at the " " and store in "grps"
# 4: grps <- strsplit(x$groups, " ")
# This next line does the work.
# It creates a vector of logical indices which are used to remove rows of "x"
# I split this into many lines to explain better.
# 5: keep <- unlist(lapply(grps, function(x) # apply a function to "grps"
# identical(x, grps[[i]]) | # Returns TRUE for each row we are checking
# !any( # Negate the next conditions. They will return rows to remove.
# (length(x) > length(grps[[i]]) & # return TRUE (negated=FALSE) if the length of each x is more than all others
# duplicated(c(grps[[i]], x)))))) # if duplicated, return TRUE (negated=FALSE)
# Update "x" by keeping only the rows that meet the criteria defined in step 5.
# 6: x <- x[keep,]
# Increase i
# 7: i <- i+1
# 8: } # This ends the while loop
# 9: x # Return the result
} # End of function
Related
I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id x y z
1: 1 109.3400 208.6732 308.7595
2: 2 101.6920 201.0989 310.1080
3: 3 119.4697 217.8550 313.9384
4: 4 111.4261 205.2945 317.3651
5: 5 100.4024 212.2826 305.1375
6: 6 114.4711 203.6988 319.4913
in several stages. I am aware that I could apply subset(.) sequentially to achieve this.
> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
id x y z
1: 55 119.2634 219.0044 315.6556
My problem is that I need to automate this and it might happen that the subset is empty. In this case, I would want to skip the step(s) that result in an empty set. For example, if my data was
dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
id x y z
1: 3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567
the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). Is there a way to apply a subset-command only if it results in a non-empty set? I imagine something like subset(s, y>219, nonzero=TRUE). I would want to avoid constructions like
s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.). That's why I am hoping to find a solution optimized for speed.
PS. I only chose subset(.) for clarity, solutions with e.g. data.table would be just as welcome if not more so.
I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
Usage
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).
If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.
An interesting approach could be developed using modified filter function offered in dplyr. In case of conditions not being met the non_empty_filter filter function returns original data set.
Notes
IMHO, this is fairly non-standard behaviour and should be reported via warning. Of course, this can be removed and has no bearing on the function results.
Function
library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
expr <- enquo(expr)
res <- df %>% filter(!!expr)
if (nrow(res) > 0) {
return(res)
} else {
# Indicate that filter is not applied
warning("No rows meeting conditon")
return(df)
}
}
Condition met
Behaviour: Returning one row for which the condition is met.
dat %>%
non_empty_filter(x > 119 & y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556
Condition not met
Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6.
dat %>%
non_empty_filter(x > 119 & y > 219 & y > 1e6)
Results
# id x y z
# 1: 1 109.3400 208.6732 308.7595
# 2: 2 101.6920 201.0989 310.1080
# 3: 3 119.4697 217.8550 313.9384
# 4: 4 111.4261 205.2945 317.3651
# 5: 5 100.4024 212.2826 305.1375
# 6: 6 114.4711 203.6988 319.4913
# 7: 7 112.1879 209.5716 319.6732
# 8: 8 106.1344 202.2453 312.9427
# 9: 9 101.2702 210.5923 309.2864
# 10: 10 106.1071 211.8266 301.0645
Condition met/not met one-by-one
Behaviour: Skipping filter that would return an empty data set.
dat %>%
non_empty_filter(y > 1e6) %>%
non_empty_filter(x > 119) %>%
non_empty_filter(y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556
I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}
I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.
I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!
I have a table of 55000 rows, which looks like that (left table):
(the code to generate sample data is below)
Now I need to convert every row of this table to 6 rows, each containing one letter of "hexamer" (right table on the picture) with some calculations:
# input for the function is one row of source table, output is 6 rows
splithexamer <- function(x){
dir <- x$dir # strand direction: +1 or -1
pos <- x$pos # hexamer position
out <- x[0,] # template of output
hexamer <- as.character(x$hexamer)
for (i in 1:nchar(hexamer)) {
letter <- substr(hexamer, i, i)
if (dir==1) {newpos <- pos+i-1;}
else {newpos <- pos+6-i;}
y <- x
y$pos <- newpos
y$letter <- letter
out <- rbind(out,y)
}
return(out);
}
# Sample data generation:
set.seed(123)
size <- 55000
letters <- c("G","A","T","C")
df<-data.frame(
HSid=paste0("Hs.", 1:size),
hexamer=replicate(n=size, paste0(sample(letters,6,replace=T), collapse="")),
chr=sample(c(1:23,"X","Y"),size,replace=T),
pos=sample(1:99999,size,replace=T),
dir=sample(c(1,-1),size,replace=T)
)
Now I would like to get some advices what would be the most efficient way to apply my function to every row. So far I tried the following:
# Variant 1: for() with rbind
tmp <- data.frame()
for (i in 1:nrow(df)){
tmp<-rbind(tmp,splithexamer(df[i,]));
}
# Variant 2: for() with direct writing to file
for (i in 1:nrow(df)){
write.table(splithexamer(df[i,]),file="d:/test.txt",append=TRUE,quote=FALSE,col.names=FALSE)
}
# Variant 3: ddply
tmp<-ddply(df, .(HSid), .fun=splithexamer)
# Variant 4: apply - I don't know correct syntax
tmp<-apply(X=df, 1, FUN=splithexamer) # this causes an error
all of the above is extremely slow, I am wondering if there's better way to solve this task...
Solution using data.table:
df$hexamer <- as.character(df$hexamer)
dt <- data.table(df)
dt[, id := seq_len(nrow(df))]
setkey(dt, "id")
dt.out <- dt[, { mod.pos <- pos:(pos+5); if(dir == -1) mod.pos <- rev(mod.pos);
list(split = unlist(strsplit(hexamer, "")),
mod.pos = mod.pos)}, by=id][dt][, id := NULL]
dt.out
# split mod.pos HSid hexamer chr pos dir
# 1: G 95982 Hs.1 GCTCCA 5 95982 1
# 2: C 95983 Hs.1 GCTCCA 5 95982 1
# 3: T 95984 Hs.1 GCTCCA 5 95982 1
# 4: C 95985 Hs.1 GCTCCA 5 95982 1
# 5: C 95986 Hs.1 GCTCCA 5 95982 1
# ---
# 329996: A 59437 Hs.55000 AATCTG 7 59436 1
# 329997: T 59438 Hs.55000 AATCTG 7 59436 1
# 329998: C 59439 Hs.55000 AATCTG 7 59436 1
# 329999: T 59440 Hs.55000 AATCTG 7 59436 1
# 330000: G 59441 Hs.55000 AATCTG 7 59436 1
Explanation of the main line:
The by=id will group by id and since they are all unique, it'll group by every line, one at a time.
Then, the ones within {} sets mod.pos to pos:(pos+6-1) and if dir == -1 reverses it.
Now, the list argument: It creates the column split by creating 6 nucleotides from your hexamer using strsplit and also sets mod.pos which we've already calculated in the step before.
This will result in a data.table with columns id, split and mod.pos.
The next part [dt] is a typical usage of data.table's X[Y] syntax which performs a join on the data.tables based on the key column ( = id, here). Since there are 6 rows for every id you get all the other columns in dt duplicated during this join.
I'd suggest you take a look at data.table FAQ first and then its documentation (intro). These links can be obtained by installing the package and loading it and then typing ?data.table. I also suggest you work through the many examples in there one by one with a test data.table to understand practically the features of data.table.
Hope this helps.