R unique combinations from given ranges quickly and using less system resource - r

This is a follow up question from here:
https://stackoverflow.com/a/55912086/3988575
I have a dataset like this:
ID=as.character(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
IQ=c(120.5,128.1,126.5,122.5,127.1,129.7,124.2,123.7,121.7,122.3,120.9,122.4,125.7,126.4,128.2,129.1,121.2,128.4,127.6,125.1)
Section=c("A","A","B","B","A","B","B","A","B","A","B","B","A","A","B","B","A","B","B","A")
zz=data.frame(ID,IQ,Section)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
What I would like to do is to match people by the range of their IQ (which was the previous question).
Now, I want to create multiple levels of the ranges. For example one range can be 10 IQ classes: 120-121,121-122,122-123....129-130. Another example is a single IQ class:120-130. All the possible combinations of the above can be obtained by:
IQ_Class=c(120,121,122,123,124,125,126,127,128,129,130)
n = length(IQ_Class)-2
all_combin=expand.grid(replicate(n, 0:1, simplify = FALSE))
all_combin$First=1
all_combin$Last=1
all_combin_new=all_combin[c("First",names(all_combin)[1:(length(names(all_combin))-2)],"Last")] #Reorder columns
all_combin_new = t((apply(all_combin_new,1,function(x)(x*IQ_Class)))) #Multiply by IQ classes
all_combin_new = apply(all_combin_new, 1, function(x) { x[x!=0] })
Note that the final object all_combin_new provides a list of lists of all the classes (a total of 512 classes in total).
Now what I want to do is to take one class (one element from all_combin_new) and create all the combinations of ID's in that particular IQ class by their section. Save this dataset and take the next class from all_combin_new and repeat the operation.
From the previous answer, I was able to to modify the code to consider the combinations by Section by changing the following in the previous question:
zz1=list("list",length(all_combin_new))
for (i in 1:length(all_combin_new)){ #changed this line to run for all combinations in all_combin_new
zz2=all_combin_new[[i]]
zz11=zz_new%>%
mutate(ID=as.character(ID),vec=as.character(cut(IQ,zz2,right=F)))%>%
group_by(vec,Section)%>% #Changed this line
summarize(if(n()>1)list(data.frame(t(combn(ID,2)),stringsAsFactors = F))
else list(data.frame(X1=ID,X2=ID,stringsAsFactors = F)))%>%
unnest()%>%
bind_cols(read.csv(text=gsub("[^0-9,]","",.$vec),h=F))
zz1[[i]]=as.data.frame(zz11)
}
My actual dataset has about 10K (as compared to zz_new here) observations with 20 Sections (leading to 2^18=262144 ranges of IQ as compared to the the length of all_combin_new list here = 512). This causes two main issues:
a) Time: The speed is extremely slow. Is there a way to increase the speed?
b) Size of objects created: In my tests, even without considering as high number of combinations, the lists grow too big and the code fails. What alternate approaches could I use here? Note that in the list of list that I obtain here, I also need to do further computations.
Any help will be appreciated. Thanks in advance.
P.S.Please let me know if any part is unclear or any part of the code has some inadvertent errors.

Edit: Now with loop to go through all IQ combos and to include Section as a key on join.
I used the sample data in the linked question. Instead of making a list and looping, this does everything at once.
Note there is a cartesian product, so it may still run into memory issues. If you're having trouble, you can always try data.table as you can have non-equi joins.
library(tidyverse)
zz <- tibble(ID=1:12
,IQ=c(120.5,123,125,122.5,122.1,121.7,123.2,123.7,120.7,122.3,120.1,122)
,Section=c("A","A","B","B","A","B","B","A","B","A","B","B")
)
IQ_Class <- c(120,122,124,126)
IQ_Classes <- data.frame(First = 1
,expand.grid(replicate(length(IQ_Class)-2, 0:1, simplify = FALSE))
,Last = 1)
IQ_Classes <- IQ_Classes * IQ_Class[col(IQ_Classes)]
IQ_Classes_List <- apply(IQ_Classes, 1, function(x) { x[x!=0] })
all_combos <- lapply(IQ_Classes_List
, function(IQs)
{
z_cut <- zz%>%
mutate(cut_range = cut(IQ, IQ_Class, right = F, labels = F))
inner_join(z_cut
, z_cut %>%
select(V2 = ID, cut_range, Section)
, by = c('cut_range', 'Section'))%>%
filter(V2 > ID) %>%
mutate(Previous_IQ_class = IQs[cut_range],
Next_Class = IQs[cut_range+1])
}
)%>%
bind_rows(.id = 'IQ_List')

Related

R optimize loop with Tidyverse

I am working with a large data set, and have written working code using for loops. I want to optimize the efficiency of running the code due to the large data set and think there must be a way using Tidyverse.
Briefly, I have a data frame with object IDs (seed IDs) and their x and y coordinates.
object <- c('A','B','C')
x <- c(147, 146, 143)
y <- c(17, 80, 155)
df_Seeds <- data.frame(object, x, y)
df_Seeds$object <- as.character(df_Seeds$object)
I also have an array with another set of objects (radicles) and their x and y coordinates.
x1 <- c(180, 146, 143, 17, 17, 155, 30, 30, 30)
array_Radicles <- array(x1,dim = c(3,3))
The following code outputs an array with the index of any radicle objects within a certain distance of each seed and another array with the seed object ID. Lastly, I cbind the arrays.
seedID_Array <-array(dim=(0:1)) #blank array for seedID
radicleIndex_Array <-array(dim=(0:1)) #blank array for radicle index
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
for (j in 1:length(indexRadicles)) { #loops through each radicle index
singleIndexRadicles <- indexRadicles[j]
seedID_Array <- rbind(seedID_Array, df_Seeds[i,1]) #adds seed object ID to array
radicleIndex_Array <- rbind(radicleIndex_Array, singleIndexRadicles) #adds radicle index to array
}
}
}
combinedArray <- cbind(seedID_Array, radicleIndex_Array)
I appreciate any suggestions or direction to another similar problem that has been solved.
Well, first of all, I highly recommend you take a look at this resource and this other resource. They're both good references on how to avoid some simple mistakes and get your R code going faster.
For your specific problem, I'd say the biggest performance bottleneck is when you're using rbind and cbind. Those functions create a copy of your original object and then fill in with the second argument. This is not very efficient.
Also, in your inner loop, you're essentially adding all the indexes in indexRadicles and repeteadly rbinding df_Seeds[i,1].
To solve this, a possible solution would be to use a list, indexing by Seed ID. For example:
output <- list()
for(i in 1:dim(df_Seeds)[1]) { #loops through each seed object
indexRadicles <- which(abs(array_Radicles[,1] - df_Seeds[i, 2]) <= 50 & abs(array_Radicles[,2]- df_Seeds[i,3]) <= 25) #generates vector index of any radicle within distance of seed
if (length(indexRadicles) > 0) { #some seed objects will not have an associated radicle
output[df_Seeds$object[i]] <- list(indexRadicles)
}
}
seeds_that_had_index_radicles <- names(output)
all_index_radicles <- unlist(output)
Note that we did not use any tidyverse solution here. I believe it is wrong to even assume tidyverse solutions to be always faster or more efficient. I personally think that they help you understand some operations better, or at least visualize them better. But you can usually do the same things with the same performance using base R.
Bonus: On a side note, you can always use profvis to help you find out the performance bottlenecks in your code. It will you show what lines are taking longer, or what lines are being called the most. Highly recommend taking a look at it: https://rstudio.github.io/profvis/
Your question is an example of a non-equi join, where your distance requirement constrains the matches between two tables. dplyr does not currently allow non-equi joins, but in many cases my data is small enough (eg cartesian product table still fits in memory) for a brute force method to work fine and be fast enough.
Option 1: cartesian product then filter
Here I join every radicle to every seedID (this could be untenable if your data is big enough) and then filter out the ones I don't need.
library(tidyverse)
df_Radicles <- tibble(x = array_Radicles[,1],
y = array_Radicles[,2],
misc = array_Radicles[,3],
rad_idx = 1:length(array_Radicles[,1]))
# brute force non-equi join: join all then filter
crossing(object = df_Seeds$object, df_Radicles) %>%
left_join(df_Seeds, by = "object") %>%
filter(abs(x.x - x.y) <= 50, abs(y.x - y.y) <= 25) %>%
select(object, rad_idx)
# A tibble: 3 x 2
object rad_idx
<chr> <int>
1 A 2
2 A 1
3 C 3
Option 2: fuzzyjoin
The fuzzyjoin package allows non-equi joins, and has built in methods for distance joins. In this case you're using a manhattan distance metric, but since your y distance is different I scale it here * 2 so that it can be evaluated on the same +/- 50 scale as your x distance. There's also a geo_join option if you're dealing with lat/lon coordinates.
library(fuzzyjoin)
df_Seeds %>%
mutate(y = y * 2) %>% # to use manhattan distance with x + y on same scale
distance_inner_join(
df_Radicles %>% mutate(y = y*2),
by = c("x", "y"),
method = "manhattan",
max_dist = 50) %>%
select(object, rad_idx)
object rad_idx
1 A 1
2 A 2
3 C 3
If these approaches aren't performant on your data, I'd recommend using data.table, which is phenomenally fast for this sort of thing.

Assigning values through a loop

I am trying to measure political ideology on Twitter (by using Rtweet). I now have a dataframe consisting of +100 politicians user_id's along with two ideal point scores on 'factor 1' and 'factor 2' (both factors have a range of 1-4). It looks like this (called kandidat):
Navne
Faktor 1
Faktor 2
"Politician1"
3.5
1.0
"Politician2"
2.0
4.0
Etc...
X
X
I would then like to detect if random Twitter users follow one or more of the politicians from my dataset. If they e.g. follow two of the politicians in my dataset - "Politician1" and "Politician2" - I will then assign a mean of the two politicians ideal point scores on the two factors to the user. An example of a Twitteruser following these two politicians could then be factor 1 = (3.5+1.0)/2 = 2.25 and factor 2 = (2.0+4.0)/2 = 3.00.
So I've tried to create a simplified loop including only two journalists from Twitter called 'testusers', who both follow a large share of the politicians in my dataset. The loop should then check whether the respective journalists follow one or more of the politicians: If they follow, then the loop should assign the mean of the values like described above. If not, they should be automatically removed from the dataset. The loop below does run, but unfortunately provides a wrong output (see table below the code):
### loop ###
for(i in 1:ncol(testusers)){
#pick politician1 of dataset
politician1_friends <- get_friends(testusers$Navne[1])
#intersect with candidate data
ids_intersect = intersect(politician1_friends$user_id, kandidat$user_id)
if(length(ids_intersect == 0)){
testusers[i, "anyFriends"] <- FALSE #user has no friends in the politicians df
} else {
#assign values to user based on intersect
politicians_friends = kandidat[kandidat$user_id %in% ids_intersect,]
s1_mean <- mean(politicians_friends$faktor1, na.rm=TRUE)
s2_mean <- mean(politicians_friends$faktor2, na.rm=TRUE)
testusers[i, "faktor1"] <- s1_mean
testusers[i, "faktor2"] <- s2_mean
testusers[i, "anyFriends"] <- TRUE #user has friends in the politicians dataset
}
# etc.
}
The code above gives me this output:
Navne
anyFriends
"Politician1"
FALSE
"Politician2"
NA
The structure of testusers is: structure(list(Navne = c("Politician1", "Politician2"), anyFriends = c(FALSE, NA)), row.names = 1:2, class = "data.frame"). And I can't post the whole structure of kandidat, since it's too big: but it's a dataframe consisting of politicians (with all the informations from the function look_up() like user_id, screen_name, text etc.
So I guess the code needs som minor changes, but I haven't figured them out yet. Ideally the output (df) should consist of "only" three dataframe columns: 1) UserID/Name 2) Faktor1 3) Faktor2?
I think what you want is another data.frame or so containing your users, and their 'scores'. R likes to work with such data frames rather than with lists.
I am now assuming, that you have a data.frame containing your politicians etc. and their scores along the two dimensions, as well as a data.frame with the users you're interested in, such like
kandidat <- data.frame(user_id = 1:2, name = c("Politician1", "Politican2"), Faktor1 = c(3.5, 2), Faktor2 = c(1,4))
my_users <- data.frame(name = c("Max", "Mara"))
Now if you want to work with a for-loop, you can do something like
find_f <- function(df){
F1_mean <- c()
F2_mean <- c()
anyFriends <- c()
for(i in 1:nrow(df)){
#pick user1 of dataset
user_friends <- get_friends(df$name[i])
#intersect with our candidatedata
ids_intersect = intersect(user_friends$user_id, politicians$user_id)
if(length(ids_intersect)==0){
anyFriends <- c(anyFriends, FALSE) # User has no friends in the politicians df
} else {
#assign values to user based on intersect - don't know what to do here
kandidat_friends = kandidat[kandidat$user_id %in% ids_intersect,]
F1_mean <- c(F1_mean, mean(kandidat_friends$Faktor1, na.rm=TRUE))
F2_mean <- c(F2_mean, mean(kandidat_friends$Faktor2, na.rm=TRUE))
anyFriends <- c(anyFriends, TRUE) # user has friends in the politicans dataset
}
}
df$Faktor1 <- F1_mean
df$Faktor2 <- F2_mean
df$anyFriends <- anyFriends
return(df[df$anyFriends,])
}
my_users2 <- find_f(my_users)
This is by far not a very brief solution, but I think it is easy to understand. The most important thing is, that you work with data.frames rather than lists, it is much easier in R. In each iteration, we get the friends of the user, see whether there is any intersection with the politicians. If not, we assign the boolean value FALSE to the anyFriends variable in the my_users dataframe, so we can easily filter them out in the end. If there is an intersection, we take the mean of the two scores of the selected politicians and assign them to the respective user entry.
No need for the IDEOLOGISCORE list in my opinion. Also, please be aware that I didn't test the code above and it might be that there are typos. Just check whether it works for you :)

How to concatenate NOT as character in R?

I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.

Find range and number of positions with zero

I have two excel file
And,
I want to know the range and positions with 0 coverage values and an output as follows:
Where,
size = (end - start)+1
mapped = positions with > 0 Coverage
%mapped = (mapped/size)*100
Completeness = (Total mapped/Total Size)*100
for e.g for the above output Completeness = ((3+2)/(7+5))*100 = 41.66%
I have several such input files to be analyzed. How can I do this in R?
To get to know which part of a data.frame satisfies some condition, you can use which, will give you all the indexes for which that condition is TRUE, so you can use that to get the parts you're interested in.
If we assume you have a data.frame called df1 for the first part of your question, and the second image is called df2, then you can get the index-range of the rows in df1 with 'chr1' like this:
range <- which(df1$chr=='chr1')[df2$start[1]]:which(df1$chr=='chr1')[df2$end[1]]
or instead of manually typing 'chr1', you can use df2$chr[1].
For the count, sum(df1[range, 'coverage'] > 0) tells you how many values are more then zero.
Now we need to do that for all rows together, we can use sapply to do something for all values provided:
df2$mapped <- sapply(1:nrow(df2), function(row) {
range <- which(df1$chr==df2$chr[row])[df2$start[row]]:which(df1$chr==df2$chr[row])[df2$end[row]]
sum(df1[range, 'coverage'] > 0)
}
Your other questions are easier answered then asked, as in R most functions are vectorised: you can do something for multiple values at the same time.
df2$size = (df2$end - df2$start)+1
df2$perc_mapped = (df2$mapped/df2$size)*100
Completeness is just a total of all rows together, sum(df2$size) and sum(df2$mapped)

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

Resources