Is there a quick way to transform intervals (Start and End) into a list of number in this interval in R - r

I have a file with interval values such as this for 50M lines:
>data
start_pos end_pos
1 1 10
2 3 6
3 5 9
4 6 11
And I would like to have a table of position occurrences so that I can compute the coverage on each position in the interval file such as this:
>occurence
position coverage
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1
Is there any fast and best way to complete this task in R?
My plan was to loop through the data and concatenate the sequence in each interval into a vector and convert the final vector into a table.
count<-c()
for (row in 1:nrow(data)){
count<-c(count,(data[row,]$start_pos:data[row,]$end_pos))
}
occurence <- table(count)
The problem is that my file is huge and it takes way to much time and memory to do so.

The Bioconductor IRanges package does this fast and efficiently
library(IRanges)
ir = IRanges(start = c(1, 3, 5, 6), end = c(10, 6, 9, 11))
coverage(ir)
with
> coverage(ir) |> as.data.frame()
value
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1

Related

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

R: Calculating displacement of variables in order

We are developing an online test in which we present 6 images (in a row) that participants are asked to reshuffle from smallest to largest (the actual objective is more challenging, but besides the point here). The test begins with the images being presented in a random order. In the end, I would like to calculate the total displacement (or deviation) of the participant's response from the correct response.
To illustrate:
We begin by presenting six images like so:
[img1] [img2] [img3] [img4] [img5] [img6]
A participant might then reshuffle the images to:
[img2] [img4] [img3] [img1] [img6] [img5]
The correct order for this trial might actually be:
[img1] [img4] [img3] [img2] [img5] [img6]
Thus, we see that the participant has not placed all images at the correct position: img1 is displaced 3 positions to the left, while img2 is displaced 3 positions to the right; and img5 and img6 are each displaced 1 position. Thus the total displacement is 3 + 3 + 1 + 1 = 8
Is there an elegant way in R to calculate this displacement?
You can try this
sum(abs(match(x,y)-match(x,x)))
Data
x=c(2,4,3,1,6,5)
y=c(1,4,3,2,5,6)
I am decomposing this a lot, obviously you can compress this on two lines if you want to.
This the correct output:
> correct_output <- data.frame(img=c(1, 4, 3, 2, 5, 6), rank=1:6)
img rank
1 1 1
2 4 2
3 3 3
4 2 4
5 5 5
6 6 6
This is the user output:
> user_output <- data.frame(img=c(2, 4, 3, 1, 6, 5), user_rank=1:6)
img user_rank
1 2 1
2 4 2
3 3 3
4 1 4
5 6 5
6 5 6
Let us bind them together:
> merge(correct_output, user_output, by="img")
img rank user_rank
1 1 1 4
2 2 4 1
3 3 3 3
4 4 2 2
5 5 5 6
6 6 6 5
From here it is very easy. I am using the dplyr package.
> tmp <- mutate(tmp, penalty=abs(rank-user_rank))
img rank user_rank penalty
1 1 1 4 3
2 2 4 1 3
3 3 3 3 0
4 4 2 2 0
5 5 5 6 1
6 6 6 5 1
> sum(tmp$penalty)
[1] 8

Percolation clustering

Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.

apply conditional numbering to grouped data in R

I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!
Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")

R: How to make sequence (1,1,1,2,3,3,3,4,5,5,5,6,7,7,7,8)

Title says it all: how would I code such a repeating sequence where the base repeat unit is : a vector c(1,1,1,2) - repeated 4 times, but incrementing the values in the vector by 2 each time?
I've tried a variety of rep,times,each,seq and can't get the wanted result out..
c(1,1,1,2) + rep(seq(0, 6, 2), each = 4)
# [1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
The rep function allows for a vector of the same length as x to be used in the times argument. We can extend the desired pattern with the super secret rep_len.
rep(1:8, rep_len(c(3, 1), 8))
#[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
I'm not sure if I get it right but what's wrong with something as simple as that:
rep<-c(1,1,1,2)
step<-2
vec<-c(rep,step+rep,2*step+rep,3*step+rep)
I accepted luke as it is the easiest for me to understand (and closest to what I was already trying, but failing with!)
I have used this final form:
> c(1,1,1,2)+rep(c(0,2,4,6),each=4)
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
You could do:
pattern <- rep(c(3, 1), len = 50)
unlist(lapply(1:8, function(x) rep(x, pattern[x])))
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
This lets you just adjust the length of the pattern under rep(len = X) and removes any usage of addition, which some of the other answers show.
How about:
input <- c(1,1,1,2)
n <- 4
increment <- 2
sort(rep.int(seq.int(from = 0, by = increment, length.out = n), length(input))) + input
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8

Resources