How to get the recombined vectors efficiently? - r

Given two axises, both with positions from 1 to N (N can be several millions, but we assume N = 1000 here),
there are two vectors recording the positions of some points on the two axises, respectively. For example:
chrm1 <- c(1, 35, 456, 732) # 4 points on axis 1 at position 1, 35, 456, 732;
chrm2 <- c(23, 501, 980)
if recombination at position 300 of the two axis, points behind 300 on the two axises will switch to the other axis.
the two vectors recording position of points will become :
chrm1 <- c(1, 35, 501, 980)
chrm2 <- c(23, 456, 732)
if a second recombination occurs at 600, the new vectors will be:
chrm1 <- c(1, 35, 501, 732)
chrm2 <- c(23, 456, 980)
the real data looks like this:
set.seed(1)
chrm1 <- sample.int(1e8, 50)
chrm2 <- sample.int(1e8, 50)
breaks.site <- sample.int(1e8, 5)
My brute-force way was to swap points into the other vector for each breaks sites. But this is quite slow, because I have to do this for 2 x 1000 x 20000 times.
How to get the recombined vectors efficiently?
for(i in breaks.site){
chrm1.new <- c(chrm1[chrm1 < i], chrm2[chrm2 > i])
chrm2.new <- c(chrm1[chrm1 > i], chrm2[chrm2 < i])
chrm1 <- chrm1.new
chrm2 <- chrm2.new
}
background about recombination:
https://en.wikipedia.org/wiki/Genetic_recombination

Maybe this:
chrm1 <- c(1, 35, 456, 732)
chrm2 <- c(23, 501, 980)
breaks <- c(300, 600)
#check all points for all breaks,
#get sum of position changes and
#calculate x mod 2
changepos1 <- rowSums(outer(chrm1, breaks, ">")) %% 2
changepos2 <- rowSums(outer(chrm2, breaks, ">")) %% 2
#assemble results and sort
res1 <- sort(c(chrm1[!changepos1], chrm2[as.logical(changepos2)]))
#[1] 1 35 501 732
res2 <- sort(c(chrm2[!changepos2], chrm1[as.logical(changepos1)]))
#[1] 23 456 980
If outer needs to much memory due to the size of your problem, you can use a loop instead.

Related

Converting Lists into Data Frames

I am trying to make a data frame with 3 columns (and 100 rows) containing random numbers such that the random numbers in each row add to 72.
I am using this code to generate the random numbers:
random_numbers <- diff(c(0, sort(sample(72, 2)), 72))
Although, I can't "fit" these random numbers into a data frame because of the format. For example:
i <- 1:100
d <- data.frame(i)
d$rand <- diff(c(0, sort(sample(72, 2)), 72))
Error in `$<-.data.frame`(`*tmp*`, rand, value = c(37, 21, 14)) :
replacement has 3 rows, data has 100
I had another idea in which at least I can create all 100 random numbers:
results <- list()
for (i in 1:100) {
r_i <- diff(c(0, sort(sample(72, 2)), 72))
results[[i]] <- r_i
}
results[1]
# [[1]]
# [1] 3 19 50
results[2]
# [[1]]
# [1] 16 11 45
But I am not sure how I can I can create a data frame with 3 columns and 100 rows from this data.
I know how to do this in the "classical" sense:
i <- 1:100
r_1 <- rnorm(5, 5, 100)
r_2 <- rnorm(5, 5, 100)
r_3 <- rnorm(5, 5, 100)
d <- data.frame(i, r_1, r_2, r_3)
d = data.frame(i, r_1, r_2, r_3)
But of course, in the above data frame, these 3 numbers will most certainly not add to 72.
Is it possible to take the 100 random numbers results that I generated above and then place them into a data frame?
We may use replicate with n specified as the number of rows of 'd' and assign new columns from the matrix output
d[paste0("r_", 1:3)] <- t(replicate(nrow(d),
diff(c(0, sort(sample(72, 2)), 72))))
-testing for equality
> all(rowSums(d[-1]) == 72)
[1] TRUE

Writing a median function in R

I have been tasked to write my own median function in R, without using the built-in median function. If the numbers are odd; calculate the two middle values, as is usual concerning the median value.
Something i probably could do in Java, but I struggle with some of the syntax in
R Code:
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
sorted=list1[order(list1)]
sorted
n = length(sorted)
n
if(n%2==0) # problem here, implementing mod() and the rest of logic.
Here is a self-written function mymedian:
mymedian <- function(lst) {
n <- length(lst)
s <- sort(lst)
ifelse(n%%2==1,s[(n+1)/2],mean(s[n/2+0:1]))
}
Example
list1 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47, 71)
list2 <- c(7, 24, 9, 42, 12, 88, 91, 131, 47)
mymedian(list1)
mymedian(list2)
such that
> mymedian(list1)
[1] 44.5
> mymedian(list2)
[1] 42
I believe this should get you the median you're looking for:
homemade_median <- function(vec){
sorted <- sort(vec)
n <- length(sorted)
if(n %% 2 == 0){
mid <- sorted[c(floor(n/2),floor(n/2)+1)]
med <- sum(mid)/2
} else {
med <- sorted[ceiling(n/2)]
}
med
}
homemade_median(list1)
median(list1) # for comparison
A short function that does the trick:
my_median <- function(x){
# Order Vector ascending
x <- sort(x)
# For even lenght average the value of the surrounding numbers
if((length(x) %% 2) == 0){
return((x[length(x)/2] + x[length(x)/2 + 1]) / 2)
}
# For uneven lenght just take the value thats right in the center
else{
return(x[(length(x)/2) + 0.5])
}
}
Check to see if it returns desired outcomes:
my_median(list1)
44.5
median(list1)
44.5
#
list2 <- c(1,4,5,90,18)
my_median(list2)
5
median(list2)
5
You don't need to test for evenness, you can just create a sequence from half the length plus one, using floor and ceiling as appriopriate:
x <- rnorm(100)
y <- rnorm(101)
my_median <- function(x)
{
mid <- seq(floor((length(x)+1)/2),ceiling((length(x)+1)/2))
mean(sort(x)[mid])
}
my_median(x)
[1] 0.1682606
median(x)
[1] 0.1682606
my_median(y)
[1] 0.2473015
median(y)
[1] 0.2473015

Generating a sequence where the gap between numbers increases

I need to generate a sequence on R where the gap between elements increases each time
Seq1:
1, 49, 100, 154, ... 19306
Seq2:
48, 99, 153, 210, ..., 19650
Note the gap between seq1 elements increases by 3 each time. 49-1 = 48, 100-49 =51, 154-100 = 54...
The gap between Seq2 elements also increases by 3 each time 99-48 =51, 153-99 = 54
Given the advice from #Dason:
seq1 <- seq(48, 19306,3)
which(cumsum(seq1) ==19650)
seq2 <- cumsum(seq1)[1:100]
seq3 <- seq(47, 19306, 3)
seq4 <- seq2 -seq3[1:100]

R - Comparing datas of two data.frames with a loop, constant by constant

Basically, I want to compare the same constants (same [X,Y] values) from two data frames and applied them a few operations afterward, before stocking the result in a new data frame. The tricky part is that I need to treat all the values of these data frames.
In other words, I need to compare the value of dataA[1,1] with dataB[1,1] and if they respect certain conditions, I'll perform an operation, else another opeartion, then stock the result in a new data.frame.
After, rinse and repeat for dataA[1,2] and dataB[1,2] up to dataA[100,100] and dataB[100,100].
Obviously, I've to use a loop here (and some if/else), but I can't seems to figure out the proper structure.
Being used to php, I tried the foreach package in R, but it only return FALSE (and it do so in vector format instead of a matrix with multiple columns). If I do the operation by hand, there are more TRUE than FALSE, so obviously, something's wrong here :
x <- foreach(dataIDH, dataPIB) %do% {
if (dataPIB <= dataIDH+5 & dataPIB >= rankIDH-5) {
x <- mean(dataPIB, dataIDH)
} else { x <- FALSE}
}
x
I did tried a for loop, but I'm simply unable to put the results in a data.frame (even less a ones that match the layout of the dataframes used here, which I need to do) :
x <- for(idh in 1:nrow(dataIDH)) {
for(idh in 1:ncol(dataIDH)) {
for(pib in 1:nrow(dataPIB)) {
for(pib in 1:ncol(dataPIB)) {
if (pib<=idh+5 & pib>=idh-5) {
x <- mean(pib,idh)
} else { x <- FALSE}
}
}
}
}
x
For informations : the data frames contain numeric values for a set of countries (rows) for a few years (columns).
Any ideas on how to get out of this mess?
Edit 1 : an extract of the two dataframes used (1st row and col displayed here are actually headers) :
dataIDH
CountryCode,2005,2006,2007,2008
AFG,14,14,16,16
ALB,100,98,99,98
DZA,85,86,90,86
dataPIB
CountryCode,2005,2006,2007,2008
AFG, 69, 18, 70, 71
ALB, 102, 98, 97, 63
DZA, 85, 89, 91, 137
Edit 2 : and the final result should be a new data.frame, on the same layout:
x
CountryCode,2005,2006,2007,2008
AFG, FALSE, 16, FALSE, FALSE
ALB, 101, 98, 98, FALSE
DZA, 85, 87.5, 90.5, FALSE
With the basic looping way. Hope this helps you.
df <- dataIDH
for(i in 1:length(dataIDH$CountryCode)){
for(j in 2:ncol(dataIDH)){
if((dataIDH[i,j] <= dataPIB[i,j]+5) & (dataPIB[i,j] <= dataIDH[i,j]+5)){
df[i,j] <- mean(dataPIB[i,j], dataIDH[i,j])
} else{ df[i,j] <- "False" }
}
}
here is your answer:
df1 <- data.frame(a= rnorm(1000),
b=rnorm(1000),
c= rnorm(1000))
df2 <- data.frame(aa= rnorm(1000, 3, 3),
bb=rnorm(1000, -2, 3),
cc= rnorm(1000, 5, 3))
df3 <- data.frame(df1, df2)
test <- function(df, column_number1, column_number2){
mean_vec <- apply(df[, c(column_number1, column_number2)], 1, function(x) mean(x, na.rm = TRUE))
dif_vec <- abs(df[,column_number1]-df[,column_number2])
ind_true <- dif_vec<=5
ind_false <- dif_vec>5
column_name <- paste(colnames(df)[column_number1],
colnames(df)[column_number2], sep = "_" )
df[ind_true, (column_name)] <- mean_vec[ind_true]
df[ind_false, (column_name)] <- "FALSE"
return(df)
}
df3 <- test(df3, 1,4)
df3 <- test(df3, 2, 5)
df3 <- test(df3, 3, 6)
Assuming you don't actually want to convert your data to strings (which would be necessary to include "FALSE" in a numeric vector), R is really good at working with vectors and matrices...
dataIDH <- read.csv(header = TRUE, as.is = TRUE, text = "
CountryCode,2005,2006,2007,2008
AFG,14,14,16,16
ALB,100,98,99,98
DZA,85,86,90,86
")
dataPIB <- read.csv(header = TRUE, as.is = TRUE, text = "
CountryCode,2005,2006,2007,2008
AFG, 69, 18, 70, 71
ALB, 102, 98, 97, 63
DZA, 85, 89, 91, 137
")
x <- abs(dataIDH[-1] - dataPIB[-1]) <= 5
y <- (dataIDH[-1] + dataPIB[-1]) / 2
y[!x] <- NA
cbind(dataIDH[1], y)
# CountryCode X2005 X2006 X2007 X2008
# 1 AFG NA 16.0 NA NA
# 2 ALB 101 98.0 98.0 NA
# 3 DZA 85 87.5 90.5 NA

intersection in R

I have two tables.
Both tables have only 1 column.
Both have random integer values between 1 to 1000.
I want to intersect these two tables. The catch is I want to intersect the numbers even if they have a difference of about 10.
1st table -> 5 , 50, 160, 280
2nd table -> 14, 75, 162, 360
Output ->
1st table -> 5, 160
2nd table -> 14, 162
How can I achieve this in R
You could do this with the sapply function, checking if each element of x or y is sufficiently close to some member of the other vector:
x <- c(5, 50, 160, 280)
y <- c(14, 75, 162, 360)
new.x <- x[sapply(x, function(z) min(abs(z-y)) <= 10)]
new.y <- y[sapply(y, function(z) min(abs(z-x)) <= 10)]
new.x
# [1] 5 160
new.y
# [1] 14 162
Here is an approach that uses the outer function (so your 2 tables will need to be reasonably sized):
x <- c(5,50,160,280)
y <- c(999,14,75,162,360)
tmp1 <- outer(x,y, function(x,y) abs(x-y))
tmp2 <- which(tmp1 <= 10, arr.ind=TRUE)
rbind(
x=x[ tmp2[,1] ],
y=y[ tmp2[,2] ]
)
This looks at every possible pair between x and y and computes the difference between the 2 values, then finds those with a difference <= 10.

Resources