Compare and obtain intervals intersections between rows - r

I have a data base like the following.
pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n)
Position 1 and position 2 designed the start and end point of an interval for each chr and population. My intention is to obtain which interval intersects between pops A, B and C (not D) and which intervals are unique for each population.
So, for the unique intervals I would have an outcome data.frame like the following:
pos1.u<-c(25,50,92,20,30,27,90)
pos2.u<-c(30,52,100,25,32,30,100)
chr.u<-c(1,1,1,2,2,2,2)
pop.u<-c("A","B","C","A","A","B","C")
n.u<-c(78,98,74,48,98,21,12)
data.u<-data.frame(pos1.u,pos2.u,chr.u,pop.u,n.u)
And for the intervals that intersects between those 3 populations a data.frame like the following:
pos1.c<-c(5,15,40,80,5,38,85,5,16,50,70,50,60,50)
pos2.c<-c(10,17,42,90,10,42,87,10,19,60,90,60,71,60)
chr.c<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2)
pop.c<-c("A","A","A","A","B","B","B","C","C","A","A","B","B","C")
n.c<-c(25,65,56,35,78,14,25,65,85,52,69,47,53,10)
data.c<-data.frame(pos1.c,pos2.c,chr.c,pop.c,n.c)
I don't know how to write a script that does precisely this, can you help me?

I think the following code does what you ask for, although it produces different results from yours - so please check it carefully! The discrepancy I think lies in the definition of open and closed intervals. The following assumes that neither end point is included, whereas I suspect this might not be what you mean (otherwise (15,18) and (17,19) would not count as overlapping, as there is no integer value that falls in both). So you might need to adjust the open/closed definitions below.
pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n,stringsAsFactors = FALSE)
library(intervals)
data<-data[data$pop!="D",] #remove irrelevant D entries
rownames(data) <- seq_len(nrow(data)) #reset rownames to allow for removed Ds
#set ints as a list of intervals (as required by intervals package)
ints <- tapply(1:nrow(data),data$pop,function(v)
Intervals(as.matrix(data[v,c("pos1","pos2")]),
closed=c(FALSE,FALSE), #this is where you adjust open/closed lower and upper ends of the intervals - TRUE means end value included
type="Z")) #Z is integers
pops <- unique(data$pop) #unique values of pop
popidx <- lapply(pops,function(x) which(data$pop==x)) #list of indices of these values in data
names(popidx) <- pops
#sets is a df of all pairwise combinations to check
sets <- expand.grid(pops,pops,stringsAsFactors = FALSE)
sets <- sets[sets$Var1!=sets$Var2,]
olap <- lapply(1:nrow(sets),function(i)
interval_overlap(ints[[sets$Var1[i]]],ints[[sets$Var2[i]]])) #list of overlaps
olap <- lapply(1:nrow(sets),function(i) {
df<-as.data.frame(olap[[i]],stringsAsFactors=FALSE)
df$pos1 <- as.numeric(rownames(df))
df$pos2 <- sapply(1:nrow(df),function(j) popidx[[sets$Var2[i]]][df[j,1][[1]][1]])
return(df)}) #tidy up as dfs, with correct indices in data (rather than in ints)
olap <- do.call(rbind,olap)[,-1] #join dataframes
olap$olaps <- !is.na(olap$pos2) #identify those with overlaps
#group by unique pos1 and identify max and min no of overlaps with other groups
olap <- data.frame(minoverlap=tapply(olap$olaps,olap$pos1,min),maxoverlap=tapply(olap$olaps,olap$pos1,max))
olap$rowno <- as.numeric(rownames(olap))
uniques <- data[olap$rowno[olap$maxoverlap==0],] #intervals appearing in just one pop
commons <- data[olap$rowno[olap$minoverlap>0],] #intervals with an overlap in all other pops

Related

R Linear Interpolation of data into every 1 unit where max(x) is variable and is not a whole number

I have the following two data sets
'''
Percent <- c(0,30.4,99.6)
Value1 <- c(100,80.4,70)
Value2 <- c(0.04,0.06, 0.062)
DF1 <- data.frame(Percent,Value1,Value2)
Percent_A <- c(0,10,50.2,70,90.1,130.6,150,180.3)
Value1_A <- c(100,90,88,70,60,62,62,58)
Value2_A <- c(0.04,0.042,0.05,0.059,0.06,0.066,0.07,0.074)
DF2 <- data.frame(Percent_A, Value1_A, Value2_A)
'''
I would like to interpolate these data frames (in reality I have many of these data frames, so it would be ideal if I can use one approach to deal with all data frames). I have some basic understanding of interpolation in R but am running into a couple of problems.
I would like these data frames to be interpolated into new data frames with y values given for every single-percentage value. This is causing me issues for two reasons (1: the data frames have two different max(Percent) values; 2: max(Percent) values are frequently not whole numbers). Based on my limited interpolation in R knowledge, the approxfun function seems useful, but I am unsure how to interpolate into every 1 percentage units, where the max known percentage value is not a whole number.
It would be great if I could interpolate Value1 and Value2 (in relation to Percent) for a given DF at the same time.
I would love your knowledge and insight on this matter! Thank you for the time and consideration on this topic.
The intended result is of course quite long (please let me know if you'd like more detail):
Intended result:
'''
DF1$Percent <- c(0:100)
DF1$Value1 <- #interpolated values for percent= 1,2,3....99,100
DF1$Value2 <- #interpolated values for percent= 1,2,3....99,100
DF2$Percent_A <- c(0:180)
DF2$Value1_A <- #interpolated values for percent= 1,2,3....99,180
DF2$Value2_A <- #interpolated values for percent= 1,2,3....99,180
'''
Note: I will then filter DF2 so that only the interpolated values for percent = 1-100 are displayed (as this is all that I am interested in).
You can only interpolate within a range, not beyond it so here is an approach:
Pct1 <- with(DF1, ceiling(min(Percent)):floor(max(Percent)))
DF1.I1 <- with(DF1, approx(Percent, Value1, xout=Pct1))
DF1.I2 <- with(DF1, approx(Percent, Value2, xout=Pct1))
DF1.interp <- data.frame(Percent=Pct1, Value1=DF1.I1$y, Value2=DF1.I2$y)
The maximum DF1$Percent is 99.6 so you cannot interpolate to 100.
Pct2 <- with(DF2, ceiling(min(Percent_A)):floor(max(Percent_A)))
DF2.I1 <- with(DF2, approx(Percent_A, Value1_A, xout=Pct2))
DF2.I2 <- with(DF2, approx(Percent_A, Value2_A, xout=Pct2))
DF2.interp <- data.frame(Percent_A=Pct2, Value1_A=DF2.I1$y, Value2_A=DF2.I2$y)
Now plots of Value 1:
dev.new(width=10, height=6)
par(mfrow=c(1, 2))
plot(Value1~Percent, DF1.interp, type="l")
plot(Value1_A~Percent_A, DF2.interp, type="l")

How to find approximately close values in different number of rows from two dataframe?

I have two data frame one with 24 row*2 columns and another with 258 row*2 columns. The columns are similar, I am interested in one column and want to find the values in two data frame that are approximately close to each other?
I am trying to simulate a spectrum and compare with an experiment.
df_exp <- data.frame("Name"=c(exp,Int), "exp" = c(x1, x2, x3, ...,x258),"int"= c(y1,y2,y3,...,y258))
df_sim <- data.frame("Name"=c(sim,Int), "sim" = c(x1, x2, x3, ...,x24),"int" = c(y1,y2,y3,...,y24))
Initial values (exp column from df_exp and sim column from df_sim):
exp sim
206.0396 182.0812
207.1782 229.1183
229.0776 246.1448
232.1367 302.1135
241.1050 319.1401
246.1691 357.1769
250.0235 374.2034
... ...
I tried this r code
match(df_exp$exp[1:258], df_sim$sim[1:24], nomatch = 0)
This code gives me all zero values because there is no exact match. The numbers always vary in decimal places. I tried to round the numbers to zero decimal places, and find values that are close. But that is not my intent. I want to find df_exp(229.0776,246.1691,...) and df_sim(229.1183, 246.1448,...) and make a new data frame with all those approximately close values. Can you please suggest some help?
You can define a similarity cutoff and loop over them:
### define your cutoff for similarity
cutoff <- 0.01
### initialize vectors to store the similar values
similar_sim <- vector(); similar_exp <- vector();
### open loop over both DF values
for (sim_value in df_sim$sim) {
for (exp_value in df_exp$exp) {
### if similar (< cutoff) append values to vectors
if ( abs(sim_value - exp_value) < cutoff ) {
similar_sim <- append(similar_sim, sim_value)
expilar_exp <- append(expilar_exp, exp_value)
}
}
}
### recreate a DF with the similar values
similar_df <- as.data.frame(cbind(similar_sim, similar_exp))
if you want to save every values of one similar to the value of the other as it sounds. Otherwise you can skip a loop and use range selection, e.g:
x[ x < x+cutoff & x > x-cutoff ]

Vectorizing a column-by-column comparison to separate values

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

Extracting raster values, from maximum, to cumulatively sum to x

I am trying to determine the location of raster cells that add up to a given amount, starting with the maximum value and progressing down.
Eg, my raster of 150,000 cells has a total sum value of 52,000,000;
raster1 <- raster("myvalues.asc")
cellStats(raster1,sum) = 52,000,000
I can extract the cells above the 95th percentile;
q95 <- raster1
q95[q95 < quantile(q95,0.95)] <- NA
cellStats(q95,sum) = 14,132,000
as you can see, the top 5% cells (based upon quantile maths) returns around 14 million of the original total of 'raster1'.
What i want to do is predetermine the overall sum as 10,000,000 (or x) and then cumulatively sum raster cells, starting with the maximum value and working down, until I have (and can plot) all cells that sum up to x.
I have attempted to convert 'raster1' to a vector, sort, cumulative sum etc but can't tie it back to the raster. Any help here much appreciated
S
The below is your own answer, but rewritten such that it is more useful to others (self contained). I have also changed the %in% to < which should be much more efficient.
library(raster)
r <- raster(nr=100, nc=100)
r[] = sample(ncell(r))
rs <- sort(as.vector(r), decreasing=TRUE)
r_10m <- min( rs[cumsum(rs) < 10000000] )
test <- r
test[test < r_10m ] <- NA
cellStats(test, sum)
couldnt find the edit button.....
this is something like what i need, after an hour scratching my head;
raster1v <- as.vector(raster1)
raster1vdesc <- sort(raster1v, decreasing=T)
raster1_10m <- raster1vdesc[cumsum(raster1vdesc)<10000000]
test <- raster1
test[!test%in%raster1_10m] <- NA
plot(test)
cellStats(test,sum) = 9,968,073
seems to work, perhaps, i dunno. Anything more elegant would be ideal

Calculate quantiles for large data

I have about 300 files, each containing 1000 time series realisations (~76 MB each file).
I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.
I cannot merge together the realisations in 1 file because it would become too large.
What's the most efficient way of doing this?
Each matrix is generated by running a model, however here is a sample containing random numbers:
x <- matrix(rexp(10000000, rate=.1), nrow=1000)
There are at least three options:
Are you sure it has to be from the full set? A 10% sample should be a very, very good approximation here.
300k elements isn't that big of a vector, but a 300k x 100+ column matrix is big. Pull just the column you need into memory rather than the entire matrix (can be repeated over every column if necessary).
Do it sequentially, possibly in conjunction with a smaller sample to get you started in the right ballpark. For the 5th percentile, you just need to know how many items are above the current guess and how many are below. So something like:
Take a 1% sample, find the 5th percentile of it. Jump some tolerance above and below, such that you're sure the exact 5th percentile lies in that range.
Read in the matrix in chunks. For each chunk, count the number of observations above the range and below the range. Then retain all observations which lie within the range.
When you've read in the last chunk, you now have three pieces of information (count above, count below, vector of observations within). One way to take a quantile is to sort the whole vector and find the nth observation, and you can do that with the above pieces of information: sort the within-range observations, and find the (n-count_below)th.
Edit: Example of (3).
Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.
library(plyr)
set.seed(1)
# -- Configuration -- #
desiredQuantile <- .25
# -- Generate sample data -- #
# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )
# -- Apply the algorithm -- #
# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
above <- x$value > guessedrange[2]
below <- x$value < guessedrange[1]
list(
aboveCount = sum( above ),
belowCount = sum( below ),
withinValues = x$value[ !above & !below ]
)
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )
In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.

Resources