R: Exclude granges overlapped by another range - r

I would compute the remained part from a query range after excluding another range, is there a way to do it? Thanks.
query <- IRanges(1, 10)
ranges2exclude <- IRanges(c(2, 6), c(3, 7))
The output I want:
IRanges(c(1, 4, 8), c(1, 5, 10))

Related

Create a sequence in R with that alternatively repeats

My homework asks me to create a sequence like this.
(1, 2, 2, 3, 4, 4, 5, 6, 6, . . . , 50, 50)
I'm a newbie to R so would really appreciate your help
You can use rep and alternate the number of repeats. Check ?rep for more information.
rep(1:50, ifelse(seq(1:50) %% 2, 1, 2))
Or, with two reps:
rep(1:50, rep(1:2, 25))
rep(1:50, rep(1:2, length.out = 50))
Short code using integer division:
2:76%/%1.5
This will return a numeric vector.

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

Create conditional variable in multiple data.tables (or data.frames)

I want to execute the same action in multiple data.tables (or data.frames). For example, I want to create the same variable conditional on the same rule in all data.tables.
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to loop on this list trying to create the variable:
list.df
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
My (bad) tentative:
for(i in 1:3){
a<-get(paste(list.df[[i]], "$var1", sep=""))
b<-get(paste(list.df[[i]], "$var2", sep=""))
name<-paste(list.df[[i]], "$var.new", sep="")
assign(name, ifelse(a==2 & b==10, 1, 0))
}
Clearly r cannot create this new variable the way I am doing as I get a error message "object not found".
Any clues on how to fix my bad code? I have a feeling that dplyr could help me but I don't know how.
We can use mget after creating the strings of object names with paste so that we get the values ie. data.frames in a list. We loop through the list (lapply(...,) and transform each dataset by creating the variable ('varNew') which is a binary variable. We can either use ifelse on the logical statement or just wrap with + to coerce the TRUE/FALSE to 1/0.
lst <- lapply(mget(paste0('df', 1:3)), transform,
varNew = +(var1==2 & var2==10))
If we need to update the original objects, we can use list2env.
list2env(lst, envir = .GlobalEnv)
df1
df2

Line in R plot should start at a different timepoint

I have the following example data set:
date<-c(1,2,3,4,5,6,7,8)
valuex<-c(2,1,2,1,2,3,4,2)
valuey<-c(2,3,4,5,6)
now I plot the date and the valuex variable:
plot(date,valuex,type="l")
now, I want to add a line of the valuey variable, but it should start with the 4th day, so not at the beginning, therefore I add NA values:
valuexmod<-c(rep(NA,3),valuex)
and I add the line with:
lines(date,valuexmod,type="l",col="red")
But this does not work? R ignores the NA values and the valuexmod line starts with the first day, but it should start with th 4th day?
Given that date and valuex have the same length, I am assuming that you have a typo above.
Try this instead:
date <- c(1, 2, 3, 4, 5, 6, 7, 8)
valuex <- c(2, 1, 2, 1, 2, 3, 4, 2)
valuey <- c(2, 3, 4, 5, 6)
valueymod <- c(rep(NA, 3), valuey)
plot(date, valuex, type = "l", ylim = range(c(valuex, valuey)))
lines(date, valueymod, type = "l", col = "red")
Here's the resulting plot:
Related to your question is a point made in help("lines")...
The coordinates can contain NA values. If a point contains NA in either its x or y value, it is omitted from the plot, and lines are not drawn to or from such points. Thus missing values can be used to achieve breaks in lines.

Resources