BioConductor IRanges coverage counts and identify segments - r

I have a dataset with interval information for a bunch of manufacturing circuits
df <- data.frame(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
Circuit: Identifier for circuit
Start: Time the circuit started running
Finish: Time the circuit stopped running
Id: Unique identifier for the row
I'm able to create a new data set that counts the number of overlapping intervals:
ir <- IRanges(start = as.numeric(df$start), end = as.numeric(df$end), names = df$id)
cov <- coverage(ir)
start_time <- as.POSIXlt(start(cov), origin = "1970-01-01")
end_time <- as.POSIXlt(end(cov), origin = "1970-01-01")
seconds <- runLength(cov)
circuits_running <- runValue(cov)
res <- data.frame(start_time,end_time,seconds,circuits_running)[-1,]
But what I really need is something that looks more like this:
sqldf("select
res.start_time,
res.end_time,
res.seconds,
res.circuits_running,
df.circuit,
df.id
from res left join df on (res.start_time between df.start and df.end)")
The problem is that the sqldf way of using an inequality join is unbearably slow on my full dataset.
How can I get something similar using IRanges alone?
I suspect it has something to do with RangedData but I've not been able to see how to get what I want. Here's what I've tried...
rd <- RangedData(ir, circuit = df$circuit, id = df$id)
coverage(rd) # works but seems to lose the circuit/id info

The coverage can be represented as ranges, dropping the first (the range from 1970 to the first start point)
cov <- coverage(ir)
intervals <- ranges(cov)[-1]
Your query is to find the start of the interval of each circuit, so I narrow the interval to their start coordinate and find overlaps (the first argument is the 'query', the second the 'subject')
olaps <- findOverlaps(narrow(intervals, width(intervals)), ir)
The number of circuits running in a particular interval is
tabulate(queryHits(olaps), queryLength(olaps))
and the actual circuits are
df[subjectHits(olaps), c("circuit", "id")]
The pieces can be knit together as, perhaps
df1 <- cbind(uid=seq_along(intervals),
as.data.frame(intervals),
circuits_running=tabulate(queryHits(olaps), queryLength(olaps)))
df2 <- cbind(uid=queryHits(olaps),
df[subjectHits(olaps), c("circuit", "id")])
merge(df1, df2, by="uid", all=TRUE)
Ranges can have associated with them 'metadata' that is accessible and subset in a coordinated way, so the connection between data.frame and ranges does not have to be so loose and ad hoc. I might instead have
ir <- IRanges(start = as.numeric(df$start), end = as.numeric(df$end))
mcols(ir) <- DataFrame(df)
## ...
mcols(ir[subjectHits(olaps)])
perhaps with as.data.frame() when done with IRanges-land.
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.

Related

matching strings regex exact match

This thread follows on from this answered qestion: Matching strings loop over multiple columns
I opened a new thread as I would like to make an update to flag for exact matches only..
I have a table of key words in separate colums as follows:
#codes table
codes <- structure(
list(
Support = structure(
c(2L, 3L, NA),
.Label = c("",
"help", "questions"),
class = "factor"
),
Online = structure(
c(1L,
3L, 2L),
.Label = c("activities", "discussion board", "quiz", "sy"),
class = "factor"
),
Resources = structure(
c(3L, 2L, NA),
.Label = c("", "pdf",
"textbook"),
class = "factor"
)
),
row.names = c(NA,-3L),
class = "data.frame"
)
I also have a comments table structured as follows:
#comments table
comments <- structure(
list(
SurveyID = structure(
1:5,
.Label = c("ID_1", "ID_2",
"ID_3", "ID_4", "ID_5"),
class = "factor"
),
Open_comments = structure(
c(2L,
4L, 3L, 5L, 1L),
.Label = c(
"I could never get the pdf to download",
"I could never get the system to work",
"I didn’t get the help I needed on time",
"my questions went unanswered",
"staying motivated to get through the textbook",
"there wasn’t enough engagement in the discussion board"
),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-5L)
)
What I am trying to do:
Search for an exact match keyword. The following working code has been provided by #Len Greski and #Ronak Shah from the previous thread (with huge thanks to both):
resultsList <- lapply(1:ncol(codes),function(x){
y <- stri_detect_regex(comments$Open_comments,paste(codes[[x]],collapse = "|"))
ifelse(y == TRUE,1,0)
})
results <- as.data.frame(do.call(cbind,resultsList))
colnames(results) <- colnames(codes)
mergedData <- cbind(comments,results)
mergedData
and
comments[names(codes)] <- lapply(codes, function(x)
+(grepl(paste0(na.omit(x), collapse = "|"), comments$Open_comments)))
Both work great but I have come across a snag and now need to match the keywords exactly. As per the example tables above, if I have a keyword "sy", the code will flag any comment with the word "system". I would modify either of the above pieces of code to flag the comment where only "sy" exact match is present.
Many thanks

Connect two points with a line in R

I have a problem connecting two points with the same y value. My dataset looks like this (I hope the formatting is ok):
attackerip,min,max
125.88.146.123,2016-03-29 17:38:17.949778,2016-03-30 07:28:47.912983
58.218.205.101,2016-04-05 15:53:20.69986,2016-05-12 17:32:08.583255
183.3.202.195,2016-04-05 15:58:27.862509,2016-04-15 18:15:13.117774
58.218.199.166,2016-04-05 16:09:34.448588,2016-04-24 06:02:12.237922
58.218.204.107,2016-04-05 16:57:17.624509,2016-05-31 00:52:44.007908
What I have so far is the following:
mydata = read.csv("timeline.csv", sep=',')
mydata$min <- strptime(as.character(mydata$min), format='%Y-%m-%d %H:%M:%S')
mydata$max <- strptime(as.character(mydata$max), format='%Y-%m-%d %H:%M:%S')
plot(mydata$min, mydata$attackerip, col="red")
points(mydata$max, mydata$attackerip, col="blue")
Which results in:
Now I want to connect the points with the same y-axis value. And can not get lines or abline to work. Thanks in Advance!
EDIT: dput of data
dput(mydata)
structure(list(attackerip = structure(c(1L, 5L, 2L, 3L, 4L), .Label = c("125.88.146.123",
"183.3.202.195", "58.218.199.166", "58.218.204.107", "58.218.205.101"
), class = "factor"), min = structure(1:5, .Label = c("2016-03-29 17:38:17.949778",
"2016-04-05 15:53:20.69986", "2016-04-05 15:58:27.862509", "2016-04-05 16:09:34.448588",
"2016-04-05 16:57:17.624509"), class = "factor"), max = structure(c(1L,
4L, 2L, 3L, 5L), .Label = c("2016-03-30 07:28:47.912983", "2016-04-15 18:15:13.117774",
"2016-04-24 06:02:12.237922", "2016-05-12 17:32:08.583255", "2016-05-31 00:52:44.007908"
), class = "factor")), .Names = c("attackerip", "min", "max"), class = "data.frame", row.names = c(NA,
-5L))
Final Edit:
The reason why plotting lines did not work was, that the datatype of min and max was timestamps. Casting those to numeric values yielded the expected result. Thanks for your help everyone
The lines function should work just fine. However, you will need to call it for every pair (or set) of points that share the same y value. Here is a reproducible example:
# get sets of observations with the same y value
dupeVals <- unique(y[duplicated(y) | duplicated(y, fromLast=T)])
# put the corresponding indices into a list
dupesList <- lapply(dupeVals, function(i) which(y == i))
# scatter plot
plot(x, y)
# plot the lines using sapply
sapply(dupesList, function(i) lines(x[i], y[i]))
This returns
data
set.seed(1234)
x <- sort(5* runif(30))
y <- sample(25, 30, replace=T)
As it appears that you have two separate groups for which you would like draw these lines, the following would be the algorithm:
for each group, (min and max, I believe)
calculate the duplicate values of the y variable
put the indicies of these duplicates into a dupesList (maybe dupesListMin and dupesListMax).
plot the points
run one sapply function on each dupesList.

How to convert matrix with tick data into xts?

I have the data which i am trying to convert into xts format:
> dput(data)
structure(list(50370788L, 50370777L, 50370694L, 50370620L, 50370504L,
620639L, 620639L, 592639L, 592639L, 592639L, "2015-10-24",
"2015-10-24", "2015-09-04", "2015-09-04", "2015-09-04", structure(list(
id = 12544L, symbol = "GBSN", title = "Great Basin Scientific, Inc."), .Names = c("id",
"symbol", "title"), class = "data.frame", row.names = 1L),
structure(list(id = 12544L, symbol = "GBSN", title = "Great Basin Scientific, Inc."), .Names = c("id",
"symbol", "title"), class = "data.frame", row.names = 1L),
structure(list(id = 12544L, symbol = "GBSN", title = "Great Basin Scientific, Inc."), .Names = c("id",
"symbol", "title"), class = "data.frame", row.names = 1L),
structure(list(id = 12544L, symbol = "GBSN", title = "Great Basin Scientific, Inc."), .Names = c("id",
"symbol", "title"), class = "data.frame", row.names = 1L),
structure(list(id = 12544L, symbol = "GBSN", title = "Great Basin Scientific, Inc."), .Names = c("id",
"symbol", "title"), class = "data.frame", row.names = 1L),
"$GBSN Still sticking with my prediction of FDA coming sometime in March..",
"$GBSN Last time I check NASDAQ gave them till sometime in April to get it together or else they'll see pink. Correct me if in wrong?",
"$GBSN time for retailers to get knocked out of the ring with a 25 to 30 % gain",
"$GBSN market cap will end up around 65 million not enough to comply rs takes it to 21 dollars pps 26$ by august",
"$GBSN shorts are going to attack the sell off"), .Dim = c(5L,
5L), .Dimnames = list(c("2016-02-28 16:59:53", "2016-02-28 16:58:58",
"2016-02-28 16:51:36", "2016-02-28 16:46:09", "2016-02-28 16:34:34"
), c("GBSN.Message_ID", "GBSN.User_ID", "GBSN.User_Join_Date",
"GBSN.Message_Symbols", "GBSN.Message_Body")))
I have been trying to use :
Message_series <- xts(zoo(data, format='%Y-%m-%d %H:%M:%S'))
i get this error:
Error in zoo(data, format = "%Y-%m-%d %H:%M:%S") :
unused argument (format = "%Y-%m-%d %H:%M:%S")
Your matrix is not tidy. Look at the fourth column (data[,4]). zoo and hence xts not support so complicated object only simple matrix with all element in the same type.
First and second columns are OK. They inherited the list properties so the conversion is not so straightforward.
data.mat <- matrix(as.numeric(data[,1:2]), ncol = 2)
colnames(data.mat) <- colnames(data)[1:2]
xts(data.mat, order.by = as.POSIXct(rownames(data)))
Join data can be converted and included:
data.mat <- cbind(data.mat, as.numeric(as.Date(as.character(data[,3]))))
colnames(data.mat) <- colnames(data)[1:3]
data.xts <- xts(data.mat, order.by = as.POSIXct(rownames(data)))
and transformable back:
as.Date(coredata(data.xts['2016-02-28 16:59:53',3]))
You can code variables id, symbol, title from Message_Symbols too in the same way.
I recommend you store Message_Body in a separate object (e.g. data.frame).
Based on the column names of data, it appears that all of your data is or could be of character type. However, data[,4], GBSN.Message_Symbols, contains lists, not an atomic vector so we'll have to flatten using rbind. apply is then used to convert each column to a character vector and combine to form a character matrix. The xts object is formed by converting the rownames to POSIX date/time types and using them as the index. Code would look like
# flatten list data in column 4 to a data frame
mat4 <- do.call(rbind, data[,4])
# convert all data to character type
data.mat <- apply(cbind(data[,-4], mat4), 2, as.character)
# create xts time series
data.xts <- xts(data.mat, order.by = as.POSIXct(rownames(data)))

prop.table doesn't work in a for-loop?

This may be a very simple question, but I don't see how to answer it.
I have the following reproducible code, where I have two small dataframes that I use to calculate a percentage value based on each column total:
#dataframe x
x <- structure(list(PROV = structure(c(1L, 1L), .Label = "AG", class = "factor"),
APT = structure(1:2, .Label = c("AAA", "BBB"), class = "factor"),
PAX.2013 = c(5L, 4L), PAX.2014 = c(4L, 2L), PAX.2015 = c(4L,0L)),
.Names = c("PROV", "APT", "PAX.2013", "PAX.2014", "PAX.2015"),
row.names = 1:2, class = "data.frame")
#dataframe y
y <- structure(list(PROV = structure(c(1L, 1L), .Label = "AQ", class = "factor"),
APT = structure(1:2, .Label = c("CCC", "AAA"), class = "factor"),
PAX.2013 = c(3L, 7L), PAX.2014 = c(2L, 1L), PAX.2015 = c(0L,3L)),
.Names = c("PROV", "APT", "PAX.2013", "PAX.2014", "PAX.2015"),
row.names = 1:2, class = "data.frame")
#list z (with x and y)
z <- list(x,y)
#percentage value of x and y based on columns total
round(prop.table(as.matrix(z[[1]][3:5]), margin = 2)*100,1)
round(prop.table(as.matrix(z[[2]][3:5]), margin = 2)*100,1)
as you can see, it works just fine.
Now I want to automate for all the list, but I can't figure out how to get the results. This is my simple code:
#for-loop that is not working
for (i in length(z))
{round(prop.table(as.matrix(z[[i]][3:5]), margin = 2)*100,1)}
You have two problems.
First, you have not put a range into your for loop so you are just trying to iterate over a single number and second, you are not assigning your result anywhere on each iteration.
Use 1:length(z) to define a range. Then assign the results to a variable.
This would work:
my_list <- list()
for (i in 1:length(z)){
my_list[[i]] <- round(prop.table(as.matrix(z[[i]][3:5]),
margin = 2)*100,1)
}
my_list
But it would be more efficient and idiomatic to use lapply:
lapply(1:length(z),
function(x) round(prop.table(as.matrix(z[[x]][3:5]), margin = 2)*100,1))
Barring discussions whether for-loops is the best approach, you had two issues. One, your for loop only iterates over 2 (which is length(z)) instead of 1:2. Two, you need to do something with the round(....) statement. In this solution, I added a print statement.
for (i in 1:length(z)){
print(round(prop.table(as.matrix(z[[i]][3:5]), margin = 2)*100,1))
}

xts merge memory performance

I am trying to improve the memory performance for the following example:
basline df with 4 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L), .Label =
c("AAA1", "AAA2","AAA3", "AAA4"), class = "factor"), bitrateinbps = c(10000000,
10000000, 10000000, 10000000), startdate = structure(c(1326758507, 1326758671,
1326759569, 1326760589), class = c("POSIXct", "POSIXt"), tzone = ""), enddate =
structure(c(1326765780, 1326758734, 1326760629, 1326761592), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("sessionid", "bitrateinbps", "startdate",
"enddate"), row.names = c(NA, 4L), class =
"data.frame")
alternate df with 8 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L),
.Label = c("AAA1", "AAA2", "AAA3", "AAA4", "AAA5", "AAA6", "AAA7", "AAA8"),
class = "factor"), bitrateinbps =c(10000000, 10000000, 10000000, 10000000,
10000000, 10000000, 10000000, 10000000), startdate = structure(c(1326758507,
1326758671, 1326759569, 1326760589, 1326761589, 1326762589, 1326763589, 1326764589),
class = c("POSIXct",
"POSIXt"), tzone = ""), enddate = structure(c(1326765780, 1326758734, 1326760629,
1326761592, 1326767592,
1326768592, 1326768700, 1326769592), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("sessionid",
"bitrateinbps", "startdate", "enddate"), row.names = c(NA, 8L), class =
"data.frame")
try df analysis memory usage and again for alternate df
library(xts)
fun0 <- function(i, d) {
idx0 <- seq(d$startdate[i],d$enddate[i],1) # create sequence for index
dat0 <- rep(1,length(idx0)) # create data over sequence
xts(dat0, idx0, dimnames=list(NULL,d$sessionid[i])) # xts object
}
# loop over each row and put each row into its own xts object
xl0 <- lapply(1:NROW(df), fun0, d=df)
# merge all the xts objects
xx0 <- do.call(merge, xl0)
# apply a function (e.g. colMeans) to each 15-minute period
xa0 <- period.apply(xx0, endpoints(xx0, 'minutes', 15), colSums, na.rm=TRUE)/900
xa1 <- t(xa0)
# convert from atomic vector to data frame
xa1 = as.data.frame(xa1)
# bind to df
out1 = cbind(df, xa1)
# print aggregate memory usage statistics
print(paste('R is using', memory.size(), 'MB out of limit', memory.limit(), 'MB'))
# create function to return matrix of memory consumption
object.sizes <- function()
{
return(rev(sort(sapply(ls(envir=.GlobalEnv), function (object.name)
object.size(get(object.name))))))
}
# print to console in table format
object.sizes()
results as follows:
4 row df:
xx0 = 292104 Bytes .... do.call(merge, xl0)
xl0 = 154648 Bytes .... lapply(1:NROW(df), fun0, d=df)
8 row df:
xx0 = 799480 Bytes .... do.call(merge, xl0)
xl0 = 512808 Bytes .... lapply(1:NROW(df), fun0, d=df)
I'm looking for something a little more memory efficient for the merge and lapply functions, so I can scale out the number of rows, if anyone has any suggestions and can show the comparative results for alternatives.

Resources