printing a list based on range met - r

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?

reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Related

R - aggregated data.table columns differently

I am given a large data-table that needs to be aggregated according to the first column:
The problem is the following:
For several columns, one just has to form the sum for each category (given in column 1)
For other columns, one has to calculate the mean
There is a 1-1 correspondence between the entries in the first and second columns. Such that the entries of the second column should be kept.
The following is a possible example of such a data-table. Let's assume that columns 3-9 need to be summed up and columns 10-12 need to be averaged.
library(data.table)
set.seed(1)
a<-matrix(c("cat1","text1","cat2","text2","cat3","text3"),nrow=3,byrow=TRUE)
M<-do.call(rbind, replicate(1000, a, simplify=FALSE)) # where m is your matrix
M<-cbind(M,matrix(sample(c(1:100),3000*10,replace=TRUE ),ncol=10))
M <- as.data.table(M)
The result should be a table of the form
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1: cat1 text1 27 81 78 95 27 22 12 76 18 76
2: cat2 text2 38 48 70 100 11 97 8 53 56 33
3: cat3 text3 58 18 66 24 14 73 18 27 92 70
but with entries the corresponding sums respective averages.
M[, names(M)[-c(1,2)] := lapply(.SD, as.numeric),
.SDcols = names(M)[-c(1,2)]][,
c(lapply(.SD[, ((3:9)-2), with=FALSE], sum),
lapply(.SD[, ((10:12)-2), with=FALSE], mean)),
by = eval(names(M)[c(1,2)])]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1: cat1 text1 51978 49854 48476 49451 49620 49870 50248 50.193 51.516 49.694
#> 2: cat2 text2 50607 50097 50572 50507 48960 51419 48905 49.700 49.631 48.863
#> 3: cat3 text3 51033 50060 49742 50345 51532 51299 50957 50.192 50.227 50.689

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Spliting a row into columns using a delimiter in R

My data loks like this:
ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,
I would like to first separate this into different columns then apply a function on each row to calculate the difference of comma separated values such as (237-204).
Without the use of external library packages.
Try this except if the data is in a file replace the readLines line with something like this: L <- readLines("myfile.csv") . After that replace the colons with commas using gsub and then read the resulting text and transform it:
# test data
Lines <- "ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,"
L <- readLines(textConnection(Lines))
DF <- read.table(text = gsub(":", ",", L), sep = ",")
transform(DF, diff = V3 - V4)
giving:
V1 V2 V3 V4 V5 diff
1 ID 10 237 204 NA 33
2 ID 11 257 239 NA 18
3 ID 12 309 291 NA 18
4 ID 13 310 272 NA 38
5 ID 14 3202 3184 NA 18
6 ID 15 404 388 NA 16

I am trying to figure out how to parse a webpage

I am working on a summer project. To grab course information from my school website.
I start off by going here: http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=
to gather the course departments.
Then I grab info from pages like this one.
I have what I need filtered down to a list like:
[1] "91091 211 01 PRINC OF FINANCIAL ACCOUNTING 3.0 55 22 33 0 MW 12:45PM 02:05PM BAB 106 Rose-Green E"
[2] "91092 211 02 PRINC OF FINANCIAL ACCOUNTING 3.0 53 18 35 0 TR 09:35AM 10:55AM BAB 123 STAFF"
[3] "91093 211 03 PRINC OF FINANCIAL ACCOUNTING 3.0 48 29 19 0 TR 05:30PM 06:50PM BAB 220 Hoskins J"
[4] "91094 212 01 MANAGEMENT ACCOUNTING 3.0 55 33 22 0 MWF 11:30AM 12:25PM BAB 106 Hoskins J"
[5] "91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R"
However my issues are as follows:
www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS
I need to add the department from each url. In the link I gave, the department was "CS". I need to have that included with each entry.
I need to turn this into a table, or some other object where I can reference the data like
Max Wait
CRN Course Title Credit Enrl Enrl Avail List Days Start End Bldg Room Instructor
------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- --------------------
Basically how the data is displayed on the page.
So my end goal is to go through each of those links I grab, get all the course info(except the section type). Then put it into a giant data.frame that has all the courses like this.
Department CRN Course Title Credit MaxEnrl Enrl Avail WaitList Days Start End Bldg Room Instructor
ACC 91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R
So far I have this working
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
dep <- readLines(url)
dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]
dep <- substring(dep, 6)
dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]
dep <- foreach(i=dep) %do% trim(i)
dep <- dep[2:length(dep)]
return(dep)
}
x <- gatherClasses(uah[1])
x <-unlist(x)
I am having trouble split the data in the right places. I am not sure what I should try next.
EDIT:(Working Now)
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
L <- readLines(url)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3}", L, value = TRUE)
classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
classes$department <- unlist(strsplit(url, '='))[3]
return(classes)
}
allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)
write.table(mydata, "c:/sum2015b.txt", sep="\t")
Read the lines into L, grab the "--- ---- etc." line into Fields and ensure that there is exactly one space at the end. Find the character positions of the spaces and difference them to get the field widths. Finally grep out the data portion and read it in using read.fwf which reads fixed width fields. For example, for Art History:
URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 90628 100 01 ARH SURV:ANCIENT-MEDIEVAL 3 35 27 8 0 TR 12:45PM 02:05PM WIL 168 Joyce L
2 90630 101 01 ARH SURV:RENAISSANCE-MODERN 3 35 14 21 0 MW 12:45PM 02:05PM WIL 168 Stewart D
3 90631 101 02 ARH SURV:RENAISSANCE-MODERN 3 35 8 27 0 MW 03:55PM 05:15PM WIL 168 Stewart D
4 92269 101 03 ARH SURV:RENAISSANCE-MODERN 3 35 5 30 0 TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5 90632 101 04 ARH SURV:RENAISSANCE-MODERN 3 35 13 22 0 TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6 90633 301 01 ANCIENT GREEK ART 3 18 3 15 0 MW 02:20PM 03:40PM WIL 168 Joyce L
7 92266 306 01 COLLAPSE OF CIVILIZATIONS 3 10 4 6 0 TR 12:45PM 02:05PM SST 205 Sever T
8 W 90634 309 01 CONTEMPORARY ART & ISSUES 3 18 10 8 0 TR 09:35AM 10:55AM WIL 168 Stewart D
9 90635 320 01 ST: MODERN ARCHITECTURE 3 12 0 12 0 TR 11:10AM 12:30PM WIL 172 Takacs T
10 90636 400 01 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Joyce L
11 90637 400 02 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Stewart D
I wrote and donated that schedule.pl script about 20 years ago because they simply published the flat mainframe files of all the courses on offer for each session. The script's job is to break up the whole set and present it in human-consumable chunks. (That, and back then a browser would choke on that much data.) I understand from one of the former UAH IT people that they tried to do away with it once, but got a great hew and cry from users, so they figured out how to keep it working.
It would be easier for you to ask the UAH IT folks if you can't just retrieve the underlying flat file. It used to be on a public-facing URL, but like I said, that was about 20 years ago, so I don't recall the specifics. The output you see when viewing courses is the same as the flat file, but the flat file contains every department, so you don't have to fetch each separately.

finding largest consecutive region in table

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.
The input looks like this. input:
> x
grp V1 V2 V3 V4 V5 V6
1: 1 DOG.1 142 144 132 134 0
2: 2 DOG.1 313 315 303 305 0
3: 3 DOG.1 316 318 306 308 0
4: 4 DOG.1 319 321 309 311 0
5: 5 DOG.1 322 324 312 314 0
the output should look like this:
out.name in out
[1,] "DOG.1" "313" "324"
Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.
One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:
require(data.table)
require(IRanges) ## Bioconductor package
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out.name = "DOG.1",
in = start(x.ir)[max.idx],
out = end(x.ir)[max.idx])
# out.name bla out
# 1: DOG.1 313 324

Resources