Plot a decision tree with R - r

I have a 440*2 matrix that looks like:
1 144
1 152
1 135
2 3
2 12
2 107
2 31
3 4
3 147
3 0
4 end
4 0
4 0
5 6
5 7
5 10
5 9
The left column are the starting points eg in the app all the 1's on the left would be on the same page. They lead to three choices, pages 144,152,135. These pages can each then lead to another page, and so on until the right hand column says 'end'. What I would like is a way to visualise the scale of this tree. I realise it will be quite large given the nb of rows so maybe not graph friendly, so for clarity I want to know how many possible routes there are in total (from every start point, down every option it gives and the end destinations of each. I realise there will be overlaps but thats why I am finding this hard to calculate).
secondly, each number has an associated title. I would like to have a function whereby if you input a given title it will plot all the possible starting points and their associated paths that will lead there. This should be a lot smaller and therefore graph friendly.
e.g.
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll"
Edited example data to show that some branches are not connected to desired tree
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
11 42 AAA
42 154 BBB
154 end Coll"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
library(data.tree)
Warning message:
package ‘data.tree’ was built under R version 3.2.5
tree <- FromDataFrameNetwork(df)
**Error in FromDataFrameNetwork(df) :**
**Cannot find root name. network is not a tree!**
I made this example to show how column 1 leads to a value in column 2 which then refers to a value in column 1 until you reach the end. Different starting points can ultimately lead to different length paths to same destination. so this would look sometigng like:
So here, I wanted to see how you could go from all start points to 'Coll'
greatly appreciate any help

If you have indeed a tree (e.g. no cycles), you can use data.tree:
Start by converting to a data.frame:
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll
55 end efg
12 end hij"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
Now, convert to a data.tree:
library(data.tree)
tree <- FromDataFrameNetwork(df)
tree$leafCount
You can now navigate to any sub-tree, for analysis and plotting. E.g. using any of the following possibilities:
subTree <- tree$FindNode(187)
subTree <- Climb(tree, nme = "Coll", nme = "Low")
subTree <- tree$`154`$`187`
subTree <- Clone(tree$`154`)
Maybe printing is all you need:
print(subTree , "nme")
This will print like so:
levelName nme
1 154 Coll
2 ¦--187 Low
3 ¦ °--186 Frac
4 °--42 BBB
5 °--11 AAA
6 °--52 Lip
7 °--23 Med
Otherwise, use fancy plotting:
SetNodeStyle(subTree , style = "filled,rounded", shape = "box", fontname = "helvetica", label = function(node) node$nme, tooltip = "name")
plot(subTree , direction = "descend")
This looks like this:

Related

How to remove just the set of numbers with / in between among other strings? [duplicate]

This question already has an answer here:
How can I extract numbers separated by a forward slash in R? [closed]
(1 answer)
Closed 3 years ago.
I need to extract the blood pressure values from a text note that is typically reported as one larger number, "/" over a smaller number, with the units mm HG (it's not a fraction, and only written as such). In the 4 examples below, I want to extract 114/46, 135/67, 109/50 and 188/98 only, without space before or after and place the top number in column called SBP, and the bottom number into a column called DBP.
Thank you in advance for your assistance.
bb <- c("PATIENT/TEST INFORMATION (m2): 1.61 m2\n BP (mm Hg): 114/46 HR 60 (bpm)", "PATIENT/TEST INFORMATION:\ 63\n Weight (lb): 100\nBSA (m2): 1.44 m2\nBP (mm Hg): 135/67 HR 75 (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Coronary artery disease. Hypertension. Myocardial infarction.\nWeight (lb): 146\nBP (mm Hg): 109/50 HR (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Aortic stenosis. Congestive heart failure. Shortness of breath.\nHeight: (in) 64\nWeight (lb): 165\nBSA (m2): 1.80 m2\nBP (mm Hg): 188/98 HR 140 (bpm) ")
BP <- head(bb,4)
dput(bb)
Base R solution:
setNames(data.frame(do.call("rbind", strsplit(trimws(gsub("[[:alpha:]]|[[:punct:]][^0-9]+", "",
gsub("HR.*", "", paste0("BP", lapply(strsplit(bb, "BP"), '[', 2)))), "both"), "/"))),
c("SBP", "DBP"))
We can use regmatches/regexpr from base R to extract the required values, and then with read.table, create a two column data.frame
read.table(text = regmatches(bb, regexpr('\\d+/\\d+', bb)),
sep="/", header = FALSE, stringsAsFactors = FALSE)
# V1 V2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
Or using strcapture from base R
strcapture( "(\\d+)\\/(\\d+)", bb, data.frame(X1 = integer(), X2 = integer()))
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
To create this as new columnss in the original data.frame, use either cbind to bind the output with the original dataset
cbind(data, read.table(text = ...))
Or
data[c("V1", "V2")] <- read.table(text = ...)
Or using extract from tidyr
library(dplyr)
library(tidyr)
tibble(bb) %>%
extract(bb, into = c("X1", "X2"), ".*\\b(\\d+)/(\\d+).*", convert = TRUE)
# A tibble: 4 x 2
# X1 X2
# <int> <int>
#1 114 46
#2 135 67
#3 109 50
#4 188 98
If we don't want to remove the original column, use remove = FALSE in extract
You could use str_match and select numbers which has / in between
as.data.frame(stringr::str_match(bb, "(\\d+)/(\\d+)")[, 2:3])
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
In base R, we can extract the numbers that follow the pattern a/b, split them on '/' and form two columns.
as.data.frame(do.call(rbind, strsplit(sub(".*?(\\d+/\\d+).*", "\\1", bb), "/")))
You can give them the column names as per your choice using setNames or any other method.

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like
1 USA China
2 Dollars Volume Dollars Volume
3 Category Brand CY2016 CY2017 CY2016 CY2017 CY2016 CY_2017 CY2016 CY2017
4 Chocolate Snickers 100 120 15 18 100 80 20 22
5 Chocolate Twix 70 80 8 10 75 50 55 20
I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.
1 Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016 China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
2 Chocolate Snickers 100 120 15 18 100 80 20 22
3 Chocolate Twix 70 80 8 10 75 50 55 20
My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!
I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:
header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)
The tricky part is the header. This function should do it:
construct_colnames <- function(header) {
f <- function(x) {
x <- as.character(x)
c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
}
res <- apply(header, 1, f)
res <- apply(res, 1, paste0, collapse="_")
sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat
Result:
Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers 100 120 15 18 100
2 Chocolate Twix 70 80 8 10 75
China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1 80 20 22
2 50 55 20

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

I am trying to figure out how to parse a webpage

I am working on a summer project. To grab course information from my school website.
I start off by going here: http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=
to gather the course departments.
Then I grab info from pages like this one.
I have what I need filtered down to a list like:
[1] "91091 211 01 PRINC OF FINANCIAL ACCOUNTING 3.0 55 22 33 0 MW 12:45PM 02:05PM BAB 106 Rose-Green E"
[2] "91092 211 02 PRINC OF FINANCIAL ACCOUNTING 3.0 53 18 35 0 TR 09:35AM 10:55AM BAB 123 STAFF"
[3] "91093 211 03 PRINC OF FINANCIAL ACCOUNTING 3.0 48 29 19 0 TR 05:30PM 06:50PM BAB 220 Hoskins J"
[4] "91094 212 01 MANAGEMENT ACCOUNTING 3.0 55 33 22 0 MWF 11:30AM 12:25PM BAB 106 Hoskins J"
[5] "91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R"
However my issues are as follows:
www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS
I need to add the department from each url. In the link I gave, the department was "CS". I need to have that included with each entry.
I need to turn this into a table, or some other object where I can reference the data like
Max Wait
CRN Course Title Credit Enrl Enrl Avail List Days Start End Bldg Room Instructor
------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- --------------------
Basically how the data is displayed on the page.
So my end goal is to go through each of those links I grab, get all the course info(except the section type). Then put it into a giant data.frame that has all the courses like this.
Department CRN Course Title Credit MaxEnrl Enrl Avail WaitList Days Start End Bldg Room Instructor
ACC 91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R
So far I have this working
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
dep <- readLines(url)
dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]
dep <- substring(dep, 6)
dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]
dep <- foreach(i=dep) %do% trim(i)
dep <- dep[2:length(dep)]
return(dep)
}
x <- gatherClasses(uah[1])
x <-unlist(x)
I am having trouble split the data in the right places. I am not sure what I should try next.
EDIT:(Working Now)
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
L <- readLines(url)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3}", L, value = TRUE)
classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
classes$department <- unlist(strsplit(url, '='))[3]
return(classes)
}
allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)
write.table(mydata, "c:/sum2015b.txt", sep="\t")
Read the lines into L, grab the "--- ---- etc." line into Fields and ensure that there is exactly one space at the end. Find the character positions of the spaces and difference them to get the field widths. Finally grep out the data portion and read it in using read.fwf which reads fixed width fields. For example, for Art History:
URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 90628 100 01 ARH SURV:ANCIENT-MEDIEVAL 3 35 27 8 0 TR 12:45PM 02:05PM WIL 168 Joyce L
2 90630 101 01 ARH SURV:RENAISSANCE-MODERN 3 35 14 21 0 MW 12:45PM 02:05PM WIL 168 Stewart D
3 90631 101 02 ARH SURV:RENAISSANCE-MODERN 3 35 8 27 0 MW 03:55PM 05:15PM WIL 168 Stewart D
4 92269 101 03 ARH SURV:RENAISSANCE-MODERN 3 35 5 30 0 TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5 90632 101 04 ARH SURV:RENAISSANCE-MODERN 3 35 13 22 0 TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6 90633 301 01 ANCIENT GREEK ART 3 18 3 15 0 MW 02:20PM 03:40PM WIL 168 Joyce L
7 92266 306 01 COLLAPSE OF CIVILIZATIONS 3 10 4 6 0 TR 12:45PM 02:05PM SST 205 Sever T
8 W 90634 309 01 CONTEMPORARY ART & ISSUES 3 18 10 8 0 TR 09:35AM 10:55AM WIL 168 Stewart D
9 90635 320 01 ST: MODERN ARCHITECTURE 3 12 0 12 0 TR 11:10AM 12:30PM WIL 172 Takacs T
10 90636 400 01 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Joyce L
11 90637 400 02 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Stewart D
I wrote and donated that schedule.pl script about 20 years ago because they simply published the flat mainframe files of all the courses on offer for each session. The script's job is to break up the whole set and present it in human-consumable chunks. (That, and back then a browser would choke on that much data.) I understand from one of the former UAH IT people that they tried to do away with it once, but got a great hew and cry from users, so they figured out how to keep it working.
It would be easier for you to ask the UAH IT folks if you can't just retrieve the underlying flat file. It used to be on a public-facing URL, but like I said, that was about 20 years ago, so I don't recall the specifics. The output you see when viewing courses is the same as the flat file, but the flat file contains every department, so you don't have to fetch each separately.

Resources