I am trying to figure out how to parse a webpage

I am trying to figure out how to parse a webpage - r

I am working on a summer project. To grab course information from my school website.
I start off by going here: http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=
to gather the course departments.
Then I grab info from pages like this one.
I have what I need filtered down to a list like:
[1] "91091 211 01 PRINC OF FINANCIAL ACCOUNTING 3.0 55 22 33 0 MW 12:45PM 02:05PM BAB 106 Rose-Green E"
[2] "91092 211 02 PRINC OF FINANCIAL ACCOUNTING 3.0 53 18 35 0 TR 09:35AM 10:55AM BAB 123 STAFF"
[3] "91093 211 03 PRINC OF FINANCIAL ACCOUNTING 3.0 48 29 19 0 TR 05:30PM 06:50PM BAB 220 Hoskins J"
[4] "91094 212 01 MANAGEMENT ACCOUNTING 3.0 55 33 22 0 MWF 11:30AM 12:25PM BAB 106 Hoskins J"
[5] "91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R"
However my issues are as follows:
www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS
I need to add the department from each url. In the link I gave, the department was "CS". I need to have that included with each entry.
I need to turn this into a table, or some other object where I can reference the data like
Max Wait
CRN Course Title Credit Enrl Enrl Avail List Days Start End Bldg Room Instructor
------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- --------------------
Basically how the data is displayed on the page.
So my end goal is to go through each of those links I grab, get all the course info(except the section type). Then put it into a giant data.frame that has all the courses like this.
Department CRN Course Title Credit MaxEnrl Enrl Avail WaitList Days Start End Bldg Room Instructor
ACC 91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R
So far I have this working
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
dep <- readLines(url)
dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]
dep <- substring(dep, 6)
dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]
dep <- foreach(i=dep) %do% trim(i)
dep <- dep[2:length(dep)]
return(dep)
}
x <- gatherClasses(uah[1])
x <-unlist(x)
I am having trouble split the data in the right places. I am not sure what I should try next.
EDIT:(Working Now)
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
L <- readLines(url)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3}", L, value = TRUE)
classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
classes$department <- unlist(strsplit(url, '='))[3]
return(classes)
}
allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)
write.table(mydata, "c:/sum2015b.txt", sep="\t")

Read the lines into L, grab the "--- ---- etc." line into Fields and ensure that there is exactly one space at the end. Find the character positions of the spaces and difference them to get the field widths. Finally grep out the data portion and read it in using read.fwf which reads fixed width fields. For example, for Art History:
URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 90628 100 01 ARH SURV:ANCIENT-MEDIEVAL 3 35 27 8 0 TR 12:45PM 02:05PM WIL 168 Joyce L
2 90630 101 01 ARH SURV:RENAISSANCE-MODERN 3 35 14 21 0 MW 12:45PM 02:05PM WIL 168 Stewart D
3 90631 101 02 ARH SURV:RENAISSANCE-MODERN 3 35 8 27 0 MW 03:55PM 05:15PM WIL 168 Stewart D
4 92269 101 03 ARH SURV:RENAISSANCE-MODERN 3 35 5 30 0 TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5 90632 101 04 ARH SURV:RENAISSANCE-MODERN 3 35 13 22 0 TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6 90633 301 01 ANCIENT GREEK ART 3 18 3 15 0 MW 02:20PM 03:40PM WIL 168 Joyce L
7 92266 306 01 COLLAPSE OF CIVILIZATIONS 3 10 4 6 0 TR 12:45PM 02:05PM SST 205 Sever T
8 W 90634 309 01 CONTEMPORARY ART & ISSUES 3 18 10 8 0 TR 09:35AM 10:55AM WIL 168 Stewart D
9 90635 320 01 ST: MODERN ARCHITECTURE 3 12 0 12 0 TR 11:10AM 12:30PM WIL 172 Takacs T
10 90636 400 01 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Joyce L
11 90637 400 02 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Stewart D

I wrote and donated that schedule.pl script about 20 years ago because they simply published the flat mainframe files of all the courses on offer for each session. The script's job is to break up the whole set and present it in human-consumable chunks. (That, and back then a browser would choke on that much data.) I understand from one of the former UAH IT people that they tried to do away with it once, but got a great hew and cry from users, so they figured out how to keep it working.
It would be easier for you to ask the UAH IT folks if you can't just retrieve the underlying flat file. It used to be on a public-facing URL, but like I said, that was about 20 years ago, so I don't recall the specifics. The output you see when viewing courses is the same as the flat file, but the flat file contains every department, so you don't have to fetch each separately.

Related

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!

We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')

with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like
1 USA China
2 Dollars Volume Dollars Volume
3 Category Brand CY2016 CY2017 CY2016 CY2017 CY2016 CY_2017 CY2016 CY2017
4 Chocolate Snickers 100 120 15 18 100 80 20 22
5 Chocolate Twix 70 80 8 10 75 50 55 20
I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.
1 Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016 China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
2 Chocolate Snickers 100 120 15 18 100 80 20 22
3 Chocolate Twix 70 80 8 10 75 50 55 20
My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!

I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:
header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)
The tricky part is the header. This function should do it:
construct_colnames <- function(header) {
f <- function(x) {
x <- as.character(x)
c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
}
res <- apply(header, 1, f)
res <- apply(res, 1, paste0, collapse="_")
sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat
Result:
Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers 100 120 15 18 100
2 Chocolate Twix 70 80 8 10 75
China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1 80 20 22
2 50 55 20

Plot a decision tree with R

I have a 440*2 matrix that looks like:
1 144
1 152
1 135
2 3
2 12
2 107
2 31
3 4
3 147
3 0
4 end
4 0
4 0
5 6
5 7
5 10
5 9
The left column are the starting points eg in the app all the 1's on the left would be on the same page. They lead to three choices, pages 144,152,135. These pages can each then lead to another page, and so on until the right hand column says 'end'. What I would like is a way to visualise the scale of this tree. I realise it will be quite large given the nb of rows so maybe not graph friendly, so for clarity I want to know how many possible routes there are in total (from every start point, down every option it gives and the end destinations of each. I realise there will be overlaps but thats why I am finding this hard to calculate).
secondly, each number has an associated title. I would like to have a function whereby if you input a given title it will plot all the possible starting points and their associated paths that will lead there. This should be a lot smaller and therefore graph friendly.
e.g.
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll"
Edited example data to show that some branches are not connected to desired tree
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
11 42 AAA
42 154 BBB
154 end Coll"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
library(data.tree)
Warning message:
package ‘data.tree’ was built under R version 3.2.5
tree <- FromDataFrameNetwork(df)
**Error in FromDataFrameNetwork(df) :**
**Cannot find root name. network is not a tree!**
I made this example to show how column 1 leads to a value in column 2 which then refers to a value in column 1 until you reach the end. Different starting points can ultimately lead to different length paths to same destination. so this would look sometigng like:
So here, I wanted to see how you could go from all start points to 'Coll'
greatly appreciate any help

If you have indeed a tree (e.g. no cycles), you can use data.tree:
Start by converting to a data.frame:
dta <- "
14 12 as
186 187 Frac
187 154 Low
23 52 Med
52 11 Lip
15 55 asd
11 42 AAA
42 154 BBB
154 end Coll
55 end efg
12 end hij"
dta <- gsub(" ", ",", dta, fixed = TRUE)
dta <- gsub(" ", ",", dta, fixed = TRUE)
df <- read.csv(textConnection(dta), stringsAsFactors = FALSE, header = FALSE)
names(df) <- c("from", "to", "nme")
Now, convert to a data.tree:
library(data.tree)
tree <- FromDataFrameNetwork(df)
tree$leafCount
You can now navigate to any sub-tree, for analysis and plotting. E.g. using any of the following possibilities:
subTree <- tree$FindNode(187)
subTree <- Climb(tree, nme = "Coll", nme = "Low")
subTree <- tree$`154`$`187`
subTree <- Clone(tree$`154`)
Maybe printing is all you need:
print(subTree , "nme")
This will print like so:
levelName nme
1 154 Coll
2 ¦--187 Low
3 ¦ °--186 Frac
4 °--42 BBB
5 °--11 AAA
6 °--52 Lip
7 °--23 Med
Otherwise, use fancy plotting:
SetNodeStyle(subTree , style = "filled,rounded", shape = "box", fontname = "helvetica", label = function(node) node$nme, tooltip = "name")
plot(subTree , direction = "descend")
This looks like this:

printing a list based on range met

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?

reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29

Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

I am trying to figure out how to parse a webpage - r

Related

writing out .dat file in r

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

Plot a decision tree with R

printing a list based on range met

Count rows for selected column values and remove rows based on count in R

Categories

Resources