I am learning web scraping during these lock down days. What does the second line of code mean in R?
tab <- h %>% html_nodes("table")
tab <- tab[[2]]
The first line returns a list of tables (there are a few tables in the source page). tab[[2]] is the second one.
Related
The text from a pdf I scraped is jumbled up in different elements. Not to mention, it deleted data when it was converted to a data frame. It's really hard to tell where the text should have been split since it seems like I got it correct in the below code. How do I split the text so that it looks looks like the original table?
mintz = "https://www.mintz.com/sites/default/files/media/documents/2019-02-08/State%20Legislation%20on%20Biosimilars.pdf"
mintzText = pdf_subset(mintz,pages = 2:23)
mintzText = pdf_text(mintzText)
q = data.frame(trimws(mintzText))
mintzdf <- q %>%
rename(x = trimws.mintzText.) %>%
mutate(x=strsplit(x, "\\n")) %>%
unnest(x)
View(mintzdf)
mintzDF=mintzdf[-c(1:2),]
mintzDF=mintzDF %>%
separate(x, c("a","State", "Substitution
Requirements","Pharmacy Notification Requirements
(to prescriber, patient, or others)","Recordkeeping
Requirements"))%>%
select(-a)
View(mintzdf)
what it looks like
what it should look like
Pdf stored order for a page may be random or bottom rows upwards as there are no key press order rules for when lasers charge a drum (The design requirement for PDF introduction)
We are lucky if the order can be sensibly extracted, but this is a very well ordered PDF. So remember there is no need to observe the grid simply output by rows with spaces that with luck form columns.
In this case using poppler pdftotext with no controls a single page text order could look like this with the first column headed State and the second starting with Substitution\nRequirements\n so clearly there may be head scratching why State is not spaced away from Alaska? but then it is PDF after all, so expect there are no rules.
Looks like it was written down one column then across two then perhaps down the last ?.
Dependant on the very different page variations, I would attempt to target as vertical strips, rather than horizontals. so set a template as 4 vertical page high zones and then hope the horizontal breaks can be determined as matches. The alternative (probably better) is extract as a tabular layout and xpdf pdftotext may then give a better result.
Or use a python table extractor like pdfminer.
Im currently working with a really large dataframe (~2M rows) about "landings" and "takes off". With some information like the time the operation happened, in which airport, where was it heading and so on.
What I want to do is to filter the whole DF into a new one that just consider "flights", so about half the entries matching each take off with its corresponding landing based on the airport codes of the origin airport and the destination airport.
What I did, that works but considering how large the DF it takes about 200 hours to complete is
Loop on all rows of DF checking for some df$Operation=="takeoff"{
Loop on all rows, below the row found before, for df$operation="ladning"
where codes of origin and destination airport match the "take off" entry{
Once found i add the data i need to the new df called Flights
}
}
(If the second loop does not find a match in the next 100 rows it discards the entry and searchs for the next "take off")
Is there a function that perfoms this operation in a more efficient way? If not, do you know of an algorithm that could be way faster than the one i did?
I am really not used to data science, nor R. Any help will be appreciated.
Thanks in advance!
In R we try to avoid using loops. For filtering a dataframe I would use the filter function in dplyr. dplyr is great and easy and fast for working with dataframes. If it's still not fast enough you can try data.table, but it's a bit less user friendly.
This does what you want I think.
library(dplyr)
flights <- df %>%
arrange(datetime) %>% # make sure the data is in the right order
group_by(origin, destination) %>% # for each flight path
dplyr::filter(Operation %in% c("takeoff", "landing")) # get these rows
I recommend the online book R For Data Science:
https://r4ds.had.co.nz/
I'm currently in the data preparation phase of my master thesis and I've ran into a problem. I'm trying to scrape a website using a for loop and the rvest package in RStudio, based on a vector of ~90k IDs. How do I add a time interval in the for loop to prevent the HTML - too many requests error?
I've also got two minor side problems I need some help with:
If I only use a vector of 10 IDs to test the loop, the loop keeps overwriting the result so that observations 1-9 are overwritten and it only shows observation 10.
I can successfully scrape one element (first name) but the code fails when I try to scrape a second element and add it in another column.
I've been tinkering with the code for a while so I've already had many iterations of it and this is the only version that I got to work. For the overwriting of the observations I've tried adding [i] after 'playernames' but it produces the error 'new columns would leave holes after existing columns'.
# playerid is a vector of ~90k different IDs. (wouldn't work as a list)
# I successfully manage to scrape the first name of the player, although it keeps overwriting the observation.
for(i in playerid){
websiteX <- paste("http://www.X.com/id=",i, sep="")
websiteX <- read_html(websiteX)
playernames <- data.frame(first =websiteX %>% html_node("dd:nth-child(2)") %>% html_text() ,
stringsAsFactors=FALSE)
playernames$playerid <- i
}
# I want to also scrape the surname using: sur =websiteX %>% html_node(websiteX, "dd:nth-child(4)") %>% html_text() ,
In short, what I want is a data frame of the 90k player IDs, followed by a column for the first name and a column for the second name.
What I'm getting is an HTML overload error if I use the entire set of 90k IDs and only trying to scrape the first name. If I only scrape a few names (say 10), it keeps overwriting the observation.
I have found no way to implement the scraping of the surname, nor do I understand how to add a time interval to prevent the overload error.
# Changing
playernames <- data.frame~...
# to
playernames[i] <- data.frame~...
# produces the new columns would leave holes after existing columns error.
I'm having trouble understanding how to create a data.tree from a data frame. I have a data frame with two columns:
EmpID
SupervisorUserID
Code:
OfficeOrg <- read_csv("hierarchy")
OfficeOrg$pathString <- paste("Root",
OfficeOrg$SupervisorEmpID, OfficeOrg$EmpID, sep = "/")
RptTree <- as.Node(OfficeOrg)
The sample data has 25 rows. By inspecting the data, I can see that there are five levels. That is to say, I expect the RptTree object to show EmpIDs grouped under SupervisorEmpID to a depth of five.
Root
|_TopLevelSupervisor
|_SecondLevelSupervisor
|_ThirdLevelSupervisor
|_Employee
Instead, I see only three levels. The root, one for each SupervisorEmpID and the employees.
Root
|_Supervisor
|_ Employee
The tree isn't being built by recursing through all levels.
Usually this means that I'm staring something in the face, but not recognizing it.
What am I missing?
After searching off and on for several days, I found the solution to my problem at this Stack Overv Flow post:
data.tree nodes through Id's
I'm trying to build a dataframe that will help me paginate some simple web scraping. What is the best way to build a dataframe where each row uses the same base URL string but varies a few specific characters, which can be specified according to the pagination one needs.
Let's say you have a set of search results where there is a total of 4485 results, 10 per page, spread out over 449 pages. All I want for the moment is to make a dataframe with one variable where each row is a character string of the URL with a variable, sequenced page number along the lines of:
**Var1**
http://begin.com/start=0/index.html
http://begin.com/start=10/index.html
http://begin.com/start=20/index.html
http://begin.com/start=30/index.html
...
http://begin.com/start=4480/index.html
Here's my original strategy but this fails (and yea it's inefficient and newbish).
startstring<-"http://begin.com/start="
variableterm<-seq(from=0, to=4485, by=10)
endstring<-"/index.html"
df <- as.data.frame(matrix(nrow=449, ncol=1))
for (x in 1:length(variableterm)){
for(i in variableterm){
df[x,]<-c(paste(startstring,i,endstring, sep=""))
}
}
But every single row is equal to http://begin.com/start=4480/index.html. How can I change this so that each row gives the same URL but with a different number increasing like in the desired dataframe above?
I would very much appreciate how to achieve this with my strategy (just to learn) but of course better approaches are welcome also. Thanks!
I am not sure why you would need this to be in a data frame. Here is one way to create a vector of page urls.
sprintf("http://begin.com/start=%s/index.html", seq(0, 4490, 10))
The reason you have each row returning the same value (the last value) is that you have two loops where you only require a single. The first loop is looping through the rows of the data frame and the second loop is looping through the entire set of URLs and leaving the last one as the value of the data frame's row before the first loop moves to the next row.
This should work as you would expect:
for(i in 1:length(variableterm)){
df[i,]<-paste(startstring,variableterm[i],endstring, sep="")
}