Web Scraping with R in ATPWORLDTOUR - r

I'm trying to scrape if the player is right handed or left handed from this page (http://www.atpworldtour.com/en/players/novak-djokovic/d643/fedex-atp-win-loss). I used the following code to scrape this info:(1603.html is the saved link)
y <- htmlParse('1603.html')
x <- xpathApply(y,"//div[#class='player-profile-hero-table']")
sapply(x,xmlValue)
The code returns me the following:
"Age\r\n\t\t\t\t\t\t\t\t\r\n28\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(1987.05.22)\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\tTurned Pro\r\n\t\t\t\t\t\t\t\t\r\n2003\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\tWeight\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t172lbs(78kg)\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\tHeight\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t6'2\"(188cm)\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\r\n\t\t\tBirthplace\r\n\t\t\r\n\t\t\r\n\t\t\tBelgrade, Serbia\r\n\t\t\r\n\t\r\n\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\tResidence\r\n\t\t\t\t\t\t\t\tMonte-Carlo, Monaco\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\r\n\t\t\tPlays\r\n\t\t\r\n\t\t\r\nRight-Handed, Two-Handed Backhand\t\t\r\n\t\r\n\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\tCoach\r\n\t\t\r\n\t\t\tBoris Becker, Marian Vajda"
What can I do to remove all this letter t's and r's in middle of the result?To know if the player is right handed or left handed I think x should be defined as : x <- xpathApply(y,"//table[#width='570']"). What should I do?

One solution is to use the wonderful readHTMLTable() to get all the tables from the page, then select the correct table and cell for the information.
This function takes the URL and returns "R" for right-handed, "L" for left-handed. It does that by selecting the first table, second row of the third column, and then uses substr to grab the 16th character. You can adapt it to whatever you like.
scrapey <- function(URL){
x <- readHTMLTable(URL,header=F, stringsAsFactors=F);
substr(x[[1]][2,3],16,16)}

Related

read in csv file in R and make a list out of last column

Content of my.csv
project names,task names
Build Finances,Calculate Earnings
Build Roads,Calculate Equipment Costs
Buy Food, Calculate Grocery Costs
The code I'm using to read /tmp/my.csv into a variable/vector is:
taskNamesAndprojectNames <- read.csv("/tmp/my.csv", header=TRUE)
What I want to do is to grab the last column of my.csv file which has been put into the csvContent variable. And then make a list out of it.
So, something like this:
#!/usr/bin/Rscript
taskNamesAndprojectNames <- read.csv("/tmp/my.csv", header=TRUE)
#str(tasklists)
#tasklists
#tasklists[,ncol(tasklists)]
taskNames <- list(taskNamesAndprojectNames[,-1])
typeof(taskNames)
length(taskNames)
The problem with the above code is, when i run length on the taskNames variable/vector to confirm that it has the correct number of items or elements, I only get a response of 1. Which is not accurate.
[roywell#test01 data]$ ~/readincsv.r
[1] "list"
[1] 1
What am I doing wrong here? Can someone help me correct this code? What i want to do is grab the last column of an excel csv sheet, get the values in that last column and put them into a variable. Make a list out of it. Then iterate through the list to confirm that values/input provided by a user matches at least one of the elements in the list.
taskNames <- list(taskNamesAndprojectNames[,-1]) makes a list with one element that is a character vector of length 3.
It sounds like you are looking for a vector in this case:
taskNames <- taskNamesAndprojectNames[,-1]
typeof(taskNames)
[1] "character"
length(taskNames)
[1] 3

Two PASTE functions in a character vector

attach.files = c(paste("/users/joesmith/nosection_", currentDate,".csv",sep=""),
paste("/users/joesmith/withsection_", currentDate,".csv",sep=""))
Basically, if I did it like
c("nosection_051418.csv", "withsection_051418.csv")
And I did that manually it would work fine but since I'm automating this to run every day I can't do that.
I'm trying to attach files in an automated email but when I structure it like this, it doesn't work. How can I recreate this so that the character vector accepts it?
I thought your example implied the need for "parallel" inputs to the path stem, the first portion of the file name, and the date portions of those full paths. Consider this illustration of using a 2 item vector and a one item vector (produced by Sys.Date, replacing your "currentdate") to populate the %s positions in that sprintf string (suggested by #Gregor):
sprintf("/users/joesmith/%s_%s.csv", c("nosection", "withsection"), Sys.Date() )
[1] "/users/joesmith/nosection_2018-05-14.csv" "/users/joesmith/withsection_2018-05-14.csv"

How to pass vector elements as individual arguments to a function in R

I am working on a web scraping project using rvest.
html_text(html_nodes(url, CSS))
extracts data from url wherever the matching CSS is found. My problem is that the website I am scraping uses a unique CSS ID for each listed product (such as ListItem_001_Price). So 1 CSS defines exactly 1 item's price and so automated webscraping doesn't work
I can create a vector
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
for all the products' CSS IDs manually. Is it possible to pass it's individual elements to the html_nodes() function in one go and so collect the resulting data back as a single vector/dataframe?
How to make it work?
You can try using lapply here:
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
results <- lapply(V, function(x) html_text(html_nodes(url, x)))
I assume here that your nested call to html_text will in general return a character vector of the text corresponding to the matching nodes, for each item in V. This would leave you with a list of vectors which you can then access.
html_nodes() needs the initial "." to find your tags by css-class. You could manually create
V <- c(".ListItem_001_Price", ".ListItem_002_Price", ".ListItem_003_Price")
like you sugest, but I recommend that you user regex to match the classes like 'ListItem_([0-9]{3})_Price' so you can avoid the manual labour. Make sure you regex on the actual string of your markup, and not on the html-node object. (see below)
In R, apply(), lapplay(), sapplay() and the like, work much like a short loop. In it you can apply a function to every member of data-type that contains numerous values, like lists, data-frames, matrixes or vectors.
In your case, it's a vector, and a way to beginning to understand how it works is thinking of it like:
sapply(vector, function(x) THING-TO-DO-WITH-ITEM-IN-VECTOR)
In your case, you'd like the thing to do with item in vector to be the fetching of the html_text corresponding to the items in the vector.
See the code below for an example:
library(rvest)
# An example piece of html
example_markup <- "<ul>
<li class=\"ListItem_041_Price\">Brush</li>
<li class=\"ListItem_031_Price\">Phone</li>
<li class=\"ListItem_002_Price\">Paper clip</li>
<li class=\"ListItem_012_Price\">Bucket</li>
</ul>"
html <- read_html(example_markup)
# Avoid manual creation of css with regex
regex <- 'ListItem_([0-9]{3})_Price'
# Note that ([0-9]{3}) will match three consecutive numeric characters
price_classes <- regmatches(example_markup, gregexpr(regex, example_markup))[[1]]
# Paste leading "." so that html_nodes() can find the class:
price_classes <- paste(".", price_classes, sep="")
# A singel entry is found like so:
html %>% html_nodes(".ListItem_031_Price") %>% html_text()
# Use sapply to get a named character vector of your products
# Note how ".ListItem_031_Price" from the line above is replaced by x
# which will be each item of price_classes in turn.
products <- sapply(price_classes, function(x) html %>% html_nodes(x) %>% html_text())
The result in products is a named character vector. Use unname(products) to drop the names.

R Loop error using character

I have the below function which inserts a row into a table (new_scores) based upon the attribute that I feed into it (where the attribute represents a table that I select things from):
buildNewScore <- function(x) {
something <- bind_rows(new_scores,x%>%select(ATT,ADJLOGSCORE))
return(something)
}
Which works fine when I define x.
But when I try to create a for loop that feeds the rest of my attributes into the function it falls over as I'm feeding in a character.
attlist <- c('Z','Y','X','W','V','U','T','RT','RO')
record_count <- length(attlist)
for (x in c(1:record_count)){
buildNewScore(attlist[x])
}
I've tried to convert the attribute into other classes but I can't get the loop to use anything I change it to (name, data.frame etc.).
Anyone have any ideas as to where I'm going wrong - is my attlist vector in the wrong format?
Thanks,
Spikelete.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources