xpathSApply skip if text equals postseason - r

I'm running into a road block here and I can't figure out what I'm doing wrong. I need to skip over the link if the text equals postseason. The text is in the second li in the xpaths below in my code.
I tried li[not(.,"postseason")] as I thought that is what I needed to exclude the postseason link but it doesn't work.
This link will show you an example of want I want to exclude under standard batting > game logs > postseason
http://www.baseball-reference.com/players/j/jeterde01.shtml
place this http://www.baseball-reference.com/players/j/jeterde01.shtml in playerURLs and you should season the postseason link returned. How can I skip over the postseason link? Thanks!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile17[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div//ul/li[2]/ul/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}

I'm not familiar with r language specifically, but from xpath point of view, you can use . != "..." or not(contains(.,"...")) predicate pattern to exclude element having specific inner text value.
The following will exclude <li> having inner text exactly equals "postseason" :
li[. != "postseason"]
This one will exclude <li> having inner text like "postseason"
li[not(contains(.,"postseason"))]

Related

In R: Search all emails by subject line, pull comma-separate values from body, then save values in a dataframe

Each day, I get an email with the quantities of fruit sold on a particular day. The structure of the email is as below:
Date of report:,04-JAN-2022
Time report produced:,5-JAN-2022 02:04
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
I'm trying to build some code in R that will search through my emails, find all emails with a particular subject, iterate through each email to find the variables I'm looking for, take the values and place them in a dataframe with the "Date of report" put in a date column.
With the assistance of people in the community, I was able to achieve the desired result in Python. However as my project has developed, I need to now achieve the same result in R if at all possible.
Unfortunately, I'm quite new to R and therefore if anyone has any advice on how to take this forward I would greatly appreciate it.
For those interested, my Python code is below:
#PREP THE STUFF
Fruit_1 = "Apples"
Fruit_2 = "Pears"
searchf = [
Fruit_1,
Fruit_2
]
#DEF THE STUFF
def get_report_vals(report, searches):
dct = {}
for line in report:
term, *value = line
if term.casefold().startswith('date'):
dct['date'] = pd.to_datetime(value[0])
elif term in searches:
dct[term] = float(value[0])
if len(dct.keys()) != len(searches):
dct.update({x: None for x in searches if x not in dct})
return dct
#DO THE STUFF
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
messages.Sort("[ReceivedTime]", True)
results = []
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
data = [line.strip().split(",") for line in message.body.split('\n')]
results.append(get_report_vals(data, searchf))
else:
pass
fruit_vals = pd.DataFrame(results)
fruit_vals.columns = map(str.upper, fruit_vals.columns)
I'm probably going about this the wrong way, but I'm trying to use the steps I took in Python to achieve the same result in R. So for example I create some variables to hold the fruit sales I'm searching for, then I create a vector to store the searchables, and then when I create an equivalent 'get_vals' function, I create an empty vector.
library(RDCOMClient)
Fruit_1 <- "Apples"
Fruit_2 <- "Pears"
##Create vector to store searchables
searchf <- c(Fruit_1, Fruit_2)
## create object for outlook
OutApp <- COMCreate("Outlook.Application")
outlookNameSpace = OutApp$GetNameSpace("MAPI")
search <- OutApp$AdvancedSearch("Inbox", "urn:schemas:httpmail:subject = 'FRUIT QUANTITIES'")
inbox <- outlookNameSpace$Folders(6)$Folders("Inbox")
vec <- c()
for (x in emails)
{
subject <- emails(i)$Subject(1)
if (grepl(search, subject)[1])
{
text <- emails(i)$Body()
print(text)
break
}
}
read.table could be a good start for get_report_vals.
Code below outputs result as a list, exception handling still needs to be implemented :
report <- "
Date of report:,04-JAN-2022
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
"
get_report_vals <- function(report,searches) {
data <- read.table(text=report,sep=",")
colnames(data) <- c('key','value')
# find date
date <- data[grepl("date",data$key,ignore.case=T),"value"]
# transform dataframe to list
lst <- split(data$value,data$key)
# output result as list
c(list(date=date),lst[searches])
}
get_report_vals(report,c('Lemons','Oranges'))
$date
[1] "04-JAN-2022"
$Lemons
[1] "4"
$Oranges
[1] "2"
The results of various reports can then be concatenated in a data.frame using rbind:
rbind(get_report_vals(report,c('Lemons','Oranges')),get_report_vals(report,c('Lemons','Oranges')))
date Lemons Oranges
[1,] "04-JAN-2022" "4" "2"
[2,] "04-JAN-2022" "4" "2"
The code now functions as intended. Function was written quite a bit differently from those recommended:
get_vals <- function(email) {
body <- email$body()
date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}") %>%
as.character()
data <- read.table(text = body, sep = ",", skip = 9, strip.white = T) %>%
row_to_names(1) %>%
mutate("Date" = date)
return(data)
}
In addition I've written this to bind the rows together:
info <- sapply(results, get_vals, simplify = F) %>%
bind_rows()
May this is not what you are expecting to get as an answer, but I must state that here to help other readers to avoid such mistakes in future.
Unfortunately your Python code is not well-written. For example, I've noticed the following code where you iterate over all items in a folder and check the Subject and message bodies for keywords:
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
You need to use the Find/FindNext or Restrict methods of the Items class instead. So, you don't need to iterate over all items in a folder. Instead, you get only items that correspond to your conditions. Read more about these methods in the following articles:
How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder
You may combine all your search criteria into a single query. So, you just need to iterate over found items and extract the data.
Also you may find the AdvancedSearch method helpful. The key benefits of using the AdvancedSearch method in Outlook are:
The search is performed in another thread. You don’t need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, i.e. beyond the scope of a certain folder. The Restrict and Find/FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). You can read more about this in the Filtering article in MSDN. To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.
See Advanced search in Outlook programmatically: C#, VB.NET for more information.

How to scrape options from dropdown list and store them in table?

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.
Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)

Splitting an htmlParse'd HTML document while preserving the class

I'd like to scrap phone numbers from this French public directory. The thing is, it can return multiple answers, and I'd like to get them all, but I have a problem in the splitting of the parsed HTML doc.
Here is my code :
# example url for reproducibility
url_ <- "http://www.pagesjaunes.fr/recherche/departement/zc-de-vignolles-beaune-21/pagot-&-savoie---espace-aubade"
response <- GET(url_)
doc <- content(response, type="text/html", encoding = "UTF-8")
parseddoc <- htmlParse(doc)
# I think the problem lies in this next line, let's call it "line A" :
boxes <- xpathSApply(parseddoc, "//article[#class='bi-bloc blocs clearfix bi-pro']")
foreach(box = boxes) %do% {
# and also in this line, let's call it "line B" :
return_line$PJ_phone_number <- xpathApply(box, "//div[#class='item bi-contact-tel']", xmlValue)
}
}
I've tested line A, the xpathSApply() gets all the nodes with the XPath "//article[#class='bi-bloc blocs clearfix bi-pro']" (which is basically each box of result from the search on the website) and puts them into a list. I'm then going through this list with foreach. (I've tested this)
However, for line B to work, "box" needs to be of class "XMLInternalDocument". (parseddoc has class "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument" for instance). But in line A, xpathSApply() returns a list of objects of class "XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode".
So my question is, how can I have line A "split" the parts of parseddoc that I need, also while keeping the same class, XMLInternalDocument ?
I hope I'm clear enough. Thanks.

R: XPath expression returns links outside of selected element

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.
My code follows:
require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href"))
(links = unique(z))
If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.
What am I doing wrong? What is the correct/more efficient way to code this with XPath?
Note: XPath novice writing.
Answered (really quickly), thanks very much! My solution is below.
extract <- function(x) {
message(x)
html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
html = xpathApply(html, "//table")[[3]]
html = xpathApply(html, ".//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")
html = gsub("#ac_newscomment", "", html)
html = unique(html)
}
d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)
This saves all links to news items with keyword 'Hadopi' on this website.
You need to start the pattern with . if you want to restrict the search to the current node.
/ goes back to the start of the document (even if the root node is not in y).
xpathSApply(y, ".//a/#href" )
Alternatively, you can extract the third table directly with XPath:
xpathApply(x, "//table[3]//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")

removing data with tags from a vector

I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)

Resources