BeautifulSoup find specific td data-sort="name" - web-scraping

Hy, is there a similar way to
soup.findAll('div', class_="cities")
only to with the data-sort attribute?
I donĀ“t know like
soup.findAll('td', data-sort_="citiesVillage")
i like to find only the rows with a specific data-sort name.
thanks.

Try this:
soup.findAll('td', attrs={'data-sort': 'citiesVillage'})

Related

Not able to find the Xpath

I am trying to scrape IMDB top 250 movies using scrapy and stuck in finding the xpath for duration[I need to extract "2","h","44" and "m"] of each movie. Website link : https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt
Here's the image of the HTML:
I've tried this Xpath but it's not accurate:
//li[#class ='ipc-inline-list__item']/following::li/text()
If it's always in the same position, what about:
//li[#class ='ipc-inline-list__item']/following::li[2]
or more simply:
//li[#class ='ipc-inline-list__item'][3]
or since the others have hyperlinks as the child, filter to just the li that has text() child nodes:
//li[#class ='ipc-inline-list__item'][text()]
However, the original XPath may be fine - it may be how you are consuming the information. If you are using .get() then try .getAll() instead.
You can use this XPath to locate the element:
//span[contains(#class,'Runtime')]
To extract the text you can use this:
//span[contains(#class,'Runtime')]/text()

Extract XML child attribute based on another child attribute

I have the following XML structure. I am trying to extract the attributes StartDate and EndDate of the relationship period, that is only if rr:PeriodType is RELATIONSHIP_PERIOD.
However, the nodes for "relationship" and "accounting" have exactly the same name and am not sure how to proceed.
<rr:RelationshipPeriods>
<rr:RelationshipPeriod>
<rr:StartDate>2018-01-01T00:00:00.000Z</rr:StartDate>
<rr:EndDate>2018-12-31T00:00:00.000Z</rr:EndDate>
<rr:PeriodType>ACCOUNTING_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
<rr:RelationshipPeriod>
<rr:StartDate>2019-01-02T00:00:00.000Z</rr:StartDate>
<rr:PeriodType>RELATIONSHIP_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
</rr:RelationshipPeriods>
I tried using this code
ldply(xpathApply(xmlData, '//rr:RelationshipPeriod/rr:StartDate', getChildrenStrings), rbind)
But doesn't work well as it's hard to understand if it is extracting accounting or relationship period.
Any help would be greatly appreciated!
For rr:StartDate use XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']/rr:StartDate
But probably better to first find the correct rr:RelationshipPeriod using XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']
See this answer on how to reuse the result of a XPath.
But don't use // in front of rr:StartDate and rr:EndDate

Use do.call to get information out of a list of RC/S4 objects

I have a defined reference class and a list:
RCclass<-setRefClass("RCclass",field=list(info="character"))
A<-RCclass$new(info="a")
B<-RCclass$new(info="b")
testList<-list(A,B)
do.call(function(x){paste0(x$info)},testList)
The do.call function doesn't look quite right, and it doesn't give me the expected string "ab". However I am not sure how to achieve this. Please share your opinions; thanks!
I found a solution around this:
Reduce("paste0",(lapply(testList,FUN=function(x)x$info)))

Add Control in rCharts::Highchart in R

I'd like to use this: http://rstudio-pubs-static.s3.amazonaws.com/5548_c3b680696b084e5db17eecf8c079a3c1.html
but I need list filter like this: http://rcharts.io/icontrols/#.Vbs9WZP1GlM
any idea how to do this?

Selenium IDE - Select checkbox on table row

I'm using Selenium IDE and I have a table where it has many rowns and columns. Each row has its own checkbox to select this row.
I was using this command to search for a specific row:
css=tr:contains('US Tester4') input[type="checkbox"]
But the problem is that in this colum, I have some other similar words like "US Tester41", "US Tester42" ... and when I use this command, it selects the wrong row.
I thought if I replace this word "contains" for some other like "equals" or "exactly" would work, but it didn't (I don't know the sintax).
Any ideas?
Follow the screenshot:
http://oi41.tinypic.com/2ake9hw.jpg
I'm not familiar with Selenium IDE, but with the selenium webdriver I would use an xpath. So I guess something like this will work for you:
xpath=//tr[td[3][text()='US Tester4']]//input[#type='checkbox']
This worked for me:
//tr//td[.='US Tester4']//input[type="checkbox"]
against:
<table>
<tr><td>US Tester</td>input(type="checkbox")</tr>
<tr><td>US Tester4</td>input(type="checkbox")</tr>
<tr><td>US Tester41</td>input(type="checkbox")</tr>
<tr><td>US Tester412</td>input(type="checkbox")</tr>
</table>
It matched the second element.
This worked for me
xpath=(//input[#name='uid'])[2])
The 2 being the order of elemets
I'm not very familiar with the IDE but I have used the Webdriver before. If possible I would use this xpath.
xpath = "//td[.= 'US Tester4']//previous-sibling::td//input[#type = 'checkbox']"
This should locate only one element on screen. Using previous-sibling and following-sibling is very helpful when you haven't got a good enough identifier on the exact element you want to find. In your case the which contains the checkbox hasn't a good identifier where as the after has text which you could match using the '=' operator. You just need to use the 'previous-sibling' to find the with the checkbox

Resources