Get part of HTML table - web-scraping

Get part of HTML table - web-scraping

I want to take the contents of a Table from a website.
This is the websites source code:
<tr><td><table width='100%'><tr><td valign='top' width='1px' class='GridViewRow1'><img src='/images/pin.gif'></td><td class='GridViewRow1'><a href='Announcements.etc'><b><i>Title num 1</i></b></a><div class='SmallText'>Username</div><div class='SmallText' style='color:#808080;'>date</div></td></tr></table></td></tr>
<tr><td><table width='100%'><tr><td valign='top' width='1px' class='GridViewRow1'><img src='/images/pin.gif'></td><td class='GridViewRow1'><a href='Announcements.etc2'><b><i>Title num 2</i></b></a><div class='SmallText'>username</div><div class='SmallText' style='color:#808080;'>date</div></td></tr></table></td></tr>
And so this is my code
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td.GridViewRow1");
desc = td.get(0).nextElementSibling().text();
The output I get is:
Title num 1 username date as a string.
I want to get the title only.
Can someone explain to me how to get the title since the title doesn't have a unique tag?

Title is marked with - select just that
... td = doc.select("td.GridViewRow1 > b >i");

Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td.GridViewRow1");
desc = td.select("a[href]").first().text();
This was the solution to my issue

Related

Python-docx - Center a cell content in an existing table after adding values

I have a .docx template with an empty table, where I am adding values:
def manipulate_table():
table = doc.tables[0]
table.cell(0, 0).text = 'A'
table.cell(0, 1).text = 'B'
table.cell(0, 2).text = 'C'
table.cell(0, 3).text = 'D'
After adding these values, the the table attribute "Centered" is gone, which is standard behaviour.
How can I loop through my table and center all values again? I've already Googled, but found nothing helpful. E.g.: does not work:
for cell in ....????:
tc = cell._tc
tcPr = tc.get_or_add_tcPr()
tcVAlign = OxmlElement('w:vAlign')
tcVAlign.set(qn('w:val'), "center")
tcPr.append(tcVAlign)
I appreciate all your help.

The .text property on a cell completely replaces the text in the cell, including the paragraph(s) that were there before.
The "centered" attribute is on each paragraph, not on the cell. So you need to do something like:
from docx.enum.text import WD_ALIGN_PARAGRAPH
cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
to each of the "new" paragraphs (assigning to .text will leave you with exactly one in each cell).

How to get the last value in Node in XML

I'm trying to figure out how I can get the last text inside node which is the ID 104543.
<id>tag:website.com:feed/web/main/104543</id>
The output should be 104543.

If you are using .NET 3.5 or higher, then you could use the XElement class in System.Xml.Linq.
You could retrieve the tag element content in the following way:
string str = #"<id>tag:website.com:feed/web/main/104543</id>";
XElement element = XElement.Parse(str);
var content = element.Descendants("id").FirstOrDefault().Value;
Now, parsing the content depends on how this is structured: if the code you want to extract will always be placed after the last "/" character, then you could do the following:
string code = content.Split(new[] { "/" }, StringSplitOptions.None).Last();

R: XPath expression returns links outside of selected element

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.
My code follows:
require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href"))
(links = unique(z))
If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.
What am I doing wrong? What is the correct/more efficient way to code this with XPath?
Note: XPath novice writing.
Answered (really quickly), thanks very much! My solution is below.
extract <- function(x) {
message(x)
html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
html = xpathApply(html, "//table")[[3]]
html = xpathApply(html, ".//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")
html = gsub("#ac_newscomment", "", html)
html = unique(html)
}
d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)
This saves all links to news items with keyword 'Hadopi' on this website.

You need to start the pattern with . if you want to restrict the search to the current node.
/ goes back to the start of the document (even if the root node is not in y).
xpathSApply(y, ".//a/#href" )
Alternatively, you can extract the third table directly with XPath:
xpathApply(x, "//table[3]//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")

Need help building a menuopt file for Jenzabar CX

I'm not sure if anyone out here uses Jenzabar & ACE reporting, but the question is specific to that as far as I know.
I'm building an ACE report and the menuopt file for it has to be modified to lookup values for a parameter based on several specific conditions.
The portion of the menuopt file I have now is:
LU7 = crs_rec.title1, optional;
PA7: optional,
comments = "Enter a course number - leave blank if for all"
default = "",
lookup LU7 joining *crs_rec.crs_no,
upshift,
length = 10;
I'm looking to modify the lookup so that it only lists courses that can be found by this SQL statement:
SELECT DISTINCT crs_no
FROM crs_rec
WHERE dept IN ( SELECT dept
FROM dept_table
WHERE div IN ('CCE','HLTH'));
If anyone is familiar with using Jenzabar CX & ACE reporting, any help would be appreciated.
Thanks

I got this answer from someone on a Jenzabar listserv....
Sometimes you can get the same effect by limiting it based on other params.
For example:
LU6 = cat_table.txt;
PA6: optional,
comments = "COMMENT_CAT_TBCODE",
lookup LU6 joining *cat_table.cat,
upshift,
length = 4;
LU7 = crs_rec.title1, optional;
LU7B = crs_rec.dept, optional,
qualifier = "#XXXX,YYYY,ZZZZ,DDDD,EEEE";
LU7C = crs_rec.cat, optional,
qualifier = "field:PA6";
PA7: optional,
comments = "COMMENT_CRS_NO - COMMENT_BLANK_ALL"
default = "",
lookup LU7,LU7B,LU7C joining *crs_rec.crs_no,
upshift,
length = 10;
This would show only the courses in departments XXXX,YYYY,ZZZZ,DDDD, and EEEE in the catalog entered as param PA6.
(the catalog param is basically the only way of doing the "distinct" for the crs_no in the menuopt).
You cannot do the dept in div thing unless you make dept another parameter in which case you could limit the dept selection with a div qualifier and change the LU7B to reference field:xxxx (the param for the dept).

Obtain data from dynamically incremented IDs in JQuery

I have a quick question about JQuery. I have dynamically generated paragraphs with id's that are incremented. I would like to take information from that page and bring it to my main page. Unfortunately I am unable to read the dynamically generated paragraph IDs to get the values. I am trying this:
var Name = ((data).find("#Name" + id).text());
The ASP.NET code goes like this:
Dim intI As Integer = 0
For Each Item As cItem in alProducts1
Dim pName As New System.Web.UI.HtmlControls.HtmlGenericControl("p")
pName.id = "Name" & intI.toString() pName.InnerText = Item.Name controls.Add(pName) intI += 1
Next
Those name values are the values I want...Name1, name2, name3 and I want to get them individually to put in their own textbox... I'm taking the values from the ASP.NET webpage and putting them into an AJAX page.

Your question is not clear about your exact requirement but you can get the IDs of elements with attr method of jQuery, here is an example:
alert($('selector').attr('id'));

You want to select all the elements with the incrementing ids, right?
// this will select all the elements
// which id starts with 'Name'
(data).find("[id^=Name]")

Thanks for the help everyone. I found the solution today however:
var Name = ($(data).find('#Name' + id.toString()).text());
I forgot the .toString() part and that seems to have made the difference.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get part of HTML table - web-scraping

Title is marked with - select just that ... td = doc.select("td.GridViewRow1 > b >i");

Document doc = Jsoup.connect(url).get(); Elements td = doc.select("td.GridViewRow1"); desc = td.select("a[href]").first().text(); This was the solution to my issue

Related

Python-docx - Center a cell content in an existing table after adding values

How to get the last value in Node in XML

R: XPath expression returns links outside of selected element

Need help building a menuopt file for Jenzabar CX

Obtain data from dynamically incremented IDs in JQuery

Categories

Resources