Scraping an item that varies in position from a XML subpage - r

so far I didn't succeed in scraping the table "Die Verlustursache" from this page
http://www.ubootarchiv.de/ubootwiki/index.php/U_205
using libraries (XML) (rvest) (readr)
I can address all tables on the site with individual code lines like
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203") %>% .[1]
but the numeric numbers vary on all the other sites.
check for example: http://www.ubootarchiv.de/ubootwiki/index.php/U_27
I just realized that the table I need is always the fourth last one (meaning: the last table minus 4).
In another scraping project, I once used this line to only scrape the last item of a list page:
html_nodes(xpath="/html/body/div/div[3]/div[2]/div[1]/div[2]/div/table/tbody/tr[last()]"
However, I was not able to find a solution for something like "last - 4"
Please advise & Thx in advance

You could use this if it is always the fourth last table:
table <-readHTMLTable("http://www.ubootarchiv.de/ubootwiki/index.php/U_203")
table[length(table) - 4]

Related

how to use rvest to scrape same kind of datapoint but labelled with different id

if I want to use rvest to scrape a particular datapoint (name, address, phone etc) repeated in different section of this page, all start with similar span id, but not exactly the same, such as:
docs-internal-guid-049ac94a-f34e-5729-b053-30567fdf050a
docs-internal-guid-765e48e9-f34b-7c88-5d95-042a93fcfda3
what's the best approach? to find and copy each id is not viable. Thanks
Edit:
You can use the following script to retrieve all star restaurants:
library("rvest")
url_base <- "http://www.straitstimes.com/lifestyle/food/full-list-of-michelin-starred-restaurants-for-2017"
data <- read_html(url_base) %>%
html_nodes("h3") %>%
html_text()
This also gives you the headers ("ONE MICHELIN STAR", "TWO MICHELIN STARS", "THREE MICHELIN STARS"), bu this might even be helpful.
Background to the script:
Fortunately, all and only the relevant information is within the h3 selector. The script gives you a char vector as output. Of course, you can further elaborate on this with e.g. %>% as.data.frame() or however you want to store / process the data.
------------------- old answer -------------------
Could you maybe provide the url of that particular page? For me it sounds like you have to find the right css-selector (nth-child(x)) you can use in a loop.

Replacing multiple string intervals in R

I am currently working on a data sat which has two header rows (The first one acting as overall category description and the second one containing subcategories. And it happens to be that both contain various <text> intervals. For example:
In the first row (column names of the data frame), i have a cell that contains:
- text... <span style=\"text-decoration: underline;\">in the office</span> on the activities below. Total must add up to 100%. <br /><br />
The second row contains multiple cells with:
- text <strong>
- text </strong>
Now, I was able to work out of how to remove all <text> intervals in the second row through:
data[1,] = gsub("<.*>", "", data[1,])
However, for the column names row, if I use:
colnames(data) = gsub("<.*>", "",colnames(data))
I end up just with "text", which I don't want. Due to the fact, that I still want to have:
text... in the office on the activities below. Total must add up to 100%
If some one would have an idea of how to solve it. I would really appreciate it.
Thanks!
You can get what you need by changing the regular expression you are using with the following:
colnames(data) <- gsub("<[^>]+>", "",colnames(data))
This will remove anything between opening and closing tags (including the tag). That should give you what you want.
Your current regex is greedy and is consuming everything in between the first opening bracket and last closing bracket. One quick fix would be to make your regex non greedy by using ?:
data[1,] = gsub("<.*?>", "", data[1,])
Note that using regex to parse HTML generally is not a good idea. If you plan on doing anything with nested content then you should consider using an R package which can parse HTML content.
Demo

Adding incomplete columns to a table

I'm new to R and picking it up pretty quick, I think, but I've hit a wall and I'm not even sure what to google to figure this out for myself.
In the code excerpt below, i'm adding a few calculated columns to table ALLDATA. The problem is with the last line. If I have an ALLDATA table where every entry has an associated QCAnalysisNumber, the code works fine. If only SOME of the entries have a QCAnalysis number, that column doesn't populate at all. I would like it to find an appropriate QCAnalysisNumber, and if it can't, just be NA or let me insert text like "No QCAnalysisNumber".
Can you guys tell me where I'm going wrong or point me in the right direction? Even just appropriate search terms for google would be a huge help. Thanks!
ALLDATA$IntResult <- round(ALLDATA$Value, 0)
ALLDATA$ComboResult <- ifelse(toupper(ALLDATA$DetectedResult)=="N", ALLDATA$Value/2, round(ALLDATA$Value, 0))
ALLDATA$ND15Result <- ifelse(toupper(ALLDATA$DetectedResult)=="N", ALLDATA$Value/2, ALLDATA$Value)
ALLDATA$LogComboResult <- ifelse(ALLDATA$DetectedResult=="N", log10(abs(ALLDATA$Value/2)), log10(abs(ALLDATA$Value)))
ALLDATA$LogResult <- log10(abs(ALLDATA$Value))
ALLDATA$QCAnalysisNumber <- ALLDATA$AnalysisNumber[ALLDATA$QCSampleCode!="O" &
ALLDATA$LongName==ALLDATAQC$LongName &
ALLDATA$SampleDate_D==ALLDATAQC$SampleDate_D]

Retrieve whole lyrics from URL

I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest

Edit a data item in Cognos to display as two digits

Disclaimer: I am a Cognos newbie.
I want to format a data item in Cognos (Report Studio 10.2) to always display as 2 digits. For example, if the value of the data item in 2, I want it to be displayed as 02. How can I achieve this?
I have tried
Format$([my_data_item], "00")
Format([my_data_item], '00') - w/o the dollar sign
None worked.
Any help will be highly appreciated. Thanks!
Marcus ...thanks for pointing me in the right direction.
I was able to use the round function in the query with a CASE statement and it served my purpose with very little modification to the original report. Note to any new Cognos developer would be leave formatting to the report page and not the query.
case when [TRANSFORMATION VIEW].[SOME_FACT].[DOLLAR_VAL]<>-99999 then
case when [TRANSFORMATION VIEW].[SOME_FACT].[DOLLAR_VAL] >= 1000000 then
round( [TRANSFORMATION VIEW].[SOME_FACT].[DOLLAR_VAL] /1000000, 2)
else
round([TRANSFORMATION VIEW].[SOME_FACT].[DOLLAR_VAL],0)
end
end
Then on the report itself, I formatted the data item as Numeric and didn't change any of the default settings. That did the magic. The reason why I have round([TRANSFORMATION VIEW].[SOME_FACT].[DOLLAR_VAL],0) is because I don't want to display cents in the dollar amount.

Resources