How can I convert a WARC file to a single page HTML file? - warc

Is there a way to convert a WARC file to a single page HTML file similar to the end result of what monolith or SingleFile produce?

Related

Biztalk File Moving

I have an xml file which consists of "File Id" attribute which denotes the file id of some pdf document.
<StoredDocumentRepresentation>
<DigitalFile FileId="3BSE077611_B001.pdf" xlink:href="files/3BSE077611_B001.pdf" xlink:type="SIMPLE"></DigitalFile>
</StoredDocumentRepresentation>
There will be many pdf with different FileId(pdf will be having FileId as their name)in the Receive Location. I need to pick the pdf with File id mentioned in the xml...
Thanks in Advance
I have used the File.Move() function in my orchestration for picking the PDF files automatically from source to destination location, once my xml file is dropped...

scrape xml from a html page with getNodeSet

Hi I am using R to do some basic web scraping where I am comfortable parsing xml file and querying them with xpath. However I am having difficulty parsing a full html page and trying to extract the xml to get into my comfort zone. For example:
parsedhtml <- htmlParse("http://www.w3schools.com/XPath/xpath_examples.asp")
parses the html. I am using this because xmlParse only works on .xml files. I know that by using getNodeSet I can isolated specific nodes within the parsed html. So I am attempting to extract the embedded xml document under the "The Example XML Document" section by trying:
getNodeSet(parsedhtml, "//div[#class = 'code notranslate']")
where I get the data in the correct node, however it is not in standard xml and I am unable to parse this using xmlParse. My question is how do I use the result of getNodeSet to extract the xml?
Thanks very much

HTML Entity in CSVs

How do I add an html entity to my CSV?
I have an asp.net, sql server that generates html, excel, and csv files. Some of the data needs to have the ‡ entity in it. How do I get it to output to my CSV correctly? If I have it like this: ‡, then it gets screwed up but if I output it with the entity code, the CSV outputs that text.
Non-printable characters in a field are sometimes escaped using one of several c style character escape sequences, ### and \o### Octal, \x## Hex, \d### Decimal, and \u#### Unicode.
So just escape your non-ascii character C#-style and you'll be fine.
I'm not sure what you mean by "it gets screwed up".
Regardless, it is up to the receiving program or application to properly interpret the characters.
What this means is that if you put ‡ in your csv file then the application that opens the CSV will have to look for those entities and understand what to do with them. For example, the opening application would have to run an html entity decoder in order to properly display it.
If you are looking at the CSV file with notepad (for example) then of course it won't decode the entities because notepad has no clue what html entities are or even what to do when it finds them.
Even Internet Explorer wouldn't convert the entities for display when opening a CSV file. Now if you gave it a .html extension then IE would handle the display of the file with it's html rendering engine.

Need to way to grab an XML file from an URL, transform it with XSL, and present it back as a XML file that prompts a save?

Requirements:
Raw XML is from external website I have little control via URL (eg. http://example.com/raw.xml)
I need to transform it via XSL into another XML file (I already have this XSL file written and it works)
I need to write an asp.net or asp file that takes the url, applies the xsl transform, and outputs the resultant xml that prompts the client to save the xml to the client local disk
End result is a xml file that has been xsl transformed, based on xsl and xml from external website
This should not be difficult, but I do not see any examples that allow me to do what is stated above. Please help! Thanks in advance!
You can get the external XML using the WebRequest class (for example).
The result can be loaded to an XML document and transformed - the transformed document can then be returned on the HttpResponse.OutputStream with the correct headers for an XML document (response-type will be either text/xml or application/xml).

ASP.NET localized files

I've got a web page with a link, and the link is suppose to correspond to a PDF is the given user's language. I'm wondering where I should put these PDF files though. If I put them in App_LocalResources, I can't specify a link to /App_LocalResources/TOS_en-US.pdf can I?
The PDF should definitely not be in the App_LocalResources folder. That folder is only for RESX files.
The PDF files can go anywhere else in your app. For example, a great place to put them would be in a ~/PDF folder. Then your links will have to be dynamically generated (similar to what Greg has shown):
string cultureSpecificFileName = String.Format("TOS_{0}.pdf", CultureInfo.CurrentCulture.Name);
However, there are some other things to consider:
You need a way to ensure that you actually have a PDF for the given language. If someone shows up at your site and has their culture specified as Klingon, it's unlikely that you have such a PDF.
You need to decide exactly what the file format will be. In the example given, the file would have to be named TOS_en-US.pdf. It you want to use the 2-letter ISO culture names, use CurrentCulture.TwoLetterISOLanguageName and then the file name would be TOS_en.pdf.
I would store the filename somewhere with an argument in it (i.e. "TOS_{0}.pdf" ) and then just add the appropriate suffix in code:
string cultureSpecificFileName = string.Format("TOS_{0}.pdf", CultureInfo.CurrentCulture);
Does the PDF have to have the same file name for each of the different languages? If not, put them all into a directory and just store the path in your resources file.

Resources