I'm trying to convert xhtml file to docx and find following example code:
wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporter.convert(new File(inputfilepath), null, wordMLPackage) );
wordMLPackage.save(new java.io.File(System.getProperty("user.dir") + "/html_output.docx") );
It seems all data will be load in memory, is that right?
If the xhtml(include image) is a big file, it may cause OOM.
Anyone know how to prevent this?
Many Thanks!
Related
I used in R
download.file('https://www.census.gov/retail/marts/www/marts_current.xls', method='auto',
destfile='C:/Users/<my name>/Desktop/test.xls')
expecting to see the contents of marts_current.xls in test.xls but much of the content in the source file is left out.
Can someone help me understand why? How can I get the whole file?
Excel xls files are binary files and care should be taken to download them as such. the default for download.files is to assume they are text files. You can control this with the mode= flag. Use
download.file('https://www.census.gov/retail/marts/www/marts_current.xls', method='auto',
destfile='C:/Users/<my name>/Desktop/test.xls', mode="wb")
We have a problem with non standard characters being displayed in Excel after being exported as CSV (using All Export) from WooCommerce. Example below:
However if you open the same file in Notepad, you can see that the characters are actually being exported correctly:
On this page I found that the exported file might be missing something which tells Excel to display the characters correctly, and they provided the below code to fix the issue with their particular plugin.
add_filter( 'tablepress_export_data', 'tablepress_add_bom_to_csv_exports', 10, 4 );
function tablepress_add_bom_to_csv_exports( $export_data, $table, $export_format, $csv_delimiter ) {
if ( 'csv' === $export_format ) {
$export_data = "\xEF\xBB\xBF" . $export_data;
}
return $export_data;
}
Is there a way to modify this code to work with All Export, or with all exports in general, to fix the issue? The above example is German but the file contains all sorts of languages (as we ship globally).
Thanks
Make sure encoding is UTF-8, unicode which supports almost all languages, make sure to change the font that contains all these glyphs for your language.
I solved this problem but not in wordpress, but in Java/ Spring Webapplication by adding the UTF-8 BOM prior writing the content to the CSV. This "helps" Excel to understand that the .csv is UTF-8 encoded and thus displays "Umlauts" correctly.
If you need Code examples in Java just ask and I will add them here.
To resolve this issue make the CSV/Excel file into a UTF-8 encoded format. Read more
I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")
we are using ITEXTPDF to compress the PDF but the issues is we want to compress the files which are compressed before uploading into our site...if the files are uploaded without compressing we would like to leave those like that..
so to do that we need to identify is that PDF is compressed or not..am wondering is there any way we can identify PDF is compressed or not using ITEXTPDF or some other tool!!!..
i have tried to Google it but couldn't find appropriate answer..
kindly let me know if u have any idea...
thanks
There are several types of compression you can get in a PDF. Data for objects can be compressed and objects can be compressed into object streams.
I voted Mark's answer up because he's right: you won't get an answer if you're not more specific. I'll add my own answer with some extra information.
In PDF 1.0, a PDF file consisted of a mix of ASCII characters for the PDF syntax and binary code for objects such as images. A page stream would contain visible PDF operators and operands, for instance:
56.7 748.5 m
136.2 748.5 l
S
This code tells you that a line has to be drawn (S) between the coordinate (x = 56.7; y = 748.5) (because that's where the cursor is moved to with the m operator) and the coordinate (x = 136.2; y = 748.5) (because a path was constructed using the l operator that adds a line).
Starting with PDF 1.2, one could start using filters for such content streams (page content streams, form XObjects). In most cases, you'll discover a /Filter entry with value /FlateDecode in the stream dictionary. You'll hardly find any "modern" PDFs of which the contents aren't compressed.
Up until PDF 1.5, all indirect objects in a PDF document, as well as the cross-reference stream were stored in ASCII in a PDF file. Starting with PDF 1.5, specific types of objects can be stored in an objects stream. The cross-reference table can also be compressed into a stream. iText's PdfReader has a isNewXrefType() method to check if this is the case. Maybe that's what you're looking for. Maybe you have PDFs that need to be read by software that isn't able to read PDFs of this type, but... you're not telling us.
Maybe we're completely misinterpreting the question. Maybe you want to know if you're receiving an actual PDF or a zip file with a PDF. Or maybe you want to really data-mine the different filters used inside the PDF. In short: your question isn't very clear, and I hope this answer explains why you should clarify.
When converting markdown to html the default is (I think) to convert an image file into a string and embed it into the html file. When running knit on an rhtml file this is not the case though. Here a separate figure folder is generated, which is of course a sensible default setting.
But if I want my images to be embedded, is there a way to achieve this using rthml and knitr as well? I can't find any options where to declare this.
Thanks, Mark
Alright, I figured out myself. Seems to work just the same as with .Rmd files, by simply passing the string "base64_images"to the options argument in knit2html.
knit2html("foo.Rhtml", options=c("base64_images"))