I've written a program that uses the Google Translate Python API to translate webpages. Most of the time, the API does translation as I expect, but in some cases text within a tag does not get translated.
I tried putting one such tag in the Google Translate web interface and found that the text is still not translated; i.e., the problem has to do with the Google Translate service rather than the way I am using the API.
The specific tag I am looking at is: <div class="someClass">World:</div>
I want the word "World" to be translated in the output, regardless of the language into which I am translating. In certain languages, such as French and Khmer, the word "World" is translated as expected, but in other languages, such as Spanish and Somali, it remains "World." I have noticed that removing the class attribute sometimes helps (translation then works in Spanish but not in Somali), and adding more text seems to help as well (I've never seen this issue when the text is a full sentence or paragraph, for example).
In the context of my project, it is particularly important that the case of a tag with just one word inside be handled correctly. Does anyone know why this is happening or how I can make translation happen consistently? A solution requiring minimal to no changes to the original HTML would be ideal.
Edit A little more context based on playing around with things: Directly calling google.cloud.translate.Client().translate('<div class="someClass">World:</div>', 'es') actually has the correct behavior: "World" becomes "Mundo." I incrementally lengthened the page text by adding tags that came before and after that div in the original webpage--none of which wrapped more than one word of text--and the text between tags stopped being translated when the text was around 1,000 characters long. However, when I changed "World:" to a whole sentence, all of the text between tags was translated even when the page text was longer than 1,000 characters.
I am new to CSS and web development in general. Hopefully there is a way to accomplish what I am trying to do. What I am trying to do is simple to explain, but I need to give some background info first, sorry for the length of the post.
I have created a webpage that is in the Tibetan language. Tibetan does not have spaces between words, it only has a character called a "tsheg" (་ - U+0F0B) that is used to separate every syllable. It also has a mark called a "shey" (། - U+0F0D) that comes at the end of phrases and clauses and sentences. Although sometimes it is doubled, after a shey is generally a space before the next line of text. When typing in Tibetan this space is represented not as a normal space (U+0020) but instead U+00A0, however when it comes to browsers and HTML/coding in general these two seem to behave the same.
In any Tibetan writing, the ideal aesthetic is for full justification. Traditionally there would be slight spaces placed between the tsheg marks and the shey marks to achieve a perfectly flush left and right alignment. (The exception would be the last line of a text, or a paragraph in contemporary formatting, does not need to be justified). It is acceptable for lines to break mid-word or mid-sentence, but never mid syllable. So the last character on any line is going to be either a tsheg or a shey. It is also not acceptable to start a line with a shey. In the last few years this has been easy to achieve for desktop publishing using MS Word, using "Thai Justification." However that option is not available even in other Office products, never mind outside of the Office environment. Other work-arounds have been to add invisible width characters after every tsheg and shey, allowing for wrapping at any point.
Now comes the question and difficulty. I am using distributed justification, and that seems to be the best option. It does not break syllables up, which is important. But it only wants to break at those spaces after shey marks, and it breaks elsewhere when there is a long string of text without a space, but if there is a space then it breaks there, sometimes stretch one or two syllables across an entire line, which is obviously not ideal.
Now, when coding the HTML of the text I can use the same work-around that is used for desktop publishing pre "Thai justification," I can add a <wbr> after every single tsheg, and this will not be visible to the end user and should allow cleaner breaking. However, there are two problems with this. But inserting that many <wbr> characters I am essentially doubling, or close to doubling, my character count, which can make the page take twice as long to load, even if half of those characters are invisible. However, more important is that it disrupts search functionality. Although you may see the word that has the syllables "AB" for instance, if you tried searching for AB you wouldn't find it, because the HTML sees "AB". And being able to search is kind of critical. Enough so that an ugly formatting is preferable to losing the ability to search and to be indexed properly. Obviously, since I need the site to be responsive and I do not know what size screens will be used I cannot have forced line breaks, either, another trick used when publishing.
So, finally, my question. Is there a way I can define a style or function or some sort of element that automatically associates a certain character--in my case the tsheg character--as having a <wbr> command after it without actually needing to input that command into my HTML? So when the text is justified it treats every tsheg as a <wbr>? I have a class .Tibetan in my stylesheet that defines the font and the justification and so forth, is there some way I can add some code there that achieves what I am looking for?
The one other thing I tried was replacing all of the spaces with which gave a beautiful justified appearance but it also caused the browser to disregard the tsheg marks entirely and it allowed for the cutting in half of syllables.
If you want to see an example of what I am talking about you can visit this page of my site: http://publishing.simplebuddhistmonk.net/index.php/downloads/critical-editions/ and next to the word "English" click the Tibetan characters and that will bring up a paragraph of prose, or you can look here: http://publishing.simplebuddhistmonk.net/index.php/downloads/tibetan/essence-of-dispelling-errors-tib/ (though the formatting on that latter page is less egregious than the former, at least on my screen).
EDIT It looks like the solution this person used might be able to be adapted for my use: Dynamically add <wbr> tag before punctuation however I do not actually understand what I would need to add, and where, to make that work for me. Anyone think that might apply to this scenario? And if so, what code would I add where?
NEW EDIT So, I think the problem might be with the search function that comes from my WordPRess theme. I used my workaround as mentioned above, adding the tag after every tsheg, on this page: http://publishing.simplebuddhistmonk.net/index.php/downloads/tibetan/essence-of-dispelling-errors-tib/ and as you can see, it displays perfectly. But if you search for any phrase from that page using the search function that is up in my header, it will not find it. If you do a Ctrl+F and search on the page, though it will find it. Even if you copy the text from the page and paste it into the search box it still does not find it. Copy the text into a word editor doesn't reveal any hidden or invisible characters. However, if you search for a term from this page http://publishing.simplebuddhistmonk.net/index.php/downloads/tibetan/beautiful-garland-ten-innermost-jewels-tib/ which I have not added the tags to, you will see that it finds it no problem.
So, that leads me to believe the error is in the search function. Any experience with this? Because search is important but I can quite possibly find alternative earch widgets to replace the one that comes with the theme. What is most important though is if you search for a line of text on Google it needs to be found. My site has not been indexed fully by any search engine so I cannot yet confirm if this does or does not affect them.
So.... At this point I wil take any advice I can get. Any advice regarding the original question (is there a way to tell the style guide "if your are displaying X then treat it like X" ) or any idea about this issue with the search functionality, and how the tag may or may not affect search, both from within the site and also from search engines.
When you capture through org-protocol and a browser, either through the capture(); function or encodeURIComponent(window.getSelection());, the text appears to be passed to the Emacs org-protocol server as plain text.
Is there a way to pull in some of the HTML heading/CSS style info to keep a minimal amount of formatting for readability? Most sites aren't anything close to plain text, so even selecting across a heading and a couple paragraphs comes out like garbage.
edit: I found pandoc, which will do HTML to org-mode conversions, but the results are overkill. Is there any way to get just the formatting from the selected objects, not a blind parse of HTML chunk?
I have an HTML textarea that will accept a large block of text. The textarea needs to read newlines.
For example. If someone types the following:
Stack overflow is so cool.
I love it.
When I save it to the database and print it out again, it is shown as:
Stack overflow is so Cool. I love it.
I need the newline (enter) to be recorded.
Anyone know how to do it?
Replace the HTML Text Area with CKEditor or FCKEditor Enabled control which take and save total html as per user entry for further use and display.
For PHP, you can use the function nl2br:
http://ca3.php.net/manual/en/function.nl2br.php
I would like to display links that are pasted in as links rather than text but freetextbox does not seem to do this. For instance, if somebody pastes in http://www.stackoverflow.com it looks like a link but shows up only as text. Do I need to convert this myself or is there a setting in the editor to take care of this?
It depends. In most cases, you need to convert it yourself. Sometimes when you copy a link you are actually getting a link and not just the text. But yes, you'll have to get your hands dirty here.