If I have a bunch of .ts files, I would like to create a new .ts file where the translation starts with the English translation, then after that contains the remainder of the longest translation (or filler to that length), obviously removing extra %1, %2, etc. This would allow me to test layout on a translation that I can read, but that overflows in the worst way. If it worked well, it would mean I wouldn't have to look every language for overflow.
For example, if I have "Hello world" and "Bonjour le monde", I'd like a translation that reads "Hello worldmonde" (or maybe "Hello world█onde" or something).
This seems like it's a short Python script, but less short to actually get right; especially if it calculated string length in pixels given my font, rather than just characters.
I can't be the only one wanting something like this, but Googling and searching SO hasn't answered this. Is there a tool for this I'm not finding? This must be a solved problem.
Related
In 2001 German scene group Farbrausch released a demo called "fuenf" (in your face). pouet.net It contains a 5 Byte executable which could be rather considered a troll approach than a demo. If you run it your hear a weird sound and it could crash your computer. At least it produces a sound. Whatever.
The hexadecimal content is:
95cd 21eb fc
And the binary representation is:
10010101 11001101 00100001 11101011 11111100
Using xxd I also get the printable chars from the content, which are:
..!..
And that makes me a little confused. Looking up the values in the ASCII table (e.g. here), I get this as a result:
•Í!ëü
At least the exclamation mark is correct.
But how does 95cd21ebfc translate into ..!..?
Side note:
file -bi fuenf.com sais the encoding is not known:
charset=unknown-8bit
And iconv -f ISO-8859-1 -t UTF-8 fuenf.com returns
Í!ëü
Which leads to the assumption, that XXD simply cannot decode the content and therefore just uses default results, like the dot?
First of all, this is not a text file, so looking at it as one makes no sense. It's instructions.
Secondly, even if it could be interpreted as text, you would need to know the encoding. It's definitely not ASCII, because that only defines symbols in the range 0-127 (and the 3rd byte here is the only one in that range, which maps to '!'). The "extended ASCII" table you link to is only one of many possible code pages that give meaning to the value from 128-255, but there are many of those code pages. Calling it "extended ASCII" is misleading, because it suggests that ASCII created an updated standard for this, which they did not. For a while, computer vendors just did whatever they wanted with those additional characters, and some of them became quasi-standards by virtue of being included in DOS, Windows, etc. Or they got standardized by ISO (you tried iso-8859-1, which is one such standard).
I would like to write a line in a text file at a given position (i) by avoiding the sequential reading.
There is WriteLines base function but I don't know how to insert the text at position (i) given as parameter.
Thanks
Dave
This is — unrelated to R — fundamentally impossible. Most (all common) filesystems do not support inserting or removing content in the middle of a file. The only supported operations are appending (or truncation) at the end, and R only supports appending, not truncation.
The way virtually all software solves your problem is by reading the file, modifying it, and writing it back to disk. If you want to get fancy because the file is very large (at least in the order of hundreds of MiB), you can stream edit the file: Read a part, edit that part, write it back to a new file. Rinse and repeat.
Technical aside: There is one exception to the above with low-level file operations, since files are stored as as non-contiguous “blocks”. But even if R supported this it wouldn’t help you since it doesn’t permit byte-level or line-level granularity: Blocks are typically at least 4 kiB in size.
For example, I have many HTML tabs to style, they use different classes, and will have different backgrounds. Background images files have names corresponding to class names.
The way I found to do it is yank:
.tab.home {
background: ...home.jpg...
}
then paste, then :s/home/about.
This is to be repeated for a few times. I found that & can be used to repeat last substitute, but only for the same target string. What is the quickest way to repeat a substitute with different target string?
Alternatively, probably there are more efficient ways to do such a thing?
I had a quick play with some vim macro magic and came up with the following idea... I apologise for the length. I thought it best to explain the steps..
First, place the text block you want to repeat into a register (I picked register z), so with the cursor at the beginning of the .tab line I pressed "z3Y (select reg z and yank 3 lines).
Then I entered the series of VIM commands I wanted into the buffer as )"zp:.,%s/home/. (Just press i and type the commands)
This translate to;
) go the end of the current '{}' block,
"zp paste a copy of the text in register z,
.,%s/home/ which has two tricks.
The .,% ensures the substitution applies to everything from the start of the .tab to the end of the closing }, and,
The command is incomplete (ie, does not have a at the end), so vim will prompt me to complete the command.
Note that while %s/// will perform a substitution across every line of the file, it is important to realise that % is an alias for range 1,$. Using 1,% as a range, causes the % to be used as the 'jump to matching parenthesis' operator, resulting in a range from the current line to the end of the % match. (which in this example, is the closing brace in the block)
Then, after placing the cursor on the ) at the beginning of the line, I typed "qy$ which means yank all characters to the end of the line into register q.
This is important, because simply yanking the line with Y will include a carriage return in the register, and will cause the macro to fail.
I then executed the content of register q with #q and I was prompted to complete the s/home/ on the command line.
After typing the replacement text and pressing enter, the pasted block (from register z) appeared in the buffer with the substitutions already applied.
At this point you can repeat the last #qby simple typing ##. You don't even need to move the cursor down to the end of the block because the ) at the start of the macro does that for you.
This effectively reduces the process of yanking the original text, inserting it, and executing two manual replace commands into a simple ##.
You can safely delete the macro string from your edit buffer when done.
This is incredibly vim-ish, and might waste a bit of time getting it right, but it could save you even more when you do.
Vim macro's might be the trick you are looking for.
From the manual, I found :s//new-replacement. Seemed to be too much typing.
Looking for a better answer.
Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?
The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.
If all you care about is whether the file is ASCII or binary...
Well, first up definitions. All files are binary at some level:
is.binary <- function(file){
if(system.type() != "quantum computer"){
return(TRUE)
}else{
return(cat=alive&dead)
}
}
ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.
So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.
Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.
Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:
is.binary <- function(filepath,max=1000){
f=file(filepath,"rb",raw=TRUE)
b=readBin(f,"int",max,size=1,signed=FALSE)
return(max(b)>128)
}
This will by default test only at most the first 1000 characters. I think the file command does something similar.
You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...
Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.
I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..
The example section of the manual for ?raw uses this:
isASCII <- function(txt) all(charToRaw(txt) <= as.raw(127))
How to count the words in a document, get the result same as the result of MS OFFICE?
In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)
Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.
Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.
If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}