Word count of a string - count

How to count the words in a document, get the result same as the result of MS OFFICE?

In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)

Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.

Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.

If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}

Related

How to process latex commands in R?

I work with knitr() and I wish to transform inline Latex commands like "\label" and "\ref", depending on the output target (Latex or HTML).
In order to do that, I need to (programmatically) generate valid R strings that correctly represent the backslash: for example "\label" should become "\\label". The goal would be to replace all backslashes in a text fragment with double-backslashes.
but it seems that I cannot even read these strings, let alone process them: if I define:
okstr <- function(str) "do something"
then when I call
okstr("\label")
I directly get an error "unrecognized escape sequence"
(of course, as \l is faultly)
So my question is : does anybody know a way to read strings (in R), without using the escaping mechanism ?
Yes, I know I could do it manually, but that's the point: I need to do it programmatically.
There are many questions that are close to this one, and I have spent some time browsing, but I have found none that yields a workable solution for this.
Best regards.
Inside R code, you need to adhere to R’s syntactic conventions. And since \ in strings is used as an escape character, it needs to form a valid escape sequence (and \l isn’t a valid escape sequence in R).
There is simply no way around this.
But if you are reading the string from elsewhere, e.g. using readLines, scan or any of the other file reading functions, you are already getting the correct string, and no handling is necessary.
Alternatively, if you absolutely want to write LaTeX-like commands in literal strings inside R, just use a different character for \; for instance, +. Just make sure that your function correctly handles it everywhere, and that you keep a way of getting a literal + back. Here’s a suggestion:
okstr("+label{1 ++ 2}")
The implementation of okstr then needs to replace single + by \, and double ++ by + (making the above result in \label{1 + 2}). But consider in which order this needs to happen, and how you’d like to treat more complex cases; for instance, what should the following yield: okstr("1 +++label")?

I am receiving \u200b in certain translations from the google-translate api

I am new to using the Google translate API and during testing we noticed that for some translations (I have not been able to find a pattern yet) we get \u200b characters in the response. That results in a lot of issues and above all it does not seem to server any purpose or make any sense. As simple example:
https://www.googleapis.com/language/translate/v2?key=YOURKEY&source=NL&target=EN&q=Hergeneer%20verkopen
returns:
{
"data": {
"translations": [
{
"translatedText": "Sell \u200b\u200bHerge Down"
}
]
}
}
Our software stumbles over these \u200b strings/characters and I have not found a way to prevent them or get rid of them.
Please read the documentation of the JSON format: https://json.org/
A string is a sequence of zero or more Unicode characters.
A char is either any Unicode character except " or \ or control-character,
[...]
or it is \u followed by four hex-digits.
We are in this last case, \u followed by four hex-digits, and it represents a Unicode character: Unicode Character 'ZERO WIDTH SPACE' (U+200B). It even has its own Wikipedia page: Zero-width space. And its Stack Overflow question: What's HTML character code 8203?.
Now, there are plenty Unicode characters with special behaviors, and this is one of those, an invisible one among others. So you need to be aware of how Unicode works, and you should sanitize input/output from third-parties API (and from user inputs as well).
Just define the list of characters that you actually want to support, and be sure to strip or filter out all the other ones. For instance, if you desire to support NL and EN, then you could strip what is outside the Latin script in Unicode.
Stripping the U+200B that you're encountering and other undesirable characters may save you from potential surprises like with:
big characters ⎲⎳
zalgo characters C̨̦̺̩̲̥͉̭͚̜̻̝̣̼͙̮̯̪o̴̡͇̘͎̞̲͇̦̲͞͡m̸̩̺̝̣̹̱͚̬̥̫̳̼̞̘̯͘ͅẹ͇̺̜́̕͢
invisible characters
emojis 👨‍👩‍👧‍👦#️⃣🏳️‍🌈

String continuation across multiple lines, no newline characters

Am using the RODBC library to bring data into R. I have a long query that I want to pass a variable to, much like this SO user.
Problem is that R interprets the whitespace/carriage returns in my query as a newline '\n'.
The accepted solution for this question suggests to simply break up the text into chunks and then paste() together - which works, but ideally I'd like to keep the whitespace intact - makes it easier to test/verify the behavior of the query over in the database before pasting into R.
In other languages I'm familiar with there's a simple line continuation character - indeed, several of the comments on the accepted answer are looking for an approach similar to python's \.
I found an aside to a workaround using strwrap deep in the bowels of an R discussion lists, so in the interest of making the internet better I will post it here. However, if someone can point the direction toward a more elegant/straightforward solution, I will happily accept your answer.
I don't know if you will find this helpful or not, but I have eventually gravitated towards keeping my SQL separate from my R scripts. Keeping the query in my R script, except for very very short ones, I find gets unreadable very quickly.
These days, I tend to keep queries that are more than a single line in their own separate .sql file. Then I can keep them nice and formatted and readable in a nice text editor, and read them into R as needed via something like this:
read_sql <- function(path){
stopifnot(file.exists(path))
sql <- readChar(path,nchar = file.info(path)$size)
sql
}
For binding parameters into the queries, I just keep a %s where the parameter will go in the .sql file, and then add in the parameters in R using sprintf.
I've been much happier this way, as I was finding that cluttering up my R scripts with really long paste statements and multi-line character objects was making my code really hard to read.
R's strwrap will destroy whitespace, including newline characters, per the documentation.
Essentially, you can get the desired behavior by initially letting R introduce line breaks/newline \ns, and then immediately stripping them out.
#make query using PASTE
query_1 <- paste("SELECT map.ps_studentid
,students.first_name || ' ' || students.last_name AS full_name
,map.testritscore
,map.termname
,map.measurementscale
FROM map$comprehensive_with_growth map
JOIN students
ON map.ps_studentid = students.id
WHERE map.termname = '",map_term,"'", sep='')
#remove newline characters introduced above.
#width is an arbitrary big number-
#it just needs to be longer than your string.
query_1 <- strwrap(query_1, width=10000, simplify=TRUE)
#execute the query
map_njask <- sqlQuery(XE, query_1)
query <- gsub(pattern='\\s',replacement="",x=query)
Try using sprintf to get variable substitution, and then replacing all newlines and whitespace.
See my answer to a similar question for details.

Is there a way to check the spelling of words in a character vector?

The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.

How I encode the ugly string?

I have a string that is:
!"#$%&'()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª« ®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
I post that to service and used Htmlencode, then I get a result:
!#$%&'()* ,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~����������� ���������•������������������������������������
it isn't result that i need,how i get original string? thanks!
Your string is not ASCII, so you are either using a string to represent binary data, or you're not maintaining awareness of multi-byte encoding. In any case, the simplest way to deal with any Internet-based technology (HTTP, SMTP, POP, IMAP) is to encode it as 7-bit clean. One common way is to base64-encode your data, send it across the wire, then base64-decode it before trying to process it.
I believe this is what you're looking for:
!"#$%&&apos;()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
You just need to use a better html entity/encoding library or tool. The one I used to generate this is from Ruby - I used the HTML Entities library. The code I wrote to do this follows. I had to put your text in input.txt to preserve Unicode (there was an EOF character in the string), but it worked great.
require 'rubygems'
require 'htmlentities'
str = File.read('input.txt')
coder = HTMLEntities.new
puts coder.encode(str, :named)

Resources