Can anyone tell me where the dictionary textfile is located on UNIX systems? Or where I can get a good dictionary textfile? I have been currently using a textfile from SUN but it contains abbreviations that are not followed by a period (or else I could remove them). Could somebody point me in the right direction? I cannot seem to find anything helpful on the Mac developer dictionary tools either. I am looking for something that only contains English words, no abbreviations, and no proper nouns. It is for a word game.
Try /usr/dict/words, /usr/share/dict/, or /var/lib/dict/.
Or google "linux dictionary text" or "linux words" and find:
http://www.linuxquestions.org/questions/linux-newbie-8/dictionary-file-in-linux-652559/
http://en.wikipedia.org/wiki/Words_(Unix)
etc.
Related
Is there a new line constant that's platform independent in R? I'm used to C# and there's Environment.NewLine which will return \r\n on windows and \n otherwise. Searching turned up nothing, but I assume there has to be something somewhere so that scripts can be platform independent.
Related question: Is there a way to detect the platform a script is running on? This could be useful to know for other reasons (which I haven't thought of yet).
EDIT: Here's why I'm asking. I'm downloading files from an FTP server, but want to get a list of files and only download files that are on the server that don't exist locally. Here's how I'm getting the list of files:
filesonserver <- unlist(strsplit(getURL(basePath, ftp.use.epsv=F, dirlistonly=T), "\n"))
On windows, the files are separated by \r\n. On my mac (where I'm currently working), they're separated by \n. I was looking for a way to make this platform independent. I haven't tried just separating by \n on windows, which might work. There might also be a way to get the list of files as a vector without having to split them, which would avoid this entirely...
The package tryCatchLog has a function determine.platform.NewLine():
https://cran.r-project.org/package=tryCatchLog
https://github.com/aryoda/tryCatchLog/blob/master/R/platform_newline.R
If you consequently use this string instead of hard-coded "\n" your new lines will work platform-independently.
The answer to the initial question appears to be there isn't a new line constant like C# has. But it doesn't matter in my case, as the comments pointed out. It didn't occur to me until after I edited in the details that I probably didn't need to worry about it. Splitting by \n works fine on windows, even though the string containing the files names returned by getURL() is split by \r\n.
Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?
The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.
If all you care about is whether the file is ASCII or binary...
Well, first up definitions. All files are binary at some level:
is.binary <- function(file){
if(system.type() != "quantum computer"){
return(TRUE)
}else{
return(cat=alive&dead)
}
}
ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.
So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.
Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.
Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:
is.binary <- function(filepath,max=1000){
f=file(filepath,"rb",raw=TRUE)
b=readBin(f,"int",max,size=1,signed=FALSE)
return(max(b)>128)
}
This will by default test only at most the first 1000 characters. I think the file command does something similar.
You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...
Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.
I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..
The example section of the manual for ?raw uses this:
isASCII <- function(txt) all(charToRaw(txt) <= as.raw(127))
The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.
I'm trying to parse a file that looks sort of hex encoded but mostly not. I contacted support for the vendor who created the file and they said that they it can be parsed using "an 0x116 offset"
What is a 0x116 offset?
It took me 2 weeks to get an answer from the vendor on my first question, so I wanted to see if someone here could help me make sense of! Thank you!
"0x116 offset" means nothing. It could be a value that needs to be added to words or subtracted to remove some naive encoding, or anything else for that matter.
Could you post a part of the file? Is it binary or text? Could you define "mostly not"?
What vendor/software package/device does this file come from?
How to count the words in a document, get the result same as the result of MS OFFICE?
In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)
Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.
Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.
If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}