Using R to process Mail Files - r

I've done a bit of searching and after not finding much I thought I would post this question. Actually, because I've not found much, I think that may be an indicator of what the answer will be, but anyway...here it is:
Does anyone have any experience using R to process files for postal mailings...and if so...what packages do you use?
I realize R might not be the best tool for this task but sometimes you have to use the tools you have at hand and sometimes you have to do "extra" things at work to stay employed...so please don't flame me too hard for this question.
Basically I'm looking at merge purge, dup/elim sort of stuff. I've played with the compare() and merge() commands a bit. I'd like to incorporate some equivalencies in the compares such as
ST=St=St.=Street
BLVD=Blvd=Blvd.=Boulevard
etc...
I'm mostly wondering if packages have already been developed for this sort of data processing so I'm not reinventing the wheel.

I'd suggest the following basic workflow:
(1) Read in your data. I don't know what it looks like based on your question, so I'll assume that's easy for you.
(2) Use a mix of gsub, toupper, and other string manipulation tools to convert all the data to the same formats. I.e., get all addresses to use ST instead of St or street, etc.
(3) merge everything into a single dataframe.
(4) Use unique and/or sort/order to clean up the list and remove duplicates.
(5) Output to whatever format you're going for. Again, not clear from the question, so I can't offer specific advice here.

Related

R-neuralnet: does it randomize the data?

I need to know if the data for training that is passed in the neuralnet call is randomized in the routine or does the routine uses the data in the same order that is given. I really need to know this info for a project that I am working on, and I have not being able to figure it out by looking at the source.
Thnx!
Look into the code - thats one of the most important advantages of FOSS: you can actually check what it is doing (neuralnet is pure R, so you don't even need to fear that you need to dig into FORTRAN or C code, and you can use debug to step through the code with example data to get an overview).
Moreover, if necessary, you can even introduce e.g. a new parameter that allows you to switch off randomization if needed.
Possibly maintainer ("neuralnet") would be willing to help you as well (and able to answer much faster than about everyone else here on SE).

Is there a way to use arbitrary type of value as key in environment or named list in R?

I've been looking for a proper implementation of hash map in R, with functionalities similar to the map type in Python.
After some googling and searching the R documentations, I found that environment and named list are the ONLY options I can use (is that really so?).
But the problem with the two is that they can only take charaters as key for the hashing, not even a number, let alone other type of things.
So is there a way to use arbitrary things as key? or at least more than just characters.
Or is there a better implemtation of hash map that I didn't find with better functionalities ?
Thanks in advance.
Edit:
My current problem: I need a map to store the distance relationship between data points. That is, the key of the map is a tuple (p1, p2) and the value is a number.
The reason I asked a generic question instead of a concrete one is that I'm learning R recently and I want to know how to manipulate some of the most fundamental data structures, not only what my problem refers to. So I may need to use other things as key in the future, and I want to avoid asking similar questions with only minor difference every time I run into them.
Edit 2:
I got a lot of very good advices on this topic. It seems I'm still thinking quite in the Pythonic way, rather than the should-be R way. I should really get more R-ly ! I think my purpose can easily be satisfied by a matrix in R. Thanks All !
The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.
That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.
Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)

Any exercises/tests/exams freely available with answers to test basic R knowledge

I have the responsibility of ensuring that a colleague who is just learning R knows the basics before a course where that is a requirement. The colleague has gone through a couple of tutorials so hopefully she is ok, but I would like to give her a test to gauge it.
I was therefore wondering if anyone knew if there were any materials on the web that would be suitable? and is possible had both questions and answers.
PS Cross-posted to r-help#stat.math.ethz.ch
There is a whole set of exercises with solutions from the book Data Analysis and Graphics Using R. (Maindonald & Braun, 2nd edn, CUP 2007) available online : http://maths.anu.edu.au/~johnm/r-book/2edn/exercises/
Next to that, a quick search using the obscure randomized pagecollector Google brought to me a set of exercises where you can pick out whatever you want. Try the magic phrase "R exercises". ;)
Some I found interesting :
http://www2.imperial.ac.uk/~das01/RCourse/Exercises.pdf (very nice)
http://dial.liacs.nl/Courses/MicroArrayDataAnalysis/Exercises/Introduction_to_R_Exercises_Nov_2004.pdf (rather basic)
http://www.shlrc.mq.edu.au/masters/students/raltwarg/altwargslp802.html (rather basic)
If this is just a one time single person evaluation then a oral style exam is probably going to tell you a lot more than a set of fixed problems. Get a data set and have her read it into R, do some basic data manipulation, a couple of plots, and a standard analysis or two. Based on what she does well or has a challenge with you can modify the direction that you have things go and what additional questions you ask.

Choosing data.table keys in R

How do I choose the right keys for data.table objects?
Are the considerations similar to those for RDBMSs? My first guess was to have a look for some documentation about indexes and keys for RDBMSs. Google came up with this helpful stackoverflow question related to Oracle.
Do the considerations from that answer apply to data.tables? Perhaps with the exception of those relating to UPDATE, INSERT or DELETE type statements? I'm guessing that our data.tables objects won't really be used in that way.
I'm trying to get my head around this stuff by using the documentation and examples, but I haven't seen any discussion on key selection.
PS: Thanks to #crayola pointing me toward the data.table package in the first place!
I am not sure this is a very helpful answer, but since you mention me in the question I'll say what I think anyway. But remember that I am a bit of a data.table newbie myself.
I personally only use keys when there is a clear benefit for it, e.g. merging datatables, or where it seems clear that doing so will speed things up (e.g. subsetting repeatedly on a variable).
But to my knowledge, there is sometimes no real need to define keys at all; the package is already faster than data.frame without keys.

Do you know a good program to draw DVCS graphs?

Recently I was trying to introduce improvements to a DVCS workflow in the company I work for. To make it happen I need to write a document describing the changes - cause it's for managers - the more pictures / graphs the better.
Do you know any program (for Windows preferably) in which it's easy to draw graphs representing branches, commits and merges? I've tried Visio but it's not exactly what I expected (or maybe I just need new stencils).
EDIT: The result I would like to accomplish is similar to this one: http://nvie.com/posts/a-successful-git-branching-model/ The author didn't answer to the questions about the software used in the comment's though.
You can try this online tool http://www.lucidchart.com, although it is not the same one.
I can recomment yEd from yFiles. It is free and can produce some beautiful graphs for a fairly wide array of use cases. It does have some relatively minor foibles, but it gets the job done more than it gets in my way.

Resources