Choosing data.table keys in R - r

How do I choose the right keys for data.table objects?
Are the considerations similar to those for RDBMSs? My first guess was to have a look for some documentation about indexes and keys for RDBMSs. Google came up with this helpful stackoverflow question related to Oracle.
Do the considerations from that answer apply to data.tables? Perhaps with the exception of those relating to UPDATE, INSERT or DELETE type statements? I'm guessing that our data.tables objects won't really be used in that way.
I'm trying to get my head around this stuff by using the documentation and examples, but I haven't seen any discussion on key selection.
PS: Thanks to #crayola pointing me toward the data.table package in the first place!

I am not sure this is a very helpful answer, but since you mention me in the question I'll say what I think anyway. But remember that I am a bit of a data.table newbie myself.
I personally only use keys when there is a clear benefit for it, e.g. merging datatables, or where it seems clear that doing so will speed things up (e.g. subsetting repeatedly on a variable).
But to my knowledge, there is sometimes no real need to define keys at all; the package is already faster than data.frame without keys.

Related

How to get useful names for anonymous java methods using async-profiler

I'm trying to use async-profiler to analyze Apache Spark performance, and I find that, even when codegen disabled, the majority of CPU time is spend in anonymous classes, such as Iterator$$anon$11.hasNext.
Is there a way I can get async-profiler to include outer class names for the objects containing the methods?
Or is there a way I can get the JVM to attach more meaningful names?
Or is there a way I can give names to anonymous classes?
The closest I've found to anyone ever even asking about this is from Finding a specific lambda noted in a Profiler, but that doesn't seem super helpful.
Thanks.
I've tried to look up ways to do this, but I'm coming up empty.

Using R to process Mail Files

I've done a bit of searching and after not finding much I thought I would post this question. Actually, because I've not found much, I think that may be an indicator of what the answer will be, but anyway...here it is:
Does anyone have any experience using R to process files for postal mailings...and if so...what packages do you use?
I realize R might not be the best tool for this task but sometimes you have to use the tools you have at hand and sometimes you have to do "extra" things at work to stay employed...so please don't flame me too hard for this question.
Basically I'm looking at merge purge, dup/elim sort of stuff. I've played with the compare() and merge() commands a bit. I'd like to incorporate some equivalencies in the compares such as
ST=St=St.=Street
BLVD=Blvd=Blvd.=Boulevard
etc...
I'm mostly wondering if packages have already been developed for this sort of data processing so I'm not reinventing the wheel.
I'd suggest the following basic workflow:
(1) Read in your data. I don't know what it looks like based on your question, so I'll assume that's easy for you.
(2) Use a mix of gsub, toupper, and other string manipulation tools to convert all the data to the same formats. I.e., get all addresses to use ST instead of St or street, etc.
(3) merge everything into a single dataframe.
(4) Use unique and/or sort/order to clean up the list and remove duplicates.
(5) Output to whatever format you're going for. Again, not clear from the question, so I can't offer specific advice here.

Is there a way to use arbitrary type of value as key in environment or named list in R?

I've been looking for a proper implementation of hash map in R, with functionalities similar to the map type in Python.
After some googling and searching the R documentations, I found that environment and named list are the ONLY options I can use (is that really so?).
But the problem with the two is that they can only take charaters as key for the hashing, not even a number, let alone other type of things.
So is there a way to use arbitrary things as key? or at least more than just characters.
Or is there a better implemtation of hash map that I didn't find with better functionalities ?
Thanks in advance.
Edit:
My current problem: I need a map to store the distance relationship between data points. That is, the key of the map is a tuple (p1, p2) and the value is a number.
The reason I asked a generic question instead of a concrete one is that I'm learning R recently and I want to know how to manipulate some of the most fundamental data structures, not only what my problem refers to. So I may need to use other things as key in the future, and I want to avoid asking similar questions with only minor difference every time I run into them.
Edit 2:
I got a lot of very good advices on this topic. It seems I'm still thinking quite in the Pythonic way, rather than the should-be R way. I should really get more R-ly ! I think my purpose can easily be satisfied by a matrix in R. Thanks All !
The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.
That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.
Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)

Ways to filter a dataset

One of the interviewer had asked me the ways to filter dataview.
I replied as;
(A) Dataview
(B) RowFilter
(C) Select
Is there any other way apart from mentioned above?
Besides those options you can also use LINQ-to-DataSets to filter data in memory.
Additionally, a superior answer in an interview would ask whether or not filtering a DataSet is the best approach in a given situation. Sometimes it is best to cache data and then filter in memory and sometimes it's better to just add the filters onto the original SQL call and have the database filter. Neither option is correct in all situations--it's case by case.
In my opinion a good interview question and answer is more of a discussion of options and pros and cons as opposed to just knowing the answer to some random question.

Should I use an expression parser in my Math game?

I'm writing some children's Math Education software for a class.
I'm going to try and present problems to students of varying skill level with randomly generated math problems of different types in fun ways.
One of the frustrations of using computer based math software is its rigidity. If anyone has taken an online Math class, you'll know all about the frustration of taking an online quiz and having your correct answer thrown out because your problem isn't exactly formatted in their form or some weird spacing issue.
So, originally I thought, "I know! I'll use an expression parser on the answer box so I'll be able to evaluate anything they enter and even if it isn't in the same form I'll be able to check if it is the same answer." So I fire up my IDE and start implementing the Shunting Yard Algorithm.
This would solve the problem of it not taking fractions in the smallest form and other issues.
However, It then hit me that a tricky student would simply be able to enter most of the problems into the answer box and my expression parser would dutifully parse and evaluate it to the correct answer!
So, should I not be using an expression parser in this instance? Do I really have to generate a single form of the answer and do a string comparison?
One possible solution is to note how many steps your expression evaluator takes to evaluate the problem's original expression, and to compare this to the optimal answer. If there's too much difference, then the problem hasn't been reduced enough and you can suggest that the student keep going.
Don't be surprised if students come up with better answers than your own definition of "optimal", though! I was a TA/grader for several classes, and the brightest students routinely had answers on their problem sets that were superior to the ones provided by the professor.
For simple problems where you're looking for an exact answer, then removing whitespace and doing a string compare is reasonable.
For more advanced problems, you might do the Shunting Yard Algorithm (or similar) but perhaps parametrize it so you could turn on/off reductions to guard against the tricky student. You'll notice that "simple" answers can still use the parser, but you would disable all reductions.
For example, on a division question, you'd disable the "/" reduction.
This is a great question.
If you are writing an expression system and an evaluation/transformation/equivalence engine (isn't there one available somewhere? I am almost 100% sure that there is an open source one somewhere), then it's more of an education/algebra problem: is the student's answer algebraically closer to the original expression or to the expected expression.
I'm not sure how to answer that, but just an idea (not necessarily practical): perhaps your evaluation engine can count transformation steps to equivalence. If the answer takes less steps to the expected than it did to the original, it might be ok. If it's too close to the original, it's not.
You could use an expression parser, but apply restrictions on the complexity of the expressions permitted in the answer.
For example, if the goal is to reduce (4/5)*(1/2) and you want to allow either (2/5) or (4/10), then you could restrict the set of allowable answers to expressions whose trees take the form (x/y) and which also evaluate to the correct number. Perhaps you would also allow "0.4", i.e. expressions of the form (x) which evaluate to the correct number.
This is exactly what you would (implicitly) be doing if you graded the problem manually -- you would be looking for an answer that is correct but which also falls into an acceptable class.
The usual way of doing this in mathematics assessment software is to allow the question setter to specify expressions/strings that are not allowed in a correct answer.
If you happen to be interested in existing software, there's the open-source Stack http://www.stack.bham.ac.uk/ (or various commercial options such as MapleTA). I suspect most of the problems that you'll come across have also been encountered by Stack so even if you don't want to use it, it might be educational to look at how it approaches things.

Resources