Two-way Summary Table in Julia (with weight by third variable) - julia

Has anyone implemented a two-way frequency(summary) table in Julia? I'm looking to create summary tables (means, sums, etc.) in Julia, with the option to weight continuous variables by a third variable (very important). Essentially, I want the abilities of proc tabulate/means, for those familiar with SAS.
I searched around and found this was discussed at one point but does not appear to have moved anywhere since. I'm at the point where I will look to code this up myself, but I want to see what the communities thoughts are before proceeding.

Related

Add extra explanatory layer onto PCA

I am trying to run a PCA on a dataframe which is accompanied by a metadata table. The PCA table is all normalized, scaled etc. the metadata, however, is not.
I want the PCA to not only cluster based on the dataframe but also the option to add one or multiple columns from the metadata table as explanatory variables as well. Again, these are not scaled and normalized with the main dataset. Also, I am not looking to color the plots with a certain data column, I'd want the column to be considered for the actual clustering.
I am aware that this sounds kinda vague, but I am having a hard time to find the exact words. After looking around for a little bit I found demixed PCA which seems to be very close to what I want to achieve. Sadly, there is no package in R to run it.
Any recommendations are welcome and thank you in advance.

R plot data.frame to get more effective overview of data

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

Fastest way to reduce dimensionality for multi-classification in R

What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.

Example R source code for multiple linear regression with looping through geographies & products?

pardon the newbie question, as I just started learning R a couple weeks ago (but intend to use it actively from now on). However, I could use some help if you already have a working example.
In order to determine own price elasticity coefficients for our each of our products (~100) in each of our states, I want to be able to write a multiple regression that regresses Units on a variety of independent variables. That's straightforward. However, I would like R to be able to cycle through EACH product within a particular state, THEN move onto the next state in the data file, and start the regression on the first product, repeating the cycle.
I have attached an example of what I'm trying to accomplish. I would also like R at the end to export the regression coefficients (and summaries, p-value, t-stat) into a separate worksheet.
Does anyone have an example similar to this? I'm comfortable enough to read the source code and make modifications to fit my needs, but certainly not yet comfortable at this point to write one from scratch. And, alas, I am tired of copying/pasting into Minitab/Excel (which is what i've been using up to this point) to run regressions 1,000 times.
Appreciate any help you could offer!

Facilities in R to verify published ANOVA from cell means, SE, and n

I would like to be able to take something like a two-way table of cell means with standard errors and sample sizes from a journal article and compute an ANOVA table, create interaction plots, and other things that are possible.
I can do this manually, I am just wondering if there are already some ways (e.g., a package) to do this automatically in R.
Even just some better search terms might help, because searching for ANOVA, cell means, and standard error, is obviously not very selective.
Have a look at package HH. It seems to provide, what you are looking for.

Resources