Addition of calculated field in rpivotTable - r

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?

You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

Related

Is it possible to plot two variables using a third one as filter in R?

I am trying to create a plot using two variables (DATE and INT_RATE) using for filter the content of a third variable GRADE.
The problem is that I can't really figure out how to use the variable GRADE as a filter for the row.
In the below section i provide a detailed sample of starting data as well as draw of the plot I'm trying to achieve.
Thanks in advance.
STARTING DATA
| DATE | INT_RATE | GRADE |
––––––––––––––––––––––––––––––
| 1-jan | 5% | A | <-- A
| 5-feb | 3% | B |
| 9-feb | 2% | D |
| 1-apr | 3% | A | <-- A
| 5-jun | 5% | A | <-- A
| 1-aug | 3% | G |
| 1-sep | 2% | E |
| 3-nov | 1% | C |
| 8-dec | 8% | A | <-- A
| . | . | . |
| . | . | . |
| . | . | . |
And this is the kind of graph i would like to achieve, which is a very basic one, except for the filtering work needed before.
WANTED RESULT:
GRADE "A"
INT_RATE
|
|
8%-| •
| ̷
| ̷
| ̷
5%-| • •
| \ /
| \ /
| \ /
| \ /
3%-| •
|
|
|
|
––––––––––––––––––––––––––––––––––-–––>
| ˆ ˆ ˆ ˆ DATE
|1-jan 1-apr 5-jun 8-dec
EDIT 1:
Following the precious help from #apax I managed to get a plot, but the result is not satisfying because of the weird way R is displaying it (I think it might be related to the fact that the dataset in question is very large 800k rows). Do you have any suggestion?
By the way, this solved my problem:
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
I am also uploading a PNG of the malformed chart.
Thanks again to all.
Here's a quick one-liner solution where I assume your data is stored in an object named df
library(dplyr) ## For filter() function below
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
You could use ggplot2 and facet_wrap(...)
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=disp)) +
geom_point() +
facet_wrap(~cyl)
For your data
ggplot(data, aes(x=DATE, y=INT_RATE)) +
geom_line() +
facet_wrap(~GRADE)
P.S. This gives separate graphs for all grades. But that should not be a problem.

rmarkdown - prevent indentation in list inside a table

When rendering tables such as this one (using RStudio + knitr), there is unwanted indentation (see red zone in the image). How can I avoid such indentation?
I imagine there is some CSS involved, but if there was a way to even prevent rmarkdown from "considering" this as a list, it could simplify matters. This is needed for an R package, so heavy hacks are not really an option, but I'll gladly receive all suggestions. Thx.
The (grid) table:
+------------------------+------------------------------------+
| Variable | Stats / Values |
+========================+====================================+
| SomeVar1 | mean (sd) : 1500000.5 (288675.28)\ |
| [numeric] | min < med < max :\ |
| | 1000001 < 1500000.5 < 2e+06\ |
| | IQR (CV) : 499999.5 (0.19) |
+------------------------+------------------------------------+
| SomeVar2 | 1. AAAAAA\ |
| [factor] | 2. BBBBBB\ |
| | 3. CCCCCC\ |
| | 4. DDDDDD\ |
| | 5. EEEEEE\ |
| | 6. FFFFFF\ |
| | 7. GGGGGG\ |
| | 8. HHHHHH\ |
| | 9. IIIIII\ |
| | 10. JJJJJJ\ |
| | [ 102917 others ] |
+------------------------+------------------------------------+
The rendered html table:

How to improve a scatter plot for plotting a large dataset in R?

I am trying to create a plot using two variables (DATE and INT_RATE) using as filter the content of a third variable called GRADE.
In the following section there is a sample of the very large data set I'm processing as well as the result I am obtaining.
STARTING DATA
| DATE | INT_RATE | GRADE |
––––––––––––––––––––––––––––––
| 1-jan | 5% | A | <-- A
| 5-feb | 3% | B |
| 9-feb | 2% | D |
| 1-apr | 3% | A | <-- A
| 5-jun | 5% | A | <-- A
| 1-aug | 3% | G |
| 1-sep | 2% | E |
| 3-nov | 1% | C |
| 8-dec | 8% | A | <-- A
| . | . | . |
| . | . | . |
| . | . | . |
And this is the kind of graph i would like to achieve.
WANTED RESULT:
GRADE "A"
INT_RATE
|
|
8%-| •
| ̷
| ̷
| ̷
5%-| • •
| \ /
| \ /
| \ /
| \ /
3%-| •
|
|
|
|
––––––––––––––––––––––––––––––––––-–––>
| ˆ ˆ ˆ ˆ DATE
|1-jan 1-apr 5-jun 8-dec
This is the relevant section of my R script that I used to build the graph shown below (the filter function is used to filter data):
plot(x = df$issue_d, y = df$int_rate, data=filter(df, df$grade == "A"))
And this is the "broken" graph I'm obtaining:
NOW THE QUESTION
How can I improve this graph? Because reading it in this way is just not possible, maybe I should go for a whole different kind of graph, but which one? I do need to filter data before plotting them.

R apply script output in different formats for similar inputs

I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07

R: graphing upper and lower bounds with ggplot2

I have a dataset with three variables. One continous independent variable, one continous dependent variable, and a binary variable that catagorizes how the measurements were taken. Using ggplot, I know that I can make a scatter plot with the points colored by the catagory:
g <- ggplot(dataset, aes(independent, dependent))
g + geom_point(aes(color=catagory))
However, I want to know if there is a way to make a graph where there is a vertical line comming up from points of catagory 0 and a vertical line going down from points of catagory 1. It would look something like this:
- | | |
| | | |
| | | |
| | | |
- | | o |
| | | | |
| | o | | |
| | o | | | |
- | | | o | o
| | | | |
| o | | |
| | | |
+----|-----|-----|-----|-----|
The reason for wanting a plot like this is that one category represents an upper bound (the points with lines going downwards) and one represents a lower bound (the points with lines going upwards). Having these lines would make it easy to visualize the area which is between these bounds, and whether a function plotted on top could accurately represent the data:
- | | |
| | | |
| | | |
| | | |
- | | o | _____
| | | |_|__/
| | o |_/| |
| | o |__/| | |
- | | /| o | o
| _|_|/ | |
| / o | | |
|/ | | |
+----|-----|-----|-----|-----|
If there is any way to do this using ggplot or any other graphing library for R, I would love to know how. However, if it isn't possible, I'd be open to hearing other ways to represent this data. Simply distinguishing the catagories based on color doesn't do enough to emphasize the upper/lower bound nature of the catagories for my purposes.
The following could work for you, I hope I understood the problem well.
First, generating some random data for the dataframe, as no sample data was provided. The random numbers will make the plot ugly, I hope it will look better with real data:
dataset <- data.frame (
independent = runif(100),
dependent = runif(100),
catagory = floor(runif(100)*2))
Next, find the upper or lower part of the plot (=min/max of values) based on "catagory" for every case:
dataset$end[which(dataset$catagory == 0)] <- max(dataset$dependent)
dataset$end[which(dataset$catagory == 1)] <- min(dataset$dependent)
Now, we can plot data with geom_segment().
g <- ggplot(dataset, aes(independent, dependent, min, max))
g + geom_segment(aes(x=independent, y=dependent, xend=independent, yend=end, color=catagory))
Note, that I also added + theme_bw() + opts(legend.position = "none") parameters to the plot as it looked very strange with random datas.

Resources