R: graphing upper and lower bounds with ggplot2 - r

I have a dataset with three variables. One continous independent variable, one continous dependent variable, and a binary variable that catagorizes how the measurements were taken. Using ggplot, I know that I can make a scatter plot with the points colored by the catagory:
g <- ggplot(dataset, aes(independent, dependent))
g + geom_point(aes(color=catagory))
However, I want to know if there is a way to make a graph where there is a vertical line comming up from points of catagory 0 and a vertical line going down from points of catagory 1. It would look something like this:
- | | |
| | | |
| | | |
| | | |
- | | o |
| | | | |
| | o | | |
| | o | | | |
- | | | o | o
| | | | |
| o | | |
| | | |
+----|-----|-----|-----|-----|
The reason for wanting a plot like this is that one category represents an upper bound (the points with lines going downwards) and one represents a lower bound (the points with lines going upwards). Having these lines would make it easy to visualize the area which is between these bounds, and whether a function plotted on top could accurately represent the data:
- | | |
| | | |
| | | |
| | | |
- | | o | _____
| | | |_|__/
| | o |_/| |
| | o |__/| | |
- | | /| o | o
| _|_|/ | |
| / o | | |
|/ | | |
+----|-----|-----|-----|-----|
If there is any way to do this using ggplot or any other graphing library for R, I would love to know how. However, if it isn't possible, I'd be open to hearing other ways to represent this data. Simply distinguishing the catagories based on color doesn't do enough to emphasize the upper/lower bound nature of the catagories for my purposes.

The following could work for you, I hope I understood the problem well.
First, generating some random data for the dataframe, as no sample data was provided. The random numbers will make the plot ugly, I hope it will look better with real data:
dataset <- data.frame (
independent = runif(100),
dependent = runif(100),
catagory = floor(runif(100)*2))
Next, find the upper or lower part of the plot (=min/max of values) based on "catagory" for every case:
dataset$end[which(dataset$catagory == 0)] <- max(dataset$dependent)
dataset$end[which(dataset$catagory == 1)] <- min(dataset$dependent)
Now, we can plot data with geom_segment().
g <- ggplot(dataset, aes(independent, dependent, min, max))
g + geom_segment(aes(x=independent, y=dependent, xend=independent, yend=end, color=catagory))
Note, that I also added + theme_bw() + opts(legend.position = "none") parameters to the plot as it looked very strange with random datas.

Related

Is it possible to plot two variables using a third one as filter in R?

I am trying to create a plot using two variables (DATE and INT_RATE) using for filter the content of a third variable GRADE.
The problem is that I can't really figure out how to use the variable GRADE as a filter for the row.
In the below section i provide a detailed sample of starting data as well as draw of the plot I'm trying to achieve.
Thanks in advance.
STARTING DATA
| DATE | INT_RATE | GRADE |
––––––––––––––––––––––––––––––
| 1-jan | 5% | A | <-- A
| 5-feb | 3% | B |
| 9-feb | 2% | D |
| 1-apr | 3% | A | <-- A
| 5-jun | 5% | A | <-- A
| 1-aug | 3% | G |
| 1-sep | 2% | E |
| 3-nov | 1% | C |
| 8-dec | 8% | A | <-- A
| . | . | . |
| . | . | . |
| . | . | . |
And this is the kind of graph i would like to achieve, which is a very basic one, except for the filtering work needed before.
WANTED RESULT:
GRADE "A"
INT_RATE
|
|
8%-| •
| ̷
| ̷
| ̷
5%-| • •
| \ /
| \ /
| \ /
| \ /
3%-| •
|
|
|
|
––––––––––––––––––––––––––––––––––-–––>
| ˆ ˆ ˆ ˆ DATE
|1-jan 1-apr 5-jun 8-dec
EDIT 1:
Following the precious help from #apax I managed to get a plot, but the result is not satisfying because of the weird way R is displaying it (I think it might be related to the fact that the dataset in question is very large 800k rows). Do you have any suggestion?
By the way, this solved my problem:
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
I am also uploading a PNG of the malformed chart.
Thanks again to all.
Here's a quick one-liner solution where I assume your data is stored in an object named df
library(dplyr) ## For filter() function below
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
You could use ggplot2 and facet_wrap(...)
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=disp)) +
geom_point() +
facet_wrap(~cyl)
For your data
ggplot(data, aes(x=DATE, y=INT_RATE)) +
geom_line() +
facet_wrap(~GRADE)
P.S. This gives separate graphs for all grades. But that should not be a problem.

How to do cumulative plots and which statistical test is better?

I need to do some cumulative plots in R, but I really don't know what to use. I have data like the one below.
I want to do some graphs, like shown in the images (below the links). The first showing me that for example 80% of the stops happen when Q is X value. The second one starting from the exceeded value (1mg/l), show the accumulation of stops over time. And the third showing the accumulation of stops over time.
+---------------------------------------------------------+
| Date | Stops | Q (m3/s) | Concentration (mg/L) |
+---------------------------------------------------------+
| 1/01/2009 | no | 100 | 0,5 |
| 2/01/2009 | no | 98 | --- |
| 3/01/2009 | no | 80 | --- |
| 4/01/2009 | yes | 65 | 1,2 |
| 5/01/2009 | yes | 60 | --- |
| 6/01/2009 | yes | 67 | --- |
| 7/01/2009 | no | 75 | 0,6 |
| 8/01/2009 | no | 70 | --- |
| 9/01/2009 | no | 72 | --- |
| 10/01/2009| yes | 60 | 1,0 |
| 11/01/2009| yes | 63 | --- |
+---------------------------------------------------------+
[%stops and discharge][1] [cumulative stops with concentration][2] [cumulative stops over time][3]
The data i'm using is bigger off course, is of 10 years.
After doing the plots I would also like to find the proportion of time where a stop happened with low discharge, or with exceeded concentrations. For example, in the 10 year period, 10 months represent stops.
I'm also looking at the relation of the stops with the other variables, but I'm not sure which test is best for that. I'm planning to use Pearson for the relation of discharge with concentration, although I'm not sure if the discontinuous data of concentration is a problem. For the relation of Stops with concentration and discharge, I'm planning Spearman rank, but again, I'm not sure if its alright with categorical variables(stops) and the discontinuous data (concentration). What do you think is the best option for relating this variables?
[1]: https://i.stack.imgur.com/hYdkD.png [2]: https://i.stack.imgur.com/N0qNW.png [3]: https://i.stack.imgur.com/0nSrF.png
Thanks you for your help!

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

R apply script output in different formats for similar inputs

I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Resources