How to do cumulative plots and which statistical test is better? - r

I need to do some cumulative plots in R, but I really don't know what to use. I have data like the one below.
I want to do some graphs, like shown in the images (below the links). The first showing me that for example 80% of the stops happen when Q is X value. The second one starting from the exceeded value (1mg/l), show the accumulation of stops over time. And the third showing the accumulation of stops over time.
+---------------------------------------------------------+
| Date | Stops | Q (m3/s) | Concentration (mg/L) |
+---------------------------------------------------------+
| 1/01/2009 | no | 100 | 0,5 |
| 2/01/2009 | no | 98 | --- |
| 3/01/2009 | no | 80 | --- |
| 4/01/2009 | yes | 65 | 1,2 |
| 5/01/2009 | yes | 60 | --- |
| 6/01/2009 | yes | 67 | --- |
| 7/01/2009 | no | 75 | 0,6 |
| 8/01/2009 | no | 70 | --- |
| 9/01/2009 | no | 72 | --- |
| 10/01/2009| yes | 60 | 1,0 |
| 11/01/2009| yes | 63 | --- |
+---------------------------------------------------------+
[%stops and discharge][1] [cumulative stops with concentration][2] [cumulative stops over time][3]
The data i'm using is bigger off course, is of 10 years.
After doing the plots I would also like to find the proportion of time where a stop happened with low discharge, or with exceeded concentrations. For example, in the 10 year period, 10 months represent stops.
I'm also looking at the relation of the stops with the other variables, but I'm not sure which test is best for that. I'm planning to use Pearson for the relation of discharge with concentration, although I'm not sure if the discontinuous data of concentration is a problem. For the relation of Stops with concentration and discharge, I'm planning Spearman rank, but again, I'm not sure if its alright with categorical variables(stops) and the discontinuous data (concentration). What do you think is the best option for relating this variables?
[1]: https://i.stack.imgur.com/hYdkD.png [2]: https://i.stack.imgur.com/N0qNW.png [3]: https://i.stack.imgur.com/0nSrF.png
Thanks you for your help!

Related

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

MS App Analytics timeline chart

I have data in the following format. Is it possible to create a time line chart using App Analytics. I am trying to easily identify the calls which overlap in my DataSet.
| Start Time | End Time | Call Name | Duration
|----------------------|----------------------|---------------------|----------
| 17:41:30.5001642Z | 17:41:30.703291Z | CreateDraftEnvelope | 203
| 17:41:31.0711234Z | 17:41:31.0867211Z | CreateLock | 21
| 17:41:31.1189342Z | 17:41:31.1345349Z | addDocument | 17
| 17:41:31.1961265Z | 17:41:31.2117613Z | addDocument | 17
| 17:41:31.4243498Z | 17:41:31.4399953Z | addDocument | 19
| 17:41:31.5242518Z | 17:41:31.5398738Z | addDocument | 17
I am looking for a chart as follows
Unfortunately, Analytics does not today provide this visualization type. Could you please submit it on our UserVoice?

Get weight of words by occurence

Maybe this is related to math.stacexhange, but I am affraid, that I will get a formula in answer what I won't undersand.
I have products in our database, and I have products from different suppliers in another table.
What I want is to pair, these supplieres products to our products if it is possible, or show for me at least show me a list, where the matching is high.
I did iterate throught all the suppliers products, and explodes the product name by spaces, and store it in a table, and the count of the occurence.
The table seems like this.
+--------+-------------+---------------+-------+
| id | word | originalWord | count |
+--------+-------------+---------------+-------+
| 220950 | Tracer | Tracer | 493 |
| 220951 | Destroyer | Destroyer | 3 |
| 220952 | Avago5050 | Avago5050 | 4 |
| 220953 | mouse | mouse | 2535 |
| 220954 | TRAMYS44916 | /TRAMYS44916/ | 2 |
| 220955 | GameZone | GameZone | 16 |
| 220956 | Enduro | Enduro | 3 |
| 220957 | AVAGO | AVAGO | 10 |
| 220958 | 5050 | 5050 | 4 |
| 220959 | optical | optical | 2370 |
| 220960 | USB | USB | 6160 |
+--------+-------------+---------------+-------+
and so on. Of course, in another table I stored, what is the product id for each word.
So what I want is to determine the weight of a word by occurence.
As you see, the word TRAMYS44916 is occured only twice, almost certain that is a partnumber, so this is the most heavy word. It weight should be 1.
Let's say the most occured is USB with 6160 occurence, so it weight should be like 0.01 or something like that, I think.
What is the best way to get all the weights of the words?
There are other tables for other suppliers so dispersion is always change.
This reminds me of Naive Bayes text classification, so to determine which product should it belongs to, you can calculate tf-idf of all the words.
Then if you want to pair it from another product name, you can decompose it to words again and select the product id based on the highest term value, however maybe you should specify some threshold for this, because in some cases it would not be that clear.
tf-idf = ("number of word matches in product name"/"word count of product name") * log ("number of products" / "number of products that contains the word")
You can see how it is done in the example here (In your case the document will be the product full name): https://en.wikipedia.org/wiki/Tf–idf#Example_of_tf.E2.80.93idf
Example implementation in Java: https://guendouz.wordpress.com/2015/02/17/implementation-of-tf-idf-in-java/

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

R: graphing upper and lower bounds with ggplot2

I have a dataset with three variables. One continous independent variable, one continous dependent variable, and a binary variable that catagorizes how the measurements were taken. Using ggplot, I know that I can make a scatter plot with the points colored by the catagory:
g <- ggplot(dataset, aes(independent, dependent))
g + geom_point(aes(color=catagory))
However, I want to know if there is a way to make a graph where there is a vertical line comming up from points of catagory 0 and a vertical line going down from points of catagory 1. It would look something like this:
- | | |
| | | |
| | | |
| | | |
- | | o |
| | | | |
| | o | | |
| | o | | | |
- | | | o | o
| | | | |
| o | | |
| | | |
+----|-----|-----|-----|-----|
The reason for wanting a plot like this is that one category represents an upper bound (the points with lines going downwards) and one represents a lower bound (the points with lines going upwards). Having these lines would make it easy to visualize the area which is between these bounds, and whether a function plotted on top could accurately represent the data:
- | | |
| | | |
| | | |
| | | |
- | | o | _____
| | | |_|__/
| | o |_/| |
| | o |__/| | |
- | | /| o | o
| _|_|/ | |
| / o | | |
|/ | | |
+----|-----|-----|-----|-----|
If there is any way to do this using ggplot or any other graphing library for R, I would love to know how. However, if it isn't possible, I'd be open to hearing other ways to represent this data. Simply distinguishing the catagories based on color doesn't do enough to emphasize the upper/lower bound nature of the catagories for my purposes.
The following could work for you, I hope I understood the problem well.
First, generating some random data for the dataframe, as no sample data was provided. The random numbers will make the plot ugly, I hope it will look better with real data:
dataset <- data.frame (
independent = runif(100),
dependent = runif(100),
catagory = floor(runif(100)*2))
Next, find the upper or lower part of the plot (=min/max of values) based on "catagory" for every case:
dataset$end[which(dataset$catagory == 0)] <- max(dataset$dependent)
dataset$end[which(dataset$catagory == 1)] <- min(dataset$dependent)
Now, we can plot data with geom_segment().
g <- ggplot(dataset, aes(independent, dependent, min, max))
g + geom_segment(aes(x=independent, y=dependent, xend=independent, yend=end, color=catagory))
Note, that I also added + theme_bw() + opts(legend.position = "none") parameters to the plot as it looked very strange with random datas.

Resources