ggplot2 histogram with outlier bin from tibble? - r

Much like this previous question, I am trying to create some histograms with ggplot2 and would like to add a "greater than X"-bin that includes all values greater than some value.
In this particular case, I am dealing with a set of bike trip data with trip durations that vary from very small (under a minute) to very, very large (hours to days). I have filter()ed the set to remove the truly extraneous rows, but I would still like to show some of the outliers on the plot.
I had already found the solution given there, the package ggoutlier, but it evidently requires the data be in a data.frame—I get an error message that "is.data.frame(x) is not TRUE". This is correct, x is a tibble—is there another solution that works with a tibble?

Related

Determining whether values between rows match for certain columns

I am working with some observation data and have run into a bit of an issue beyond my current capabilities. I surveyed different polygons (the column "PolygonID" in the screenshot) for lizards two times during a survey season. I want to determine the total search effort (shown in the column "Effort") for each individual polygon within each survey round. Problem is, the software I was using to collect the data sometimes creates unnecessary repeats for polygons within a survey round. There is an example of this in the screenshot for the rows with PolygonID P3.
Most of the time it does not affect the effort calculations because the start and end time for the rows (the fields used to calculate effort) are the same, and I know how to filter the dataset so it only shows one line per polygon per survey, but I have reason to be concerned there might be some lines where the software glitched and assigned incorrect start and end times for one of the repeat lines. Is there a way I can test whether start and end time match for any such repeats with R, rather than manually going through all the data?
Thank you!

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

SOM Data preperation

Good day.
I am 3 month old in R and R-Studio but am getting the hang of things. I am implementing a SOM solution with 38k records/observations using Kohonen SuperSOM following Self-Organising Maps for Customer Segmentation using R.
My data have no missing values but almost 60 columns many of them are dummyVars (I received this data in this format)
I have removed the ONE char Column (URL)
My Y column (as I understand it) is "shares" (How many times it was shared)
My data only consist of numerical data (dummyVars are of course 1 or 0)
I have Centered and Scaled my data (entire dataFrame)
As per the example I followed I dod convert the entire DF to a matrix
My problem is that my SOM takes ages to train even with multi core processing and my progress graph does not reach a nice flat"ish" plateau, it does come nicely down but still is very erratic, all my other graphs are extremely high in population and there are no nice clustering. I have even tried a 500 iteration with a 100x100 grid ;-(
I think /guess it is because of the huge amount of columns including mostly dummyVars e.g. dayOfWeek.Monday, dayOfWeek.Tuesday, category.LifeStile, category.Computers, etc.
What am I to do?
Should I convert the dummyVars back into another format, How and Why?
Please do not just give me a section of code as I would like to understand why I need to do What.
Thanx

simple R Time Series function plotting

thank you kindly for your time.
I'm merely trying to plot a simple time series data set, but am running into a number of basic issues (one of which I'll ask here). For example, I have a notepad file that starts with:
"x"
"1",2.731
"2",2.562
"3",2.632
"4",2.495
"5",1.978
...and so on...
So R reads it just fine, e.g. myfile=read.table("F:/Documents/myfile.txt",sep=""). However, the values seem to change under a conversion using R's ts function, i.e.
myfile = ts(myfile,start=1,end=120,frequency=1)
plot(myfile, type="o",pch=22,lty=1,pty=2,xlab="Month",ylab="Values",main="My File")
So when plotted, the first value starts at 20+ for some reason, as opposed to 2+. Furthermore, R assumes that the y-axis goes from 1 to 120 (mirroring the x-axis), which is not the right scale (i.e. 0 through 10). In another data set that I did (using integers), it was shifted upward by 1. In any event, I believe the issue is probably about how to properly identifying the y-axis.
Any ideas on how to tackle this? Thanks!

JasperReports CategoryDataset has less data than expected?

I'm trying to develop a ChartCustomizer that takes the data from a chart and converts it into a histogram (because JR does not directly support histograms). It's a fairly simple implementation with hard-coded intervals, etc. mostly as a proof-of-concept at this point.
The data I'm analyzing is HTTP response-time data of the form [date, response-time] and I have a CSV file with 18512 records in it. In my summary band, I have 3 items:
A text field dumping $V{REPORT_COUNT} (it reports 18512 in iReport's report preview)
A time series showing all the data points [date, response-time]
A category plot containing all the data points in a single series [category=$F{DATE}, value=$F{RESPONSE_TIME}]
I decided that the most straightforward way to build a histogram would be to use the Category plot because it had the right structure for the final histogram chart.
When the ChartCustomizer runs, it dumps out all kinds of good information about the data set, including the size. Strangely, the size is 10252: it's missing something like 8000 data points. I can't understand why the category plot would have fewer data points than the whole data set.
Any ideas?
Answering my own question in case others run across this foolish user error.
The problem was that CategoryDataset only allows one data point per "category", and in my case, "category" was a java.util.Date captured from the web server log. Apparently, nearly half of my dates were duplicates and so part of the data set overwrote the other half, leaving a subset of the data.
That should have been totally obvious to me at the outset, because that is exactly how a category dataset works.
Anyhow, simply changing the category plot series's "category expression" from $F{DATE} to $V{REPORT_COUNT} gave each datum a unique category which makes everything work.

Resources