histogram with varying bin widths - r

I am trying to replicate the concept of chart Fig 1 from the following paper (http://dx.doi.org/10.1016/j.envsci.2011.08.004):
It is a histogram whose bin widths vary dependent upon the value of x and whose height depends on variable y. The precise values in the chart are not of concern - rather, understanding how to reproduce it.
The following code creates a data frame with two characteristics (abatement and cost) for each measure. the width of measure is the abatement, and the height of measure is cost. The measure should be ordered from least cost to highest cost.
measure <- c(LETTERS)
abatement <- c(sample(1:100, 26))
cost <- c(sample(-100:250, 26))
data <- data.frame(cbind(measure, abatement, cost))

Technically speaking, this is a barplot and not a histogram (histograms specifically refer to barplots used to represent binned frequencies of continuous variables) ...
Your cbind() is messing things up (converting abatement and cost to factors):
data <- data.frame(measure, abatement, cost)
Here's a start:
with(dplyr::arrange(data,cost),
barplot(width=abatement,height=cost,space=0))

Maybe I don't understand well what the question is, but if you are looking for ordering the data frame I think that this could be a good solution:
data2 <- data[ order(cost), ]
Or you can use dplyr package and its arrange function.

Related

Weight ggridges by another variable

I'm trying to visualize some data with a ridge plot, but I'm wondering if there's a way I can weight the densities of the ridges.
Basically I have the following:
set.seed(1)
example <- data.frame(matrix(nrow=100,ncol=3))
colnames(example) <- c("year","position","weight")
example$year <- as.character(rep(c(1,2,3,4,5),each=20) )
example$position <- runif(100,1,10)
example$weight <- sample(1:3,100,replace = T)
A sample of position in 5 different years. I want to plot the distribution change over time with a ridge plot, but in the dataset, there is also a column for "weight," meaning that some samples counted more than others. Is there a way to incorporate this into my ridges distribution plot? And also is there a way to make rows with more sample*weight be taller than rows with less? So not normalize every year's height to one?
ggplot(example,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
I was thinking I could try to pipe the dataset to repeat rows for number of weight value that they have, and so they would get counted more than x number of times (or, "weight" number of times) and change the density. Can't quite figure out how to do that though. Also, in my dataset, the weights aren't integers, so I'm hoping for a better solution.
Or, is there another package/technique that might achieve that?
For this dataset we can repeat the rows based on weight column and then plot:
library(ggplot2)
library(ggridges)
example2 <- example[rep(seq_along(example$weight), example$weight), ]
ggplot(example2,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
#> Picking joint bandwidth of 1.02
However, if you have wights that are not integer, this would not work. There's this open issue on github that you may want to give it a shot.
Another idea would be normalizing your weights in your original dataset to be integer by rounding them to certain digits and multiplying them by 10 to the power of your desired precision. Then you can utilize previous solution for your actual dataset.

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?
As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

construct a flat plot using third parameter or with three axis

I don't know how can I plot in better way.
I have
df1 <- data.frame(x=c(1,3,5), y=c(2,4,6))
df2 <- data.frame(x=c(2,6,10,12), y=c(1,4,7,15)
Those data frames have x as time, y as its own value.
I have data-frames with different amount of elements
I want to combine this data by x (time), but I need one method of two to show them on one plot: a) to show df1.y on x axis of a plot to see distribution df2 by df1, so these two data frames should be connected by the time (x) but shown each on one of two axis, or b) to show three axis, and for df1.y the y axis should be at the right side of a plot.
For a better terminology, I will rename your example variables according to your sample plots.
df1 <- data.frame(time=c(1,3,5), memory=c(2,4,6))
df2 <- data.frame(time=c(2,6,10,12), threads=c(1,4,7,15))
Your first plot:
From your description, I assume that you want to do the following: For each available time value get the value of df1$memory and df2$threads. However, that value may not always be available. One suitable approach is to fill up missing values by linear interpolation. This may be done using the approx-function:
merged.time <- sort(unique(c(df1$time, df2$time))
merged.data <- data.frame(time = merged.time,
memory = approx(df1$time, df1$memory, xout=merged.time)$y
threads = approx(df2$time, df2$threads, xout=merged.time)$y
)
Note that appprox(...)$y just extracts the interpolated data.
Plotting may now be done using standard plotting commands (or, as your tags suggest, using ggplot2:
ggplot(data=merged.data, aes(x=memory, y=threads)) + geom_line()
Your second plot
... is not possible with ggplot2. That is for numerous reasons, for example see here.

What does the "jitter" function do in R?

According to the documentation, the explanation for the jitter function is "Add a small amount of noise to a numeric vector."
What does this mean?
Is a random number associated with each number in the vector and added to it?
Jittering indeed means just adding random noise to a vector of numeric values, by default this is done in jitter-function by drawing samples from the uniform distribution. The range of values in the jittering is chosen according to the data, if amount-parameter is not provided.
I think term 'jittering' covers other distributions than uniform, and it is typically used to better visualize overlapping values, such as integer covariates. This helps grasp where the density of observations is high. It is good practice to mention in the figure legend if some of the values have been jittered, even if it is obvious. Here is an example visualization with the jitter-function as well as a normal distribution jittering where I arbitrarily threw in value sd=0.1:
n <- 500
set.seed(1)
dat <- data.frame(integer = rep(1:3, each=n), continuous = c(rnorm(n, mean=1), rnorm(n, mean=2), rnorm(n, mean=3))^2)
par(mfrow=c(3,1))
plot(dat, main="No jitter for x-axis", xlab="Integer", ylab="Continuous")
plot(jitter(dat[,1]), dat[,2], main="Jittered x-axis (uniform distr.)", xlab="Integer", ylab="Continuous")
plot(dat[,1]+rnorm(3*n, sd=0.1), dat[,2], main="Jittered x-axis (normal distr.)", xlab="Integer", ylab="Continuous")
A really good explanation of the Jitter effect and why it is necessary can be found in the Swirl course on Regression Models in R.
It takes the Sir Francis Galton's data on the relationship between heights of parents and their children and plots it out on the graph without jitter and then with jitter.
This is the one without jitter (plot(child ~ parent, galton)):
This is the one with jitter (please ignore the regression lines) (plot(jitter(child,4) ~ parent,galton)):
The course says that if you do not have jitter, many people will have the same height, so points falls on top of each other which is why some of the circles in the first plot look darker than others. However, by using R's function "jitter" on the children's heights, we can spread out the data to simulate the measurement errors and make high frequency heights more visible.

R histogram showing time spent in each bin

I'm trying to create a plot similar to the ones here:
Basically I want a histogram, where each bin shows how long was spent in that range of cadence (e.g 1 hour in 0-20rpm, 3 hours in 21-40rpm, etc)
library("rjson") # 3rd party library, so: install.packages("rjson")
# Load data from Strava API.
# Ride used for example is http://app.strava.com/rides/13542320
url <- "http://app.strava.com/api/v1/streams/13542320?streams[]=cadence,time"
d <- fromJSON(paste(readLines(url)))
Each value in d$cadence (rpm) is paired with the same index in d$time (the number of seconds from the start).
The values are not necessarily uniform (as can be seen if you compare plot(x=d$time, y=d$cadence, type='l') with plot(d$cadence, type='l') )
If I do the simplest possible thing:
hist(d$cadence)
..this produces something very close, but the Y value is "frequency" instead of time, and ignores the time between each data point (so the 0rpm segment in particular will be underrepresented)
You need to create a new column to account for the time between samples.
I prefer data.frames to lists for this kind of thing, so:
d <- as.data.frame(fromJSON(paste(readLines(url))))
d$sample.time <- 0
d$sample.time[2:nrow(d)] <- d$time[2:nrow(d)]-d$time[1:(nrow(d)-1)]
now that you've got your sample times, you can simply "repeat" the cadence measures for anything with a sample time more than 1, and plot a histogram of that
hist(rep(x=d$cadence, times=d$sample.time),
main="Histogram of Cadence", xlab="Cadence (RPM)",
ylab="Time (presumably seconds)")
There's bound to be a more elegant solution that wouldn't fall apart for non-integer sample times, but this works with your sample data.
EDIT: re: the more elegant, generalized solution, you can deal with non-integer sample times with something like new.d <- aggregate(sample.time~cadence, data=d, FUN=sum), but then the problem becomes plotting a histogram for something that looks like a frequency table, but with non-integer frequencies. After some poking around, I'm coming to the conclusion you'd have to roll-your-own histogram for this case by further aggregating the data into bins, and then displaying them with a barchart.

Resources