Weight ggridges by another variable - r

I'm trying to visualize some data with a ridge plot, but I'm wondering if there's a way I can weight the densities of the ridges.
Basically I have the following:
set.seed(1)
example <- data.frame(matrix(nrow=100,ncol=3))
colnames(example) <- c("year","position","weight")
example$year <- as.character(rep(c(1,2,3,4,5),each=20) )
example$position <- runif(100,1,10)
example$weight <- sample(1:3,100,replace = T)
A sample of position in 5 different years. I want to plot the distribution change over time with a ridge plot, but in the dataset, there is also a column for "weight," meaning that some samples counted more than others. Is there a way to incorporate this into my ridges distribution plot? And also is there a way to make rows with more sample*weight be taller than rows with less? So not normalize every year's height to one?
ggplot(example,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
I was thinking I could try to pipe the dataset to repeat rows for number of weight value that they have, and so they would get counted more than x number of times (or, "weight" number of times) and change the density. Can't quite figure out how to do that though. Also, in my dataset, the weights aren't integers, so I'm hoping for a better solution.
Or, is there another package/technique that might achieve that?

For this dataset we can repeat the rows based on weight column and then plot:
library(ggplot2)
library(ggridges)
example2 <- example[rep(seq_along(example$weight), example$weight), ]
ggplot(example2,aes(x=position,y=year))+
ggridges::geom_density_ridges()+
theme_classic()
#> Picking joint bandwidth of 1.02
However, if you have wights that are not integer, this would not work. There's this open issue on github that you may want to give it a shot.
Another idea would be normalizing your weights in your original dataset to be integer by rounding them to certain digits and multiplying them by 10 to the power of your desired precision. Then you can utilize previous solution for your actual dataset.

Related

How to find outliers in R for MDS?

I'm making a MDS plot with Salary and WAR data from baseball players. I have attached it here. Because there are so many data points I just want to label the Baseball Player names for the outlier data points. Any idea on how to do so?
Well, first it depends on the method you want to detect the outliers. I create a reproducible example, since you did not provide a dataset. I'm using the mtcars dataset, which is a regular dataset already implemented in R.
For outlier detection I'm using the cooks distance (you can check about that in a post here)
First i set up the data and calculating the cooks distance (cooksd) with cooks.distance from stats between the variables hp and sec
df<-cbind( mtcars[,c("hp", "qsec")], cooks.distance(lm(mtcars$hp~mtcars$qsec)))
names(df)[3]<-"cooksd"
plot(cooks.distance(lm(mtcars$hp~mtcars$qsec)))
We see, that we got two outliers. I'm using the threshold of 4*mean(cooksd) and labeling those in my dataset with following loop. Those are considered as outliers.
for (i in 1:nrow(df)){
if (df$cooksd[i] > 4* mean(df$cooksd)){
df$outlier[i] =1
}else {df$outlier[i]=0}
}
Last step is to plot the values and only give a label to those, which are above the threshold. I'm using here the row.names, but you can also use any column, just filter it as the other datapoints.
plot(df$hp, df$qsec,
text(df[df$outlier==1,"hp"], df[df$outlier==1,"qsec"], row.names(df[df$outlier == 1,]), cex=0.6, pos=4, col="red") )
A huge problem with outlier detection is, there are different methods and for every case the interpretation must be adjusted. Which method you want to use or which threshold, won't be answered here, since it's more a discussion for stackexchange. You can use cooksd, stdev or anything you seem it is the most useful. But the solution stays the same:
use your method to find the outliers
label them in your dataset
just implement the text in your plot, which you labelled.
For your three clusters, you just have to do a lm modell for every cluster, label the outliers and then join all three results.

histogram with varying bin widths

I am trying to replicate the concept of chart Fig 1 from the following paper (http://dx.doi.org/10.1016/j.envsci.2011.08.004):
It is a histogram whose bin widths vary dependent upon the value of x and whose height depends on variable y. The precise values in the chart are not of concern - rather, understanding how to reproduce it.
The following code creates a data frame with two characteristics (abatement and cost) for each measure. the width of measure is the abatement, and the height of measure is cost. The measure should be ordered from least cost to highest cost.
measure <- c(LETTERS)
abatement <- c(sample(1:100, 26))
cost <- c(sample(-100:250, 26))
data <- data.frame(cbind(measure, abatement, cost))
Technically speaking, this is a barplot and not a histogram (histograms specifically refer to barplots used to represent binned frequencies of continuous variables) ...
Your cbind() is messing things up (converting abatement and cost to factors):
data <- data.frame(measure, abatement, cost)
Here's a start:
with(dplyr::arrange(data,cost),
barplot(width=abatement,height=cost,space=0))
Maybe I don't understand well what the question is, but if you are looking for ordering the data frame I think that this could be a good solution:
data2 <- data[ order(cost), ]
Or you can use dplyr package and its arrange function.

Is there a better way to plot multicolor lines in R than splitting the data?

Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.

Plotting distribution of differences in R

I have a dataset with numbers indicating daily difference in some measure.
https://dl.dropbox.com/u/22681355/diff.csv
I would like to create a plot of the distribution of the differences with special emphasis on the rare large changes.
I tried plotting each column using the hist() function but it doesn't really provide a detailed picture of the data.
For example plotting the first column of the dataset produces the following plot:
https://dl.dropbox.com/u/22681355/Rplot.pdf
My problem is that this gives very little detail to the infrequent large deviations.
What is the easiest way to do this?
Also any suggestions on how to summarize this data in a table? For example besides showing the min, max and mean values, would you look at quantiles? Any other ideas?
You could use boxplots to visualize the distribution of the data:
sdiff <- read.csv("https://dl.dropbox.com/u/22681355/diff.csv")
boxplot(sdiff[,-1])
Outliers are printed as circles.
I back #Sven's suggestion for identifying outliers, but you can get more refinement in your histograms by specifying a denser set of breakpoints than what hist chooses by default.
d <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv', header=TRUE, row.names=1)
with(d, hist(a, breaks=seq(min(a), max(a), length.out=100)))
Violin plots could be useful:
df <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv')
library(vioplot)
with(df,vioplot(a,b,c,d,e,f,g,h,i,j))
I would use a boxplot on transformed data, e.g.:
boxplot(df[,-1]/sqrt(abs(df[,-1])))
Obviously a histogram would also look better after transformation.

R histogram showing time spent in each bin

I'm trying to create a plot similar to the ones here:
Basically I want a histogram, where each bin shows how long was spent in that range of cadence (e.g 1 hour in 0-20rpm, 3 hours in 21-40rpm, etc)
library("rjson") # 3rd party library, so: install.packages("rjson")
# Load data from Strava API.
# Ride used for example is http://app.strava.com/rides/13542320
url <- "http://app.strava.com/api/v1/streams/13542320?streams[]=cadence,time"
d <- fromJSON(paste(readLines(url)))
Each value in d$cadence (rpm) is paired with the same index in d$time (the number of seconds from the start).
The values are not necessarily uniform (as can be seen if you compare plot(x=d$time, y=d$cadence, type='l') with plot(d$cadence, type='l') )
If I do the simplest possible thing:
hist(d$cadence)
..this produces something very close, but the Y value is "frequency" instead of time, and ignores the time between each data point (so the 0rpm segment in particular will be underrepresented)
You need to create a new column to account for the time between samples.
I prefer data.frames to lists for this kind of thing, so:
d <- as.data.frame(fromJSON(paste(readLines(url))))
d$sample.time <- 0
d$sample.time[2:nrow(d)] <- d$time[2:nrow(d)]-d$time[1:(nrow(d)-1)]
now that you've got your sample times, you can simply "repeat" the cadence measures for anything with a sample time more than 1, and plot a histogram of that
hist(rep(x=d$cadence, times=d$sample.time),
main="Histogram of Cadence", xlab="Cadence (RPM)",
ylab="Time (presumably seconds)")
There's bound to be a more elegant solution that wouldn't fall apart for non-integer sample times, but this works with your sample data.
EDIT: re: the more elegant, generalized solution, you can deal with non-integer sample times with something like new.d <- aggregate(sample.time~cadence, data=d, FUN=sum), but then the problem becomes plotting a histogram for something that looks like a frequency table, but with non-integer frequencies. After some poking around, I'm coming to the conclusion you'd have to roll-your-own histogram for this case by further aggregating the data into bins, and then displaying them with a barchart.

Resources