I'm analysing some house sale transaction data, and I want to produce a geographic plot with the colour indicating average price per (hex-binned) region. Some regions have limited data, and I want to indicate this by adjusting the opacity to reflect the number of points in each region.
This would require me to calculate two statistics for each hex: average price and number of points. The ggplot2 package makes it very easy to calculate and plot one statistic in a chart, but I can't figure out how to calculate two.
To illustrate the point:
library(ggplot2)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) # dummy data
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) + geom_point(mapping=aes(x=A, y=B, color=C))
# It's easy to get each hex's average price *or* its point density:
ggplot(data=df_demo) + stat_summary_hex(mapping=aes(x=A,y=B,z=C), fun=mean) # color = average of C across hex, but opacity can't be adjusted
ggplot(data=df_demo) + geom_hex(mapping=aes(x=A, y=B, color=C, alpha=..ndensity..)) # opacity = normalised # of points, but color is *total* value which is wrong
I would like to combine the effects of the last two lines, but that doesn't seem to be an option: the ..ndensity.. statistic doesn't work in the context of stat_summary_hex(), and geom_hex() won't calculate the mean value.
Is there a way to do this that I'm overlooking? Alternatively, is there an obvious way of precomputing the statistics needed before constructing the plot? E.g. by determining the expected hex for each datum during my dplyr pipeline.
One hint that there may not be an easy solution is this non-CRAN package which - if I've understood correctly - solves more or less this problem. However, I'd rather not rely on out-of-CRAN code if at all possible, so I'm holding onto hope that I've missed something obvious.
What about a different geom? E.g. geom_tile - you can create cuts for each dimension (A/B) and then pre-calculate mean and number for each tile and then plot like this:
library(tidyverse)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) %>%
mutate(cuts_a= cut(A, breaks = 20), cuts_b= cut(B, breaks = 20)) %>%
group_by(cuts_a, cuts_b) %>% mutate(mean_c = mean(C), n_obs = n())
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) +
geom_tile(mapping=aes(x=cuts_a, y=cuts_b, fill=mean_c, alpha = n_obs))
Created on 2020-02-13 by the reprex package (v0.3.0)
Related
I want to show change in job numbers within certain time period. Ideally, I'd like to use a ggplot2 geom_dotplot and then color those dots by the column that they are in for that month. One idea I have not tried yet: do I need to reformat my data using tidyr from a wide to a long format in order to plot this?
Example data
Month Finance Tech Construction Manufacturing
Jan 14,000 6,800 11,000 17,500
Feb 11,500 8,400 9,480 15,000
Mar 15,250 4,200 7,200 12,400
Apr 12,000 6,400 10,300 8,500
My current r code attempt: I know that I need to fill the dot color by a factor of industry type. Maybe I have to have the data in a long format to do so.
library(tidyverse)
g <- ggplot(dat, aes(x = Month)) +
geom_dotplot(stackgroups = TRUE, binwidth = 1000, binpositions = "all") +
theme_light()
g
Here's how the plot I'm trying to make could look. Ideally I'd like to bin the dots as one dot per 1000 in the column value. Is that possible?
Thank you for taking the time to help someone who is new to R and is studying in school. Much appreciated as always,
I could not get the geom_dotplot to work, the y-axis always comes out wrong. Try something like, first pivot long and we repeat the Month+category per every 1000, note this solution below rounds up:
library(dplyr)
library(tidyr)
library(ggplot2)
test = pivot_longer(dat,-Month,names_to="category") %>%
group_by(Month,category) %>%
summarize(bins=ceiling(value/ 1000)) %>%
uncount(bins)
If you would prefer to round down to the nearest 1000, use floor() instead of ceiling() .
Then plot:
test$Month = factor(test$Month,levels=dat[,1])
test %>% ggplot(aes(x=Month,y=1,col=category)) +
geom_point(position=position_stack()) +
scale_y_continuous(labels=scales::number_format(scale=1000))
I am creating a frequency plot using the geom_freqpoly function in ggplot2. I have a large data set of social media comments across 14 months and am plotting the number of comments for each week of that data. I am using this code, first converting the UTC to POSIXct and the doing the frequency plot:
ggplot(data = TRP) +
geom_freqpoly(mapping = aes(x = created_utc), binwidth = 604800)
This is creating a plot that looks like this:
I want however to top and tail the plot, as it touches 'zero' at both the start and end, making it look like there was rapid growth and rapid decline. This is not the case as this is simply a snapshot of the data, which exists before and after my analysis. The data begins at the 4,000 mark and ends at around 2,000 and I want it represented like that. I have checked the 'pad' instruction and have insured it is set at FALSE.
Any help as to why this may be occurring would be greatly appreciated.
Thanks!
Rather than adjusting the geom_freqpoly to work differently than intended, it might be simpler to calculate the weekly totals yourself and use geom_line:
library(lubridate); library(dplyr)
set.seed(1)
df <- data.frame(
datetime = ymd_h(2018010101) + dhours(runif(1000, 0, 14*30*24))
)
df %>%
count(week_count = floor_date(datetime, "1 week")) %>%
ggplot(aes(week_count, n)) +
geom_line()
I am trying to develop an animated plot showing how the rates of three point attempts and assists have changed for NBA teams over time. While the points in my plot are transitioning correctly, I tried to add a vertical and horizontal mean line, however this is staying constant for the overall averages rather than shifting year by year.
p<-ggplot(dataBREFPerPossTeams, aes(astPerPossTeam,fg3aPerPossTeam,col=ptsPerPossTeam))+
geom_point()+
scale_color_gradient(low='yellow',high='red')+
theme_classic()+
xlab("Assists Per 100 Possessions")+
ylab("Threes Attempted Per 100 Possessions")+labs(color="Points Per 100 Possessions")+
geom_hline(aes(yintercept = mean(fg3aPerPossTeam)), color='blue',linetype='dashed')+
geom_vline(aes(xintercept = mean(astPerPossTeam)), color='blue',linetype='dashed')
anim<-p+transition_time(as.integer(yearSeason))+labs(title='Year: {frame_time}')
animate(anim, nframes=300)
Ideally, the two dashed lines would shift as the years progress, however, right now they are staying constant. Any ideas on how to fix this?
I am using datasets::airquality since you have not shared your data. The idea here is that you need to have the values for your other geom (here it is mean) as a variable in your dataset, so gganimate can draw the connection between the values and frame (i.e. transition_time).
So What I did was grouping by frame (here it is month and it will be yearSeason for you) and then mutating a column with the average of my desired variables. Then in geoms I used that appended variable instead of getting the mean inside of the geom. Look below;
library(datasets) #datasets::airquality
library(ggplot2)
library(gganimate)
library(dplyr)
g <- airquality %>%
group_by(Month) %>%
mutate(mean_wind=mean(Wind),
mean_temp=mean(Temp)) %>%
ggplot()+
geom_point(aes(Wind,Temp, col= Solar.R))+
geom_hline(aes(yintercept = mean_temp), color='blue',linetype='dashed')+
geom_vline(aes(xintercept = mean_wind), color='green',linetype='dashed')+
scale_color_gradient(low='yellow',high='red')+
theme_classic()+
xlab("Wind")+
ylab("Temp")+labs(color="Solar.R")
animated_g <- g + transition_time(as.integer(Month))+labs(title='Month: {frame_time}')
animate(animated_g, nframes=18)
Created on 2019-06-09 by the reprex package (v0.3.0)
Data:
data = data.frame(rnorm(250, 90, sd = 30))
I want to create a histogram where I have a bin of fixed width, but all observation which are bigger than arbitrary number or lower than another arbitrary number are group in their own bins. To take the above data as an example, I want binwidth = 10, but all values above 100 together in one bin and all values bellow 20 together in their own bin.
I looked at some answers, but they make no sense to me since they are mostly code. I would appreciate it greatly if somebody can explain the steps.
The examples below show how to create the desired histogram in base graphics and with ggplot2. Note that the resulting histogram will be quite distorted compared to one with a constant break size.
Base Graphics
The R function hist creates the histogram and allows us to set whatever bins we want using the breaks argument:
# Fake data
set.seed(1049)
dat = data.frame(value=rnorm(250, 90, 30))
hist(dat$value, breaks=c(min(dat$value), seq(20,100,10), max(dat$value)))
In the code above c(min(dat$value), seq(20,100,10), max(dat$value)) sets breaks that start at the lowest data value and end at the highest data value. In between we use seq to create a sequence of breaks that goes from 20 to 100 by increments of 10. Here's what the plot looks like:
ggplot2
library(ggplot2)
ggplot(dat, aes(value)) +
geom_histogram(breaks=c(min(dat$value), seq(20,100,10), max(dat$value)),
aes(y=..density..), color="grey30", fill=hcl(240,100,65)) +
theme_light()
I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()