R scatterplot y-axis grouped - r

My df is a database of individuals (rows) and amount they spent (column) in one activity. I want to draw a scatterplot in R that has the following characteristics:
x-axis: log(amount spent)
y-axis: log(number of people that spent this amount)
This is how far I got:
plot(log(df$Amount), log(df$???))
How can I do that? Thanks!
My df looks something like this:
df
Name Surname Amount
John Smith 223
Mary Osborne 127
Mark Bloke 45
This is what I have in mind (taken from a paper by Chen (2012))

Try this:
library(dplyr)
library(scales) # To let you make plotted points transparent
# Make some toy data that matches your df's structure
set.seed(1)
df <- data.frame(Name = rep(letters, 4), Surname = rep(LETTERS, 4), Amount = rnorm(4 * length(LETTERS), 200, 50))
# Use dplyr to get counts of loans in each 5-dollar bin, then merge those counts back
# into the original data frame to use as y values in plot to come.
dfsum <- df %>%
mutate(Bins=cut(Amount, breaks=seq(round(min(Amount), -1) - 5, round(max(Amount) + 5, -1), by=5))) # Per AkhilNair's comment
group_by(Bins) %>%
tally() %>%
merge(df, ., all=TRUE)
# Make the plot with the new df with the x-axis on a log scale
with(dfsum, plot(x = log(Amount), y = n, ylab="Number around this amount", pch=20, col = alpha("black", 0.5)))
Here's what that produced:

Related

ggplot secondry y axis scale based on data with facet_wrap or grid_arrange

My data consists of 25 sectors on a time series, I want to plot for each sector the number of workers (series 1) and the average pay (series 2) in a line graph, with the secondary y axis for the average pay and the primary y axis for the number of workers, and than arrange the graphs on a grid.
example data:
period
avg_wage
number_of_workers
sector
1990
2000
5000
construction
1991
2020
4970
construction
1992
2050
5050
construction
1990
1000
120
IT
1991
1100
400
IT
1992
1080
500
IT
1990
10000
900
hospital staff
1991
10200
980
hospital staff
1992
10400
1200
hospital staff
I tried to use facet_wrap() for the grid and scale_y_continuous(sec.axis...) as follows:
#fake sample data for reference
dfa=data.frame(order=seq(1,100),workers=rnorm(1000,7),pay=rnorm(1000,3000,500),type="a") #1st sector
dfb=data.frame(order=seq(1,100),workers=rnorm(1000,25),pay=rnorm(1000,1000,500),type="b") #2nd sector
dfc=data.frame(order=seq(1,100),workers=rnorm(1000,400),pay=rnorm(1000,5000,500),type="c") #3rd sector
df=rbind(dfa,dfb,dfc)
colnames(df)=c(
"order", #shared x axis/time value
"workers", #time series 1 (y values for left side y axis)
"pay", #time series 2 (y values for left side y axis)
"type" #diffrent graphs to put on the grid
)
ggploting the data:
df=df %>% group_by(l=type) %>% mutate(coeff=max(pay)/max(workers)) %>% ungroup() #creating a coefficient to scale the secondry axis
plot=ggplot(data=df,aes(x=order))+
geom_line(aes(y=workers),linetype="dashed",color="red")+
geom_line(aes(y=pay/coeff)) +
scale_y_continuous(sec.axis=sec_axis(~.*coeff2,name="wage"))+
facet_wrap(~type,scale="free")
But unfortunately this doesn't work since you cant use data in the function sec_axis() (this example doesn't even run).
another approach I tried is using a for loop and grid.arrange():
plots=list()
for (i in (unique(df$type)))
{
singlesector=df[df$type==i,]
axiscoeff=df$coeff[1]
plot=ggplot(data=singlesector,aes(x=order))+
geom_line(aes(y=workers),linetype="dashed",color="red")+
geom_line(aes(y=pay/coeff)) + labs(title=i)+
scale_y_continuous(sec.axis=sec_axis(~.*axiscoeff,name="wage"))
plots[[i]]=plot
}
grid.arrange(grobs=plots)
But this also doesn't work because ggplot doesn't save the various values of the variable axiscoeff so it applies the first value to all of the graphs.
see result (the axis on the right are messed up and don't conform to the red line's data):
Is there any way to do what I want to do?
I thought maybe saving directly all of the plots as png separately and than joining them in some other way but it just seems like an extreme solution which would take too much time figuring out.
As far as I get it, the issue is the way you (re)scale your data, i.e. using max(pay) / max(workers) you rescale your data such that the maximum value of pay is mapped on the maximum value of workers which however does not take account of the different range or the spread of the variables.
Instead you could use scales::rescale to rescale your data such that the range of pay is mapped on the range of workers.
Besides that I took a different approach to glue the plots together which makes use of patchwork. To this end I have put the plotting code in a function, split the data by type, use lapply to loop over the splitted data and finally glue the plots together using patchwork::wrap_plots.
Note: As your example data included multiple values per order/type I slightly changed it to get rid of the zig-zag lines.
library(dplyr)
library(ggplot2)
library(patchwork)
library(scales)
df %>%
split(.$type) %>%
lapply(function(df) {
range_pay <- range(df$pay)
range_workers <- range(df$workers)
ggplot(data = df, aes(x = order)) +
geom_line(aes(y = workers), linetype = "dashed", color = "red") +
geom_line(aes(y = rescale(pay, range_workers, range_pay))) +
scale_y_continuous(sec.axis = sec_axis(~ rescale(.x, range_pay, range_workers), name = "wage")) +
facet_wrap(~type)
}) %>%
wrap_plots(ncol = 1)
DATA
set.seed(123)
dfa <- data.frame(order = 1:100, workers = rnorm(100, 7), pay = rnorm(100, 3000, 500), type = "a") # 1st sector
dfb <- data.frame(order = 1:100, workers = rnorm(100, 25), pay = rnorm(100, 1000, 500), type = "b") # 2nd sector
dfc <- data.frame(order = 1:100, workers = rnorm(100, 400), pay = rnorm(100, 5000, 500), type = "c") # 3rd sector
df <- rbind(dfa, dfb, dfc)
names(df) <- c("order", "workers", "pay", "type")

kmeans through time retain consistent cluster ID

There are times when we would like to know how the clustering of points might change through time. For example, say you have cities with demographic attributes by decade and you are interested in what cities are the "most similar" based on their attributes for each decade. Here is a toy dataset that illustrates the point:
set.seed(1)
centers <- data.frame(cluster=factor(1:3),
size=c(100, 150, 50),
x1=c(5, 0, -3),
x2=c(-1, 1, -2))
year1 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 1",
stringsAsFactors = F)) %>%
data.frame()
year2 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 2",
stringsAsFactors = F)) %>%
data.frame()
points <- rbind(year1,year2)
We can calculate kmeans per year using something like below:
kclusters <- points %>%
select(-cluster) %>%
group_by(year) %>%
do(data.frame(., kclust = kmeans(as.matrix(.[,-3]),centers=3)$cluster)) %>%
mutate(kclust = as.character(kclust))
And here is the resulting plot:
ggplot(kclusters) +
geom_point(aes(x1,x2,color=kclust)) +
facet_wrap(~year) +
theme_bw() +
scale_color_viridis_d()
The code works as expected but notice that the cluster IDs have changed. It doesn't make much difference here because I am plotting the clusters using the original x1 and x2, but my real example is making a map and the points are plotted in space using coordinates and colored according to clusters (i.e., the location of the point never changes). Imagine this same plot for several years--it becomes hard to track the changing cluster membership of individual points each year. Is there a way to keep the IDs consistent?

R - How to Plot Multiple Density Plots With ggvis

How would one go about plotting multiple density plots on the same set of axes? I understand how to plot multiple line graphs and scatter plots together, however the matter of having the density plots share a common x-axis is tripping me up. My data is currently set up as such:
name x1 x2 x3
a 123 123 123
b 123 123 123
c 123 123 123
Thanks for the help!
EDIT: Here are some details I was missing which may help make my question clearer.
I have a data frame attr_gains which looks like the example above, and whose variable names are Str, Agi, and Int. So far, I have been able to get a density plot of the Str variable alone with this code:
attr_gains %>%
ggvis(x=~Str)%>%
layer_densities(fill :="red", stroke := "red")
What I would like to do is overlay two more density plots, one for Agi and Int each, so that I have three density plots on the same set of axes.
Directly from the documentation:
PlantGrowth %>%
ggvis(~weight, fill = ~group) %>%
group_by(group) %>%
layer_densities()
Link
Your Case:
set.seed(1000)
library('ggvis')
library('reshape2')
#############################################
df = data.frame(matrix(nrow = 3, ncol = 5))
colnames(df) <- c('names', 'x1', 'x2', 'x3', 'colors')
df['names'] <- c('a','b','c')
df['x1'] <- runif(3, 100.0, 150.0)
df['x2'] <- runif(3, 100.0, 150.0)
df['x3'] <- runif(3, 100.0, 150.0)
df['colors'] <- c("blue","orange","green")
df <- melt(df)
#############################################
df %>%
ggvis( ~value, fill = ~colors ) %>%
group_by(names) %>%
layer_densities()
Please see this SE page for information on controlling ggvis color(s).
Looks like this:

Plotting a line graph with multiple lines

I am trying to plot a line graph with multiple lines in different colors, but not having much luck. My data set consists of 10 states and the voting turnout rates for each state from 9 elections (so the states are listed in the left column, and each subsequent column is an election year from 1980-2012 with the voting turnout rate for each of the 10 states). I would like to have a graph with the year on the X axis and the voting turnout rate on the Y axis, with a line for each state.
I found this previous answer (Plotting multiple lines from a data frame in R) to a similar question but cannot seem to replicate it using my data. Any ideas/suggestions would be immensely appreciated!
Use tidyr::gather or reshape::melt to transform the data to a long form.
## Simulate data
d <- data.frame(state=letters[1:10],
'1980'=runif(10,0,100),
'1981'=runif(10,0,100),
'1982'=runif(10,0,100))
library(dplyr)
library(tidyr)
library(ggplot2)
## Transform to a long df
e <- d %>% gather(., key, value, -state) %>%
mutate(year = as.numeric(substr(as.character(key), 2, 5))) %>%
select(-key)
## Plot
ggplot(data=e,aes(x=year,y=value,color=state)) +
geom_point() +
geom_line()
Please include your data, or sample data, in your question so that we can answer your question directly and help you get to the root of the problem. Pasting your data is simplified by using dput().
Here's another solution to your problem, using scoa's sample data and the reshape2 package instead of the tidyr package:
# Sample data
d <- data.frame(state = letters[1:10],
'1980' = runif(10,0,100),
'1981' = runif(10,0,100),
'1982' = runif(10,0,100))
library(reshape2)
library(ggplot2)
# Melt data and remove X introduced into year name
melt.d <- melt(d, id = "state")
melt.d[["variable"]] <- gsub("X", "", melt.td[["variable"]])
# Plot melted data
ggplot(data = melt.d,
aes(x = variable,
y = value,
group = state,
color = state)) +
geom_point() +
geom_line()
Produces:
Note that I left out the as.numeric() conversion for year from scoa's example, and this is why the graph above does not include the extra x-axis ticks that scoa's does.

Plot table objects with ggplot?

I've got this data:
No Yes
Female 411 130
Male 435 124
which was created using the standard table command. Now with plot I can plot this as such:
plot(table(df$gender, df$fraud))
and it then outputs a 2x2 bar chart.
So my question is, how can I do this with ggplot2? Is there any way with out transforming the table-object to a data frame? I would do that, but it becomes a mess and you then need to rename column and row headers and it just becomes a mess for what is really a quite simple thing?
Something such as
ggplot(as.data.frame(table(df)), aes(x=gender, y = Freq, fill=fraud)) +
geom_bar(stat="identity")
gets a similar chart with a minimum amount of relabelling.
ggplot2 works with data frame, so, you have to convert table into a frame. Here is a sample code:
myTable <- table(df$gender, df$fraud)
myFrame <- as.data.frame(table(myTable))
Now, you can use myFrame in ggplot2:
ggplot(myFrame, aes(x=gender))+
geom_bar(y = Freq)
see Coerce to a Data Frame for more information.
For the record, janitor::tabyl() outputs contingency tables that are data.frames. As such, they are more convenient for a workflow based on tidyverse tools.
For example:
# Data
df <- data.frame(gender = c(rep("female", times = 411),
rep("female", times = 130),
rep("male", times = 435),
rep("male", times = 124)),
fraud = c(rep("no", times = 411),
rep("yes", times = 130),
rep("no", times = 435),
rep("yes", times = 124)))
# Plotting the tabulation with tidyverse tools
df |>
janitor::tabyl(gender, fraud) |>
tidyr::gather(key = fraud, value = how_many, no:yes) |>
ggplot2::ggplot(aes(y = how_many, x = gender, fill = fraud)) +
geom_col()
Note: The sequence of piped results with janitor and tidyr has the benefit of being more transparent, but it essentially replicates the same result achieved with as.data.frame(table(df)).

Resources