Categorical bubble plot for mapping studies - r

How to create a categorical bubble plot, using GNU R, similar to that used in systematic mapping studies (see below)?
EDIT: ok, here's what I've tried so far. First, my dataset (Var1 goes to the x-axis, Var2 goes to the y-axis):
> grid
Var1 Var2 count
1 Does.Not.apply Does.Not.apply 53
2 Not.specified Does.Not.apply 15
3 Active.Learning..general. Does.Not.apply 1
4 Problem.based.Learning Does.Not.apply 2
5 Project.Method Does.Not.apply 4
6 Case.based.Learning Does.Not.apply 22
7 Peer.Learning Does.Not.apply 6
10 Other Does.Not.apply 1
11 Does.Not.apply Not.specified 15
12 Not.specified Not.specified 15
21 Does.Not.apply Active.Learning..general. 1
23 Active.Learning..general. Active.Learning..general. 1
31 Does.Not.apply Problem.based.Learning 2
34 Problem.based.Learning Problem.based.Learning 2
41 Does.Not.apply Project.Method 4
45 Project.Method Project.Method 4
51 Does.Not.apply Case.based.Learning 22
56 Case.based.Learning Case.based.Learning 22
61 Does.Not.apply Peer.Learning 6
67 Peer.Learning Peer.Learning 6
91 Does.Not.apply Other 1
100 Other Other 1
Then, trying to plot the data:
# Based on http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/
grid <- subset(grid, count > 0)
radius <- sqrt( grid$count / pi )
symbols(grid$Var1, grid$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area")
text(grid$Var1, grid$Var2, grid$count, cex=0.5)
Here's the result:
Problems: axis labels are wrong, the dashed grid lines are missing.

Here is ggplot2 solution. First, added radius as new variable to your data frame.
grid$radius <- sqrt( grid$count / pi )
You should play around with size of the points and text labels inside the plot to perfect fit.
library(ggplot2)
ggplot(grid,aes(Var1,Var2))+
geom_point(aes(size=radius*7.5),shape=21,fill="white")+
geom_text(aes(label=count),size=4)+
scale_size_identity()+
theme(panel.grid.major=element_line(linetype=2,color="black"),
axis.text.x=element_text(angle=90,hjust=1,vjust=0))

This will get you started by adding the tick marks to your xaxis.
To add the lines, just add a line at each level
ggs <- subset(gg, count > 0)
radius <- sqrt( ggs$count / pi )
# ggs$Var1 <- as.character(ggs$Var1)
# set up your tick marks
# (this can all be put into a single line in `axis`, but it's placed separate here to be more readable)
#--------------
# at which values to place the x tick marks
x_at <- seq_along(levels(gg$Var1))
# the string to place at each tick mark
x_labels <- levels(gg$Var1)
# use xaxt="n" to supress the standard axis ticks
symbols(ggs$Var1, ggs$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area", xaxt="n")
axis(side=1, at=x_at, labels=x_labels)
text(ggs$Var1, ggs$Var2, ggs$count, cex=0.5)
also, notice that instead of calling the object grid I called it gg, and then ggs for the subset. grid is a function in R. While it is "allowed" to overwrite the function with an object, it is not recommended and can lead to annoying bugs down the line.

Here a version using levelplot from latticeExtra.
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
},scales=(x=list(abbreviate=TRUE))) ## to get short labels
To get the size of bubble proprtional to the count , you can do this
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
})
I don't display it since the render is not clear as in the fix size case.

Related

ACF plot with multiple time-series in R

I want to follow up on this thread, as it's over 2 years old and I also can't comment because I do not have enough posts, yet.
plot acf of several timeseries in one plot
I would like to understand why all additional lines start at lag=1 and not 0. How can I have them start at lag=0 like the first one?
Also, is there a way to extend the x-axis to negative values? When I do pairwise ccf, I get values from -10 to +10, which nicely shows the pattern I'm looking for, but with acf I only get lags of either -6 or +6.
Also, please apologize my ignorance, but what does the dashed blue line at 0.4 represent? Significance? I've seen the line at various values in different examples.
Thank you in advance.
Here the code, I basically used the same as in the link above.
> data3
Maui8 Maui7 Maui6 Olowalu Maalaea
1 1.01532397 0.7583463 -1.45102480 0.37355214 0.093384619
2 0.84997103 0.7802248 -1.47906584 0.57370139 0.000741584
3 0.65297103 0.9325412 -1.31256709 0.29211557 0.077706758
4 0.42029456 0.8041302 -1.36599992 0.15763796 0.018583624
5 0.15063769 0.5932333 -1.00933326 -0.03478742 0.073490340
6 0.14522593 0.4739607 -0.82896012 0.22469641 0.226357256
7 0.03779456 0.4774847 -0.09524122 0.42900612 0.194261484
8 -0.39651917 -0.2433839 0.07535580 -0.03204488 0.384578649
9 -0.99220544 -1.3080379 0.07143167 -0.57821403 0.012594818
10 -1.58116623 -1.3739277 -0.28876112 -1.34129239 -0.543698715
11 -1.68365642 -1.5527201 0.35511326 -0.99125508 -0.574656426
12 -1.67555838 -1.6044574 0.21679237 -1.05519787 -0.731770854
13 -1.64012701 -1.6975577 0.68442918 -1.20809587 -0.888636526
14 -1.22618583 -1.3975012 0.94365182 -0.84284090 -0.611341749
15 -1.12916623 -0.8248387 1.05953344 -0.86989314 -0.242448715
16 -1.11394684 -0.3294150 1.41744881 -0.45954904 -0.331766245
17 -0.41821140 -0.4312582 1.19811924 -0.45322699 -0.384893352
18 0.22428860 -0.2696410 1.14340119 -0.28008162 -0.323007387
19 0.69397114 -0.1249800 1.12954154 0.48571412 0.074298377
20 1.55118345 1.1953590 0.91711047 1.47251236 0.802606648
21 1.76527075 1.6837135 0.50540620 1.30325798 0.951992613
22 1.34356440 1.6247940 -0.09836573 1.21764394 0.794730708
23 1.59601480 0.9492149 -0.69564643 0.87988078 0.490006397
24 1.41023107 0.8847163 -1.09236948 0.73676048 0.436886096
> ACF<-acf(data3)
> plot(ACF, type="l", max.mfrow=1)
> lines(ACF$acf[-1, 2,3], lty=1, col="red", lwd=1)
> lines(ACF$acf[-1, 2,4], lty=1, col="green", lwd=1)

Plotting a dot for every n observations

I want to archieve the following plot type using ggplot:
using the following data:
t <- read.table(header=T, row.names=NULL,
colClasses=c(rep("factor",3),"numeric"), text=
"week team level n.persons
1 A 1 50
1 A 2 20
1 A 3 30
1 B 1 50
1 B 2 20
2 A 2 20
2 A 3 40
2 A 4 20
2 B 3 30
2 B 4 20")
so far, by applying this transformation
t0 <- t[ rep(1:nrow(t), t$n.persons %/% 10 ) , ]
and plotting
ggplot(t0) + aes(x=week, y=level, fill=team) +
geom_dotplot(binaxis="y", stackdir="center",
position=position_dodge(width=0.2)
i could generate
A: How to archieve that dots of different teams dodge each other vertically and do not overlap?
B: Is it possible that the whole pack of dots is always centered, i.e.
no dodging occurs if there are only dots of one team in one place?
The following code stops the overlap:
t0 <- t[ rep(1:nrow(t), t$n.persons %/% 10 ) , ]
t0$level <- as.numeric(t0$level) # This changes the x-axis to numerics
t0$level <- ifelse(t0$team == "B", (t0$level+.1), t0$level) # This adds .1 to the position on the x-axis if the team is 'B'
ggplot(t0) + aes(x=week, y=level, fill=team) + geom_dotplot(binaxis="y", stackdir="center",
position=position_dodge(width=0.2))
Here is the output:
You could also minus a value to move the dot downwards if you would prefer that.
If you want the line exactly between the dots this code should do it:
t0$level <- ifelse(t0$team == "B", (t0$level+.06), t0$level)
t0$level <- ifelse(t0$team == "A", (t0$level-.06), t0$level)
Output:
I'm not sure off the top of my head how to skip the above ifelse when there is only one team at a given coordinate. I'd imagine you'd need to do a count of unique team labels at each coordinate and only if that count was > 1 then run the code above.

Adding a fitted Weibull distribution (fitdistr) to a geom_bar (ggplot2) categorical plot

I have created a barplot of Age vs. Population size (by gender) from Census data in ggplot2. Similarly, I have used the 'fitdist' function from the fitdistrplus package to derive Weibull parameters for the normalised (by maximum observed population across all Age bins) population data.
What I would like to do is to overlay the plotted data with the distribution as a line plot. I have tried
+ geom_line (denscomp(malefit.w))
Plus other numerous (unsuccessful) strategies.
Any help that could be provided would be much appreciated! Please find the syntax appended below:
Data Structure
Order Age Male Female Total male.norm
1 1 0 - 5 2870000 2820000 5690000 1.00000000
2 2 5 - 9 2430000 2390000 4820000 0.84668990
3 3 10 - 14 2340000 2250000 4590000 0.81533101
4 4 15 - 19 2500000 2500000 5000000 0.87108014
5 5 20 - 24 2690000 2680000 5370000 0.93728223
6 6 25 - 29 2540000 2520000 5060000 0.88501742
7 7 30 - 34 2040000 1990000 4030000 0.71080139
8 8 35 - 39 1710000 1760000 3470000 0.59581882
9 9 40 - 44 1400000 1550000 2950000 0.48780488
10 10 45 - 49 1200000 1420000 2620000 0.41811847
11 11 50 - 54 1010000 1210000 2220000 0.35191638
12 12 55 - 59 812000 985000 1800000 0.28292683
13 13 60 - 64 612000 773000 1390000 0.21324042
14 14 65 - 69 402000 556000 958000 0.14006969
15 15 70 - 74 293000 455000 748000 0.10209059
16 16 75 - 79 165000 316000 481000 0.05749129
17 17 80 - 84 101000 222000 323000 0.03519164
18 18 85 plus 75500 180000 256000 0.02630662
female.norm
1 1.00000000
2 0.84751773
3 0.79787234
4 0.88652482
5 0.95035461
6 0.89361702
7 0.70567376
8 0.62411348
9 0.54964539
10 0.50354610
11 0.42907801
12 0.34929078
13 0.27411348
14 0.19716312
15 0.16134752
16 0.11205674
17 0.07872340
18 0.06382979
This is the answer to the original question I posed above. In conjunction with the data posted in the question it is a beginning to end solution (i.e. raw data to plot).
Fitting of South-African age-population data (by gender) to a Weibull distribution (Theresa Cain and Ben Small)
load libraries
library(MASS)
library(ggplot2)
Import dataset
age_gender2 <- read.csv("age_gender2.csv", sep=",", header = T)
Define total population size by gender - that is sum the entire male / female population across all age bins and place in an objects 'total.male' and 'total.female' respectively
total.male <- sum(age_gender2$Male)
total.female <- sum(age_gender2$Female)
The object 'age.groups' is a single row, single column vector describing the number of age bins for the 'age_gender2' df
age.groups <- length(age_gender2$Age)
The object 'age.all' is a 1 row, 18 column empty matrix that will describe the minimum age range extracted from the age bins (categories) in the 'Age' column from age_gender2 df
age.all <- matrix(0,1,age.groups)
Next line assigns min age to each element of matrix (1 X 18) for first column in each age group. So 'for' loop assigns each column of matrix as an age (HELP: writing a for loop in R).
Structure of the 'for' loop # RULE (given in parentheses()): for each element (i) loop from 2 to the value presented in the 'age.groups' object (i.e. 18) # COMMAND (given in curly brackets {}): taking each element (i) in the 'age.male' matrix and starting at the first row (i.e. [1, by each element (i.e. [1,i], perform / assign ('<-') the following operation: ((5 X (ith element - 1)) - 2.5). This operation provides the 'middle' age for the bin
this assigns the first element (row, column) in the 'age.all' matrix the value 2.5
age.all[1,1] <- 2.5
for(i in 2:age.groups){
age.all[1,i] <- ((5*(i)) - 2.5)
}
This next command 'rep' creates a (1 X 25190500) vector of all the ages within a particular bin
male.data <- rep(age.all,age_gender2$Male)
female.data <- rep(age.all,age_gender2$Female)
Fit weibull distribution to age for male and female
male.weib <- fitdistr(male.data, "weibull")
female.weib <- fitdistr(female.data, "weibull")
male.shape <- male.weib$estimate[1]
male.scale <- male.weib$estimate[2]
female.shape <- female.weib$estimate[1]
female.scale <- female.weib$estimate[2]
Add column "Age_Median" to 'age_gender2' df with median age. Need to transpose as 'age.all' is an 1 row X 18 column vector.
age_gender2["Age_Median"] <- t(age.all)
Fit weibull distribution
The function 'pweibull' is a PDF and finds the cumulative probability over all ages, therefore we need to subtract the previous age bin(s) from the present bin to find the probability for that bin and hence (by multiplying by the total male population) the expected population for that bin.
male.p.weibull <- matrix(0,1,age.groups)
female.p.weibull <- matrix(0,1,age.groups)
for (i in 1:age.groups){
male.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, male.shape, male.scale) - pweibull(age.all[1,i]-2.5, male.shape, male.scale)
}
for (i in 1:age.groups){
female.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, female.shape, female.scale) - pweibull(age.all[1,i]-2.5, female.shape, female.scale)
}
Add column to list calculated population per age bin - 'transpose' to 1 x 18 -> 18 row x 1 column vector
age_gender2["male.prob"] <- t(male.p.weibull * total.male)
age_gender2["female.prob"] <- t(female.p.weibull * total.female)
Create bar plots describing Age-Gender population distributions
Males (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.male <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Male, fill=Male)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Male Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$male.prob, group=1))
Females (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.female <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Female, fill=Female)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Female Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$female.prob, group=1))

reverse axis in R

I am trying to plot simple picture like this, using 3 values - xyz loaded from textfile.
Now I need X-axis to go from biggest numbers to lowest (now are biggest numbers on the right, I need them on the left), so that two zeros meet in the same corner. I am using this simple code:
xyz <- read.table("excel")
scatterplot3d(xyz,xlim = c(0, 100000))
xyz
I have tried "rev" with no success. Picture always looks the same. Help will be greatly appreciated.
Sample data stored in file named "excel":
8884 20964 2
8928 5 1
9033 6 2
9261 61307 1
9435 64914 3
9605 5 2
9626 7 3
9718 5 3
10117 48941 7
10599 399 9
20834 5802 10
21337 3 8
21479 556 8
I want my 0,0,0 point to be in right front down corner.
You can choose an angle between >90 and <270
scatterplot3d(xyz,xlim = c(0, 100000),angle=ang)
for example:
z <- seq(-10, 10, 0.01)
x <- cos(z)+1
y <- sin(z)+1
scatterplot3d(x, y, z, highlight.3d=TRUE, col.axis="blue",angle=120,
col.grid="lightblue", main="scatterplot3d - 1", pch=20)
if you don't mind using cloud function from lattice package, then you can simply put the arguments of xlim in reversed order:
require(lattice)
xyz <- read.table( text =
"0 1 2
1 2 3
2 3 4
3 4 5")
cloud(V3~V1*V2,data = xyz, scales = list(arrows = FALSE), drape = T, xlim = c(3,0))
You can rotate the axes with screen parameter to make it look the way you like.

How do I write a generic function to pick out distance between positive values?

I have a dataset that looks like so:
x y
1 0.0000 0.4459183993
2 125.1128 0.4068805502
3 250.2257 0.3678521348
4 375.3385 0.3294434397
5 500.4513 0.2922601919
6 625.5642 0.2566381551
7 750.6770 0.2229130927
8 875.7898 0.1914207684
9 1000.9026 0.1624969456
10 1126.0155 0.1364773879
11 1251.1283 0.1136978589
12 1376.2411 0.0944717371
13 1501.3540 0.0786550515
14 1626.4668 0.0656763159
15 1751.5796 0.0549476349
16 1876.6925 0.0458811131
17 2001.8053 0.0378895151
18 2126.9181 0.0304416321
19 2252.0309 0.0231041362
20 2377.1438 0.0154535572
21 2502.2566 0.0070928195
22 2627.3694 -0.0020708606
23 2752.4823 -0.0119351534
24 2877.5951 -0.0223944877
25 3002.7079 -0.0332811155
26 3127.8208 -0.0442410358
27 3252.9336 -0.0548855203
...
Full data available here.
It's easier to see visually by plotting x and y with a zero intercept line:
ggplot(dat,aes(x,y)) + geom_line() + geom_hline(yintercept=0)
You can see the plot here (if you don't want to download the data and plot it yourself.)
I want to pick out 'patches' defined as the distance along x from when the line goes above zero on the y till it goes below zero. This will always happen at least once (since the line starts above zero), but can happen many times.
Picking out the first patch is easy.
patch1=dat[min(which(dat$y<=0.000001)),]
But how would I loop through and pick up subsequent patches?
Here's a complete working solution:
# sample data
df <- data.frame(x=1:10, y=rnorm(10))
# find positive changes in "y"
idx <- which(c(FALSE, diff(df$y > 0) == 1))
# get the change in "x"
patches <- diff(c(0, df[idx, "x"]))

Resources