Replacing legend and making distinct colours on scatter plot - r

I am working on a countrywide data and trying to look at the relationship between disease count and flock size. I want to change the legend for scatter plot i.e to have names for the regions rather than the codes as appear on the plot posted here.I also want to make some improvements on the colours which represent the 8 regions so that there are some clear differences as it is a bit hard to differentiate between the current colours. Any suggestions on making the improvements on the plot?
library(lattice)
xyplot(log(Cases2012+1)~ Flock2012, data=orf, groups = Region.Coding,
auto.key =
list(space = "right", points = TRUE))
portion of data:
Region Flock2012
1 190
2 343
1 810
3 1450
1 1125
3 1305
1 750
1 227
3 1800
1 1100
2 1250
1 362
6 800
2 559
4 770
1 900
2 600
1 860
2 1450
6 1014
1 1870
4 950
1 1730
5 353
1 6000
5 1150
1 3100
1 2400
5 278
2 444
2 546
7 775
2 870
5 690
8 1032
2 2351
7 680
3 430
2 931
8 1590
2 70
5 780
2 1366
2 1900
4 730
2 1860
2 1032
7 1700
2 230
2 301
5 565
Tried this but plot not showing up
mycols <- c("red", "blue", "forestgreen", "gold", "black", "cyan", "darkorange", "darkred")
myregions <- c("East", "Midlands", "Wmidlands","NWest","NEast","Yorkshire","SEast","SWest")
xyplot(log(Flock2012+1)~ Flock2012, data=stack, groups = Regions,
col=mycols, pch=1,
key=list(space="right",
text=list(myregions),
points=list(col=mycols, cex=1.5, pch=1)

I think this should work. I would create a list of colours that you do want and a list of names of the regions.
mycols <- c("red", "blue", "forestgreen", "gold", "black", "cyan", "darkorange", "darkred")
myregions <- c("East", "Midlands", "Wmidlands","NWest","NEast","Yorkshire","SEast","SWest")
Then rather than use the auto.key option, use the key option for a bit more flexibility.
xyplot(log(Cases2012+1)~ Flock2012, data=orf, groups = Region.Coding,
col=mycols, pch=1,
key=list(space="right",
text=list(myregions),
points=list(col=mycols, cex=1.5, pch=1)))
Hope this helps.

Related

Cross joining for the computation of a new variable

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?
The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787
Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

How to categorize a vector in R to draw a pie chart

I want to categorize rivers dataset into “tiny” (<500), “short” (<1500), “medium” (<3000) and “long”
(>=3000). I want to plot a pie chart that visualizes frequency of these four categories.
I tried:
rivers[rivers >= 3000] = 'long'
rivers[rivers >= 1500 & rivers < 3000] = 'meidum'
rivers[rivers >= 500 & rivers < 1500]='short'
rivers[rivers < 500] = 'tiny'
It seems the third command has no effect on data and they are the same as before!
table(rivers)
rivers
500 505 524 525 529 538 540 545 560 570 600 605
2 1 1 2 1 1 1 1 1 1 3 1
610 618 620 625 630 652 671 680 696 710 720 730
1 1 1 1 1 1 1 1 1 1 2 1
735 760 780 800 840 850 870 890 900 906 981 long
2 1 1 1 1 1 1 1 2 1 1 1
meidum tiny
36 62
What is wrong with my commands, and is it the right way to draw a pie chart for them?
The cut function and easily perform this task:
#random data
rivers<-runif(20, 0, 5000)
#break into desired groups and label
answer<-cut(rivers, breaks=c(0, 500, 1500, 3000, Inf),
labels=c("tiny", "short", "medium", "long"), right=FALSE)
table(answer)
# tiny short medium long
# 1 10 7 2
You are running into this problem because you are trying to assign character values to an integer vector. If you work with a character vector instead, it should work:
> rivers_size <- as.character(rivers)
> rivers_size[rivers >= 3000] = 'long'
> rivers_size[rivers >= 1500 & rivers < 3000] = 'meidum'
> rivers_size[rivers >= 500 & rivers < 1500]='short'
> rivers_size[rivers < 500] = 'tiny'
> table(rivers_size)
rivers_size
long meidum short tiny
1 5 53 82
> pie(table(rivers_size))
Alternatively, the same thing can be accomplished using cut (as #Dave2e shows):
rivers <- cut(datasets::rivers,
breaks = c(0, 500, 1500, 3000, Inf),
labels = c("tiny", "short", "medium", "long"),
right = FALSE)
pie(table(rivers))
Here is another alternative using dplyr::case_when. It is more verbose than using cut but it is also easier generalize.
library("tidyverse")
set.seed(1234) # for reproducibility
# `case_when` vectorizes multiple `if-else` statements.
rivers <- sample.int(5000, size = 1000, replace = TRUE)
rivers <- case_when(
rivers >= 3000 ~ "long",
rivers >= 1500 ~ "medium",
rivers >= 500 ~ "short",
TRUE ~ "tiny"
)
table(rivers)
#> rivers
#> long medium short tiny
#> 406 303 199 92
Created on 2019-04-10 by the reprex package (v0.2.1)

Plot observations in same x-axis point which linked with id variable

I need help. This is a view of my database :
482 940 914 1
507 824 1042 2
514 730 1450 3
477 595 913 4
My aim is to plot in the same point of x-axis each row.
Example:
in 1 (=x) i want to plot 482, 940 and 914
in 2 (=x) I want to plot 507, 824 and 1042.
So three points in vertical for each x axis points.
it's a good idea to share the data in a reproducible way - I'm using readClipboard to read in the copied vector into R. Anyway, here's a quick answer:
x <- as.numeric(unlist(strsplit(readClipboard(), " ")))
This makes it into a numeric vector. We now need to split into groups based on the description you provided. I'm using matrix to achieve this and will then convert to data.frame for plotting using ggplot2:
m <- matrix(x, ncol = 4, byrow = T)
> m
[,1] [,2] [,3] [,4]
[1,] 482 940 914 1
[2,] 507 824 1042 2
[3,] 514 730 1450 3
[4,] 477 595 913 4
df <- as.data.frame(m)
# Assign names to the data.frame
names(df) <- letters[1:4]
> df
a b c d
1 482 940 914 1
2 507 824 1042 2
3 514 730 1450 3
4 477 595 913 4
To get the plot:
library(ggplot2)
ggplot(df, aes(x = d)) +
geom_point(aes(y = a), color = "red") +
geom_point(aes(y = b), color = "green") +
geom_point(aes(y = c), color = "blue")
OUTPUT
You can play around with ggtitle and xlab etc. to change the plot labels and add legends.
Hope this is helpful!

DT::datatable in R, flexdashboard

Household Size 0 1 2 3 4 5+
Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have the above dataframe where 'Bedrooms' is a label on the columns.
I'm trying to change this into a data table I can then use within rmarkdown to add into a flexdashboard. When I use the below code:
DT::datatable(df, rownames = FALSE, extensions = 'FixedColumns', escape=TRUE,options= list(bPaginate = FALSE))
I get the output:
Household Size 0 1 2 3 4 5+
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have a few problems with this:
The lables that say 'Bedrooms' don't show, so there's no way of knowing what these numbers in the columns actually mean. I'd like to include the labels or have a Row on top of the column names that says "Number of Bedrooms" that covers all of the rows?
The column Household Size and 5+ have a wider width than the rest of the columns, I want these to either be the same or Household Size to be slightly bigger than the rest
I think it's worth noting that the row 5+ and the column 5+ are both a new row/column that count any value above 5.
Also, this is just an extra but I'd like to colour the bottom left cells red and the top right cells green, is this possible?
I've figured out how to keep 'Bedrooms' in the column titles. It's possible to set the column names within DT::datatable using the code below;
DT::datatable(HS_BED_ALL, rownames = FALSE, colnames=c('Household Size','0 Bedrooms','1 Bedroom','2 Bedrooms','3 Bedrooms','4 Bedrooms','5+ Bedrooms'), extensions = 'FixedColumns', escape=TRUE, options= list(bPaginate = FALSE, dom = 't',buttons = c('excel')))%>%formatStyle(1:7,fontSize = '14px')
Which gives the desired output.

Multiple scatterplot figure in R

I have a slightly complicated plotting task. I am half way there, quite sure how to get it. I have a dataset of the form below, with multiple subjects, each in either Treatgroup 0 or Treatgroup 1, each subject contributing several rows of data. Each row corresponds to a single timepoint at which there are values in columns "count1, count2, weirdname3, etc.
Task 1. I need to calculate "Days", which is just the visitdate - the startdate, for each row. Should be an apply type function, I guess.
Task 2. I have to make a multiplot figure with one scatterplot for each of the count variables (a plot for count1, one for count2, etc). In each scatterplot, I need to plot the value of the count (y axis) against "Days" (x-axis) and connect the dots for each subject. Subjects in Treatgroup 0 are one color, subjects in treatgroup 1 are another color. Each scatterplot should be labeled with count1, count2 etc as appropriate.
I am trying to use the base plotting function, and have taken the approach of writing a plotting function to call later. I think this can work but need some help with syntax.
#Enter example data
tC <- textConnection("
ID StartDate VisitDate Treatstarted count1 count2 count3 Treatgroup
C0098 13-Jan-07 12-Feb-10 NA 457 343 957 0
C0098 13-Jan-06 2-Jul-10 NA 467 345 56 0
C0098 13-Jan-06 7-Oct-10 NA 420 234 435 0
C0098 13-Jan-05 3-Feb-11 NA 357 243 345 0
C0098 14-Jan-06 8-Jun-11 NA 209 567 254 0
C0098 13-Jan-06 9-Jul-11 NA 223 235 54 0
C0098 13-Jan-06 12-Oct-11 NA 309 245 642 0
C0110 13-Jan-06 23-Jun-10 30-Oct-10 629 2436 45 1
C0110 13-Jan-07 30-Sep-10 30-Oct-10 461 467 453 1
C0110 13-Jan-06 15-Feb-11 30-Oct-10 270 365 234 1
C0110 13-Jan-06 22-Jun-11 30-Oct-10 236 245 23 1
C0151 13-Jan-08 2-Feb-10 30-Oct-10 199 653 456 1
C0151 13-Jan-06 24-Mar-10 3-Apr-10 936 25 654 1
C0151 13-Jan-06 7-Jul-10 3-Apr-10 1147 254 666 1
C0151 13-Jan-06 9-Mar-11 3-Apr-10 1192 254 777 1
")
data1 <- read.table(header=TRUE, tC)
close.connection(tC)
# format date
data1$VisitDate <- with(data1,as.Date(VisitDate,format="%d-%b-%y"))
# stuck: need to define days as VisitDate - StartDate for each row of dataframe (I know I need an apply family fxn here)
data1$Days <- [applyfunction of some kind ](VisitDate,ID,function(x){x-data1$StartDate})))
# Unsure here. Need to define plot function
plot_one <- function(d){
with(d, plot(Days, Count, t="n", tck=1, cex.main = 0.8, ylab = "", yaxt = 'n', xlab = "", xaxt="n", xlim=c(0,1000), ylim=c(0,1200))) # set limits
grid(lwd = 0.3, lty = 7)
with(d[d$Treatgroup == 0,], points(Days, Count1, col = 1))
with(d[d$Treatgroup == 1,], points(Days, Count1, col = 2))
}
#Create multiple plot figure
par(mfrow=c(2,2), oma = c(0.5,0.5,0.5,0.5), mar = c(0.5,0.5,0.5,0.5))
#trouble here. I need to call the column names somehow, with; plyr::d_ply(data1, ???, plot_one)
Task 1:
data1$days <- floor(as.numeric(as.POSIXlt(data1$VisitDate,format="%d-%b-%y")
-as.POSIXlt(data1$StartDate,format="%d-%b-%y")))
Task 2:
par(mfrow=c(3,1), oma = c(2,0.5,1,0.5), mar = c(2,0.5,1,0.5))
plot(data1$days, data1$count1, col=as.factor(data1$Treatgroup), main="count1")
plot(data1$days, data1$count2, col=as.factor(data1$Treatgroup), main="count2")
plot(data1$days, data1$count3, col=as.factor(data1$Treatgroup), main="count3")

Resources