R: Adding a condition to an existing plot - r

I have a dataframe consists of three variables asn(this is an id),ip_used,domain_used,correct(this is binary 0 or 1). data example :
asn, ip_used,domain_used,correct
1,234,34,1
30,45,765,1
498,4,765,0
3874,876,8765,1
I have plotted ip_used and domain_used against each other for each asn in a bubble plot and now I want to specify bubbles of the entries that are equal to 1 for "correct" with a different bubble color.
Here is my current plot and my current code:
symbols(log_domused,log_ipused, circles = radius,inches=0.40, fg="black", bg="white",xlab = "# used domain",ylab="# used ips",main="dnsdb distribution of domains per ips for each ASN")
Does anybody have any idea how to do that?

Your data:
myData <- rbind(c(1,234,34,1), c(30,45,765,1), c(498,4,765,0), c(3874,876,8765,1))
colnames(myData) <- c("asn", "ip_used", "domain_used", "correct")
myData
asn ip_used domain_used correct
[1,] 1 234 34 1
[2,] 30 45 765 1
[3,] 498 4 765 0
[4,] 3874 876 8765 1
You can specify the color of each circle with "fg" (or "bg"):
symbols(myData[,1], myData[,3], circles=c(1,1,1,1), inches=0.40, fg=myData[,4]+1, bg="white",
xlab = "# used domain",ylab="# used ips",
main="dnsdb distribution of domains per ips for each ASN"
)

Related

How to count number of instances above a value within a given range in R?

I have a rather large dataset looking at SNPs across an entire genome. I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome. For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap. Currently, my data are set out like so:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
Thanks!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1

Convert Number to Factor using Labels in R

I have a column in my dataset that has various different numeric values in it. However, 3 of the numbers have a specific label, while all others have a general label. Going through the dataset one by one is not an option. It is a very large dataset with 167K obs.
Below shows all the unique values that are in the column:
> unique(NYC_2019_Arrests$JURISDICTION_CODE)
Levels: 0 1 2 3 4 6 7 9 11 12 13 14 15 16 69 71 72 73 74 76 79 85 87 88 97
The levels of JURISDICTION_CODE are defined as follows:
JURISDICTION_CODE - Jurisdiction responsible for arrest. Jurisdiction codes 0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst codes
3 and more represent non NYPD jurisdictions.
This is the code that I tried to get it to work but just returns an error:
> NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction"))
Error in factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0, 1, 2, :
invalid 'labels'; length 4 should be 1 or 101
I also tried the above code by taking out the 3:100 and leave in the label but that also did not work.
It would be greatly appreciated if anybody here would know how to make it that all values 3 and above has the generic without having to type out all of the numbers individually.
Thanks!
The error message is providing some direction. The problem is that the labels vector is of length 4 but your levels are length 101. I think you are almost there with the original code. Just make the labels to the correct length with:
reps<-rep("Non-NYPD Jurisdiction",98)
NYC_2019_Arrests$JURISDICTION_CODE <- factor(NYC_2019_Arrests$JURISDICTION_CODE, levels = c(0,1,2, 3:100), labels = c("Patrol", "Transit", "Housing", reps))
Edit with explanation:
Run this code for additional explanation.
#The key is that labels needs the same vector length as level
#length of levels
levels <- c(0,1,2, 3:100)
print(length(levels))
#length of original levels
labels = c("Patrol", "Transit", "Housing", "Non-NYPD Jurisdiction")
print(length(labels))
#This is problematic because what happens for when level - 4. labels[4] would be null.
#Therefore need to repeat "Non-NYPD Jurisdiction" for each level
#since length(3:100) is 98 that is how we know we need 98
reps<-rep("Non-NYPD Jurisdiction",98)
labels <- c("Patrol", "Transit", "Housing", reps)
print(length(labels))
There are several ways to solve this. The simplest and best way I can think of is to use case_when from dplyr
Here is an example:
library(dplyr)
case_when(mtcars$carb == 1 ~ "One",
mtcars$carb == 2 ~ "Two",
mtcars$carb >= 3 ~ "Three or More")

R: Gradient plot on a shapefile

I currently have a shapefile of the UK and have plot the population of species in different regions of the UK. So far I have just plotted 3 levels of species population and coloured them red=high, orange=med, green=low. But what I would like to do would be to have a gradient plot instead of being bounded by just 3 colours.
So far I have a table called Count that has the regions as the column names and then the count of species for each region below. My lowest count being 0 and my highest being around 2500 and the regions in Count match with the regions in my shapefile. I have a function that determines what is high, med, low based on levels you input yourself
High<-colnames(Count)[which(Count>'input value here')]
and then these are plotted onto the shapefile like this:
plot(ukmap[(ukmap$Region %in% High),],col='red',add=T)
Unfortunately I can't really install any packages, I was thinking of using colorRamp, but I'm not really sure what to do?
EDIT: my data looks something like this
Wales Midlands North Scotland South East South West
1 551 32 124 1 49 28
3 23 99 291 152 164 107
4 1 7 17 11 21 14
7 192 32 12 0 1 9
9 98 97 5 1 21 0
and the first column is just a number that represents the species and currently I have a function that plots the count onto a UK shapefile but based on boundaries of high, med and low. The data above is not attached to my shapefile. I then loop through for each line (species) of my data set and plot a new map for each line (species).
All right, I'll bite. I'm not going to use base R because plot is too hard for me to understand, so instead we will be using ggplot2.
# UK shapefile found via http://www.gadm.org/download
uk.url <- "http://www.filefactory.com/file/s3dz3jt3vr/n/GBR_adm_zip"
# replace following with your working directory - no trailing slash
work.dir <- "C:/Temp/r.temp/gb_map"
# the full file path for storing file
file.loc <- paste0(work.dir, "/uk.zip")
download.file (uk.url, destfile = file.loc, mode = "wb")
unzip(file.loc, exdir = work.dir)
# open the shapefile
require(rgdal)
require(ggplot2)
uk <- readOGR(work.dir, layer = "GBR_adm2")
# use the NAME_2 field (representing counties) to create data frame
uk.map <- fortify(uk, region = "NAME_2")
# create fake count data...
uk.map$count <- round(runif(nrow(uk.map), 0, 2500), 0)
# quick visual check
ggplot(uk.map, aes(x = long, y = lat, group = group, fill = count)) +
geom_polygon(colour = "black", size = 0.5, aes(group = group)) +
theme()
This generates the output below, which may be similar to what you need.
Note that we don't explictly specify the gradient in this case - we just leave it up to ggplot. If you wish to specify those details it is possible but more involved. If you go down that route you should create another column in uk.map to allocate each count into one of (say) 10 bins using the cut function. The uk.map data frame looks like this:
> str(uk.map)
'data.frame': 427339 obs. of 8 variables:
$ long : num -2.05 -2.05 -2.05 -2.05 -2.05 ...
$ lat : num 57.2 57.2 57.2 57.2 57.2 ...
$ order: int 1 2 3 4 5 6 7 8 9 10 ...
$ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ piece: Factor w/ 234 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ group: Factor w/ 1136 levels "Aberdeen.1","Aberdeenshire.1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ id : chr "Aberdeen" "Aberdeen" "Aberdeen" "Aberdeen" ...
$ count: num 1549 1375 433 427 1282 ...
>
OK, here is an alternative solution that doesn't use ggplot (I will leave the ggplot solution for reference). This code is simple but it should be enough to give you some ideas as to how you can adapt it to your own data.
# UK shapefile found via http://www.gadm.org/download
uk.url <- "http://www.filefactory.com/file/s3dz3jt3vr/n/GBR_adm_zip"
# replace following with your working directory - no trailing slash
work.dir <- "C:/Temp/r.temp/gb_map"
# the full file path for storing file
file.loc <- paste0(work.dir, "/uk.zip")
download.file (uk.url, destfile = file.loc, mode = "wb")
unzip(file.loc, exdir = work.dir)
# open the shapefile
require(rgdal)
uk <- readOGR(work.dir, layer = "GBR_adm2")
# make some fake data to plot
uk#data$count <- round(runif(nrow(uk#data), 0, 2500), 0)
uk#data$count <- as.numeric(uk#data$count)
# and plot it
plot(uk, col = gray(uk#data$count/2500))
The result of the code is the following plot.
EDIT following a request to include a legend, I have tweaked the code a little but in all honesty I don't understand base R's legend function well enough to get something of production quality and I have no wish to research it further. (Incidentally hat tip to this question for ideas.) A look at the plot beneath the code suggests that we need to reorder the legend colours etc but I will leave that to the original poster as an exercise or to post as another question.
# UK shapefile found via http://www.gadm.org/download
uk.url <- "http://www.filefactory.com/file/s3dz3jt3vr/n/GBR_adm_zip"
# replace following with your working directory - no trailing slash
work.dir <- "C:/Temp/r.temp/gb_map"
# the full file path for storing file
file.loc <- paste0(work.dir, "/uk.zip")
download.file (uk.url, destfile = file.loc, mode = "wb")
unzip(file.loc, exdir = work.dir)
# open the shapefile
require(rgdal)
uk <- readOGR(work.dir, layer = "GBR_adm2")
# make some fake data to plot
uk#data$count <- as.numeric(round(runif(nrow(uk#data), 0, 2500), 0))
uk#data$bin <- cut(uk#data$count, seq(0, 2500, by = 250),
include.lowest = TRUE, dig.lab = 4)
# labels for the legend
lev = levels(uk#data$bin)
lev2 <- gsub("\\,", " to ", lev)
lev3 <- gsub("\\]$", "", lev2)
lev4 <- gsub("\\(|\\)", " ", lev3)
lev5 <- gsub("^\\[", " ", lev4)
my.levels <- lev5
# Create a function to generate a continuous color palette
rbPal <- colorRampPalette(c('red','blue'))
uk#data$Col <- rbPal(10)[as.numeric(cut(uk#data$count, seq(0, 2500, by = 250)))]
# Plot
plot(uk, col = uk#data$Col)
legend("topleft", fill = uk#data$Col, legend = my.levels, col = uk#data$Col)
Have you tried colorRampPalette?
Here is how you could try to build a gradient palette
gradient_color <- colorRampPalette(c("blue", "red"))
gradient_color(10)
[1] "#0000FF" "#1C00E2" "#3800C6" "#5500AA" "#71008D" "#8D0071" "#AA0055"
[8] "#C60038" "#E2001C" "#FF0000"
An example plot
plot(rep(1,10),col=gradient_color(10))

Categorical bubble plot for mapping studies

How to create a categorical bubble plot, using GNU R, similar to that used in systematic mapping studies (see below)?
EDIT: ok, here's what I've tried so far. First, my dataset (Var1 goes to the x-axis, Var2 goes to the y-axis):
> grid
Var1 Var2 count
1 Does.Not.apply Does.Not.apply 53
2 Not.specified Does.Not.apply 15
3 Active.Learning..general. Does.Not.apply 1
4 Problem.based.Learning Does.Not.apply 2
5 Project.Method Does.Not.apply 4
6 Case.based.Learning Does.Not.apply 22
7 Peer.Learning Does.Not.apply 6
10 Other Does.Not.apply 1
11 Does.Not.apply Not.specified 15
12 Not.specified Not.specified 15
21 Does.Not.apply Active.Learning..general. 1
23 Active.Learning..general. Active.Learning..general. 1
31 Does.Not.apply Problem.based.Learning 2
34 Problem.based.Learning Problem.based.Learning 2
41 Does.Not.apply Project.Method 4
45 Project.Method Project.Method 4
51 Does.Not.apply Case.based.Learning 22
56 Case.based.Learning Case.based.Learning 22
61 Does.Not.apply Peer.Learning 6
67 Peer.Learning Peer.Learning 6
91 Does.Not.apply Other 1
100 Other Other 1
Then, trying to plot the data:
# Based on http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/
grid <- subset(grid, count > 0)
radius <- sqrt( grid$count / pi )
symbols(grid$Var1, grid$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area")
text(grid$Var1, grid$Var2, grid$count, cex=0.5)
Here's the result:
Problems: axis labels are wrong, the dashed grid lines are missing.
Here is ggplot2 solution. First, added radius as new variable to your data frame.
grid$radius <- sqrt( grid$count / pi )
You should play around with size of the points and text labels inside the plot to perfect fit.
library(ggplot2)
ggplot(grid,aes(Var1,Var2))+
geom_point(aes(size=radius*7.5),shape=21,fill="white")+
geom_text(aes(label=count),size=4)+
scale_size_identity()+
theme(panel.grid.major=element_line(linetype=2,color="black"),
axis.text.x=element_text(angle=90,hjust=1,vjust=0))
This will get you started by adding the tick marks to your xaxis.
To add the lines, just add a line at each level
ggs <- subset(gg, count > 0)
radius <- sqrt( ggs$count / pi )
# ggs$Var1 <- as.character(ggs$Var1)
# set up your tick marks
# (this can all be put into a single line in `axis`, but it's placed separate here to be more readable)
#--------------
# at which values to place the x tick marks
x_at <- seq_along(levels(gg$Var1))
# the string to place at each tick mark
x_labels <- levels(gg$Var1)
# use xaxt="n" to supress the standard axis ticks
symbols(ggs$Var1, ggs$Var2, radius, inches=0.30, xlab="Research type", ylab="Research area", xaxt="n")
axis(side=1, at=x_at, labels=x_labels)
text(ggs$Var1, ggs$Var2, ggs$count, cex=0.5)
also, notice that instead of calling the object grid I called it gg, and then ggs for the subset. grid is a function in R. While it is "allowed" to overwrite the function with an object, it is not recommended and can lead to annoying bugs down the line.
Here a version using levelplot from latticeExtra.
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
},scales=(x=list(abbreviate=TRUE))) ## to get short labels
To get the size of bubble proprtional to the count , you can do this
library(latticeExtra)
levelplot(count~Var1*Var2,data=dat,
panel=function(x,y,z,...)
{
panel.abline(h=x,v=y,lty=2)
cex <- scale(z)*3
panel.levelplot.points(x,y,z,...,cex=5)
panel.text(x,y,label=z,cex=0.8)
})
I don't display it since the render is not clear as in the fix size case.

Peak detection in Manhattan plot

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords

Resources