The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords
Related
There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")
I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:
intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).
users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users
I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow.
Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.
This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.
I have a set of values (100000 entries) ranging from -0.20 to +0.15, which are return percentages.
Bulk of the values lies between +3.5% and -3.5%
I am looking to convert this into a factor such that:
any return between -0.035 to +.035 are equally binned in 0.05 increments and
anything between -0.2 to -.035 is binned as one factor and
anything between 0.05 to .15 is binned as one factor variable.
Any thoughts on how I can achieve this in R? I did try cut, but it seems to bin only in equal increments.
So I generated the vector that holds the values (out of uniform distribution)
library(data.table)
set.seed(555)#in order to be reproducible
N <- 100000#number of pseudonumbers to be generated
min1=-0.035#arbitrary limits
max1=0.035#idem
samp <- runif(N,min = -0.2,max = 0.15)#create the vector
level1 <- as.factor(ifelse(samp<=min1,paste0("(",min(samp),",",min1,"]"),NA))#create the first level
level2 <- as.factor(ifelse(samp>=max1,paste0("[",max1,",",max(samp),")"),NA))#create the second level
incr <- 0.005
level3 <- cut(samp,seq(min1, max1, by = incr))#create the intermediate levels
dt <- data.table(samp,level1,level2,level3)#put all together
mylevels <- na.omit(unlist(matrix(t(dt[,-1]))))#the vector that contains in which range the samp belongs to
For better visualization of results:
mylevels<-factor(mylevels,levels= unique(mylevels))
dt2<-dt[,.(samp,levels=mylevels)]
samp levels
1: -0.07023653 (-0.199996188434307,-0.035]
2: 0.10889991 [0.035,0.149995080730878)
3: 0.04246077 [0.035,0.149995080730878)
4: -0.01193010 (-0.015,-0.01]
5: 0.02607736 (0.025,0.03]
---
99996: -0.04786692 (-0.199996188434307,-0.035]
99997: -0.08700210 (-0.199996188434307,-0.035]
99998: 0.09989973 [0.035,0.149995080730878)
99999: 0.10095336 [0.035,0.149995080730878)
100000: -0.05555869 (-0.199996188434307,-0.035]
I am trying to reproduce something similar to this map using ggplot2:
This is what I've done so far:
load("mapdata.Rdata")
> ls() #2 datasets: "depth" for basemap (geom_contour) and "data" is use to construct geom_bin2d
[1] "data" "depth"
> head(data)
latitude longitude GRcounts
740 67.20000 -57.83333 0
741 67.11667 -57.80000 0
742 67.10000 -57.93333 1
743 67.06667 -57.80000 0
751 67.15000 -58.15000 0
762 67.18333 -58.15000 0
ggplot(data=data,aes(x =longitude, y =latitude))
+theme_bw()
+ stat_bin2d(binwidth = c(0.5, 0.5))
+geom_contour(data=depth,aes(lon, lat, z=dn),colour = "black", bins=5)
+ xlim(c(-67,-56)) + ylim(c(65,71))
Which gives me this map:
The last step is to display over my geom_bin2d circles with size proportional to the sum of the counts (Grcounts) within each bin.
Any tips on how to do so in ggplot (preferably) would be much appreciated.
follow-up question: alignment mismatch between stat_bin2d and stat_summary2d when using facet_wrap
When I run the following code on the diamonds data set, there are no apparent problem: However if I do run the same code on my data, I do get misalignment problems. Any thoughts on what may cause this problem?
p<-ggplot(diamonds,aes(x =carat, y =price,colour=cut))+
stat_summary2d(fun=sum,aes(z=depth,group=cut),bins=10)
p+facet_wrap(~cut)
df <- ggplot_build(p)$data[[1]]
summary(df)##now 5 groups, 1 panel
df$x<-with(df,(xmin+xmax)/2)
df$y<-with(df,(ymin+ymax)/2)
plot1<-ggplot(diamonds,aes(carat, price))+ stat_bin2d(bins=10)
plot1+geom_point(data=df,aes(x,y,size=value,group=group),color="red",shape=1)+facet_wrap(~group)
This is my Rcode and plot:
p<-ggplot(dat,aes(x =longitude, y =latitude,colour=SizeClass))+
stat_summary2d(fun=sum,aes(z=GRcounts,group=SizeClass),bins=10)
p+facet_wrap(~SizeClass)
df <- ggplot_build(p)$data[[1]]
summary(df)##now 4 groups, 1 panel
df$x<-with(df,(xmin+xmax)/2)
df$y<-with(df,(ymin+ymax)/2)
plot1<-ggplot(dat,aes(longitude, latitude))+ stat_bin2d(bins=10)
plot1+geom_point(data=df,aes(x,y,size=value,group=group),color="red",shape=1)+facet_wrap(~group)
> head(dat[c(7,8,14,21)])###mydata
latitude longitude GRcounts SizeClass
742 67.10000 -57.93333 1 (100,150)
784 67.21667 -57.95000 1 (100,150)
756 67.11667 -57.80000 1 (<100)
1233 68.80000 -59.55000 2 (100,150)
1266 68.68333 -59.60000 2 (100,150)
1288 68.66667 -59.65000 1 (100,150)
My data set can be downloaded here: data
As your dataset doesn't work on my computer will use diamonds dataset as example.
Make new plot of your data with stat_summary2d() and set z= as argument you want to sum (in your case GRcounts) and provide fun=sum to sum those values. Store it as some object.
p<-ggplot(diamonds,aes(carat,price))+stat_summary2d(fun=sum,aes(z=depth))
Use function ggplot_build() to get data used for plot. Coordinates of rectangles are in columns xmin, xmax, ymin and ymax and sum are in column value.
df <- ggplot_build(p)$data[[1]]
head(df)
fill xbin ybin value ymax ymin yint xmax xmin xint PANEL group
1 #55B1F7 [0.2,0.36] [326,943] 641318.2 942.5667 326.0000 1 0.3603333 0.2000000 1 1 1
2 #1A3955 [0.2,0.36] (943,1.56e+03] 75585.5 1559.1333 942.5667 2 0.3603333 0.2000000 1 1 1
3 #132B43 [0.2,0.36] (1.56e+03,2.18e+03] 415.8 2175.7000 1559.1333 3 0.3603333 0.2000000 1 1 1
4 #132B43 [0.2,0.36] (2.18e+03,2.79e+03] 304.4 2792.2667 2175.7000 4 0.3603333 0.2000000 1 1 1
5 #244D71 (0.36,0.521] [326,943] 179486.8 942.5667 326.0000 1 0.5206667 0.3603333 2 1 1
6 #2D5F8A (0.36,0.521] (943,1.56e+03] 271688.9 1559.1333 942.5667 2 0.5206667 0.3603333 2 1 1
For the points calculate x and y positions as mean of xmin,xmax and ymin,ymax.
df$x<-with(df,(xmin+xmax)/2)
df$y<-with(df,(ymin+ymax)/2)
Use this new data frame to add points to your original plot with stat_bin2d().
ggplot(diamonds,aes(carat,price))+stat_bin2d()+
geom_point(data=df,aes(x=x,y=y,size=value),color="red",shape=1)
UPDATE - solution with facetting
To use facet_wrap() and combine stat_bin2d() and points you should use some workaround as there seems to be some problem.
First, create two plots - one for sums with stat_summary2d() and one for counts with stat_bin2d(). Both plots should be faceted.
plot1 <- ggplot(dat,aes(x =longitude, y =latitude))+
stat_summary2d(fun=sum,aes(z=GRcounts),bins=10)+facet_wrap(~SizeClass)
plot2 <- ggplot(dat,aes(longitude, latitude))+ stat_bin2d(bins=10)+
facet_wrap(~SizeClass)
Now extract data from both plots using ggplot_build() and store them as objects. For the sums data frame (df1) calculated x and y coordinates as in example above.
df1 <- ggplot_build(plot1)$data[[1]]
df1$x<-with(df,(xmin+xmax)/2)
df1$y<-with(df,(ymin+ymax)/2)
df2<-ggplot_build(plot2)$data[[1]]
Now plot your data using those new data frames - df1 for points and df2 for rectangles. With geom_rect() you will get rectangles which fill= depend on count. For faceting use column PANEL.
ggplot()+geom_rect(data=df2,aes(xmin=xmin,xmax=xmax,
ymin=ymin,ymax=ymax,fill=count))+
geom_point(data=df1,aes(x=x,y=y,size=value),shape=1,color="red")+
facet_wrap(~PANEL)
I'm trying to read a GRIB file wavedata.grib with wave heights from the ECMWF ERA-40 website, using an R function. Here is my source code until now:
mylat = 43.75
mylong = 331.25
# read the GRIB file
library(rgdal)
library(sp)
gribfile<-"wavedata.grib"
grib <- readGDAL(gribfile)
summary = GDALinfo(gribfile,silent=TRUE)
save(summary, file="summary.txt",ascii = TRUE)
# >names(summary): rows columns bands ll.x ll.y res.x res.y oblique.x oblique.y
rows = summary[["rows"]]
columns = summary[["columns"]]
bands = summary[["bands"]]
# z=geometry(grib)
# Grid topology:
# cellcentre.offset cellsize cells.dim
# x 326.25 2.5 13
# y 28.75 2.5 7
# SpatialPoints:
# x y
# [1,] 326.25 43.75
# [2,] 328.75 43.75
# [3,] 331.25 43.75
myframe<-t(data.frame(grib))
# myframe[bands+1,3]=331.25 myframe[bands+2,3]=43.75
# myframe[1,3]=2.162918 myframe[2,3]=2.427078 myframe[3,3]=2.211989
# These values should match the values read by Degrib (see below)
# degrib.exe wavedata.grib -P -pnt 43.75,331.25 -Interp 1 > wavedata.txt
# element, unit, refTime, validTime, (43.750000,331.250000)
# SWH, [m], 195709010000, 195709010000, 2.147
# SWH, [m], 195709020000, 195709020000, 2.159
# SWH, [m], 195709030000, 195709030000, 1.931
lines = rows * columns
mycol = 0
for (i in 1:lines) {
if (mylat==myframe[bands+2,i] & mylong==myframe[bands+1,i]) {mycol = i+1}
}
# notice mycol = i+1 in order to get values in column to the right
myvector <- as.numeric(myframe[,mycol])
sink("output.txt")
cat("lat:",myframe[bands+2,mycol],"long:",myframe[bands+1,mycol],"\n")
for (i in 1:bands) { cat(myvector[i],"\n") }
sink()
The wavedata.grib file has grided SWH values, in the period 1957-09-01 to 2002-08-31. Each band refers to a pair of lat/long and has a series of 16346 SWH values at 00h of each day (1 band = 16346 values at a certain lat/long).
myframe has dimensions 16438 x 91. Notice 91 = 7rows x 13columns. And the number 16438 is almost equal to number of bands. The additional 2 rows/bands are long and lat values, all other columns should be wave heights corresponding to the 16436 bands.
The problem is I want to extract SWH (wave heights) at lat/long = 43.75,331.25, but they don't match the values I get reading the file with Degrib utility at this same lat/long.
Also, the correct values I want (2.147, 2.159, 1.931, ...) are in column 4 and not column 3 of myframe, even though myframe[16438,3]=43.75 (lat) and myframe[16437,3]=331.25 (long). Why is this? I would like to know to which lat/long do myframe[i,j] values actually correspond to or if there is some data import error in the process. I'm assuming Degrib has no errors.
Is there any R routine to easily interpolate values in a matrix if I want to extract values between grid points? More generally, I need help in writing an effective R function to extract wave heights like this:
SWH <- function (latitude, longitude, date/time)
Please help.