How to make mutiple lines in R - r

I have the data like this, I want to draw multiple lines in R, the lines contained SC1, SC2, SC3, SC4 and SC5, the xlab is chr (from 1 to 10).
chr pos SC1 SC2 SC3 SC4 SC5
chr01.8.5 1 0.000 2.420907e-02 1.317053e+00 7.171021e-02 3.280758e-03 1.185807e+00
chr01.6.5 1 0.714 0.040931607 1.150449274 0.042270667 0.044192568 0.976696855

A quick and slightly dirty way is to use ?matlines
# assume d is your data
plot(d$chr, d$pos) # plots the data as points
matlines(d$chr, d[,-(1:2)]) # plots every column except 1,2 against d$chr

Related

Calculating measure of spatial segregation?

There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")

Performing a 2 sample t test in R with replicates

I have a dataframe name R_alltemp in R with 6 columns, 2 groups of data with 3 replicates each. I'm trying to perform a t-test for each row between the first three values and the last three and use apply() so it can go through all the rows with one line. Here is the code im using so far.
R_alltemp$p.value<-apply(R_all3,1, function (x) t.test(x(R_alltemp[,1:3]), x(R_alltemp[,4:6]))$p.value)
and here is a snapshot of the table
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634
it functions, but the p-values im getting just from eyeballing it seem wrong. For instance in the first line, the average of the first group is way lower than the second group, but my p value is only .4.
I feel like I'm missing something very obvious here, but I've been struggling with it for much longer than I'd like. Any help would be appreciated.
Your code is incorrect. I actually don't understand why it does not return an error. This part in particular: x(R_alltemp[,1:3]) should be x[1:3].
This should be your code:
R_alltemp$p.value2 <- apply(R_alltemp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value p.value2
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Remember that by specifying 1 it you are telling apply to get the columns. So function(x) returns the equivalent of this: x <- c(13.587632, 22.225083, 15.074230, 58.187465, 79, 82.287573) which means you want to subset the first three values by x[1:3] and then the last three x[4:6] and apply t.test to them.
A good idea before using apply is to test the function manually so if you do get odd results like these you know something went wrong with your code.
So the two-tailed p-value for the first row should be:
> g1 <- c(13.587632, 22.225083, 15.074230)
> g2 <- c(58.187465, 79, 82.287573)
> t.test(g1,g2)$p.value
[1] 0.01059583
Applying the function across all rows (I tacked the new p-val at the end as pval:
> tt$pval <- apply(tt,1,function(x) t.test(x[1:3],x[4:6])$p.value)
> tt
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value pval
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Maybe it's the double-use of the data frame name in the function (that you don't need)?

Peak detection in Manhattan plot

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords

If statement for weighted averaged in R

I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:
MARKER GROUP1_A1 GROUP1_A2 GROUP1_FREQ GROUP1_N GROUP2_A1 GROUP2_A2 GROUP2_FREQ GROUP2_N
rs10 A C 0.055 1232 A C 0.055 3221
rs1000 A G 0.208 1232 A G 0.208 3221
rs10000 G C 0.134 1232 C G 0.8624 3221
rs10001 C A 0.229 1232 A C 0.775 3221
I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.
I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.
Any insight or direction is appreciated!
Say you've read your data in a data frame called mydata. Then do the following:
mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)
It works because R treats TRUE values as 1 and FALSE values as 0.
EDIT: Try the following instead:
mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) !=
as.character(mydata$GROUP2_A1)) -
as.numeric(mydata$GROUP2_FREQ) )

Howto plot two cumulative frequency graph together

I have data that looks like this:
#val Freq1 Freq2
0.000 178 202
0.001 4611 5300
0.002 99 112
0.003 26 30
0.004 17 20
0.005 15 20
0.006 11 14
0.007 11 13
0.008 13 13
...many more lines..
Full data can be found here:
http://dpaste.com/173536/plain/
What I intend to do is to have a cumulative graph
with "val" as x-axis with "Freq1" & "Freq2" as
y-axis, plot together in 1 graph.
I have this code. But it creates two plots instead of 1.
dat <- read.table("stat.txt",header=F);
val<-dat$V1
freq1<-dat$V2
freq2<-dat$V3
valf1<-rep(val,freq1)
valf2<-rep(val,freq2)
valfreq1table<- table(valf1)
valfreq2table<- table(valf2)
cumfreq1=c(0,cumsum(valfreq1table))
cumfreq2=c(0,cumsum(valfreq2table))
plot(cumfreq1, ylab="CumFreq",xlab="Loglik Ratio")
lines(cumfreq1)
plot(cumfreq2, ylab="CumFreq",xlab="Loglik Ratio")
lines(cumfreq2)
What's the right way to approach this?
data <- read.table("http://dpaste.com/173536/plain/", header = FALSE)
sample1 <- unlist(apply(as.matrix(data),1,function(x) rep(x[1],x[2])))
sample2 <- unlist(apply(as.matrix(data),1,function(x) rep(x[1],x[3])))
plot(ecdf(sample1), verticals=TRUE, do.p=FALSE,
main="ECDF plot for both samples", xlab="Scores",
ylab="Cumulative Percent",lty="dashed")
lines(ecdf(sample2), verticals=TRUE, do.p=FALSE,
col.h="red", col.v="red",lty="dotted")
legend(100,.8,c("Sample 1","Sample 2"),
col=c("black","red"),lty=c("dashed","dotted"))
Try the ecdf() function in base R --- which uses plot.stepfun() if memory serves --- or the Ecdf() function in Hmisc by Frank Harrell. Here is an example from help(Ecdf) that uses a grouping variable to show two ecdfs in one plot:
# Example showing how to draw multiple ECDFs from paired data
pre.test <- rnorm(100,50,10)
post.test <- rnorm(100,55,10)
x <- c(pre.test, post.test)
g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))
Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))
Just for the record, here is how you get multiple lines in the same plot "by hand":
plot(cumfreq1, ylab="CumFreq",xlab="Loglik Ratio", type="l")
# or type="b" for lines and points
lines(cumfreq2, col="red")

Resources