Related
Other questions
There is another question asking how to build a second order transition matrix, however the answer does not seem to produce a second order transition matrix.
Second order transition matrix & scoring a sequence
Let's use this dataset:
set.seed(1)
dat<-data.frame(replicate(20,sample(c("A", "B", "C","D"), size = 100, replace=TRUE)))
What would be the best way to build a second order transition matrix such that I can easily score a new sequence I encounter as discussed here. For example, such that I can calculate the probability of observing AAABCAD.
Reaction to Julius Vainora
set.seed(1)
mat <-data.frame(replicate(100,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))
aux <- apply(mat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
TM <- table(aux)
TM <- TM / rowSums(TM)
x <- as.character(unlist(mat[1,]))
transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2])
prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])
When I ran this code it gave me a probability of 0, however the sequence for which I calculated the probability was also used to build the transition matrix (namely the first row of the df, here mat). I suppose this should not happen since the sequence was used to build the transition matrix so none of the transitions can be zero right?
Moreover, when I change the mat creation to this line:
mat <-data.frame(replicate(10,sample(c("AAA", "BBB", "CCC","DDD", "ABC", 'ABD'), size = 5, replace=TRUE)))
It will give the error Error in [.default (TM, transitions) : subscript out of bounds
Let's start with the data coming in a matrix format:
set.seed(1)
dat <- replicate(20, sample(c("A", "B", "C", "D"), size = 100, replace = TRUE))
As to estimate the second order transition matrix, we extract the following observed transitions:
aux <- apply(dat, 2, function(col) rbind(paste0(head(col, -2), head(col[-1], -1)), col[-1:-2]))
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
head(aux, 3)
# From To
# 1 DD D
# 2 DD B
# 3 DB A
The transition matrix then can be estimated with
TM <- table(aux)
(TM <- TM / rowSums(TM)) # As expected, everything around 0.25
# To
# From A B C D
# AA 0.2459016 0.2950820 0.2049180 0.2540984
# AB 0.2222222 0.3037037 0.1925926 0.2814815
# AC 0.3162393 0.1794872 0.1709402 0.3333333
# AD 0.3211679 0.2189781 0.1824818 0.2773723
# BA 0.2066116 0.2066116 0.2727273 0.3140496
# BB 0.2517483 0.2587413 0.2167832 0.2727273
# BC 0.2647059 0.2745098 0.2254902 0.2352941
# BD 0.3007519 0.2180451 0.2105263 0.2706767
# CA 0.2500000 0.2931034 0.2068966 0.2500000
# CB 0.2178218 0.3168317 0.2178218 0.2475248
# CC 0.2584270 0.2247191 0.2359551 0.2808989
# CD 0.3083333 0.2583333 0.2500000 0.1833333
# DA 0.2402597 0.2727273 0.2272727 0.2597403
# DB 0.2689076 0.2605042 0.2016807 0.2689076
# DC 0.2416667 0.2750000 0.2166667 0.2666667
# DD 0.2442748 0.2213740 0.2671756 0.2671756
In your example we have the sequence and transitions given by
x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(paste0(head(x, -2), head(x[-1], -1)), x[-1:-2]))
# [,1] [,2]
# [1,] "AA" "A"
# [2,] "AA" "B"
# [3,] "AB" "C"
# [4,] "BC" "A"
# [5,] "CA" "D"
Analogously as in my other answer then,
prAA <- 1 / (4 * 4)
prAA * prod(TM[transitions])
# [1] 6.223154e-05
is the probability to observe x, where prAA is the probability (specified by the user) of observing the first two elements of the sequence, AA.
Generalization: n-th order Markov chain.
n <- 3
aux <- apply(dat, 2, function(col) {
from <- head(apply(embed(col, n)[, n:1], 1, paste, collapse = ""), -1)
to <- col[-1:-n]
rbind(from, to)
})
aux <- data.frame(t(matrix(aux, nrow = 2)))
names(aux) <- c("From", "To")
TM <- table(aux)
TM <- TM / rowSums(TM)
head(TM)
# To
# From A B C D
# AAA 0.3541667 0.2083333 0.2083333 0.2291667
# AAB 0.3103448 0.3103448 0.1724138 0.2068966
# AAC 0.2142857 0.2857143 0.2857143 0.2142857
# AAD 0.1463415 0.3902439 0.2439024 0.2195122
# ABA 0.1200000 0.4800000 0.2000000 0.2000000
# ABB 0.2424242 0.2727273 0.1515152 0.3333333
x <- c("A", "A", "A", "B", "C", "A", "D")
(transitions <- cbind(head(apply(embed(x, n)[, n:1], 1, paste, collapse = ""), -1), x[-1:-n]))
# [,1] [,2]
# [1,] "AAA" "B"
# [2,] "AAB" "C"
# [3,] "ABC" "A"
# [4,] "BCA" "D"
prAAA <- 1 / 4^n
prAAA * prod(TM[transitions])
# [1] 3.048129e-05
In my R function below, I was wondering how I could get the length of the unique elements (which is 2) of two vectors a and b?
Here is what I tried without success:
foo <- function(...){
L <- list(...)
lengths(unique(unlist(L)))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b) # the function returns 1 1 instead of 2 2
Use lapply() or sapply() because your object is a list. I think you might check the difference between length() and lengths(). They both exist but have different abilities. I provide two solutions foo1 and foo2:
foo1 <- function(...){
L <- list(...)
sapply(L, function(x) length(unique(x)))
}
foo2 <- function(...){
L <- list(...)
lengths(lapply(L, unique))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo1(a, b)
# [1] 2 2
foo2(a, b)
# [1] 2 2
Here is the answer
You were using the unlist function - so you were back at the start with the vector lengths!
use this code instead
foo <- function(a,b){
L <- list(a,b)
lengths(unique(L)) ### this return 1 1
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b)
I have a list of 701 given csv files. Each one has the same number of columns (7) but different number of rows (between 25000 and 28000).
Here is an extract of the first file:
Date,Week,Week Day,Hour,Price,Volume,Sale/Purchase
18/03/2011,11,5,1,-3000.00,17416,Sell
18/03/2011,11,5,1,-1001.10,17427,Sell
18/03/2011,11,5,1,-1000.00,18055,Sell
18/03/2011,11,5,1,-500.10,18057,Sell
18/03/2011,11,5,1,-500.00,18064,Sell
18/03/2011,11,5,1,-400.10,18066,Sell
18/03/2011,11,5,1,-400.00,18066,Sell
18/03/2011,11,5,1,-300.10,18068,Sell
18/03/2011,11,5,1,-300.00,18118,Sell
I made a nonlinear regression of the supply curve of the ninth hour for the year 2012. The datas for 2012 are in 290. to 654. csv files.
allenamen <- dir(pattern="*.csv")
alledat <- lapply(allenamen, read.csv, header = TRUE, sep = ",", stringsAsFactors = FALSE)
h <- list()
for(i in 290:654) {
g <- function(a, b, c, d, p) {a*atan(b*p+c)+d}
f <- nlsLM(Volume ~ g(a,b,c,d,Price), data=subset(alledat[[i-289]], (Hour==9) & (Sale.Purchase == "Sell") & (!Price %in% as.character(-50:150))), start = list(a=4000, b=0.1, c=-5, d=32000))
h[[i-289]] <- coef(f)
}
This works and I get the coefficients a, b, c and d for every day in 2012.
This is the head(h):
[[1]]
a b c d
2.513378e+03 4.668218e-02 -3.181322e+00 2.637142e+04
[[2]]
a b c d
2.803172e+03 6.696201e-02 -4.576432e+00 2.574454e+04
[[3]]
a b c d
3.298991e+03 5.817949e-02 -3.425728e+00 2.393888e+04
[[4]]
a b c d
2.150487e+03 3.810406e-02 -2.658772e+00 2.675609e+04
[[5]]
a b c d
2.326199e+03 3.044967e-02 -1.780965e+00 2.604374e+04
[[6]]
a b c d
2934.0193270 0.0302937 -1.9912913 26283.0300823
And this is dput(head(h)):
list(structure(c(2513.37818972349, 0.0466821822063123, -3.18132213466142,
26371.4241646124), .Names = c("a", "b", "c", "d")), structure(c(2803.17230054557,
0.0669620116294894, -4.57643230249848, 25744.5376725213), .Names = c("a",
"b", "c", "d")), structure(c(3298.99066895304, 0.0581794881246528,
-3.42572804902504, 23938.8754575156), .Names = c("a", "b", "c",
"d")), structure(c(2150.48734655237, 0.0381040636898022, -2.65877160023262,
26756.0907073567), .Names = c("a", "b", "c", "d")), structure(c(2326.19873555633,
0.0304496684589379, -1.7809654498454, 26043.735374657), .Names = c("a",
"b", "c", "d")), structure(c(2934.01932702805, 0.0302937043170001,
-1.99129130343521, 26283.0300823458), .Names = c("a", "b", "c",
"d")))
Now I am trying to get just a column with h$a but I get NULL. How can I get just the a column?
In addition to this I want to plot the single coefficients and Date. I tried this code:
koeffreihe <- function(x) {
files <- list.files(pattern="*.csv")
df <- data.frame()
for(i in 1:length(files)){
xx <- read.csv(as.character(files[i]))
xx <- subset(xx, Sale.Purchase == "Sell" & Hour == 3)
df <- rbind(df, xx)
g <- function(a, b, c, d, p) {a*atan(b*p+c)+d}
f <- nlsLM(Volume ~ g(a,b,c,d,Price), data=subset(alledat[[i]], (Hour==9) & (Sale.Purchase == "Sell") & (!Price %in% as.character(-50:150))), start = list(a=4000, b=0.1, c=-5, d=32000))
h[[i]] <- coef(f)
}
df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
plot(h$x ~ Date, df, xlim = as.Date(c("2012-01-01", "2012-12-31")))
}
koeffreihe(a)
But I get this error:
invalid type (NULL) for variable 'h$x'
So the problem is that h$a is NULL. If someone can fix this problem I guess the code will work too.
Thank you for your help!
First transform your list into a data.frame:
h.df <- setNames(do.call(rbind.data.frame, h), names(h[[1]]))
# a b c d
#1 2513.378 0.04668218 -3.181322 26371.42
#2 2803.172 0.06696201 -4.576432 25744.54
#3 3298.991 0.05817949 -3.425728 23938.88
#4 2150.487 0.03810406 -2.658772 26756.09
#5 2326.199 0.03044967 -1.780965 26043.74
#6 2934.019 0.03029370 -1.991291 26283.03
Then you can extract variables easily:
h.df$a
#[1] 2513.378 2803.172 3298.991 2150.487 2326.199 2934.019
Alternatively you can iterate over the list to extract the variable:
sapply(h, "[", "a")
# a a a a a a
#2513.378 2803.172 3298.991 2150.487 2326.199 2934.019
In this line, although x is a variable, h$x is looking for a column named x in h:
plot(h$x ~ Date, df, xlim = as.Date(c("2012-01-01", "2012-12-31")))
You probably want h[[x]] instead.
From ?'[[':
x$name is equivalent to x[["name", exact = FALSE]].
That is, you are looking for a column literally named x.
I want to calculate the distance between two linked set of spatial coordinates (program and admin in my fake dataset). The data are in a wide format, so both pairs of coordinates are in the same row.
library(sp)
set.seed(1)
n <- 100
program.id <- seq(1, n)
c1 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
c2 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
dat <- data.frame(cbind(program.id, c1, c2))
names(dat) <- c("program.id", "program.lat", "program.long", "admin.lat", "admin.long")
head(dat)
# program.id program.lat program.long admin.lat admin.long
# 1 1 -42.20844 55.70061 -41.848523 62.536404
# 2 2 -23.01770 -52.84898 -50.643849 -145.851172
# 3 3 13.11361 -82.70635 3.023431 -2.665397
# 4 4 73.47740 177.36626 -41.588893 -13.841337
# 5 5 -53.69725 48.05758 -57.389701 -44.922049
# 6 6 71.71014 -103.24507 3.343705 176.795719
I know how to create a matrix of distances among program or admin using the sp package:
ll <- c("program.lat", "program.long")
coords <- dat[ll]
dist <- apply(coords, 1,
function(eachPoint) spDistsN1(as.matrix(coords),
eachPoint, longlat=TRUE))
But what I want to do is create a nx1 vector of distances (dist.km) between each pair of coordinates and add it to dat.
# program.id program.lat program.long admin.lat admin.long dist.km
# 1 1 -42.20844 55.70061 -41.848523 62.536404 567.35
# 2 2 -23.01770 -52.84898 -50.643849 -145.851172 8267.86
# ...
Any suggestions? I've spent a while going through old SO questions, but nothing seems quite right. Happy to be proven wrong.
Update
#Amit's solution works for my toy dataset:
apply(dat,1,function(x) spDistsN1(matrix(x[2:3],nrow=1),x[3:4],longlat=TRUE))
But I think I need to swap the order of the lat, long the order of the lat long columns so long comes before lat. From ?spDistsN1:
pts: A matrix of 2D points, first column x/longitude, second column y/latitude, or a SpatialPoints or SpatialPointsDataFrame object
Also, unless I've misunderstood the logic, I think Amit's solution should grab cols [2:3] and [4:5], not [2:3] and [3:4].
My challenge now is applying this to my actual data. I've reproduced a portion below.
library(sp)
dat <- structure(list(ID = 1:4,
subcounty = c("a", "b", "c", "d"),
pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984),
pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981),
sub.long = c(33.47552, 31.78307, 31.53083, 31.53083),
sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)),
.Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),
row.names = c(NA, 4L), class = "data.frame")
head(dat)
# ID subcounty pro.long pro.lat sub.long sub.lat
# 1 1 a 33.47628 2.739970 33.47552 2.740362
# 2 2 b 31.73605 3.265301 31.78307 3.391209
# 3 3 c 31.54073 3.213276 31.53083 3.208736
# 4 4 d 31.51749 3.177850 31.53083 3.208736
apply(dat, 1, function(x) spDistsN1(matrix(x[3:4], nrow=1),
x[5:6],
longlat=TRUE))
I get the error: Error in spDistsN1(matrix(x[3:4], nrow = 1), x[5:6], longlat = TRUE) : pts must be numeric
I'm confused because these columns are numeric:
> is.numeric(dat$pro.long)
[1] TRUE
> is.numeric(dat$pro.lat)
[1] TRUE
> is.numeric(dat$sub.long)
[1] TRUE
> is.numeric(dat$sub.lat)
[1] TRUE
The problem you're having is thatapply(...) coerces the first argument to a matrix. By definition, a matrix must have all elements of the same data type. Since one of the columns in dat (dat$subcounty) is char, apply(...) coerces everything to char. In your test dataset, everything was numeric, so you didn't have this problem.
This should work:
dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T))
There is a much faster solution using data.table and geosphere.
library(data.table)
library(geosphere)
setDT(dat)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2),
matrix(c(sub.long, sub.lat), ncol = 2))/1000]
Benchmark:
library(sp)
jlhoward <- function(dat) { dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T)) }
rafa.pereira <- function(dat2) { setDT(dat2)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2),
matrix(c(sub.long, sub.lat), ncol = 2))/1000] }
> system.time( jlhoward(dat) )
user system elapsed
8.94 0.00 8.94
> system.time( rafa.pereira(dat) )
user system elapsed
0.07 0.00 0.08
Data
dat <- structure(list(ID = 1:4,
subcounty = c("a", "b", "c", "d"),
pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984),
pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981),
sub.long = c(33.47552, 31.78307, 31.53083, 31.53083),
sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)),
.Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),
row.names = c(NA, 4L), class = "data.frame")
# enlarge dataset to 40,000 pairs
dat <- dat[rep(seq_len(nrow(dat)), 10000), ]
I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:
Genotype File ---
Ref Obs
A G
T C
G C
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
Current Data---
Sample.1 Sample.2 .... Sample.2000
0 1 2
0 0 0
0 NA 1
mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)
Desired Data format---
Sample.1 Sample.1 Sample.2 Sample.2 Sample.2000 Sample.2000
A A A G G G
T T T T T T
G G 0 0 G C
I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.
MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:
# Make up a matrix with your properties (4 cols, 6 rows)
col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)
# Mapping data
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs
# Function to map 0,1,2,NA to Ref/Obs
re_map <- function(mat.small) { # 3 row matrices, with col columns
t(
mapply( # iterate through each row in matrix
function(vals, map, num.to.let) {
vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
ifelse(is.na(vals.2), 0, vals.2)
},
vals=split(mat.small, row(mat.small)), # a row
map=split(map.base, row(map.base)), # the mapping for that row
MoreArgs=list(num.to.let=num.to.let) # general conversion of number to Obs/Ref
) )
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group
mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map))
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
# Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G G A G G G G G
# 2 C C 0 0 T C T C
# 3 0 0 G C G G G G
# 1 A A A A A G A A
# 2 C C C C T C C C
# 3 C C G G 0 0 0 0
I am not sure but this could be what you need:
first same simple data
geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))
then a couple of functions:
f <- function(i , x, geno){
x <- x[i]
if(!is.na(x)){
if (x == 0) {y <- geno[i , c(1,1)]}
if (x == 1) {y <- geno[i, c(1,2)]}
if (x == 2) {y <- geno[i, c(2,2)]}
}
else y <- c(0,0)
names(y) <- c("s1", "s2")
y
}
g <- function(x, geno){
Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}
The way f() is defined may not be the most elegant but it does the job
Then simply run it as a doble for loop in a lapply fashion
as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))
hope it helps
Here's one way based on the sample data in your answer:
# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))
# list of matrices
lst <- lapply(idx, function(x) {
tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
replace(tmp, is.na(tmp), 0)
})
# one data frame
as.data.frame(lst)
# s1.1 s1.2 s2.1 s2.2
# 1 A A A G
# 2 T T T T
# 3 G G 0 0