I took a stratified random sample out of a raster layer using R's raster package and the sampleStratified function:
library(raster)
r<-raster(nrows=5, ncols=5)
r[]<-c(1,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,1,0,1,0,0,0,1,1,1)
#Stratified random sample size
sampleStratified(r, size=5)
cell layer
[1,] 3 0
[2,] 22 0
[3,] 7 0
[4,] 21 0
[5,] 12 0
[6,] 13 1
[7,] 17 1
[8,] 11 1
[9,] 8 1
[10,] 23 1
What I would like to do now is to order the sample by the first column, interpolate the first column to get the original length of the raster and fill the missing values of the second column with NA to look like this:
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] 3 0
[4,] 4 NA
[5,] 5 NA
[6,] 6 NA
[7,] 7 0
[8,] 8 1
[9,] 9 NA
[10,] 10 NA
[11,] 11 1
[12,] 12 0
[13,] 13 1
[14,] 14 NA
[15,] 15 NA
[16,] 16 NA
[17,] 17 1
[18,] 18 NA
[19,] 19 NA
[20,] 20 NA
[21,] 21 0
[22,] 22 0
[23,] 23 1
[24,] 24 NA
[25,] 25 NA
I tried something with the approxTime function from the simecol package but failed with the NA filling. I have 10 raster layers with around 500,000 values each so a fast approach would really appreciated.
I'd think about it the opposite way. Instead of interpolation which could be expensive, you already know the cells you want to change are those that are not in the random sample. so use your random sample as an index vector for the cell numbers you don't want to change and just use the [<- replacement method on those cell indices that do not appear in your stratified sample. We use raster methods for the base functions [<- and %in% and also seq_len. Forgive the slightly long-winded example, better to show the steps. Should be quite fast and I don't envisage any problems with rasters of 500,000 cells...
# For reproducible example
set.seed(1)
# Get stratified random sample
ids <- sampleStratified(r, size=5)
# Copy of original raster (to visualise difference)
r2 <- r
# Get set of cell indices
cell_no <- seq_len(ncell(r2))
# Those indices to replace are those not in the random sample
repl <- cell_no[ ! cell_no %in% ids[,1] ]
# Replace cells not in sample with NA
r2[ repl ] <- NA
# Plot to show what is going on
par( mfrow = c(1,2))
plot(r)
plot(r2)
I would use merge as #Roland suggested.
mm <- data.frame(col1 = sample(1:100, 50), col2 = sample(0:1, 50, replace = TRUE))
mm <- as.matrix(mm[order(mm[, 1]), ])
mdl <- as.matrix(data.frame(col1 = 1:100, col2 = NA))
merge(mdl, mm, by = "col1", all.x = TRUE)
col1 col2.x col2.y
1 1 NA NA
2 2 NA 0
3 3 NA 0
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 NA 0
8 8 NA 1
9 9 NA NA
10 10 NA 0
11 11 NA NA
12 12 NA 0
13 13 NA 1
Related
I have an xts object with monthly returns of stocks. I want to calculate a rolling cumulative return for the stocks. Some of the stocks have NAs in the data. I want the cumulative return to reset to 1, each time an NA is encountered. Here is some sample data:
rets<-read.table(text=
'Date,AFX SJ Equity,DSY SJ Equity
1996-12-31,0.000000000,0.0298516427
1997-01-31,-0.046874751,0.1173840351
1997-02-28,0.088537483,0.0080555362
1997-03-31,-0.003013021,0.2516612299
1997-04-30,-0.003022126,-0.0425537783
1997-05-30,-0.060610279,0.1222167814
1997-06-30,-0.030128416,0.0594070842
1997-07-31,-0.040264811,NA
1997-08-29,0.143354912,NA
1997-09-30,NA,NA
1997-10-31,0.023807612,0.0458311280
1997-11-28,0.011881887,0.1035818306
1997-12-31,0.023445977,-0.0729239783
1998-01-30,-0.064883184,-0.0007773145
1998-02-27,-0.020408576,0.0405326221
1998-03-31,0.124981915,0.1198516418
1998-04-30,0.081499173,-0.0167247568
1998-05-29,-0.143835151,0.1292490014
1998-06-30,-0.189289470,0.1198825615
1998-07-31,-0.130008077,NA
',sep=',',header=TRUE)
library(lubridate)
library(xts)
rets<-xts(rets[,-1],order.by=ymd(rets[,1]))
Here's what I've tried:
cum_ret <- ifelse(is.na(rets)==T, 1, cumprod(1+rets))
Which gives:
AFX.SJ.Equity DSY.SJ.Equity
[1,] 1.0000000 1.029852
[2,] 0.9531252 1.150740
[3,] 1.0375126 1.160010
[4,] 1.0343865 1.451939
[5,] 1.0312605 1.390154
[6,] 0.9687555 1.560054
[7,] 0.9395684 1.652732
[8,] 0.9017369 1.000000
[9,] 1.0310053 1.000000
[10,] 1.0000000 1.000000
[11,] NA NA
[12,] NA NA
[13,] NA NA
[14,] NA NA
[15,] NA NA
[16,] NA NA
[17,] NA NA
[18,] NA NA
[19,] NA NA
[20,] NA 1.000000
This place NAs, anywhere where there is data after the first NA is encountered and a 1 where there was an NA in the original data.
My desired output should look like this:
AFX SJ Equity DSY SJ Equity
1996-12-31 1.0000000 1.029852
1997-01-31 0.9531252 1.150740
1997-02-28 1.0375126 1.160010
1997-03-31 1.0343865 1.451939
1997-04-30 1.0312605 1.390154
1997-05-30 0.9687555 1.560054
1997-06-30 0.9395684 1.652732
1997-07-31 0.9017369 NA
1997-08-29 1.0310053 NA
1997-10-31 NA NA
1997-10-31 1.0238076 1.045831
1997-11-28 1.0359724 1.154160
1997-12-31 1.0602618 1.069994
1998-01-30 0.9914686 1.069163
1998-02-27 0.9712341 1.112499
1998-03-31 1.0926208 1.245833
1998-04-30 1.1816685 1.224997
1998-05-29 1.0117031 1.383327
1998-06-30 0.8201983 1.549163
1998-07-31 0.7135659 NA
I don't have xts around, but this process should work equally well. (Because of this, I use lapply to work on rets, you should be able to adapt this to your time-series rather directly.)
rets[,-1] <- lapply(rets[,-1], function(ret) {
r <- rle(!is.na(ret))
r2 <- c(0, cumsum(r$lengths))
starts <- 1 + head(r2, n = -1)
ends <- r2[-1]
seqs <- Map(seq, starts[r$values], ends[r$values])
for (s in seqs) {
ret[s] <- cumprod(1 + ret[s])
}
ret
})
rets
# Date AFX.SJ.Equity DSY.SJ.Equity
# 1 1996-12-31 1.0000000 1.029852
# 2 1997-01-31 0.9531252 1.150740
# 3 1997-02-28 1.0375126 1.160010
# 4 1997-03-31 1.0343865 1.451939
# 5 1997-04-30 1.0312605 1.390154
# 6 1997-05-30 0.9687555 1.560054
# 7 1997-06-30 0.9395684 1.652732
# 8 1997-07-31 0.9017369 NA
# 9 1997-08-29 1.0310053 NA
# 10 1997-09-30 NA NA
# 11 1997-10-31 1.0238076 1.045831
# 12 1997-11-28 1.0359724 1.154160
# 13 1997-12-31 1.0602618 1.069994
# 14 1998-01-30 0.9914686 1.069163
# 15 1998-02-27 0.9712341 1.112499
# 16 1998-03-31 1.0926208 1.245833
# 17 1998-04-30 1.1816685 1.224997
# 18 1998-05-29 1.0117031 1.383327
# 19 1998-06-30 0.8201983 1.549163
# 20 1998-07-31 0.7135659 NA
The trick here is to use rle to determine the subsets of each vector that are non-NA (stored in the r variable ... though I shouldn't use single-letter variable names). If we look at the first pass within lapply, we'd see
r
# Run Length Encoding
# lengths: int [1:3] 9 1 10
# values : logi [1:3] TRUE FALSE TRUE
seqs
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9
# [[2]]
# [1] 11 12 13 14 15 16 17 18 19 20
I have two columns ID and Trial. The ID column is filled with NAs. The Trial column starts at 0 and ends on an arbritary number (e.g. 1232), whereupon the next trial sequences commences with 0, etc. My goal is to create a unique ID for each series of trials.
I am new to R and realize that there are several ways to solve this:
Using two nested loops
Using lapply (or rapply?) together with an abstract (?) function call or handle
Using nextElem from the iterator package together with point 1 or 2
Generating the list first and replacing the
values in a second step, e.g. using two seq() based on some kind of iteration on subsets: ex_data[subset]
So far, I have figured out that number of participants is:
N <- dim(filter(ex_data, Trial == 0))[1]
Or more elegantly:
N <- count(ex_data[Trial == 0])
In particular, it is the conditional part that I am struggling with and what would be the most R-like solution.
Pseudocode:
IDs are 1:N
while IDs < N+1
current + 1
while column Trial is > 0
ID is IDs[current]
next Trial
next Trial
How do I make the decision when to use loops over more compact expressions like the apply family? Specifically, how do I generate a new sequence based on a nearly cyclic column?
Example Data (for generation see below)
id t
[1,] NA 0
[2,] NA 1
[3,] NA 2
[4,] NA 3
[5,] NA 4
[6,] NA 5
[7,] NA 0
[8,] NA 1
[9,] NA 2
[10,] NA 3
[11,] NA 4
[12,] NA 5
[13,] NA 6
[14,] NA 7
[15,] NA 0
[16,] NA 1
[17,] NA 2
[18,] NA 3
[19,] NA 4
[20,] NA 5
[21,] NA 6
[22,] NA 7
[23,] NA 8
[24,] NA 9
[25,] NA 10
[26,] NA 11
[27,] NA 12
# Generate Example Data
t <- c(0:5, 0:7, 0:12)
id <- rep(NA, length(t))
dta <- cbind(id, t)
# Optional (using dtplyr)
# dta <- tbl_df(dta)
We can use data.table methods
ex_data[, ID := cumsum(!Trial)]
solution :
ex_data$ID <- cumsum(ex_data$Trial==0 )
I am trying to loop over the columns of a matrix and change certain predefined sequences within the colomns, which are available in form of vectors.
Let's say I have the following matrix:
m2 <- matrix(sample(1:36),9,4)
[,1] [,2] [,3] [,4]
[1,] 11 6 1 14
[2,] 22 16 27 3
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] 17 9 2 4
[6,] 28 18 29 5
[7,] 20 30 13 36
[8,] 26 33 24 15
[9,] 8 12 25 7
As an example my vector of sequence starts is a and my vector of sequence ends is b. Thus the first sequence to delete in all columns is a[1] to b[1], the 2nd a[2] to b[2] and so on.
My testing code is as follows:
testing <- function(x){
apply(x,2, function(y){
a <- c(1,5)
b <- c(2,8)
mapply(function(y){
y[a:b] <- NA; y
},a,b)
})
}
Expected outcome:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] NA NA NA NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
[9,] 8 12 25 7
Actual result:
Error in (function (y) : unused argument (dots[[2]][[1]])
What is wrong in the above code? I know I could just set the rows to NA, but I am trying to get the above output by using nested apply functions to learn more about them.
We get the sequence of corresponding elements of 'a', 'b' using Map, unlist to create a vector and assign the rows of 'm2' to NA based on that.
m2[unlist(Map(":", a, b)),] <- NA
m2
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# [3,] 34 10 23 32
# [4,] 21 19 31 35
# [5,] NA NA NA NA
# [6,] NA NA NA NA
# [7,] NA NA NA NA
# [8,] NA NA NA NA
# [9,] 8 12 25 7
Q.I have a erdos.reyni graph. I infect a vertex and want to see what sequence of vertices the disease would follow? igraph has helful functions like get.adjacency(), neighbors().
Details. This is the adjacency matrix with vertex names instead of 0,1 flags and i'm trying to get the contagion chain out of it. Like the flow/sequence of an epidemic through a graph if a certain vertex is infected. Let's not worry about infection probabilities here (assume all vertices hit are infected with probability 1).
So suppose I hit vertex 1 (which is row 1 here). We see that it has outgoing links to vertex 4,5,18,22,23,24,25. So then the next vertices will be those connected to 4,5,18...25 i.e. those values in row4, row5, row18,... row25. Then, according to the model, the disease will travel through these and so forth.
I understand that I can pass a string to order the matrix rows. My problem is, I cannot figure out how to generate that sequence.
The matrix looks like this.
> channel
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 4 5 18 22 23 24 25 NA
[2,] 6 10 11 18 25 NA NA NA
[3,] 7 11 18 20 NA NA NA NA
[4,] 24 NA NA NA NA NA NA NA
[5,] 1 3 9 13 14 NA NA NA
[6,] 3 8 9 14 19 23 NA NA
[7,] 3 4 8 15 20 22 NA NA
[8,] 2 3 25 NA NA NA NA NA
[9,] 3 4 11 13 20 NA NA NA
[10,] 4 5 8 15 19 20 21 22
[11,] 3 13 15 18 19 23 NA NA
[12,] 11 13 16 NA NA NA NA NA
[13,] 4 6 14 15 16 17 19 21
[14,] 2 6 13 NA NA NA NA NA
[15,] 3 17 20 NA NA NA NA NA
[16,] 6 15 18 23 NA NA NA NA
[17,] 2 25 NA NA NA NA NA NA
[18,] 2 5 NA NA NA NA NA NA
[19,] 3 11 NA NA NA NA NA NA
[20,] 1 4 7 10 12 21 22 25
[21,] 2 4 6 13 14 16 18 NA
[22,] 1 3 4 15 23 NA NA NA
[23,] 1 16 24 NA NA NA NA NA
[24,] 7 8 19 20 22 NA NA NA
[25,] 7 12 13 17 NA NA NA NA
I want to reorder this matrix based on a selection criteria as follows:
R would be most helpful (but i'm interested in the algo so any python,ruby,etc.will be great).The resulting vector will have length of 115 (8x25=200 - 85 NAs=115). and would look like this. Which is basically how the disease would spread if vertex 1, becomes infected.
4,5,18,22,23,24,25,24,1,3,9,13,14,2,5,1,3,4,15,23,1,16,24,7,8,19,20,22,7,12,13,17,7,8,19,20,22, 4,5,18,22,23,24,25,7,11,18,20...
What I know so far:
1. R has a package **igraph** which lets me calculate neighbors(graph, vertex, "out")
2. The same package can also generate get.adjlist(graph...), get.adjacency
Finding a "contagion chain" like this is equivalent to a breadth-first search through the graph, e.g.:
library(igraph)
set.seed(50)
g = erdos.renyi.game(20, 0.1)
plot(g)
order = graph.bfs(g, root=14, order=TRUE, unreachable=FALSE)$order
Output:
> order
[1] 14 1 2 11 16 18 4 19 12 17 20 7 8 15 5 13 9 NaN NaN NaN
It's not clear how you define the ordering of the rows, so... just a few hints:
You can select a permutation/combination of rows by passing an index vector:
> (m <- matrix(data=1:9, nrow=3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> m[c(2,3,1),]
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 1 4 7
The function t() transposes a matrix.
The matrix is stored in columns-first (or column-major) order:
> as.vector(m)
[1] 1 2 3 4 5 6 7 8 9
NA values can be removed by subsetting:
> qq <- c(1,2,NA,5,7,NA,3,NA,NA)
> qq[!is.na(qq)]
[1] 1 2 5 7 3
Also, graph algorithms are provided by Bioconductor's graph or CRAN's igraph packages.
I have a data frame which has some rows with NA entries, I want to find the index of the row and the column at which the entry is NA. I am looping in a nested fashion to do that, and that is taking too long. Is there a quicker way to do it? Thanks.
set.seed(123)
dfrm <- data.frame(a=sample(c(1:5, NA), 25,T), b=sample(c(letters,NA), 25,rep=T)
which(is.na(dfrm), arr.ind=TRUE)
row col
[1,] 4 1
[2,] 5 1
[3,] 8 1
[4,] 11 1
[5,] 16 1
[6,] 20 1
[7,] 21 1
[8,] 24 1
[9,] 6 2