Methods to correct for lookahead bias - r

This is not a regex problem.
I am trying to correct for lookahead bias in data, basically to move the values up by 1. This is what I came up with. Does anyone have a better/faster/ built-in method to do this?
d<-c(1,2,3,4)
#correct for lookahead bias, move values up by 1
e<-d[-c(1)]
length(e)<-length(d)
cbind(d,e)
> cbind(d,e)
d e
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 NA

There are a few ways I could think of to do this. Both are fairly concise one liners:
base
cbind(d, c(d[-1], NA))
data.table
rev(data.table::shift(rev(d), 1))
If we want to write it as a function, we can do that too. Note that this function does not attempt to error handle anything.
shift_up <- function(x, n = 1) c(x[-(1:n)], rep(NA, n))
Which is very useful for fans of the comic series Batman:
d <- 1:16
shift_up(d, 16)
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #BATMAN!

Related

Index TRUE occurrences preserving NA in a new vector

I have what some of you might categorise as a dumb question, but I cannot solve it. I have this vector:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
And I want to get this in a new vector:
b <- c(NA,NA,1,NA,2,NA,3)
That simple. All the ways I am trying do not preserve the NA and I need them untouched. I would prefer if there would be a way in base R.
In base R, use cumsum() while excluding the NA values:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
a[!is.na(a)] <- cumsum(a[!is.na(a)])
Output:
[1] NA NA 1 NA 2 NA 3
Using replace from base R
b <- replace(a, !is.na(a), seq_len(sum(a, na.rm = TRUE)))
b
[1] NA NA 1 NA 2 NA 3
Or slightly more compact option (if the values are logical/numeric)
cumsum(!is.na(a)) * a
[1] NA NA 1 NA 2 NA 3
Update
If the OP's vector is
a <- c(NA,NA,TRUE,NA,FALSE,NA,TRUE)
(a|!a) * cumsum(replace(a, is.na(a), 0))
[1] NA NA 1 NA 1 NA 2
replaceing the non-NAs with the cumsum.
replace(a, !is.na(a), cumsum(na.omit(a)))
# [1] NA NA 1 NA 2 NA 3

Fast way to insert values in column of data frame in R

I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA

Why does R 'sample' some columns more than others?

I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Difference between intersect and match in R

I am trying to understand the difference between match and intersect in R. Both return the same output in a different format. Are there any functional differences between both?
match(names(set1), names(set2))
# [1] NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11
intersect(names(set1), names(set2))
# [1] "Year" "ID"
match(a, b) returns an integer vector of length(a), with the i-th element giving the position j such that a[i] == b[j]. NA is produced by default for no_match (although you can customize it).
If you want to get the same result as intersect(a, b), use either of the following:
b[na.omit(match(a, b))]
a[na.omit(match(b, a))]
Example
a <- 1:5
b <- 2:6
b[na.omit(match(a, b))]
# [1] 2 3 4 5
a[na.omit(match(b, a))]
# [1] 2 3 4 5
I just wanted to know if there any other differences between the both. I was able to understand the results myself.
Then we read source code
intersect
#function (x, y)
#{
# y <- as.vector(y)
# unique(y[match(as.vector(x), y, 0L)])
#}
It turns out that intersect is written in terms of match!
Haha, looks like I forgot the unique in the outside. Em, by setting nomatch = 0L we can also get rid of na.omit. Well, R core is more efficient than my guess.
Follow-up
We could also use
a[a %in% b] ## need a `unique`, too
b[b %in% a] ## need a `unique`, too
However, have a read on ?match. In "Details" we can see how "%in%" is defined:
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
So, yes, everything is written using match.

Resources