Index TRUE occurrences preserving NA in a new vector - r

I have what some of you might categorise as a dumb question, but I cannot solve it. I have this vector:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
And I want to get this in a new vector:
b <- c(NA,NA,1,NA,2,NA,3)
That simple. All the ways I am trying do not preserve the NA and I need them untouched. I would prefer if there would be a way in base R.

In base R, use cumsum() while excluding the NA values:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
a[!is.na(a)] <- cumsum(a[!is.na(a)])
Output:
[1] NA NA 1 NA 2 NA 3

Using replace from base R
b <- replace(a, !is.na(a), seq_len(sum(a, na.rm = TRUE)))
b
[1] NA NA 1 NA 2 NA 3
Or slightly more compact option (if the values are logical/numeric)
cumsum(!is.na(a)) * a
[1] NA NA 1 NA 2 NA 3
Update
If the OP's vector is
a <- c(NA,NA,TRUE,NA,FALSE,NA,TRUE)
(a|!a) * cumsum(replace(a, is.na(a), 0))
[1] NA NA 1 NA 1 NA 2

replaceing the non-NAs with the cumsum.
replace(a, !is.na(a), cumsum(na.omit(a)))
# [1] NA NA 1 NA 2 NA 3

Related

How to select n random values from each rows of a dataframe in R?

I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA

Delete rows where any column contains number

This is almost certainly a duplicate question, but I can't find an an answer anywhere on SO. Most of the other similar questions relate to subsetting from one column, rather than an entire dataframe.
I have a dataframe:
test = data.frame(
'A' = c(.31562, .48845, .27828, -999),
'B' = c(.5674, 5.7892, .4687, .1345),
'C' = c(-999, .3145, .0641, -999))
I want to drop rows where any column contains -999, so that my dataframe will look like this:
A B C
2 0.48845 5.7892 0.3145
3 0.27828 0.4687 0.0641
I am sure there is an easy way to do this with the subset() function, or apply(), but I just can't figure it out.
I tried this:
test[apply(test, MARGIN = 1, FUN = function(x) {-999 != x}), ]
But it returns:
A B C
1 0.31562 0.5674 -999.0000
2 0.48845 5.7892 0.3145
4 -999.00000 0.1345 -999.0000
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
Use arr.ind with which to obtain the rows where -999 is present (which(test == -999, arr.ind = TRUE)[,1])and remove those rows.
test[-unique(which(test == -999, arr.ind = TRUE)[,1]),]
# A B C
#2 0.48845 5.7892 0.3145
#3 0.27828 0.4687 0.0641
We can use Reduce
test[!Reduce(`|`, lapply(test, `==`, -999)),]
# A B C
#2 0.48845 5.7892 0.3145
#3 0.27828 0.4687 0.0641

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Methods to correct for lookahead bias

This is not a regex problem.
I am trying to correct for lookahead bias in data, basically to move the values up by 1. This is what I came up with. Does anyone have a better/faster/ built-in method to do this?
d<-c(1,2,3,4)
#correct for lookahead bias, move values up by 1
e<-d[-c(1)]
length(e)<-length(d)
cbind(d,e)
> cbind(d,e)
d e
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 NA
There are a few ways I could think of to do this. Both are fairly concise one liners:
base
cbind(d, c(d[-1], NA))
data.table
rev(data.table::shift(rev(d), 1))
If we want to write it as a function, we can do that too. Note that this function does not attempt to error handle anything.
shift_up <- function(x, n = 1) c(x[-(1:n)], rep(NA, n))
Which is very useful for fans of the comic series Batman:
d <- 1:16
shift_up(d, 16)
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #BATMAN!

Removing duplicates in vector but preserving order

Suppose a vector :
vec = c(NA,NA,1,NA,NA,NA,1,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
I would like to get :
vec = c(NA,NA,1,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
I have tried a for loop with an if checking if the value is equal to the previous non NA, but it doesn't work when it is repeated more than once.
Remove duplicates in vector to next value
doesn't work either since I want to keep my NAs.
You can do this with a little bit of logic and a compound [ and [<- operation. First we need to find the duplicates. We'll do this with diff() on all the non NA values...
diff( vec[ ! is.na( vec ) ]
[1] 0 -1 0 0 1 0 -1 0
Each 0 is a duplicate. Now we need to find their position in vec and set them to NA..
# This gives us a vector of TRUE/FALSE values which we will use to subset vec to the values we want to change
dups <- c( 1 , diff( vec[ ! is.na( vec ) ] ) ) == 0
# Now subset vec to non NA values and change the duplicates to NA
vec[ ! is.na( vec ) ][ dups ] <- NA
# [1] NA NA 1 NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA 1 NA NA NA
#[26] NA NA 0 NA NA
Use duplicated:
vec[duplicated(vec, incomparables=NA)] <- NA
You could omit the incomparables parameter in your example:
vec[duplicated(vec)] <- NA
According to the documentation this might be faster, but you'd need to benchmark it yourself.
Edit:
After clarification:
vec <- c(NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
vec2 <- c(NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
tmp <- vec[!is.na(vec)]
tmp[c(FALSE, diff(tmp)==0)] <- NA
vec[!is.na(vec)] <- tmp
identical(vec, vec2)
#[1] TRUE
I think this does it:
vrl<-rle(vec)
diff(vrl$values[!is.na(vrl$values)])->vdif
vdif<-c(1,vdif)
vrl$values[!is.na(vrl$values)][vdif==0]<-NA
inverse.rle(vrl)
# [1] NA NA 1 NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA
#[19] 1 NA NA NA NA NA 0 NA NA
The trick in there was to prepend a 1 to the difference vector so that the very first non-NA location is preserved.

Resources