Conditional subsetting of data frame keeping previous row - r

My data frame looks like this
Model w0 p0 w1 p1 w2 p.value
1 Null_model 3.950000e-05 0.7366921 0.988374029 0.000000e+00 1.296464
2 alt_test 1.366006e-02 0.4673263 0.139606503 3.049244e-01 1.146653
3 alt_ref 2.000000e-07 0.4673263 0.000846849 3.049244e-01 1.635038 5.550000e-15
8 Null_model 2.790000e-05 0.7240479 0.987016439 0.000000e+00 1.263556
9 alt_test 7.550000e-09 0.7231176 0.991768899 1.060000e-13 1.369259
10 alt_ref 2.770000e-05 0.7231176 0.995373167 1.060000e-13 1.192839 3.073496e-01
... ... ... ... ... ... ...
What I want is to subset my data.frame in a way that keeps every case where p.value < 0.05 but it also keeps the previous rows to these cases.
So ideally my output will be something like this
Model w0 w1 w2
2 alt_test 1.4e-0.2 0.139606503 1.146653
3 alt_ref 2.00e-07 0.000846849 1.635038
I've tried the following but it doesn't work quite right:
subset(v, p.value < 0.05, select = c(Model,w0,w1,w2))
the output doesn't have the alt_test row.
I have also tried
with(v, ifelse(p.value < 0.05, paste(dplyr::lag(c(w0,w1,w2),1)), ""))
and the output in this case looks like
[1] NA NA NA NA "0.013660056" NA NA NA NA ""
[11] NA NA NA NA "" NA NA NA NA ""
[21] NA NA NA NA "" NA NA NA NA ""
[31] NA NA NA NA "" NA NA NA NA ""
[41] NA NA NA NA "" NA NA NA NA ""
[51] NA NA NA NA "1.34e-11" NA NA NA NA "" ...
I also tried
subset(v, p.value < 0.05, select = c(w0, w1,w2, w0-1, w1-1, w2-1))
but this gives the previous column, so I was wondering if something similar can give previous rows instead?
Thank you

If your data.frame always has alternating structure as alt_test and alt_ref, then you can manually construct the subset index as below:
library(data.table)
setDT(myDf)
myDf[Reduce(function(x,y) ifelse(!is.na(x), x, ifelse(!is.na(y), y, F)),
shift(p.Value < 0.05, n = 0:1, type = "lead")), .(Model,w0,w1,w2)]

Related

How to get name row as variable in function and plot density graph

I have issues with my function, i don't know if the problem is in the function or in my way to called it.
I have big dataframe with > 20000 row and around 700 columns, with each row a part of a gene and i want to calculate density for each row + plot the density plot with name of the gene.
baseM <- read.csv("expansions_full_omim_06_07_21.2.csv", sep = "\t")
rownames(baseM) <- paste(baseM$motif, baseM$chromosome, baseM$intervalle , baseM$gene , baseM$localisation, baseM$OMIM, sep = ".")
baseM.num <- baseM[sapply(baseM, is.numeric)]
names <- rownames(baseM.num.fltr)
d.density <- function(X, n){
#print(X)
d <- density(as.numeric(as.matrix(X)), na.rm=T)
peaks <- NULL
for (i in 2:(length(d$y)-1)) {
if (d$y[i-1] >= d$y[i] & d$y[i] <= d$y[i+1]) {
peaks <- cbind(peaks, c(d$x[i], d$y[i]))
}}
df <- data.frame(test =as.numeric(as.matrix(X)))
g <- ggplot(df, aes(x = as.numeric(as.matrix(test)))) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
ggsave(filename=paste("/work/gad/shared/analyse/STR/Marine/analysis/output/annotation/R_plots/", n, ".png", sep=""), plot=g)
#q <- plot(d)
#png(file=file_name)
#print(q)
#dev.off()
return(peaks)
}
baseM.num.fltr$peaks <- apply(temp, 1 , d.density, n=names)
I get correctly my peaks but obviously something wrong with the plot. I'm not sure my way to pass the name is correct, or is something else would be better/easier? Thanks for your help! I tried 2 ways for the plot, with or without ggplot2 but not working.
This is the error I get:
NULL
Erreur : `device` must be NULL, a string or a function.
>
Example of my data :
> head(baseM)
motif chromosome intervalle gene localisation
1 AAAAAAAAAAAAAAAAAAAC chr2 (69131154, 69132154) BMP10 intergenic
2 AAAAAAAAAAAAAAAAAAAC chr2 (237411093, 237412093) IQCA1 intronic
3 AAAAAAAAAAAAAAAAAAAC chr2 (44378070, 44379070) LRPPRC intergenic
4 AAAAAAAAAAAAAAAAAAAC chr2 (105218444, 105219444) LINC01102 intergenic
5 AAAAAAAAAAAAAAAAAAAC chr2 (124310903, 124311903) LINC01826 intergenic
6 AAAAAAAAAAAAAAAAAAAC chr2 (30730559, 30731559) LCLAT1 intronic
OMIM
1 .
2 .
3 .,Mitochondrial complex IV deficiency, nuclear type 5, (French-Canadian), 220111 (3)
4 .
5 .
6 .
dijen003 dijen004 dijen005 dijen006 dijen007 dijen008 dijen009 dijen010
1 NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA
3 NA NA NA 5 NA NA NA NA
4 NA NA NA 5 NA NA NA NA
5 NA NA NA 5 NA NA NA NA
6 NA NA NA NA 5 NA NA NA
dijen011 dijen012 dijen013 dijen014 dijen015 dijen016 dijen017 dijen018
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
(Sorry i know it's a short example but data is really big - and of course not all lines have that much NA)
For the device argument, use png or 'png'. (Note that png() will work also but only when the filename has the '.png' extension.)
(png() will work also but only when the filename includes the '.png' extension, see comment thread below.)
Example:
library(tidyverse)
set.seed(1L)
df <- tibble(a = rnorm(10))
df %>% ggplot(aes(a)) + geom_density()
ggsave("foo.png", device = "png")

Assign specific rows in one list as NAs based on a second list of row numbers

Here is some code to simulate my problem. The simulated data has the same dimensions (74 subjects, 178 time points in a time series, 294 variables + 8 nuisance variables)
fulldata = lapply(1:74, function(i) matrix(rnorm(300,0,1), ncol=300,nrow=178))
rownumbers = seq(1:178)
badrows = lapply(1:74, function(i) sample(rownumbers, size=10, rownumbers,replace=FALSE))
Now what I need to do is replace the rows listed in each vector in the list badrows in the corresponding matrix in the list "fulldata" with NAs
These are time points that are corrupted and will be interpolated. But first the bad values must be replaced with NAs.
This doesn't work.
lapply(1:74, function(l) lapply(1:74, function(l) fulldata[[l]][badrows[[l]],1:294]<-NA))
returns list that looks like this:
[[74]][[72]]
[1] NA
[[74]][[73]]
[1] NA
[[74]][[74]]
[1] NA
This doesn't work either.
lapply(1:74, function(l) fulldata[[l]][badrows[[l]],1:294]<-NA)
Returns list that looks like this:
[[72]]
[1] NA
[[73]]
[1] NA
[[74]]
[1] NA
This just returns a vector of NAs
sapply(1:74, function(i) fulldata[[i]][badrows[[i]],1:294] <- NA)
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[34] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[67] NA NA NA NA NA NA NA NA
I also tried some stuff with mapply but lost the lines when R froze up and don't recall exactly what I did. What I am expecting is for the output to be like this, where the "bad rows" are replaced by NA just for columns 1-294 and 295-300 are returned unchanged :
(Can't get the table to appear right here, but leaving it anyway)
| var1 | var2 | var3 | ........ | var295 | ..... | var300 |
|------|------|------|----------|--------|-------|--------|
| 3 | 1 | 5 | ....... | .72 | ..... | .23 |
| NA | NA | NA | ........ | .10 | ..... | .98 |
| 5 | 7 | 12 | ........ | .42 | ..... | 1.2 |
Here's one way:
lapply(1:74, function(iii) "[<-"(fulldata[[iii]], badrows[[iii]],, NA))
which is equivalent to
mapply(function(x,y) "[<-"(x, y,, NA), fulldata, badrows, SIMPLIFY = FALSE) # without setting SIMPLIFY to FALSE you get one large matrix
and
mapply("[<-", fulldata, i=badrows, MoreArgs=alist(j=, value=NA), SIMPLIFY=FALSE)
# j= # this corresponds to the empty second argument in [i,j]
Your code above suffers from the fact that subset assignment returns the value that is assigned, not the whole object.
sapply(1:74, function(i) fulldata[[i]][badrows[[i]],1:294] <- NA)
# does the same thing as ...
sapply(1:74, function(i) NA)
To improve this, you could make the function to return the whole object:
sapply(1:74, function(i) {fulldata[[i]][badrows[[i]],1:294] <- NA; fulldata[[i]]})

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

r: Extracting residuals of regressed data with different dimensions

I am running 500 linear regressions with a different dependent variable each time, but the same independent variables, using the following loop:
for(j in 1:500) {
lmj <- lm(formula = df[, j] ~ x1 + x2, data = df)
coeff[j,] <- t(lmj$coefficients)
}
However the all columns of df have different ‘start’ and ‘end’ times, e.g.
> df[,1]
[1] NA NA NA NA NA NA NA NA
[9] NA NA NA NA NA NA NA NA
[17] NA NA -12.56643 2.90788 -15.80776 10.35763 18.22261 -8.33948
[25] -11.92777 3.35641 -9.13571 -27.46489 -14.18712 -3.75335 3.60028 -0.64753
[33] 1.07798 12.67291 8.83168 2.20233 11.13526 8.75306
> df[,2]
[1] NA NA NA NA NA NA NA 4.59821
[9] 1.80505 0.88652 1.05448 -7.39130 -0.46957 -5.85455 7.66825 -3.12985
[17] -6.58715 -9.43875 NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA
[33] NA NA NA NA NA NA
(Note all observations of the dependent variables are consecutive, and there are no NA values in either x1 or x2, i.e. x1 and x2 are both (38x1) vectors. Incidentally, x1 and x2 are the 501st and 502nd columns of df).
How can I save the residuals from each of these 500 regressions?

Removing duplicates in vector but preserving order

Suppose a vector :
vec = c(NA,NA,1,NA,NA,NA,1,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
I would like to get :
vec = c(NA,NA,1,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
I have tried a for loop with an if checking if the value is equal to the previous non NA, but it doesn't work when it is repeated more than once.
Remove duplicates in vector to next value
doesn't work either since I want to keep my NAs.
You can do this with a little bit of logic and a compound [ and [<- operation. First we need to find the duplicates. We'll do this with diff() on all the non NA values...
diff( vec[ ! is.na( vec ) ]
[1] 0 -1 0 0 1 0 -1 0
Each 0 is a duplicate. Now we need to find their position in vec and set them to NA..
# This gives us a vector of TRUE/FALSE values which we will use to subset vec to the values we want to change
dups <- c( 1 , diff( vec[ ! is.na( vec ) ] ) ) == 0
# Now subset vec to non NA values and change the duplicates to NA
vec[ ! is.na( vec ) ][ dups ] <- NA
# [1] NA NA 1 NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA 1 NA NA NA
#[26] NA NA 0 NA NA
Use duplicated:
vec[duplicated(vec, incomparables=NA)] <- NA
You could omit the incomparables parameter in your example:
vec[duplicated(vec)] <- NA
According to the documentation this might be faster, but you'd need to benchmark it yourself.
Edit:
After clarification:
vec <- c(NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,0,NA,NA,0,NA,NA,NA,1,NA,NA,1,NA,NA,0,NA,0)
vec2 <- c(NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,0,NA,NA)
tmp <- vec[!is.na(vec)]
tmp[c(FALSE, diff(tmp)==0)] <- NA
vec[!is.na(vec)] <- tmp
identical(vec, vec2)
#[1] TRUE
I think this does it:
vrl<-rle(vec)
diff(vrl$values[!is.na(vrl$values)])->vdif
vdif<-c(1,vdif)
vrl$values[!is.na(vrl$values)][vdif==0]<-NA
inverse.rle(vrl)
# [1] NA NA 1 NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA
#[19] 1 NA NA NA NA NA 0 NA NA
The trick in there was to prepend a 1 to the difference vector so that the very first non-NA location is preserved.

Resources