This is almost certainly a duplicate question, but I can't find an an answer anywhere on SO. Most of the other similar questions relate to subsetting from one column, rather than an entire dataframe.
I have a dataframe:
test = data.frame(
'A' = c(.31562, .48845, .27828, -999),
'B' = c(.5674, 5.7892, .4687, .1345),
'C' = c(-999, .3145, .0641, -999))
I want to drop rows where any column contains -999, so that my dataframe will look like this:
A B C
2 0.48845 5.7892 0.3145
3 0.27828 0.4687 0.0641
I am sure there is an easy way to do this with the subset() function, or apply(), but I just can't figure it out.
I tried this:
test[apply(test, MARGIN = 1, FUN = function(x) {-999 != x}), ]
But it returns:
A B C
1 0.31562 0.5674 -999.0000
2 0.48845 5.7892 0.3145
4 -999.00000 0.1345 -999.0000
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
Use arr.ind with which to obtain the rows where -999 is present (which(test == -999, arr.ind = TRUE)[,1])and remove those rows.
test[-unique(which(test == -999, arr.ind = TRUE)[,1]),]
# A B C
#2 0.48845 5.7892 0.3145
#3 0.27828 0.4687 0.0641
We can use Reduce
test[!Reduce(`|`, lapply(test, `==`, -999)),]
# A B C
#2 0.48845 5.7892 0.3145
#3 0.27828 0.4687 0.0641
Related
I have what some of you might categorise as a dumb question, but I cannot solve it. I have this vector:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
And I want to get this in a new vector:
b <- c(NA,NA,1,NA,2,NA,3)
That simple. All the ways I am trying do not preserve the NA and I need them untouched. I would prefer if there would be a way in base R.
In base R, use cumsum() while excluding the NA values:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
a[!is.na(a)] <- cumsum(a[!is.na(a)])
Output:
[1] NA NA 1 NA 2 NA 3
Using replace from base R
b <- replace(a, !is.na(a), seq_len(sum(a, na.rm = TRUE)))
b
[1] NA NA 1 NA 2 NA 3
Or slightly more compact option (if the values are logical/numeric)
cumsum(!is.na(a)) * a
[1] NA NA 1 NA 2 NA 3
Update
If the OP's vector is
a <- c(NA,NA,TRUE,NA,FALSE,NA,TRUE)
(a|!a) * cumsum(replace(a, is.na(a), 0))
[1] NA NA 1 NA 1 NA 2
replaceing the non-NAs with the cumsum.
replace(a, !is.na(a), cumsum(na.omit(a)))
# [1] NA NA 1 NA 2 NA 3
i have a data frame like this:
question <- data.frame("cars"=c(1,1,2,2,34),"bike"=c(1,1,2,2,37),"motorcycle"=c(3,3,2,2,45),
"trycicle"=c(3,3,4,4,56),"skate"=c(1,1,4,4,78))
and i want to make a filter to delete the repeated data and maintain the different numberns, is this possible in R system?
the new data frame has to be like this:
question2 <- data.frame("cars"=c(NA,NA,NA,NA,34),"bike"=c(NA,NA,NA,NA,37),"motorcycle"=c(NA,NA,NA,NA,45),
"trycicle"=c(NA,NA,NA,NA,56),"skate"=c(NA,NA,NA,NA,78))
You may use duplicated to turn the repeated values to NA.
library(dplyr)
question %>%
mutate(across(.fns =
~replace(., duplicated(.) | duplicated(., fromLast = TRUE), NA)))
# cars bike motorcycle trycicle skate
#1 NA NA NA NA NA
#2 NA NA NA NA NA
#3 NA NA NA NA NA
#4 NA NA NA NA NA
#5 34 37 45 56 78
In base R you may use lapply -
question[] <- lapply(question, function(x)
replace(x, duplicated(x) | duplicated(x, fromLast = TRUE), NA))
I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA
I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?
Probably simple but tricky question especially for larger data sets. Given two dataframes (df1,df2) of equal dimensions as below:
head(df1)
a b c
1 0.8569720 0.45839112 NA
2 0.7789126 0.36591578 NA
3 0.6901663 0.88095485 NA
4 0.7705756 0.54775807 NA
5 0.1743111 0.89087819 NA
6 0.5812786 0.04361905 NA
and
head(df2)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 1
3 0.08982958 0.4453491 2
4 0.75196925 0.6745908 3
5 0.73216793 0.6418483 4
6 0.73640209 0.7448011 5
How can one find all columns where if(all(is.na(df1)), in this case c, go to df2and set all values in matching column (c) to NAs.
Desired output
head(df3)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 NA
3 0.08982958 0.4453491 NA
4 0.75196925 0.6745908 NA
5 0.73216793 0.6418483 NA
6 0.73640209 0.7448011 NA
My actual dataframes have more than 140000 columns.
We can use colSums on the negated logical matrix (is.na(df1)), negate (!) thevector` so that 0 non-NA elements becomes TRUE and all others FALSE, use this to subset the columns of 'df2' and assign it to NA.
df2[!colSums(!is.na(df1))] <- NA
df2
# a b c
#1 0.21210312 0.7670091 NA
#2 0.19767464 0.3050934 NA
#3 0.08982958 0.4453491 NA
#4 0.75196925 0.6745908 NA
#5 0.73216793 0.6418483 NA
#6 0.73640209 0.7448011 NA
Or another option is to loop over the columns and check whether all the elements are NA to create a logical vector for subsetting the columns of 'df2' and assigning it to NA
df2[sapply(df1, function(x) all(is.na(x)))] <- NA
If these are big datasets, another option would be set from data.table (should be more efficient as this does the assignment in place)
library(data.table)
setDT(df2)
j1 <- which(sapply(df1, function(x) all(is.na(x))))
for(j in j1){
set(df2, i = NULL, j = j, value = NA)
}