replace <NA> with NA - r

I have a data frame containing entries; It appears that these values are not treated as NA since is.na returns FALSE. I would like to convert these values to NA but could not find the way.

Use dfr[dfr=="<NA>"]=NA where dfr is your dataframe.
For example:
> dfr<-data.frame(A=c(1,2,"<NA>",3),B=c("a","b","c","d"))
> dfr
A B
1 1 a
2 2 b
3 <NA> c
4 3 d
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
> dfr[dfr=="<NA>"] = NA **key step**
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] FALSE FALSE

The two classes where this is likely to be an issue are character and factor. This should loop over a dtaframe and convert the "NA" values into true <NA>'s but just for those two classes:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x=="NA"; x} else {
x}
df[] <- lapply(df, make.true.NA)
(Untested in the absence of a data example.) The use of the form: df_name[] will attempt to retain the structure of the original dataframe which would otherwise lose its class attribute. I see that ujjwal thinks your spelling of NA has flanking "<>" characters so you might try this functions as more general:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x %in% c("NA", "<NA>"); x} else {
x}

You can do this with the naniar package as well, using replace_with_na and associated functions.
dfr <- data.frame(A = c(1, 2, "<NA>", 3), B = c("a", "b", "c", "d"))
library(naniar)
# dev version - devtools::install_github('njtierney/naniar')
is.na(dfr)
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] FALSE FALSE
#> [4,] FALSE FALSE
dfr %>% replace_with_na(replace = list(A = "<NA>")) %>% is.na()
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] TRUE FALSE
#> [4,] FALSE FALSE
# You can also specify how to do this for many variables
dfr %>% replace_with_na_all(~.x == "<NA>")
#> # A tibble: 4 x 2
#> A B
#> <int> <int>
#> 1 2 1
#> 2 3 2
#> 3 NA 3
#> 4 4 4
You can read more about using replace_with_na here

Related

Find which elements of a nested list are dataframes

Suppose that I have a nested list like the following
test <- list(
a = data.frame(x = 1),
b = "foo",
c = list(
d = 1:5,
e = data.frame(y = 1),
f = "a",
list(g = "hello")
)
)
test
#> $a
#> x
#> 1 1
#>
#> $b
#> [1] "foo"
#>
#> $c
#> $c$d
#> [1] 1 2 3 4 5
#>
#> $c$e
#> y
#> 1 1
#>
#> $c$f
#> [1] "a"
#>
#> $c[[4]]
#> $c[[4]]$g
#> [1] "hello"
I want to know the location of character elements in this nested list. In this
case, I want to return a named vector or a named list with TRUE if the element
is a character and FALSE otherwise.
I can do that with rapply, that unlists everything:
rapply(test, is.character)
#> a.x b c.d c.e.y c.f c.g
#> FALSE TRUE FALSE FALSE TRUE TRUE
However, I can’t do that to find all dataframes because rapply() also unlists
dataframes (note that the first element is a.x and not only a).
rapply(test, is.data.frame)
#> a.x b c.d c.e.y c.f c.g
#> FALSE FALSE FALSE FALSE FALSE FALSE
Therefore, is there a way to find which elements of a nested list are dataframes?
Note that the solution should work with any number of levels in the nested
list.
I’m looking for a solution in base R only.
1) rrapply
library(rrapply)
cls <- c("data.frame", "ANY")
rrapply(test, f = is.data.frame, classes = cls, how = "unlist")
## a b c.d c.e c.f c.g
## TRUE FALSE FALSE TRUE FALSE FALSE
2) recursion
findDF <- function(x) {
if (is.data.frame(x)) TRUE
else if (is.list(x)) lapply(x, findDF)
else FALSE
}
unlist(findDF(test))
## a b c.d c.e c.f c.g
## TRUE FALSE FALSE TRUE FALSE FALSE

Comparing 2 data frame with different size

I would like to check if the values in my dataframe df is larger than the threshold in df2. I tried making df2 to have the size with df to test on the threshold, but is an alternate way to do this?
> df
A B C
5 12 -5
4 4 0
15 5 9
1 11 1
11 1 -3
> df2
A B C
5 6 3
I tried replicating df2 into and then checking if df > df2
> df2
A B C
5 6 3
5 6 3
5 6 3
5 6 3
5 6 3
dput
> dput(df)
structure(list(A = c(5, 4, 15, 1, 11), B = c(12, 4, 5, 11, 1),
C = c(-5, 0, 9, 1, -3)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
> dput(df2)
structure(list(A = 5, B = 6, C = 3), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
You can try using sweep :
sweep(df, 2, unlist(df2), `>`)
# A B C
#[1,] FALSE TRUE FALSE
#[2,] FALSE FALSE FALSE
#[3,] TRUE FALSE TRUE
#[4,] FALSE TRUE FALSE
#[5,] TRUE FALSE FALSE
Using tidyverse
library(dplyr)
df %>%
mutate(across(everything(), ~ . > df2[[cur_column()]]))
# A tibble: 5 x 3
# A B C
# <lgl> <lgl> <lgl>
#1 FALSE TRUE FALSE
#2 FALSE FALSE FALSE
#3 TRUE FALSE TRUE
#4 FALSE TRUE FALSE
#5 TRUE FALSE FALSE
Or using map2
library(purrr)
map2_df(df, df2, `>`)
# A tibble: 5 x 3
# A B C
# <lgl> <lgl> <lgl>
#1 FALSE TRUE FALSE
#2 FALSE FALSE FALSE
#3 TRUE FALSE TRUE
#4 FALSE TRUE FALSE
#5 TRUE FALSE FALSE
You can use the follwing code
setNames(data.frame(do.call("cbind",lapply(names(df), function(nam) {
df[[nam]] > df2[[nam]]
}))),names(df))
# A B C
#1 FALSE TRUE FALSE
#2 FALSE FALSE FALSE
#3 TRUE FALSE TRUE
#4 FALSE TRUE FALSE
#5 TRUE FALSE FALSE
If its enough to get the result as named matrix (coercing to data.frame is quite time consuming if not really needed), you can just do:
comparedMatrix <- do.call("cbind",lapply(names(df), function(nam) {
df[[nam]] > df2[[nam]]
}))
colnames(comparedMatrix) <- names(df)
comparedMatrix
A base R option using t + unlist
> t(t(df)> unlist(df2))
A B C
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] TRUE FALSE TRUE
[4,] FALSE TRUE FALSE
[5,] TRUE FALSE FALSE

How to subset data frame using row & col indices stored in another data frame?

I have a data frame with numbers called ‘m_df’ and another logical data frame called ‘pos’.
I saved the coordinates (row and col) of the TRUE values in another data frame (‘true_pos’)
and would like to extract the numbers corresponding to these coordinates from the m_df.
What would be the best way to do this, please?
set.seed(123)
m <- matrix(rnorm(3*4), 3, 4)
m
#> [,1] [,2] [,3] [,4]
#> [1,] -0.5604756 0.07050839 0.4609162 -0.4456620
#> [2,] -0.2301775 0.12928774 -1.2650612 1.2240818
#> [3,] 1.5587083 1.71506499 -0.6868529 0.3598138
m_df <- as.data.frame(m)
pos <- (m_df < 0.36 & m_df > 0.0)
pos
#> V1 V2 V3 V4
#> [1,] FALSE TRUE FALSE FALSE
#> [2,] FALSE TRUE FALSE FALSE
#> [3,] FALSE FALSE FALSE TRUE
true_pos <- which(pos==TRUE, arr.ind = TRUE)
true_pos
#> row col
#> [1,] 1 2
#> [2,] 2 2
#> [3,] 3 4
We can just use the matrix as row/column index for extracting the elements from either the 'data.frame' or matrix
m_df[true_pos]
Also, we don't need to convert to row/col index. Here, just
m_df[pos]
is enough

Find similar groups of numbers across rows R

I'm trying to find similar patterns of numbers across a dataframe. I have a dataframe with 5 columns and some columns have a random number between 3 and 50. However, for some rows 2 or 3 columns don't have a number.
A B C D E
5 23 6
9 33 7 8 12
33 7 14
6 18 23 48
8 44 33 7 9
I want to know what are the recurring numbers, so I'm interested in:
Row 1 and 4 that have the number 23 and 6,
Row 2 and 5 that have number 9, 33 and 8,
Row 2, 3 and 5 that have number 33 and 7.
Basically I'm trying to get the number of different combinations.
I'm a bit stuck about how to do this. I've tried to join the numbers in a list.
for (i in 1:dim(knots_all)[1]) {
knots_all$list_knots <- list(sort(knots_all[i,1:5]))
}
I've also tried intersect but it doesn't seem very efficient as R also considers the NAs which I want to disregard.
I would like to hear some ideas about the best way to achieve this. I've been thinking about this problem but I'm not able to understand how to get to the answer. My mind is stuck so any idea is much appreciated!
Thank you!
There's no specific/target pattern you want to capture. It seems like you need a process to identify the numbers that appear more often in your dataset and then see in which rows they appear.
I'll modify your example dataset to have number 23 appearing twice in the same row in order to illustrate some useful differences in counts.
df = read.table(text = "
A B C D E
5 23 6 23 NA
9 33 7 8 12
33 7 14 NA NA
6 18 23 48 NA
8 44 33 7 9
", header=T)
library(dplyr)
library(tidyr)
df %>%
mutate(row_id = row_number()) %>% # add a row flag
gather(col_name,value,-row_id) %>% # reshape
filter(!is.na(value)) %>% # exclude NAs
group_by(value) %>% # for each number value
summarise(NumOccurences = n(), # count occurences
rows = paste(sort(row_id), collapse = "_"), # capture rows
NumRowOccurences = n_distinct(row_id), # count occurences in unique rows
unique_rows = paste(sort(unique(row_id)), collapse = "_")) %>% # capture unique rows
arrange(desc(NumOccurences)) # order by number popularity (occurences)
# # A tibble: 12 x 5
# value NumOccurences rows NumRowOccurences unique_rows
# <int> <int> <chr> <int> <chr>
# 1 7 3 2_3_5 3 2_3_5
# 2 23 3 1_1_4 2 1_4
# 3 33 3 2_3_5 3 2_3_5
# 4 6 2 1_4 2 1_4
# 5 8 2 2_5 2 2_5
# 6 9 2 2_5 2 2_5
# 7 5 1 1 1 1
# 8 12 1 2 1 2
# 9 14 1 3 1 3
# 10 18 1 4 1 4
# 11 44 1 5 1 5
# 12 48 1 4 1 4
Make a list of lists:
List = [1[],2[],...,n[]].
Loop through your data frame and for your example ad A to List = [1[],2[],.5[A]..,[n]] (at index = 5). And so on for every column.
after this loop through list check if the list (in the list) are filled and have multiple columns.
this should get you started.
good luck
This is an algorithm which can detect numbers presents in two columns.
df <- data.frame(A = c(5, 23, 6, NA, NA),
B = c(9, 33, 7, 8, 12),
C = c(33, 7, 14, NA, NA),
D = c(6, 18, 23, 48, NA),
E = c(8, 44, 33, 7, 9))
L <- as.list(df)
LL <- rep(list(rep(list(NA), length(L))), length(L))
for(i in 1:length(L)){
for(j in 1:length(L))
LL[[i]][[j]] <- intersect(L[[i]], L[[j]])
}
To see the overlapping numbers in columns 1 and 4:
LL[[1]][[4]]
[1] 23 6 NA
To see all overapping numbers:
unique(unlist(LL))
[1] 5 23 6 NA 9 33 7 8 12 14 18 48 44
It could be changed a little bit (by adding a level in the nested loop and if the for loop) to see the pesence in 3 different columns etc
One example for dealing with the NA would be to temporarily fill them with randomly generated numbers:
# data
df <- data.frame(A = c(5,9,33,6,8),
B = c(23,33,7,18,44),
C = c(6,7,14,23,33),
D = c(NA, 8, NA, 48, 7),
E = c(NA, 12, NA, NA, 9))
# fill NA with random numbers
set.seed(1)
df2 <- as.data.frame(do.call(cbind, lapply(df, function(x) ifelse(is.na(x), rnorm(1), x))))
> df2
A B C D E
1 5 23 6 -0.6264538 0.1836433
2 9 33 7 8.0000000 12.0000000
3 33 7 14 -0.6264538 0.1836433
4 6 18 23 48.0000000 0.1836433
5 8 44 33 7.0000000 9.0000000
# split data by rows
df2 <- split(df2, seq_along(df2))
# compare rows with each other
temp <- lapply(lapply(df2, function(x) lapply(df2, function(y) x %in% y)), function(x) do.call(rbind, x))
# delete self comparisons
output <- lapply(1:5, function(x) temp[[x]] <- temp[[x]][-x,])
Result:
[[1]]
[,1] [,2] [,3] [,4] [,5]
2 FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE TRUE TRUE
4 FALSE TRUE TRUE FALSE TRUE
5 FALSE FALSE FALSE FALSE FALSE
[[2]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
5 TRUE TRUE TRUE TRUE FALSE
[[3]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE TRUE TRUE
2 TRUE TRUE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE TRUE
5 TRUE TRUE FALSE FALSE FALSE
[[4]]
[,1] [,2] [,3] [,4] [,5]
1 TRUE FALSE TRUE FALSE TRUE
2 FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE TRUE
5 FALSE FALSE FALSE FALSE FALSE
[[5]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE TRUE TRUE TRUE
3 FALSE FALSE TRUE TRUE FALSE
4 FALSE FALSE FALSE FALSE FALSE

subset using `[`, explain NA output

If we have his data recentely used here:
data <- data.frame(name = rep(letters[1:3], each = 3),
var1 = rep(1:9), var2 = rep(3:5, each = 3))
name var1 var2
1 a 1 3
2 a 2 3
3 a 3 3
4 b 4 4
5 b 5 4
6 b 6 4
7 c 7 5
8 c 8 5
9 c 9 5
we can look for rows where var2 == 4.
data[data[,3] == 4 ,] # equally data[data$var2 == 4 ,]
# name var1 var2
#4 b 4 4
#5 b 5 4
#6 b 6 4
or rows where both var1 and var2 ==4
data[data[,2] == 4 & data[,3] == 4,]
# name var1 var2
#4 b 4 4
what I dont get is why this:
data[ data[ , 2:3 ] == 4 ,]
gives this:
name var1 var2
4 b 4 4
NA <NA> NA NA
NA.1 <NA> NA NA
NA.2 <NA> NA NA
#I would still hope to get
# name var1 var2
#4 b 4 4
Where do the NAs come from?
Your logical that you're subsetting on is a matrix:
> sel <- data[ , 2:3 ] == 4
> sel
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
According to help("[.data.frame"):
Matrix indexing (x[i] with a logical or a 2-column integer matrix i)
using [ is not recommended, and barely supported. For extraction, x is
first coerced to a matrix. For replacement, a logical matrix (only)
can be used to select the elements to be replaced in the same way as
for a matrix.
But that implies this form:
> data[ sel ]
[1] "b" "4" "5" "6" "4"
Badness. What you're doing is even less sensical, though, in that you're telling it you want only the rows (with your trailing comma), and then giving it a matrix to index on!
> data[sel,]
name var1 var2
4 b 4 4
NA <NA> NA NA
NA.1 <NA> NA NA
NA.2 <NA> NA NA
If you really wanted to use the matrix form, you could use apply to apply a logical operation across rows.
Your data[,2:3]==4 is the following :
R> data[,2:3]==4
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
Then you try to index the rows of your data frame with this matrix. To do this, R seems to first convert your matrix to a vector :
R> as.vector(data[,2:3]==4)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE TRUE TRUE TRUE FALSE FALSE FALSE
It then selects the rows of data based on this vector. The 4th TRUE value selects the 4th row, but the three others TRUE values select "out of bounds" rows, so they return NA's.
data[ data[ , 2 ] == 4 | data[,3] == 4,]
name var1 var2
4 b 4 4
5 b 5 4
6 b 6 4
I suspect your method does not work because c() builds a vector, whereas you need to compare the atomic elements.
Because you're not passing a vector but a matrix to the index:
> data[ , 2:3 ] == 4
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
If you want the matrix collapsed into a vector that indexing works with here are two options:
data[ apply(data[ , 2:3 ] == 4, 1, all) ,]
data[ rowSums(data[ , 2:3 ] == 4) == 2 ,]

Resources