R finding values in a data frame using | operator vs %in% - r

I'm trying to find all instances of certain values in a data frame, and replace them with NA. I tried this two different ways that I thought were equivalent, but I get different results. For example:
df <- data.frame(a=c(1,2),b=c(3,4))
df[df == 1 | df == 4] <- NA
gives me the expected result:
df
# a b
# 1 NA 3
# 2 2 NA
whereas
df <- data.frame(a=c(1,2),b=c(3,4))
df[df %in% c(1,4)] <- NA
does nothing:
df
# a b
# 1 1 3
# 2 2 4
This seems to be because if I use the "|" operator, it searches the data frame element by element, whereas if I use %in% it searches the data frame vector by vector (column by column), but I don't understand why.
df <- data.frame(a=c(1,2),b=c(3,4))
df == 1 | df == 4
# a b
# [1,] TRUE FALSE
# [2,] FALSE TRUE
df %in% c(1,4)
# [1] FALSE FALSE

If we look at the code for %in%
function (x, table)
match(x, table, nomatch = 0L) > 0L
So, it is basically doing a match. The output of match would be
match(c(1,4), df, nomatch = 0L) > 0L
#[1] FALSE FALSE
%in% is applied on vectors instead of data.frame. So, we loop through the columns using lapply, then do the %in%
lapply(df, `%in%`, c(1, 4))
If we need how the matrix, then use sapply
df[sapply(df, `%in%`, c(1, 4))] <- NA
We can check the match works on a vector
sapply(df, match, x = c(1,4), nomatch = 0L) > 0
# a b
#[1,] TRUE FALSE
#[2,] FALSE TRUE

%in% is only for vectors. In order to perform it on a dataframe you would have to use sapply to apply a function across each of the columns.
df[sapply(df, function(x) x %in% c(1, 4))] <- NA
a b
1 NA 3
2 2 NA

Related

Convert nested list with different names to data.frame filling NA and adding column

I need a base R solution to convert nested list with different names to a data.frame
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z=list('k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 <NULL>
## 3 NA <NULL>
## NA 5 <NULL>
## 9 NA <chr [1]>
I know this could be easily done with dplyr::bind_rows or data.table::rbindlist with fill = TRUE (not ideal though since it fills character column with NULL, not NA), but I do really need a solution in base R. To simplify the problem, it is also fine with a 2-level nested list that has no 3rd level lists such as
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 NA
## 3 NA NA
## NA 5 NA
## 9 NA k
I have tried something like
convert <- function(L) as.data.frame(do.call(rbind, L))
This does not fill NA and add additional column z
Update
mylist here is just a simplified example. In reality I could not assume the names of the sublist elements (a, b and z in the example), nor the sublists lengths (2, 1, 1, 2 in the example).
Here are the assumptions for expected data.frame and the input mylist:
The column number of the expected data.frame is determined by the maximum length of the sublists which could vary from 1 to several hundreds. There is no explicit source of information about the length of each sublist (which names will appear or disappear in which sublist is unknown)
max(sapply(mylist, length)) <= 1000 ## ==> TRUE
The row number of the expected data.frame is determined by the length of mylist which could vary from 1 to several thousands
dplyr::between(length(mylist), 0, 10000) ## ==> TRUE
No explicit information for the names of the sublist elements and their orders, therefore the column names and order of the expected data.frame can only be determined intrinsically from mylist
Each sublist contains elements in types of numeric, character or list. To simplify the problem, consider only numeric and character.
A shorter solution in base R would be
make_df <- function(a = NA, b = NA, z = NA) {
data.frame(a = unlist(a), b = unlist(b), z = unlist(z))
}
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
Update
A more general solution using the same method, but which does not require specific names would be:
build_data_frame <- function(obj) {
nms <- unique(unlist(lapply(obj, names)))
frmls <- as.list(setNames(rep(NA, length(nms)), nms))
dflst <- setNames(lapply(nms, function(x) call("unlist", as.symbol(x))), nms)
make_df <- as.function(c(frmls, call("do.call", "data.frame", dflst)))
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
}
This allows
build_data_frame(mylist)
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
We can try the base R code below
subset(
Reduce(
function(...) {
merge(..., all = TRUE)
},
Map(
function(k, x) cbind(id = k, list2DF(x)),
seq_along(mylist), mylist
)
),
select = -id
)
which gives
a b z
1 1 2 NA
2 3 NA NA
3 NA 5 NA
4 9 NA k
You can do something like the following:
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert <- function(mylist){
col_names <- NULL
# get all the unique names and create the df
for(i in 1:length(mylist)){
col_names <- c(col_names, names(mylist[[i]]))
}
col_names <- unique(col_names)
df <- data.frame(matrix(ncol=length(col_names),
nrow=length(mylist)))
colnames(df) <- col_names
# join data to row in df
for(i in 1:length(mylist)){
for(j in 1:length(mylist[[i]])){
df[i, names(mylist[[i]])[j]] <- mylist[[i]][names(mylist[[i]])[j]]
}
}
return(df)
}
df <- convert(mylist)
> df
a b z
1 1 2 <NA>
2 3 NA <NA>
3 NA 5 <NA>
4 9 NA k
I've got a solution. Note this only uses the pipe, and could be exchanged for native pipe, etc.
mylist %>%
#' first, ensure that the 2nd level is flat,
lapply(. %>% lapply(FUN = unlist, recursive = FALSE)) %>%
#' replace missing vars with `NA`
lapply(function(x, vars) {
x[vars[!vars %in% names(x)]]<-NA
x
}, vars = {.} %>% unlist() %>% names() %>% unique()) %>%
do.call(what = rbind) %>%
#' do nothing
identity()
In {.} it is meant to define and evaluate the function formed by unlist followed by names. Otherwise . %>% unlist() %>% names() just defines the function, and not evaluate on the input ..

How do I change 0 values in numeric data type columns to NA, without changing FALSE operators in logical data type columns to NA?

I want to convert all 0 numerical values in a spreadsheet to NA. The code below changes all of my 0 values in numerical columns to NA as intended, however, FALSE values in logical columns are also being changed to NA.
dataframe <- na_if(dataframe, 0)
Is there a way around this that doesn't require me splitting the data frame up into logical and numeric parts, converting 0s to NAs on the numeric data, then merging it? Thank you!
We may have to loop across numeric columns because TRUE/FALSE are otherwise 1/0 when coerced
library(dplyr)
FibreDatabase <- FibreDatabase %>%
mutate(across(where(is.numeric), na_if, 0))
If we check the source code of na_if
...
x[x == y] <- NA
...
which does the conversion
> df1 <- data.frame(v1 = FALSE, v2 = c(0, 1))
> df1 == 0
v1 v2
[1,] TRUE TRUE
[2,] TRUE FALSE
Here, the FALSE are also coerced to 0 and it returns TRUE when we do the == as below
> df1 %>%
mutate(across(where(is.numeric), na_if, 0))
v1 v2
1 FALSE NA
2 FALSE 1
> na_if(df1, 0)
v1 v2
1 NA NA
2 NA 1

R, how to replace only the numeric values of a dataframe?

I am working on R 3.4.3 on Windows 10. I have a dataframe made of numeric values and characters.
I would like to replace only the numeric values but when I do that the characters also change and are replaced.
How can I edit my function to make it affect only the numeric values and not the characters?
Here is the piece of code of my function:
dataframeChange <- function(dFrame){
thresholdVal <- 20
dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Here is a dataframe example:
example_df <- data.frame(
myNums = c (1:5),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
Thanks for the help!
As Tim's comment, you should be aware of the location of the numeric columns which we can locate them using ind <- sapply(dFrame, is.numeric)
dataframeChange <- function(dFrame){
#browser()
thresholdVal <- 20
ind <- sapply(dFrame, is.numeric)
dFrame[(dFrame[,ind] >= thresholdVal),ind] <- -1
#dFrame[dFrame >= thresholdVal] <- -1
return(dFrame)
}
Use mutate_if from dplyr:
library(dplyr)
example_df %>% mutate_if(is.numeric, funs(if_else(. >= thresh, repl, .)))
myNums myChars
1 10 A
2 -1 B
3 -1 C
4 5 D
5 -1 E
Explanation:
The mutate family of functions is for variable assignment or updating.
mutate_if functions (specified within funs()) are only applied to columns which satisfy the first argument (in this case, is.numeric())
The updating function is a simple if_else clause based on OP rules.
Data:
thresh <- 20
repl <- -1.0
example_df <- data.frame(
myNums = c(10,20,30,5,70),
myChars = c("A","B","C","D","E"),
stringsAsFactors = FALSE
)
example_df
myNums myChars
1 10 A
2 20 B
3 30 C
4 5 D
5 70 E
Using data.table, we can avoid explicit loops and is faster. Here I've set the threshold value as 2:
# set to data table
setDT(example_df)
# get numeric columns
num_cols <- names(example_df)[sapply(example_df, is.numeric)]
# loop over all columns at once
example_df[,(num_cols) := lapply(.SD, function(x) ifelse(x>2,-1, x)), .SDcols=num_cols]
print(example_df)
myNums myChars
1: 1 A
2: 2 B
3: -1 C
4: -1 D
5: -1 E
Another data.table solution.
library(data.table)
dataframeChange <- function(dFrame){
setDT(dFrame)
for(j in seq_along(dFrame)){
set(dFrame, i= which(dFrame[[j]] < 20), j = j, value = -1)
}
}
dataframeChange_dt(example_df)
example_df
# myNums myChars
# 1: -1 A
# 2: 20 B
# 3: 30 C
# 4: -1 D
# 5: 70 E
It does not explicitly call only numeric columns, however I tested on multiple datasets and it does not effect the non-numeric columns.

negating filter condition with NA present gives counter-intuitive result

I have stumbled across the behaviour of dplyr::filter in a complex statement on a large dataframe, which basically comes down to the treatment of NA values:
df <- tibble(a = c(rep(1,3),
rep(NA, 3)))
A tibble: 6 x 1
a
<dbl>
1 1
2 1
3 1
4 NA
5 NA
6 NA
Filtering for rows that equal 1 gives the expected result:
df %>% filter(a == 1)
A tibble: 3 x 1
a
<dbl>
1 1
2 1
3 1
Filtering for rows that do not equal 1, I would expect the remaining 3 rows of the df to be returned, which is not the case, however:
df %>% filter(!a == 1)
A tibble: 0 x 1
... with 1 variables: a <dbl>
So while in the first case NA is interpreted as not equaling 1, in the second case, it is interpreted as equaling 1. Is there a logic I am missing here?
I know I can use %in% to get the expected result:
df %>% filter(!a %in% 1)
A tibble: 3 x 1
a
<dbl>
1 NA
2 NA
3 NA
but it seems strange to me to use this operator with just one element (rather than a vector).
So my questions to the experts: Is this the intended behaviour of filter? Is it common practice to use %in% when negating a filter condition?
This is due to the behaviour of %in%, not filter.
Let's use a simple example:
a = c(1, 1, 1, NA, NA, NA)
> a == 1
[1] TRUE TRUE TRUE NA NA NA
> a != 1
[1] FALSE FALSE FALSE NA NA NA
> !(a == 1)
[1] FALSE FALSE FALSE NA NA NA
We see that when we use the relational operators == or !=, NA values in the input remain NA in the output. However...
> a %in% 1
[1] TRUE TRUE TRUE FALSE FALSE FALSE
> !(a %in% 1)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
With the %in% operator, NA values in the input become FALSE in the output. Since this is supposed to be the more intuitive interface for match(), let's take a look at that as well:
> match(a, 1)
[1] 1 1 1 NA NA NA
So nope, match() itself doesn't behave this way, at least not with the default arguments. However, the help file ?match explains:
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
There you have it. When we use a %in% 1, we are actually doing the following:
> match(a, 1, nomatch = 0L)
[1] 1 1 1 0 0 0
> match(a, 1, nomatch = 0L) > 0L
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Hence filter() returns rows with NA values when the %in% operator is used together with ! negation.

Sorting a list of unequal-size vectors in r

Suppose I have several vectors - maybe they're stored in a list, but if there's a better data structure that's fine too:
ll <- list(c(1,3,2),
c(1,2),
c(2,1),
c(1,3,1))
And I want to sort them, using the first number, then the second number to resolve ties, then the third number to resolve remaining ties, etc.:
c(1,2)
c(1,3,1)
c(1,3,2)
c(2,1)
Are there any built in functions that will allow me to do this or do I need to roll my own solution?
(For those who know Python, what I'm after is something that mimics the behavior of sort in Python)
ll <- list(c(1,3,2),
c(1,2),
c(2,1),
c(1,3,1))
I'd prefer using NA for missing values and using rbind.data.frame instead of paste:
sortfun <- function(l) {
l1 <- lapply(l, function(x, n) {
length(x) <- n
x
}, n = max(lengths(l)))
l1 <- do.call(rbind.data.frame, l1)
l[do.call(order, l1)] #order's default is na.last = TRUE
}
sortfun(ll)
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3 1
#
#[[3]]
#[1] 1 3 2
#
#[[4]]
#[1] 2 1
Here's an approach that uses data.table.
The result is a rectangular data.table with the rows ordered in the form you described. NA values are filled in where the list item was a different length.
library(data.table)
setorderv(data.table(do.call(cbind, transpose(l))), paste0("V", 1:max(lengths(l))))[]
# V1 V2 V3
# 1: 1 2 NA
# 2: 1 3 1
# 3: 1 3 2
# 4: 2 1 NA
This is ugly, but you can use the result on your list with something like:
l[setorderv(
data.table(
do.call(cbind, transpose(l)))[
, ind := seq_along(l)][],
paste0("V", seq_len(max(lengths(l)))))$ind]

Resources