I'm stumped by the following:
z <- data.frame(a=c(1,2,3,4,5,6), b=c("Yes","Yes","No","No","",NA))
is.na(z$b)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
z$a[z$b=="Yes"]
[1] 1 2 NA
is.na(z$a[z$b=="Yes"])
[1] FALSE FALSE TRUE
Why is it that when I select z$b=="Yes", NA appears as a third value for the subsetted z$a?
When I subset, however, this isn't a problem:
subset(z, b=="Yes")$a
[1] 1 2
Many thanks in advance.
Related
I have a data.frame similar to this:
mydf=data.frame(LETTERS=LETTERS, rev_letters=rev(letters), var1=c(rep('a',10),rep('b',10),rep('c',6)), value=1:26)
> head(mydf)
LETTERS rev_letters var1 value
1 A z a 1
2 B y a 2
3 C x a 3
4 D w a 4
5 E v a 5
6 F u a 6
I want to select the row indexes that correspond to the columns and values stored in a list, like this one:
mylist=list(LETTERS=c('A','M','X'), var1='b')
> mylist
$LETTERS
[1] "A" "M" "X"
$var1
[1] "b"
I would like to do something like the following, but for all columns and values at once:
> which(mydf[,names(mylist)[1]] %in% mylist[[1]])
[1] 1 13 24
... or even better as a TRUE/FALSE variable:
> mydf[,names(mylist)[1]] %in% mylist[[1]]
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[25] FALSE FALSE
The idea is to end up with a single variable of all the indexes for all the columns and values in the list; in the example above, the result would be:
> indexes
[1] 1 11 12 13 14 15 16 17 18 19 20 24
... or the TRUE/FALSE counterpart:
> indexes
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
[25] FALSE FALSE
Thanks!
With %in% + sapply:
mydf=data.frame(LETTERS=LETTERS, rev_letters=rev(letters), var1=c(rep('a',10),rep('b',10),rep('c',6)), value=1:26)
mylist = list(LETTERS = c('A','M','X'), var1 = 'b')
rowSums(sapply(names(mylist), function(x) mydf[[x]] %in% mylist[[x]])) != 0
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[11] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#[21] FALSE FALSE FALSE TRUE FALSE FALSE
which(rowSums(sapply(names(mylist), function(x) mydf[[x]] %in% mylist[[x]])) != 0)
#[1] 1 11 12 13 14 15 16 17 18 19 20 24
Loop through names and use which:
sort(unique(unlist(sapply(names(mylist), function(i){
which(mydf[, i] %in% mylist[[ i ]])
}))))
# [1] 1 11 12 13 14 15 16 17 18 19 20 24
I cannot understand the properties of logical (boolean) values TRUE, FALSE and NA when used with logical OR (|) and logical AND (&). Here are some examples:
NA | TRUE
# [1] TRUE
NA | FALSE
# [1] NA
NA & TRUE
# [1] NA
NA & FALSE
# [1] FALSE
Can you explain these outputs?
To quote from ?Logic:
NA is a valid logical object. Where a component of x or y is NA, the
result will be NA if the outcome is ambiguous. In other words NA &
TRUE evaluates to NA, but NA & FALSE evaluates to FALSE. See the
examples below.
The key there is the word "ambiguous". NA represents something that is "unknown". So NA & TRUE could be either true or false, but we don't know. Whereas NA & FALSE will be false no matter what the missing value is.
It's explained in help("|"):
NA is a valid logical object. Where a component of x or y
is NA, the result will be NA if the outcome is ambiguous. In
other words NA & TRUE evaluates to NA, but NA & FALSE
evaluates to FALSE. See the examples below.
From the examples in help("|"):
x <- c(NA, FALSE, TRUE)
names(x) <- as.character(x)
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
# <NA> NA FALSE NA
# FALSE FALSE FALSE FALSE
# TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
# <NA> NA NA TRUE
# FALSE NA FALSE TRUE
# TRUE TRUE TRUE TRUE
I've created an example data.table
library(data.table)
set.seed(1)
siz <- 10
my <- data.table(
AA=c(rep(NA,siz-1),"11/11/2001"),
BB=sample(c("wrong", "11/11/2001"),siz, prob=c(1000000,1), replace=T),
CC=sample(siz),
DD=rep("11/11/2001",siz),
EE=rep("HELLO", siz)
)
my[2,AA:=1]
NA wrong 3 11/11/2001 HELLO
1 wrong 2 11/11/2001 HELLO
NA wrong 6 11/11/2001 HELLO
NA wrong 10 11/11/2001 HELLO
NA wrong 5 11/11/2001 HELLO
NA wrong 7 11/11/2001 HELLO
NA wrong 8 11/11/2001 HELLO
NA wrong 4 11/11/2001 HELLO
NA wrong 1 11/11/2001 HELLO
11/11/2001 wrong 9 11/11/2001 HELLO
If I run this code
patt <- "^\\d\\d?/\\d\\d?/\\d{4}$"
sapply(my, function(x) (grepl(patt,x )))
I get a table with TRUE whenever there is a date.
AA BB CC DD EE
[1,] FALSE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE FALSE TRUE FALSE
[3,] FALSE FALSE FALSE TRUE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE
[6,] FALSE FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE FALSE
[10,] TRUE FALSE FALSE TRUE FALSE
But if I do it like this:
my[,lapply(.SD, grepl, patt)]
I just get this result:
AA BB CC DD EE
1: NA FALSE FALSE FALSE FALSE
Why?
How can I get the same result writing wverything inside the brackets?
We need to specify the pattern argument if we are not using anonymous function call
my[,lapply(.SD, grepl, pattern = patt)]
Or otherwise with an anonymous function call
my[,lapply(.SD, function(x) grepl(patt, x))]
I have a very large data set including 250 string and numeric variables. I want to compare one after another columns together. For example, I am going to compare (difference) the first variable with second one, third one with fourth one, fifth one with sixth one and so on.
For example (The structure of the data set is something like this example), I want to compare number.x with number.y, day.x with day.y, school.x with school.y and etc.
number.x<-c(1,2,3,4,5,6,7)
number.y<-c(3,4,5,6,1,2,7)
day.x<-c(1,3,4,5,6,7,8)
day.y<-c(4,5,6,7,8,7,8)
school.x<-c("a","b","b","c","n","f","h")
school.y<-c("a","b","b","c","m","g","h")
city.x<- c(1,2,3,7,5,8,7)
city.y<- c(1,2,3,5,5,7,7)
You mean, something like this?
> number.x == number.y
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
> length(which(number.x==number.y))
[1] 1
> school.x == school.y
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
> test.day <- day.x == day.y
> test.day
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
EDIT: Given your example variables above, we have:
df <- data.frame(number.x,
number.y,
day.x,
day.y,
school.x,
school.y,
city.x,
city.y,
stringsAsFactors=FALSE)
n <- ncol(df) # no of columns (assumed EVEN number)
k <- 1
comp <- list() # comparisons will be stored here
while (k <= n-1) {
l <- (k+1)/2
comp[[l]] <- df[,k] == df[,k+1]
k <- k+2
}
After which, you'll have:
> comp
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[[3]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE
[[4]]
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
To get the comparison result between columns k and k+1, you look at the (k+1)/2 element of comp - i.e to get the comparison results between columns 7 & 8, you look at the comp element 8/2=4:
> comp[[4]]
[1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
EDIT 2: To have the comparisons as new columns in the dataframe:
new.names <- rep('', n/2)
for (i in 1:(n/2)) {
new.names[i] <- paste0('V', i)
}
cc <- as.data.frame(comp, optional=TRUE)
names(cc) <- new.names
df.new <- cbind(df, cc)
After which, you have:
> df.new
number.x number.y day.x day.y school.x school.y city.x city.y V1 V2 V3 V4
1 1 3 1 4 a a 1 1 FALSE FALSE TRUE TRUE
2 2 4 3 5 b b 2 2 FALSE FALSE TRUE TRUE
3 3 5 4 6 b b 3 3 FALSE FALSE TRUE TRUE
4 4 6 5 7 c c 7 5 FALSE FALSE TRUE FALSE
5 5 1 6 8 n m 5 5 FALSE FALSE FALSE TRUE
6 6 2 7 7 f g 8 7 FALSE TRUE FALSE FALSE
7 7 7 8 8 h h 7 7 TRUE TRUE TRUE TRUE
I cannot understand the properties of logical (boolean) values TRUE, FALSE and NA when used with logical OR (|) and logical AND (&). Here are some examples:
NA | TRUE
# [1] TRUE
NA | FALSE
# [1] NA
NA & TRUE
# [1] NA
NA & FALSE
# [1] FALSE
Can you explain these outputs?
To quote from ?Logic:
NA is a valid logical object. Where a component of x or y is NA, the
result will be NA if the outcome is ambiguous. In other words NA &
TRUE evaluates to NA, but NA & FALSE evaluates to FALSE. See the
examples below.
The key there is the word "ambiguous". NA represents something that is "unknown". So NA & TRUE could be either true or false, but we don't know. Whereas NA & FALSE will be false no matter what the missing value is.
It's explained in help("|"):
NA is a valid logical object. Where a component of x or y
is NA, the result will be NA if the outcome is ambiguous. In
other words NA & TRUE evaluates to NA, but NA & FALSE
evaluates to FALSE. See the examples below.
From the examples in help("|"):
x <- c(NA, FALSE, TRUE)
names(x) <- as.character(x)
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
# <NA> NA FALSE NA
# FALSE FALSE FALSE FALSE
# TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
# <NA> NA NA TRUE
# FALSE NA FALSE TRUE
# TRUE TRUE TRUE TRUE