Tricky multi-step subset selection - r

I have a matrix:
1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9
...
I would like to select those elements for each number in the first column to which the corresponding value in the second column has an NA in its own second column.
So the search would go the following way:
look up number in the first column: 1.
check corresponding values in second column: 3,2,7,5,9,6...
look up 3,2,7,5,9,6 in first column and see if they have NA in their
second column
The result in the above case would be:
>3 NA 4<
Since this is the only value which has NA in its own second row.
Here's what I want to do in words:
Look at the number in column one, I find '1'.
What numbers does 1 have in its second column: 3,2,7,5,9,6
Do these numbers have NA in their own second column? yes, 3 has an NA
I would like it to return those numbers not row numbers.
the result would be the subset of the original matrix with those rows which satisfy the condition.
This would be the matlab equivalent, where i is the number in column 1:
isnan(matrix(matrix(:,1)==i,2))==1)

Using by, to get the result by group of column 1, assuming dat is your data frame
by(dat,dat$V1,FUN=function(x){
y <- dat[which(dat$V1 %in% x$V2),]
y[is.na(y$V2),]
})
dat$V1: 1
V1 V2 V3
9 3 NA 4
--------------------------------------------------------------------------------
dat$V1: 2
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 3
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 4
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
EDIT
Here I trie to do the same function as matlab command:
here the R equivalent of matlab
isnan(matrix(matrix(:,1)==i,2))==1) ## what is i here
is.na(dat[dat[dat[,1]==1,2],]) ## R equivalent , I set i =1
V1 V2 V3
3 FALSE FALSE FALSE
2 FALSE FALSE FALSE
7 FALSE FALSE FALSE
5 FALSE FALSE FALSE
9 FALSE TRUE FALSE
6 FALSE FALSE FALSE

This hopefully reads easily as it follows the steps you described:
idx1 <- m[, 1L] == 1L
idx2 <- m[, 1L] %in% m[idx1, 2L]
idx3 <- idx2 & is.na(m[, 2L])
m[idx3, ]
# V1 V2 V3
# 3 NA 4
It is all vectorized and uses integer comparison so it should not be terribly slow. However, if it is too slow for your needs, you should use a data.table and use your first column as the key.
Note that you don't need any of the assignments, so if you are looking for a one-liner:
m[is.na(m[, 2L]) & m[, 1L] %in% m[m[, 1L] == 1L, 2L], ]
# [1] 3 NA 4
(but definitely harder to read and maintain.)

I am still not totally clear as to what you want, but maybe this would work?
m<-read.table(
textConnection("1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9"))
do.call(rbind,lapply(split(m[,2],m[,1]),function(x) m[x[!is.na(x)][is.na(m[x[!is.na(x)],2])],]))
# V1 V2 V3
# 1 3 NA 4
It would be much nicer if you provided an example that you want to have more than one row.

Related

Categorical to Numeric Variable in R

Create a factor vector v1 using 10 random numbers without decimals.
Convert the factor vector to numeric vector v2.
Compare v1 and v2 element-wise. Store the comparison values (true or false) in a vector, and display it.
I have tried this:
v1<- factor(round(runif(10)),0)
v1
v2<-as.numeric(v1)
v2
comp<-v1==v2
comp
Have a look at the code below.
When v1 is a factor, as.numeric(v1) returns information on the level of each element of v1. In this example, the first element is a 5, which is the third level of the factor so as.numeric returns 3.
The second element of v1 is 2 which is also the second level so as.numeric returns 2 and we get TRUE in the comparison v1 == v2 for that element. Also check the help ?factor.
Using as.numeric(as.character(v1) does the expected conversion.
set.seed(2002)
v1 <- factor(round(10*runif(10),0))
v1
# [1] 5 2 9 0 9 8 8 10 10 9
# Levels: 0 2 5 8 9 10
str(v1)
#Factor w/ 6 levels "0","2","5","8",..: 3 2 5 1 5 4 4 6 6 5
v2 <- as.numeric(v1)
v2
# [1] 3 2 5 1 5 4 4 6 6 5
v1 == v2
#[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
v2 <- as.numeric(as.character(v1))
v1 == v2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Sort data frame by column of numbers

I am trying to sort a data frame by a column of numbers and I get an alphanumeric sorting of the digits instead. If the data frame is converted to a matrix, the sorting works.
df[order(as.numeric(df[,2])),]
V1 V2
1 a 1
3 c 10
2 b 2
4 d 3
> m <- as.matrix(df)
> m[order(as.numeric(m[,2])),]
V1 V2
[1,] "a" "1"
[2,] "b" "2"
[3,] "d" "3"
[4,] "c" "10"
V1 <- letters[1:4]
V2 <- as.character(c(1,10,2,3))
df <- data.frame(V1,V2, stringsAsFactors=FALSE)
df[order(as.numeric(df[,2])),]
gives
V1 V2
1 a 1
3 c 2
4 d 3
2 b 10
But
V1 <- letters[1:4]
V2 <- as.character(c(1,10,2,3))
df <- data.frame(V1,V2)
df[order(as.numeric(df[,2])),]
gives
V1 V2
1 a 1
2 b 10
3 c 2
4 d 3
which is due to factors.
thanks to the commentators akrun and Imo. Inspect each of the two dfs with str(df).
Also, there is more detail given the factor() function help menu. Scroll down to 'Warning' for more details of the issue at hand.
Could you be a little more specific about what's your intial dataframe ?
Because by running this code :
df<-data.frame(c("a","b","c","d"),c(1,2,10,3))
colnames(df)<-c("V1","V2")
#print(df)
df.order<-df[order(as.numeric(df[,2])),]
print(df.order)
I get the right answer :
V1 V2
1 a 1
2 b 2
4 d 3
3 c 10
Edit:
The column values might be being treated as factors.
Try forcing to character and then integer.
Example copy and pasted from console:
> Foo <- data.frame('ABC' = c('a','b','c','d'),'123' = c('1','2','10','3'))
> Foo[order(as.integer(as.character(Foo[,2]))),]
ABC X123
1 a 1
2 b 2
4 d 3
3 c 10

finding rows which don't have NA in a particular column if it already didn't have any NA

I just observed that if one of the columns in my data frame does not contain any NA values(see col2 below) and I unknowingly, try to find rows which do not have the corresponding col2 value as NA, the below code gives me an empty output.
See col1 below where it works since, it has at least one NA value.
The same does not work for col2
> col1 = c(1,1,1,1,NA)
> col2 = c(2,2,2,2,2)
> df = data.frame(col1,col2)
> df
col1 col2
1 1 2
2 1 2
3 1 2
4 1 2
5 NA 2
> df[-which(is.na(df$col1)),]
col1 col2
1 1 2
2 1 2
3 1 2
4 1 2
> df[-which(is.na(df$col2)),]
[1] col1 col2
<0 rows> (or 0-length row.names)
I was able to get it to work as follows, but just wondering if the above behavior is okay?
> df[which(! is.na(df$col2)),]
col1 col2
1 1 2
2 1 2
3 1 2
4 1 2
5 NA 2
The problem is not limited to NAs. It happens if the indexing vector is empty. The hope is that the whole vector will be returned, but actually, x[numeric(0)] (x indexed by a vector of length 0) returns an empty vector.
For example, consider the following:
> df[ c(-1), ] # Negative indexing
col1 col2
2 1 2
3 1 2
4 1 2
5 NA 2
> df[ c(), ] # numeric(0)
[1] col1 col2
<0 rows> (or 0-length row.names)
> df[ c(1), ] # Positive indexing
col1 col2
1 1 2
See section 8.1.13 in the R inferno for a more general explanation and work arounds.

Find groups of duplicates in data frame by all columns except one

I have a large dataframe. For some purposes I need to do the following:
Select one column in this data frame
Iterate on all rows of a given data frame except selected column
Select all rows of this data frame that are equal by all elements except one selected column
Group them by the way that group name is the row index and group values are indexes of duplicated rows.
I have wrote a function for this task, but it works slow because of nested loop. I would like to get some ideas how this code can be improved.
Say we have a dataframe like this:
V1 V2 V3 V4
1 1 2 1 2
2 1 2 2 1
3 1 1 1 2
4 1 1 2 1
5 2 2 1 2
And we want to get this list as a output:
diff.dataframe("V2", conf.new, conf.new)
Ouput:
$`1`
[1] 1
$`2`
[1] 2
$`3`
[1] 1 3
$`4`
[1] 2 4
$`5`
[1] 5
The following code reaces the goal, but it works too slow. Is it possible to improve it somehow?
diff.dataframe <- function(param, df1, df2){
excl.names <- c(param)
df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
list.out <- list()
for (i in 1:nrow(df1.excl)){
for (j in 1:nrow(df2.excl)){
if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
if (!as.character(i) %in% unlist(list.out)){
list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)
}
}
}
}
return(list.out)
}
Let's generate some data first
df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))
# Produces df like this
V1 V2 V3 V4
1 2 1 1 1
2 2 1 2 2
3 1 1 2 2
4 1 2 1 1
5 1 2 1 1
We then loop through the lines with lapply. Each row i is then compared to all lines of df with apply (including itself). The rows with <= 1 differences returns TRUE, the others return FALSE producing a logical vector, which we convert to a numeric vector with which.
lapply(1:nrow(df), function(i)
apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))
# Produces output like this
[[1]]
[1] 1
[[2]]
[1] 2 3
[[3]]
[1] 2 3
[[4]]
[1] 4 5
[[5]]
[1] 4 5

An NA in subsetting a data.frame does something unexpected

Consider the following code. When you don't explicitly test for NA in your condition, that code will fail at some later date then your data changes.
> # A toy example
> a <- as.data.frame(cbind(col1=c(1,2,3,4),col2=c(2,NA,2,3),col3=c(1,2,3,4),col4=c(4,3,2,1)))
> a
col1 col2 col3 col4
1 1 2 1 4
2 2 NA 2 3
3 3 2 3 2
4 4 3 4 1
>
> # Bummer, there's an NA in my condition
> a$col2==2
[1] TRUE NA TRUE FALSE
>
> # Why is this a good thing to do?
> # It NA'd the whole row, and kept it
> a[a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
NA NA NA NA NA
3 3 2 3 2
>
> # Yes, this is the right way to do it
> a[!is.na(a$col2) & a$col2==2,]
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
>
> # Subset seems designed to avoid this problem
> subset(a, col2 == 2)
col1 col2 col3 col4
1 1 2 1 4
3 3 2 3 2
Can someone explain why the behavior you get without the is.na check would ever be good or useful?
I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The == operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as
non-comparable even to themselves, so comparisons involving them
will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match() function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
Update: How is NA handled when there are multiple logical conditions?
NA is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE). These truth tables from ?Logic may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE

Resources