Merging dataframes with all.equal on numeric(float) keys? - r

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.
Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.

Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

Related

Comparing changes across two matrices

I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!
You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4

R inference from one matrix to a data frame

I think this may be a very simple and easy question, but since I'm new to R, I hope someone can give me some outlines of how to solve it step by step. Thanks!
So the question is if I have a (n * 2) matrix (say m) where the first column representing the index of the data in another data frame (say d) and the second column representing some value(p value).
What i want to do is if the p value of some row r in m is less than 0.05,I will plot the data in d by the index indicated in the first column in row r of matrix m.
..............
The data is somewhat like what I draw below:
m:
ind p_value
2 0.02
23 0.03
56 0.12
64 0.54
105 0.04
d:
gene_id s1 s2 s3 s4 ... sn
IDH1 0.23 3.01 0 0.54 ... 4.02
IDH2 0.67 0 8.02 10.54 ... 0.72
...
so IDH2 is corresponding to the first line in m whose index column is 2
toplot <- d[ m[ m[,'p_value'] < .05,'ind'], ] works!

delete vector entries based on another vector

I have two vectors
a <- c(1:20)
b <- c(2,11,14)
I want to delete the entries in the a vector based on the vector entries in b (I want the 2nd, 11th, and 14th entries deleted).
I've tried several methods, including:
c <- a[!a %in% b]
but that doesn't work.
Any suggestions? I've tried searching SO, but can only find deleting based on values.
You can simply index into a and remove the elements at indices in b as follows:
a <- c(1:20)
b <- c(2,11,14)
a[-b]
[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
I created 3.1 million entries and am randomly sampling 100,000 to remove. As can be seen, it is blazing fast.
a <- 1:3100000
b <- sample(a, 100000)
system.time(a[-b])
user system elapsed
0.024 0.003 0.027
Edited: Adding this extra check option based on comment below by akrun and thelatemail to handle the case where b might be null.
a[if(length(b)) -b else TRUE]
The approach by #Gopala works in most cases except when the 'b' vector is NULL. To make it a bit more general, we can get the logical condition using seq_along(a) with %in%
a[!seq_along(a) %in% b]
#[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
Now, if we change 'b' to
b <- vector('integer')
a[-b]
#integer(0)
a[!seq_along(a) %in% b]
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The former returns a vector of length 0, while the %in% approach returns the whole vector 'a'.
Other method is obviously more efficient, but in case if we need an approach that works on the case I mentioned, this can be used.
system.time(a[-b])
# user system elapsed
# 0.07 0.00 0.08
system.time(a[!seq_along(a) %in% b])
# user system elapsed
# 0.17 0.01 0.18
The approach posted by #thelatemail to make the first approach general
system.time(a[if(length(b)==0) TRUE else -b])
# user system elapsed
# 0.05 0.00 0.05
NOTE: Benchmark data from #Gopala's post.

ANOVA of subsetted data

I am manipulating a data set comprising several factors with several variables. The idea is that I want to do ANOVA analysis between factor levels nested within one level of another factor.
Here is an example similar to my data set:
treatment category trial individual response
1 A big 1 F1 0.10
2 A big 2 F1 0.20
3 A big 1 F2 0.30
4 A big 2 F2 0.11
5 A small 1 F3 0.12
6 A small 2 F3 0.13
7 A small 1 F4 0.20
8 A small 2 F4 0.30
9 B big 1 F5 0.40
10 B big 2 F5 0.21
11 B big 1 F6 0.22
12 B big 2 F6 0.23
13 B small 1 F7 0.31
14 B small 2 F7 0.32
15 B small 1 F8 0.34
16 B small 2 F8 0.25
So basically, I'd like to do an ANOVA between big and small when treatment is A, then B, then same idea with ANOVA between big and small when treatment is A and trial 1... you get the logic.
It seems I have to use:
anova(lm(Y~x,data=dataset))
and add a subset argument, but I can't work the logic out of it and I can't find any example similar to mine. Any hint for it? Thank you in advance!
By your description, you want to apply separated ANOVAs to different subsets of your data.
Try this:
df1 <- df[df$treatment=="A",]
df2 <- df[df$treatment=="B",]
aov(response ~ category, data=df1)
aov(response ~ category, data=df2)
If you are interested in the effect of factor treatment, maybe you should keep it in a more complex model and use a posthoc to test differences within treatment A and B. But it's just a suggestion.

Find the 2 max values for each factor in R

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Resources