compare two similar data frames and only display differences in R

compare two similar data frames and only display differences in R - r

I have two similar data frames with the same number of columns, but different numbers of rows. Most of the entries between the two are the same, but in a few places there are differences, and these are what I care about. The first column in both data frames serves as a key.
Ideally, I'd like to be able to see whether they've changed, as well as the values from each of the two data frames. My first solution was to create a merged dataframe and re-organize the columns side-by-side like so:
df1<-data.frame(gene=c('cyp1a1','cyp2a6','srd5a','slc5a5','cox15'), updated=c(TRUE,TRUE,FALSE,TRUE,FALSE),version=c(2,3,1,2,1))
df2<-data.frame(gene=c('cyp1a1','cyp2a6','srd5a','slc5a5'), updated=c(FALSE,TRUE,FALSE,FALSE),version=c(1,2,1,1))
#merge data frames
comp<-merge(df1,df2, by="gene", all=TRUE)
#re-order columns side-by-side
#probably a better way to do this
ordList<-c(1,2,4,3,5)
comp<-comp[ordList]
So now I have a side-by-side comparison data frame. I am unsure about how to iterate over the data frames to perform the comparison. Eventually I would like to create a new data-frame which uses information from the comparison to exclude data that is identical (replace with empty string) and includes data that differs from the first df to the second.
This is what comp looks like now:
gene updated.x updated.y version.x version.y
1 cox15 FALSE NA 1 NA
2 cyp1a1 TRUE FALSE 2 1
3 cyp2a6 TRUE TRUE 3 2
4 slc5a5 TRUE FALSE 2 1
5 srd5a FALSE FALSE 1 1
This is what I want it to look like:
gene updated.x updated.y version.x version.y
1 cox15 FALSE NA 1 NA
2 cyp1a1 TRUE FALSE 2 1
3 cyp2a6 3 2
4 slc5a5 TRUE FALSE 2 1
5 srd5a
In my actual data, I have 14 columns in each data frame, and hundreds of rows. I may be doing similar comparisons in the future, so having a functional way of executing this task would be ideal.

Here is my suggestion, considering you have 14 columns:
library(data.table)
library(magrittr)
z = rbindlist(list(df1,df2), idcol=TRUE)
z[, lapply(.SD, . %>% unique %>% paste(collapse=";")), keyby=gene]
# gene .id updated version
# 1: cox15 1 FALSE 1
# 2: cyp1a1 1;2 TRUE;FALSE 2;1
# 3: cyp2a6 1;2 TRUE 3;2
# 4: slc5a5 1;2 TRUE;FALSE 2;1
# 5: srd5a 1;2 FALSE 1
This shows you which data frame each gene appears in (.id) as well as the attributes (updated and version). This display extends naturally to additional tables, like list(df1,df2,df3).
If you really are not interested in unchanged values, you can hide them with an if test:
z[, lapply(.SD, function(x)
if (uniqueN(x)>1) x %>% unique %>% paste(collapse=";")
else ""
), keyby=gene]
# gene .id updated version
# 1: cox15
# 2: cyp1a1 1;2 TRUE;FALSE 2;1
# 3: cyp2a6 1;2 3;2
# 4: slc5a5 1;2 TRUE;FALSE 2;1
# 5: srd5a 1;2
This also hides .id for genes only showing up once, but that can be tweaked.
Explanation. z contains all the data, "stacked" or stored in "long" format.
To make the summary table, we use z[, j, keyby=gene] where j works on the Subset of Data, .SD, associated with each keyby=gene group and returns a list of column vectors for the result.
The . %>% unique %>% paste(collapse=";") uses a feature of magrittr. It is just an easy-to-read version of function(y) paste(unique(y), collapse=";"). When it starts with x, it applies the function to x. You can replace it if you prefer to write these in the standard way.

Related

Fastest way to find a big list of strings in a big data table in r

I have a list of around 15.000 user id's,
> head(ID_data)
[1] "A01Y" "AC43" "BBN5" "JK45" "NT66" "WC44"
and a table with 3 columns and around 100.000 rows as a data.table:
> head(USER_data)
V1 V2 V3
1: 0 John John
2: A01Y Martin 3311290
3: Peter Johnson Peter JK45 x
4: 1 wc44#email.com wc44#email.com
5: NA x
6: 419223 Christian 21221140 ac43#email.com
I want to know the row index of rows that contain a user id somewhere in one of the 3 columns.
In the example above, the code should find row 2, 3, 4 and 6, since they contain "A01Y", "JK45", "WC44" and "AC43" somewhere in one or more of the 3 columns.
The main problem is the big amount of data.
I have tried pasting "|" between the ID's and use grep to search for "A01Y|JK45" etc.:
toMatch <- paste(ID_data,collapse="|")
V1.matches <- grep(toMatch, USER_data$V1, ignore.case=TRUE)
V2.matches <- grep(toMatch, USER_data$V2, ignore.case=TRUE)
V3.matches <- grep(toMatch, USER_data$V3, ignore.case=TRUE)
but grep can only take a search pattern of around 2.500 ID's, so I would have to go through the ID's in blocks of size 2500. This takes around 15 minutes to compute.
I have also tried using strapplyc, which can take a search pattern of around 9.999 ID's.
Is there a faster way to find the row indices?
I was thinking of using sqldf() and do something like
sqldf("SELECT * FROM USER_data, ID_data WHERE USER_data LIKE '%'+ID_data+'%'")
but I'm not sure how to do this exactly.
Thanks a lot in advance for any suggestions.

Not sure if this fast enough, but I've done it before with many rows and IDs. It took me some time, but no need to process IDs in blocks.
# list of ids
IDs = c("A01Y", "AC43", "BBN5", "JK45", "NT66", "WC44")
# example dataframe
dt = data.frame(V1 = c("Christian 21223456","x", "wc44#email.com"),
V2 = c("0 John","1 wc44#email.com", "wc44#email.com"),
V3 = c("1","0","A01Y Martin 3311290"))
dt
# V1 V2 V3
# 1 Christian 21223456 0 John 1
# 2 x 1 wc44#email.com 0
# 3 wc44#email.com wc44#email.com A01Y Martin 3311290
# combine row elements in one big string
dt_rows = apply(dt, 1, function(x) paste(x,collapse = " "))
# update to lower case
IDs = tolower(IDs)
dt_rows = tolower(dt_rows)
# find in which row you have matches
sapply(IDs, grepl, dt_rows)
# a01y ac43 bbn5 jk45 nt66 wc44
# [1,] FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] TRUE FALSE FALSE FALSE FALSE TRUE
# find which row id has a match (at least one match)
which(apply(sapply(IDs, grepl, dt_rows), 1, sum) >= 1)
# [1] 2 3

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.

This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.

A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Subsetting data.table by not head(key(DT),m), using binary search not vector scan

If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.

Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.

Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).

R: Grouping levels of a factor across multiple files

I'm new to R and struggling to group multiple levels of a factor prior to calculating means. This problem is complicated by the fact that I am doing this over hundreds of files that have variable levels of factors that need to be grouped. I see from previous posts how to address this grouping issue for single levels using levels (), but my data is too variable for this method.
Basically, I'd like to calculate both individual and then an overall mean for multiple levels of a factor. For example, I would like to calculate the mean for each species for each of the following factors present in the column Status: Crypt1, Crypt2, Crypt3, Native, Intro, and then also the overall mean for Crypt species (includes Crypt1, Crypt2, and Crypt3, but not Native or Intro). However, a species either has multiple levels of Crypt (variable, and up to Crypt8), or has Native and Intro, and means for all species at each of these levels are ultimately averaged into the same summary sheet.
For example:
Species Status Value
A Crypt1 5
A Crypt1 6
A Crypt2 4
A Crypt2 8
A Crypt3 10
A Crypt3 50
B Native 2
B Native 9
B Intro 9
B Intro 10
I was thinking that I could use the first letter of each factor to group the Crypt factors together, but I am struggling to target the first letter because they are factors, not strings, and I am not sure how to convert between them. I'm ultimately calculating the means using aggregate(), and I can get individual means for each factor, but not for the grouped factors.
Any ideas would be much appreciated, thanks!

For the individual means:
# assuming your data is in data.frame = df
require(plyr)
df.1 <- ddply(df, .(Species, Status), summarise, ind.m.Value = mean(Value))
> df.1
# Species Status ind.m.Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
For the overall mean, the idea is to remove the numbers present at the end of every entry in Status using sub/gsub.
df.1$Status2 <- gsub("[0-9]+$", "", df.1$Status)
df.2 <- ddply(df.1, .(Species, Status2), summarise, oall.m.Value = mean(ind.m.Value))
> df.2
# Species Status2 oall.m.Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Is this what you're expecting?

Here's an alternative. Conceptually, it is the same as Arun's answer, but it sticks to functions in base R, and in a way, keeps your workspace and original data somewhat tidy.
I'm assuming we're starting with a data.frame named "temp" and that we want to create two new data.frames, "T1" and "T2" for individual and grouped means.
# Verify that you don't have T1 and T2 in your workspace
ls(pattern = "T[1|2]")
# character(0)
# Use `with` to generate T1 (individual means)
# and to generate T2 (group means)
with(temp, {
T1 <<- aggregate(Value ~ Species + Status, temp, mean)
temp$Status <- gsub("\\d+$", "", Status)
T2 <<- aggregate(Value ~ Species + Status, temp, mean)
})
# Now they're there!
ls(pattern = "T[1|2]")
# [1] "T1" "T2"
Notice that we used <<- to assign the results from within with to the global environment. Not everyone likes using that, but I think it is OK in this particular case. Here is what "T1" and "T2" look like.
T1
# Species Status Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
T2
# Species Status Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Looking back at the with command, it might have seemed like we had changed the value of the "Status" column. However, that was only within the environment created by using with. Your original data.frame is the same as it was when you started.
temp
# Species Status Value
# 1 A Crypt1 5
# 2 A Crypt1 6
# 3 A Crypt2 4
# 4 A Crypt2 8
# 5 A Crypt3 10
# 6 A Crypt3 50
# 7 B Native 2
# 8 B Native 9
# 9 B Intro 9
# 10 B Intro 10

Fast grouping by list column subsets in data.table

I am working with a large (millions of rows) data table with a list column containing deeply nested lists, which do not have uniform structure, size or order of elements (list(x=1,y=2) and list(y=2,x=1) may both be present and should be treated as identical). I need to repeatedly perform arbitrary groupings that include some columns from the data table as well as a subset of the data in the list column. Not all rows have values that will match the subset.
The approach I've come up with feels overly complicated. Here are the key points:
Identifying values in a nested list structure. My approach is to use ul <- unlist(list_col), which "flattens" nested data structures and builds hierarchical names for direct access to each element, e.g., address.country.code.
Ensuring that permutations of the same unlisted data are considered equal from a grouping standpoint. My approach is to order the unlisted vectors by the names of their values via ul[order(names(ul))] and assign the result as a new character vector column by reference.
Performing grouping on subsets of the flattened values. I was not able to get by= to work in any way with a column whose values are lists or vectors. Therefore, I had to find a way to map unique character vectors to simple values. I did this with digest.
Here are the two workhorse functions:
# Flatten list column in a data.table
flatten_list_col <- function(dt, col_name, flattened_col_name='props') {
flatten_props <- function(d) {
if (length(d) > 0) {
ul <- unlist(d)
names <- names(ul)
if (length(names) > 0) {
ul[order(names)]
} else {
NA
}
} else {
NA
}
}
flattened <- lapply(dt[[col_name]], flatten_props)
dt[, as.character(flattened_col_name) := list(flattened), with=F]
}
# Group by properties in a flattened list column
group_props <- function(prop_group, prop_col_name='props') {
substitute({
l <- lapply(eval(as.name(prop_col_name)), function(x) x[names(x) %in% prop_group])
as.character(lapply(l, digest))
}, list(prop_group=prop_group, prop_col_name=prop_col_name))
}
Here is a reproducible example:
library(data.table)
dt <- data.table(
id=c(1,1,1,2,2,2),
count=c(1,1,2,2,3,3),
d=list(
list(x=1, y=2),
list(y=2, x=1),
list(x=1, y=2, z=3),
list(y=5, abc=list(a=1, b=2, c=3)),
NA,
NULL
)
)
flatten_list_col(dt, 'd')
dt[, list(total=sum(count)), by=list(id, eval(group_props(c('x', 'y'))))]
The output is:
> flatten_list_col(dt, 'd')
id count d props
1: 1 1 <list> 1,2
2: 1 1 <list> 1,2
3: 1 2 <list> 1,2,3
4: 2 2 <list> 1,2,3,5
5: 2 3 NA NA
6: 2 3 NA
> dt[, list(total=sum(count)), by=list(id, eval(group_props(c('x', 'y'))))]
id group_props total
1: 1 325c6bbb2c33456d0301cf3909dd1572 4
2: 2 7aa1e567cd0d6920848d331d3e49fb7e 2
3: 2 ee7aa3b9ffe6bffdee83b6ecda90faac 6
This approach works but is pretty inefficient because of the need to flatten & order the lists and because of the need to calculate digests. I'm wondering about the following:
Can this be done without having to create a flattened column by instead retrieving values directly from the list column? This will probably require specifying selected properties as expressions as opposed to simple names.
Is there a way to get around the need for digest?

There are a number of issues here. The most important (and one you haven't come to yet due to others) is that that you are assigning by reference but trying to replace with more values than you have space to do so by reference.
Take this very simple example
DT <- data.table(x=1, y = list(1:5))
DT[,new := unlist(y)]
Warning message:
In `[.data.table`(DT, , `:=`(new, unlist(y))) :
Supplied 5 items to be assigned to 1 items of column 'new' (4 unused)
You will lose all but the firstnrow(DT) items in the newly created list. They wont correspond to the rows of the data.table
Therefore you will have to create a new data.table that will be large enough for you to explode these list variables. This won't be possible by reference.
newby <- dt[,list(x, props = as.character(unlist(data))), by = list(newby = seq_len(nrow(dt)))][,newby:=NULL]
newby
x props
1: 1 1
2: 1 2
3: 1 2
4: 1 1
5: 1 10
6: 2 1
7: 2 2
8: 2 3
9: 2 5
10: 2 1
11: 2 2
12: 2 3
13: 3 NA
14: 3 NA
Note that as.character is required to ensure that all values are the same type, and a type that won't lose data in the conversion. At the momemnt you have a logical NA value amongst lists of numeric / integer data.
Another edit to force all components to be character (even the NA). props is now a list with 1 character vector for each row.
flatten_props <- function(data) {
if (is.list(data)){
ul <- unlist(data)
if (length(ul) > 1) {
ul <- ul[order(names(ul))]
}
as.character(ul) } else {
as.character(unlist(data))}}
dt[, props := lapply(data, flatten_props)]
dt
x data props
1: 1 <list> 1,2
2: 1 <list> 10,1,2
3: 2 <list> 1,2,3
4: 2 <list> 1,2,3,5
5: 3 NA NA
6: 3
dt[,lapply(props,class)]
V1 V2 V3 V4 V5 V6
1: character character character character character character