How to join data.tables when one is a lookup table? - r

I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?

Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.

Related

Find columns with different values in duplicate rows

I have a data set that has some duplicate records. For those records, most of the column values are the same, but a few ones are different.
I need to identify the columns where the values are different, and then subset those columns.
This would be a sample of my dataset:
library(data.table)
dat <- "ID location date status observationID observationRep observationVal latitude longitude setSource
FJX8KL loc1 2018-11-17 open 445 1 17.6 -52.7 -48.2 XF47
FJX8KL loc2 2018-11-17 open 445 2 1.9 -52.7 -48.2 LT12"
dat <- setDT(read.table(textConnection(dat), header=T))
And this is the output I would expect:
observationRep observationVal setSource
1: 1 17.6 XF47
2: 2 1.9 LT12
One detail is: my original dataset has 189 columns, so I need to check all of them.
How to achieve this?
Two issues, first, use text= argument rather than textConnection, second, use as.data.table, since seDT modifies object in place, but it yet isn't there.
dat1 <- data.table::as.data.table(read.table(text=dat, header=TRUE))
dat1[, c('observationRep', 'observationVal', 'setSource')]
# observationRep observationVal setSource
# 1: 1 17.6 XF47
# 2: 2 1.9 LT12

Override bad/wrong values in a main table with NA or null values listed on another lookup table in R

The main table is large.
Has certain undesired values that I want to override.
I am writing into a lookup table the keys and new_value (NA) to override.
Both have 2 keys (session_id and datetime), not one unique.
Other similar questions goes into replacing an NA with a value, but I want to replace the value with an NA. Clear cells contents.
The 2 keys limits the use of match() which can handle only one key and first occurrences.
left_join or merge operations, would create a new large dataframe with an added column, and will fill them up with NA for every row, and it would also require to perform some 'coalescing' into an NA value, which I guess, doesn't exists.
I don't want to remove the entire row, as there are many other columns with its own values. I just want to delete that value from that cells.
I think, that in short, it is just an assignment operation to a filtered subset based on 2 keys. Something like:
table[ lookup_paired_keys(session_ids, lookup_datetimes) ] <- NA
Follows a sample dataset with undesired "0" to replace by NA. The real dataset may contain other kind of values.
table <- read.table(text = "
session_id datetime CaloriesDaily
1233815059 2016-05-01 5555
8583815123 2016-05-03 4444
8512315059 2016-05-04 2432
8583815059 2016-05-12 0
6290855005 2016-05-10 0
8253242879 2016-04-30 0
1503960366 2016-05-20 0
1583815059 2016-05-19 2343
8586545059 2016-05-20 1111
1290855005 2016-05-11 5425
1253242879 2016-04-25 1234
1111111111 2016-05-09 6542", header = TRUE)
table$datetime = as.POSIXct(table$datetime, tz='UTC')
table
lookup <- read.table(text = "
session_id datetime CaloriesDaily
8583815059 2016-05-12 NA
6290855005 2016-05-10 NA
8253242879 2016-04-30 NA
1503960366 2016-05-12 NA", header = TRUE)
lookup$datetime = as.POSIXct(lookup$datetime, tz='UTC')
lookup$CaloriesDaily = as.numeric(lookup$CaloriesDaily)
lookup
SOLVED
After reading the accepted answer, I want to share the final outcome.
And as I have the main table a data.table and I got some warns regarding nomenclature, be aware that I am no expert, but is working with this example dataset and my own.
lookup_by : Standard Lookup operation
lookup_by <- function(table, lookup, by) {
merge( table, lookup, by=by )
}
### usage ###
keys = c('session_id','datetime')
lookup_by( table, lookup, keys)
Adopted solution: match_by
Like match() but with keys.
It returns a vectors with row numbers when keys match.
So that, assignment like table[ ..matches.. ] <- NA is possible.
match_by <- function(table, lookup, by) {
table <- setDT(table)[,..by]
table$idx1 <- 1:nrow(table)
lookup <- setDT(lookup)[,..by]
lookup$idx2 <- 1:nrow(lookup)
m <- merge( table , lookup, by=by )
return( m[ ,c('idx1','idx2') ] )
}
### usage ###
keys = c('session_id','datetime')
rows = match_by( table, lookup, keys)
overrides <- c(lookup[ rows$idx2, 'CaloriesDaily' ])
table[ rows$idx1, 'CaloriesDaily' ] <- overrides
table
Here’s a solution using dplyr::semi_join() and dplyr::anti_join() to split your dataframe based on whether the id and date keys match your lookup table. I then assign NAs in just the subset with matching keys, then row-bind the subsets back together. Note that this solution doesn’t preserve the original row order.
library(dplyr)
table_ok_vals <- table %>%
anti_join(lookup, by = c("session_id", "datetime"))
table_replaced_vals <- table %>%
semi_join(lookup, by = c("session_id", "datetime")) %>%
mutate(CaloriesDaily = NA_real_)
table <- bind_rows(table_ok_vals, table_replaced_vals)
table
Output:
session_id datetime CaloriesDaily
1 1233815059 2016-05-01 5555
2 8583815123 2016-05-03 4444
3 8512315059 2016-05-04 2432
4 1503960366 2016-05-20 0
5 1583815059 2016-05-19 2343
6 8586545059 2016-05-20 1111
7 1290855005 2016-05-11 5425
8 1253242879 2016-04-25 1234
9 1111111111 2016-05-09 6542
10 8583815059 2016-05-12 NA
11 6290855005 2016-05-10 NA
12 8253242879 2016-04-30 NA

Obtain the same cross join with merge() and sqldf::sqldf()

I have two data frames: Sales and Clients. I want to perform cross joins on these data frames using sqldf::sqldf() and also using merge() and obtain the exact same result with both methods.
So far I´ve only been able to obtain two data frames with the rows ordered differently.
This is the code to generate the Sales and Clients data frames:
set.seed(1)
Sales <- data.frame(
Product = sample(c("Toaster", "Radio", "TV"), size = 7, replace = TRUE),
CustomerID = c(rep("1_2019", 2), paste(2:3, "2019", sep = "_"), paste(1:3, "2020", sep = "_"))
)
Sales$Price <- round(ifelse(Sales$Product == "TV", rnorm(1, 400, 20),
ifelse(Sales$Product == "Toaster", rnorm(1, 40, 2),
rnorm(1, 35, 2))))
Clients <- data.frame(
CustomerID = c(paste(2:4, "2019", sep = "_"), paste(1:2, "2020", sep = "_")),
State = sample(c("CA", "AZ", "IL", "MA"), size = 5, replace = TRUE)
)
This is what I got:
library(sqldf)
# cross join with base R
out1 <- merge(x = Sales, y = Clients, by = NULL)
# cross join with sqldf
out2 <- sqldf("SELECT *
FROM Sales
CROSS JOIN Clients")
out1 and out2 have different row orderings. How can I tweak the sqldf() call in order for out1 and out2 to be exactly the same?
This is the closest I got:
merge(x = Sales, y = Clients, by = NULL)
sqldf("SELECT *
FROM Sales
CROSS JOIN Clients
ORDER BY State DESC, Clients.CustomerID")
I think including ORDER BY in sqldf is important, since it drives home the fact that in SQL, ordering is never guaranteed unless explicitly directed.
If you were doing simple ORDER BY with just "increasing" on both variables, then the translation to order in R would be direct. However, since one variable is decreasing and one is increasing, order by itself doesn't deal with that. However, as suggested by https://stackoverflow.com/a/3316719, we can do the same with xtfrm.
out1 <- merge(x = Sales, y = Clients, by = NULL)
out1 <- out1[order(-xtfrm(out1$State), out1$CustomerID.y),]
out2 <- sqldf::sqldf(
"SELECT *
FROM Sales
CROSS JOIN Clients
ORDER BY State DESC, Clients.CustomerID")
### proof they are identical
all(unlist(Map(`==`, out1, out2)))
# [1] TRUE
The xtfrm helper function here allows us to negate the "values" of a column for the purposes of sorting. From ?xtfrm:
A generic auxiliary function that produces a numeric vector which will sort in the same order as 'x'.
If the field were already numeric, we could merely do order(-State, CustomerID.y), but the fact that it is character requires a further step. Argo xtfrm.
Edit: in comments, it's determined that the OP wants to mimic the sort-order of merge in the SQL statement. Unfortunately, because this is a cartesian product of the two frames, no sorting is applied: merge merely cbinds all rows of the first frame against the first row of the second frame, then repeats with each row of the second.
This can be demonstrated by using some code from merge:
nx <- nrow(x) # Sales
ny <- nrow(y) # Clients
expand.grid(seq_len(nx), seq_len(ny))
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 1 2
# ...
# 33 3 7
# 34 4 7
# 35 5 7
where each number is a row from the respective frames (x for Var1, y for Var2). If the original data is:
## Sales ## Clients
Product CustomerID Price CustomerID State
1 Toaster 1_2019 37 1 2_2019 AZ
2 Radio 1_2019 33 2 3_2019 MA
3 Radio 2_2019 33 3 4_2019 AZ
4 TV 3_2019 408 4 1_2020 IL
5 Toaster 1_2020 37 5 2_2020 MA
6 TV 2_2020 408
7 TV 3_2020 408
then this results in
out1
# Product CustomerID.x Price CustomerID.y State
# 1 Toaster 1_2019 37 2_2019 AZ
# 2 Radio 1_2019 33 2_2019 AZ
# 3 Radio 2_2019 33 2_2019 AZ
# 4 TV 3_2019 408 2_2019 AZ
# 5 Toaster 1_2020 37 2_2019 AZ
# 6 TV 2_2020 408 2_2019 AZ
# 7 TV 3_2020 408 2_2019 AZ
# 8 Toaster 1_2019 37 3_2019 MA
# ...
# 33 Toaster 1_2020 37 2_2020 MA
# 34 TV 2_2020 408 2_2020 MA
# 35 TV 3_2020 408 2_2020 MA
which will very much destroy any sorting present in x (Sales), even if y (Clients) comes pre-sorted (which it does).
Because of this, if you want congruity between R and SQL cross-join solutions, I suggest the most transparent/clear way would be to merge in R and then apply post-merge ordering in a fashion that is similar to SQL. In fact, from a pedagogic perspective, ask the question: *"What ordering makes sense to humans?" If you assert during the lesson plan that ordering may not be assured until explicitly strong-armed into the process (via dplyr::arrange, x[order(...),], or SQL's ORDER BY clause). Find the intuitive ordering of the data and then demonstrate that in both R and SQL.
Side notes:
Your sqldf query results in same-named columns, this results in some errors post-sqldf if you start playing with columns. This can be mitigated with select ... as ... field-naming.
Lexicographic sorting of your data is unfortunately counter-intuitive at the moment: having year at the end of a customer id suggests (yes, I'm inferring) a timeline of customer onboarding, yet they will sort first by the leading number. Similar to how "2020-05-04" sorts correctly even as a string, while "05/04/2020" does not, it might support more intuitive sorting to have the most-significant portion be the leading part of id strings. Or make them integers. Or UUIDs (v4, of course), those are always fun.

Equivalent of SAS format (in R)

Suppose I have a dataframe:
sick <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
code <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
sick_code <- data.frame(sick, code)
And another:
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
Now i would like to add a new varialbe to p_sick, that "translates" p_sick$sick to sick_code$code. The variable in p_sick$sick is a string which may or may not be p_sick$sick in this case NA should be returned.
Now I could write for loop with a simple ifelse statement. But the data I have is 150million rows long, and the translate table is 15.000 long.
I have googled that this is the equalivalent of a "proc format" in SaS (but I do not have acces to SaS, nor do I have any idea how it works).
Perhaps some variant of merge in plyr, or an apply function?
EDIT: I have accepted both answer, since they work.
I will try and look into the difference (in speed) between the two. Since merge is a built in function I am guessing it does lots of checking.
EDIT2: To people getting here by Google; merge has and sort = FALSE which will speed things up. Note that the order is not preserved in any way.
data.table will be suitable in your example:
library(data.table)
setkey(setDT(p_sick),sick)
p_sick[setDT(sick_code),code := i.code][]
pid sick code
1: 3137 - NA
2: 755 - NA
3: 1327 - NA
4: 929 - NA
5: 939 daa16 cancer
6: 906 SO NA
Please see here for detail explanation.
You could use merge with all.x = TRUE (to keep values from p_sick with no match in sick_code:
merge(p_sick, sick_code, all.x = TRUE)
An equivalent is using left_join from dplyr:
library(dplyr)
left_join(p_sick, sick_code)
# pid sick code
# 1 212 - <NA>
# 2 2366 - <NA>
# 3 325 - <NA>
# 4 269 - <NA>
# 5 501 daa16 cancer
# 6 1352 SO <NA>
Note that each of these solutions works only because the name sick is shared between the two data frames. Suppose they had different names- say the column was called sickness in sick_code. You could accommodate this with, respectively:
merge(p_sick, sick_code, by.x = "sick", by.y = "sickness", all.x = TRUE)
# or
left_join(p_sick, sick_code, c(sick = "sickness"))
A simple named vector will also work. The named vector can act as a lookup. So instead of defining sick and code as a data frame, define it as a named vector and use it as a decode. Like this:
# Set up named vector
sick_decode <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
names(sick_decode) <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
# Prepare data
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
# Create new variable using decode
p_sick$sick_decode <- sick_decode[p_sick$sick]
# Results
#> pid sick sick_decode
#> 1 511 - <NA>
#> 2 1619 - <NA>
#> 3 394 - <NA>
#> 4 641 - <NA>
#> 5 53 daa16 cancer
#> 6 244 SO <NA>
I suspect this method will also be fast, but have not benchmarked it.
Also, there is now an R package specifically for replicating SAS format functionality in R. It is called fmtr.

Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Resources