full_join by date plus one or minus one - r

I want to use full_join to join two tables. Below is my pseudo code:
join <- full_join(a, b, by = c("a_ID" = "b_ID" , "a_DATE_MONTH" = "b_DATE_MONTH" +1 | "a_DATE_MONTH" = "b_DATE_MONTH" -1 | "a_DATE_MONTH" = "b_DATE_MONTH"))
a_DATE_MONTH and b_DATE_MONTH are in date format "%Y-%m".
I want to do full join based on condition that a_DATE_MONTH can be one month prior to b_DATE_MONTH, OR one month after b_DATE_MONTH, OR exactly equal to b_DATE_MONTH. Thank you!

While SQL allows for (almost) arbitrary conditions in a join statement (such as a_month = b_month + 1 OR a_month + 1 = b_month) I have not found dplyr to allow the same flexibility.
The only way I have found to join in dplyr on anything other than a_column = b_column is to do a more general join and filter afterwards. Hence I recommend you try something like the following:
join <- full_join(a, b, by = c("a_ID" = "b_ID")) %>%
filter(abs(a_DATE_MONTH - b_DATE_MONTH) <= 1)
This approach still produces the same records in your final results.
It perform worse / slower if R does a complete full join before doing any filtering. However, dplyr is designed to use lazy evaluation, which means that (unless you do something unusual) both commands should be evaluated together (as they would be in a more complex SQL join).

Related

How to not use loops & IF-statements in R

I have two dataframes in R, one big but imcomplete (import) and I want to create a smaller, complete subset of it (export). Every ID in the $unique_name column is unique, and does not appear twice. Other columns might be for example body mass, but also other categories that correspond to the unique ID. I've made this code, a double-loop and an if-statement and it does work, but it is slow:
for (j in 1:length(export$unique_name)){
for (i in 1:length(import$unique_name)){
if (toString(export$unique_name[j]) == toString(import$unique_name[i])){
export$body_mass[j] <- import$body_mass[i]
}
}
}
I'm not very good with R but I know this is a bad way to do it. Any tips on how I can do it with functions like apply() or perhaps the plyr package?
Bjørn
There are many functions to do this. check out..
library(compare)
compare(DF1,DF2,allowAll=TRUE)
or as mentioned by #A.Webb Merge is pretty handy function.
merge(x = DF1, y = DF2, by.x = "Unique_ID",by.y = "Unique_ID", all.x = T, sort = F)
If you prefer SQL style statements then
library(sqldf)
sqldf('SELECT * FROM DF1 INTERSECT SELECT * FROM DF2')
easy to implement and to avoid for and if conditions
As A.Webb suggested you need join:
# join data on unique_name
joined=merge(export, import[c("unique_name", "body_mass")], c('unique_name'))
joined$body_mass=joined$body_mass.y # update body_mass from import to export
joined$body_mass.x=NULL # remove not needed column
joined$body_mass.y=NULL # remove not needed column
export=joined;
Note:As shown below use "which" function .This would reduce the loop iterations
for (j in 1 : nrow(export)){
index<- which(import$unique_name %in% export$unique_name[j])
if(length(index)=1)
{
export$body_mass[j] <- import[index[1],"body_mass"]
}
}

LHS:RHS vs functional in data.table

Why does the functional ':=' not aggregate unique rows using 'by' yet LHS:RHS does aggregate using 'by'? Below is a .csv file of 20 rows of data with 58 variables. A simple copy, paste, delim = .csv works. I am still trying to find the best way to post sample data to SO. The 2 variants of my code are:
prodMatrix <- so.sample[, ':=' (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does not aggregate the rowID using by---
prodMatrix <- so.sample[, (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does aggregate the rowID using by---
"CID","NetIncome_length_Auto Advantage","NetIncome_length_Certificates","NetIncome_length_Comm. Share Draft","NetIncome_length_Escrow Shares","NetIncome_length_HE Fixed","NetIncome_length_HE Variable","NetIncome_length_Holiday Club","NetIncome_length_IRA Certificates","NetIncome_length_IRA Shares","NetIncome_length_Indirect Balloon","NetIncome_length_Indirect New","NetIncome_length_Indirect RV","NetIncome_length_Indirect Used","NetIncome_length_Loanline/CR","NetIncome_length_New Auto","NetIncome_length_Non-Owner","NetIncome_length_Personal","NetIncome_length_Preferred Plus Shares","NetIncome_length_Preferred Shares","NetIncome_length_RV","NetIncome_length_Regular Shares","NetIncome_length_S/L Fixed","NetIncome_length_S/L Variable","NetIncome_length_SBA","NetIncome_length_Share Draft","NetIncome_length_Share/CD Secured","NetIncome_length_Used Auto","NetIncome_sum_Auto Advantage","NetIncome_sum_Certificates","NetIncome_sum_Comm. Share Draft","NetIncome_sum_Escrow Shares","NetIncome_sum_HE Fixed","NetIncome_sum_HE Variable","NetIncome_sum_Holiday Club","NetIncome_sum_IRA Certificates","NetIncome_sum_IRA Shares","NetIncome_sum_Indirect Balloon","NetIncome_sum_Indirect New","NetIncome_sum_Indirect RV","NetIncome_sum_Indirect Used","NetIncome_sum_Loanline/CR","NetIncome_sum_New Auto","NetIncome_sum_Non-Owner","NetIncome_sum_Personal","NetIncome_sum_Preferred Plus Shares","NetIncome_sum_Preferred Shares","NetIncome_sum_RV","NetIncome_sum_Regular Shares","NetIncome_sum_S/L Fixed","NetIncome_sum_S/L Variable","NetIncome_sum_SBA","NetIncome_sum_Share Draft","NetIncome_sum_Share/CD Secured","NetIncome_sum_Used Auto","totNI","Count","totalNI"
93,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,212.97,0,0,0,-71.36,0,0,0,49.01,0,0,67.42,6,404.52
114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-285.44,0,0,0,49.01,0,0,-221.89,90,-19970.1
1112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,-101.55,0,-71.36,0,0,0,98.02,0,0,-14.66,28,-410.48
5366,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
6078,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,7,0,0,0,1,0,0,0,0,0,0,0,0,-17.44,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-499.52,0,0,0,49.01,0,0,-453.41,3,-1360.23
11684,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
47358,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-14.43,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-85.79,3194,-274013.26
193761,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-71.36,0,0,0,49.01,0,0,-123.9,9973,-1235654.7
232530,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
604897,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1021309,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1023633,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1029726,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,37.88,8688,329101.44
1040005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1040092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1064453,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,212.97,0,0,0,-142.72,0,0,0,0,0,0,84.79,49,4154.71
1067508,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-194.56,4162,-809758.72
1080303,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1181005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-142.72,0,0,0,98.02,0,0,-146.25,614,-89797.5
1200484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-285.44,0,0,0,0,0,0,-386.99,50,-19349.5
Because := is making operations by reference. It means it will not invoke in-memory copy of your dataset but it will update it in-place.
Making aggregation of your dataset is a copy of it's original unaggregated form.
You can read more about it in Reference semantics vignette.
This is a design concept in data.table that := is used to update by reference and other forms - .(), list() or direct expression are used to query data. And query data isn't a by reference operation. The by reference operation is not able to aggregate rows, it can just calculate aggregates and put it into dataset in-place. Query is able to aggregate dataset because query result is not the same object in memory as original data.table.

R select rows based on values returned from another column by data.table, OR merge/chop overlapping ranges

I am trying to subset a data.table by two conditions stored in the same data.table, that are found by a single key.
In practice, I am trying to merge overlapping ranges.
I know how to do this:
dt[, max := max(localRange), by=someGroup]
However, I want to use a range as selectors in i. So something like:
dt[range > min(localRange) & range < max(localRange),
max := max(localRange),
by = someGroup]
where range and finalRange are the same column, just range is outside of the scope of .SD.
Or something like:
dt[col2 > dt[,min(col2),by = col1] & col2 < dt[,max(col2),by = col1],
col2 := max(col2)]
where the two by='s synchronise/share the same col1 value
I have tried it with a for loop using set(), iterating over a list of the min and max range as conditions to the data.table. The list I made using split() on a data.table table:
for (range in split(
dt[,
list(min = min(rightBound),max = max(rightBound)),
by = leftBound
],
f = 1:nrow(dt[,.GRP,by = leftBound])
)
){
set(
x = dt,
i = dt[rightBound >= range$min & rightBound <= range$max]
j = range$max
)
}
It all became a mess (even errors), though I assume that this could be a (syntacticly) fairly straightforward operation. Moreover, this is only a case in which there is a single step, getting the conditions associated with the by= group.
What if I would want to adjust values based on a series of transformations on a value in by= based on data in the data.table outside of .SD? For example: "by each start, select the range of ends, and based on that range find a range of starts", etc.
Here it does not matter at all that we are talking about ranges, as I think this is generally useful functionality.
In case anybody is wondering about the practical case, user971102 provides fine sample data for a simple case:
my.df<- data.frame(name=c("a","b","c","d","e","f","g"), leftBound=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), rightBound=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))
dt = as.data.table(my.df)
name leftBound rightBound
a 0 71200
b 70001 71200
c 70203 80001
d 70060 71051
e 40004 42004
f 50000872 50000890
g 50000872 51000952
Edit:
The IRanges package is going to solve my practical problem. However, I am still very curious to learn a possible solution to the more abstract case of 'chaining' selectors in data.tables
Thanks a bunch Jeremycg and AGstudy. Though it's not the findOverlaps() function, but the reduce() and disjoin() functions.

Data.table - lapply function within .SD with merge not working

I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.
The below example probably explains it better:
require(data.table)
customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]
rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))
customer[,lapply(.SD, function(z){
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
return(a)
}),by=customer]
But I get the following error:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum
What I want it to do for all customers is this:
z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]
Which gives:
> a[new==max(new)]
product_bought customer number recommended_product number2 new
1:
C 1 0.613043 F 168.4335 103.257
I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.
Any ideas why the merge within a .SD doesn't work?
Cheers,
Scott
I guess you were trying to do this:
customer[, {
a <- merge(.SD,rules,by="product_bought");
a[, new:=number*number2];
a[new==max(new)]
}, by = customer]
But it's much better to do a single merge:
customer[rules, on = 'product_bought', new := number * number2]
customer[, .SD[new == max(new)], by = customer]
Or do the .I trick if the last line is too slow.

Comparing specific columns in 2 different files using R

I find myself having to do this very often -- compare specific columns from 2 different files. The columns, formats are the same, but the columns that need comparison have floating point/exponential format data, e.g. 0.0058104642437413175, -3.459017050577087E-4, etc.
I'm currently using the below R code:
test <- read.csv("C:/VBG_TEST/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
prod <- read.csv("C:/VBG_PROD/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
sqldf("select sum(V10), sum(V15) from test")
sqldf("select sum(V10), sum(V15) from prod")
I read in the files, and sum the specific columns -- V10, V15 and then observe the values. This way I can ignore very small differences in floating point data per row.
However, going forward, I would like to set a tolerance percent, ie. if abs( (prod.V10 - test.V10)/prod.V10 ) > 0.01%, and only print those row numbers that exceed this tolerance limit.
Also, if the data is not in the sane order, how can I do a comparison by specifying columns that will act like a composite primary key?
For e.g., if I did this in Sybase, I'd have written something like:
select A.*, B.*
from tableA A, tableB B
where abs( (A.Col15-B.Col15)/A.Col15) ) > 0.01%
and A.Col1 = B.Col1
and A.Col4 = B.Col4
and A.Col6 = B.Col6
If I try doing the same thing using sqldf in R, it does NOT work as the files contain 500K+ rows of data.
Can anyone point me to how I can do the above in R?
Many thanks,
Chapax.
Au, this sqldf hurts my mind -- better use plain R capabilities than torture yourself with SQL:
which(abs(prod$V10-test$V10)/prod$V10>0.0001)
In a more general version:
which(abs(prod[,colTest]-test[,colTest])/prod[,colTest]>tolerance)
where colTest is an index of column that you want to test and tolerance is tolerance.
I don't know R but I'm suggesting this as a general advice. You should paginate your table and then use your query. I mean I think in general is not wise to execute specific comparison instructions over a table that big.

Resources