Efficiently joining two data tables with a condition - r

One data table (let's call is A) contains the ID numbers:
ID
3
5
12
8
...
and another table (let's call it B) contains the lower bound and the upper bound and the name for that ID.
ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah
so based on table B, given the ID from table A, we can find the matching name by finding the name on the row in table B such that
ID_lower <= ID <= ID upper
and I wanna create a table of ID and Name, so in the above example, it would be
ID Name
3 James
5 Arthur
12 Sarah
8 Jacob
... ...
I used for loop, so that for each row of A, I look for the row in B such that ID is between the ID_lower and ID_upper for that row and joined the name from there.
However, this method was a bit slow. Is there a fast way of doing it in R?

Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
require(data.table) # v1.9.7+
dt2[dt1, .(ID, Name), on=.(ID_lower <= ID, ID_upper >= ID)]
See the installation instructions for devel version here.
where,
dt1=fread('ID
3
5
12
8')
dt2 = fread('ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah')

You can make a look-up table with your second data.frame (B):
lu <- do.call(rbind,
apply(B,1,function(x)
data.frame(ID=c(x[1]:x[2]),Name=x[3], row.names = NULL)))
then you query it with your first data.frame (A):
A$Name <- lu[A$ID,"Name"]

You can try this data.table solution:
data.table::setDT(B)[, .(Name, ID = Map(`:`, ID_lower, ID_upper))]
[, .(ID = unlist(ID)), .(Name)][ID %in% A$ID]
Name ID
1: James 3
2: Arthur 5
3: Sarah 12
4: Jacob 8

I believe findInterval() on ID_lower might be the ideal approach here:
A[,Name:=B[findInterval(ID,ID_lower),Name]];
A;
## ID Name
## 1: 3 James
## 2: 5 Arthur
## 3: 12 Sarah
## 4: 8 Jacob
This will only be correct if (1) B is sorted by ID_lower and (2) all values in A$ID are covered by the ranges in B.

Related

Count no-NA values per row

family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
Now I have DT, how can I use this table to know how many children each family have using syntax like this:
DT[,apply..,keyby=family_id]##this code is wrong
This may also work:
> DT$total_child <- as.vector(rowSums(!is.na(DT[, c("dob_child1",
"dob_child2", "dob_child3")])))
> DT
family_id age_mother dob_child1 dob_child2 dob_child3 total_child
1 1 30 1998-11-12 <NA> <NA> 1
2 2 27 1999-12-12 1997-09-09 1999-09-01 3
3 3 29 1996-04-12 <NA> 1996-09-09 2
You can use sqldf package, to use a SQL query in R.
I duplicated your DT.
family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
library(sqldf)
sqldf('select distinct (count(dob_child3)+count(dob_child2)+count(dob_child1)) as total_child,
family_id from DT group by family_id')
The result is the following:
total_child family_id
1 1 1
2 3 2
3 2 3
It is correct for you?

Classification according to unique values \

I have a data frame named as Records having 2 vectors Rank and Name
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
I want that every name should be placed in either 1, 2 ,3 or 4 tag depending on their occurrence and uniqueness:
I want to create a new vector which will be named as Tagging
So The output should be:
Rank 1 has three unique elements Mark Spencer and Ashish so the tag is 1 for all three.
Rank 2 has one unique records which is Mary as Ashish has already been assigned tag 1 so Mary is tagged as 2.
Rank 3 has no unique records as Spencer and Mark has already been assigned 1 so I cannot tag 3 to anybody.
Rank 4 has one unique record Joseph so he gets tagged as 4.
Let me know which function can help me do this.
I do not want to use looping as this is 1000000 row database
The below solution follows the principle that the highest Rank of a person is going to be that person's tag too.
tbl <- read.table(header=TRUE, text='
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
')
Ordering the 'tbl' dataframe by Rank
tbl_ord <- tbl[with(tbl,order(Rank)),]
Removing multiple occurrence of name within same Rank
> name_ord<- tbl_ord[duplicated(tbl_ord$Rank),]
> name_ord
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
8 2 Spencer
10 2 Mary
7 3 Spencer
11 4 Joseph
Displaying unique Names
#name_ord[unique(name_ord$Name),] #this will work too
> name_ord[!duplicated(name_ord$Name),]
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
10 2 Mary
11 4 Joseph
Using the setkey function of data.table package and unique:
library(data.table)
dt<-data.table(Rank=c(1,1,2,3,4,1,3,2,1,2,4), Name=c(rep("Ashish", 3), rep("Mark", 3), rep("Spencer", 3), "Mary", "Joseph"))
setkey(dt, Rank, Name)
dt<-unique(dt)
setkey(dt, Name)
dt<-unique(dt) # works because of the above setkey call which sorted it
setkey(dt, Rank) # if you want to order them by Rank again

merging data in R

I have a data set A
paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15
........
and a data set B
author_id author_name author_affiliation
9 Ernest Jordan Cambridge
14 K. MORIBE NA
15 D. Jakominich NA
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
........
I want to merge these two data sets in such a way so that merging is done through author_id but result should be seen like:
paper id author_id author_name author_affiliation
2 9 Ernest Jordan Cambridge
8 15 D. Jakominich NA
That is I want to have data in the order by paper_id only and merging is performed on the author_id, such that all the paper_id order doesnt get disturbed.
From what I am doing is:
b<-merge(A,B,by="author_id")
and I am getting. In this the paper_id is getting disturbed
author_id paper_id author_name author_affiliation
9 1468598 Ernest Jordan cambridge
9 1682105 Ernest Jordan cambridge
and then I have to sort this output by sorting through paper_id column.Its a very inefficient way.
How could this be done.
Thanks
This should do what you want.
b <-merge(A,B,by="author_id", sort=F)
b <- b[,c(2,1,3,4)]
You can turn off sorting on the by=... columns with sort=F, but merge(...) will always make the sort columns the first columns of the result. The last line of code just reverses columns 1 and 2.
EDIT (Response to #BrianDiggs comment)
#BrianDiggs is correct that, while sort=F will not force a sort on the by=... column, it does not guarantee the original sort order in A. If efficiency is a big concern, then consider the data.table package, which was built for this:
# create an example
A <- data.frame(paper_id=1:10000, author_id=rev(LETTERS[1:4]))
B <- data.frame(author_id=LETTERS[1:4],
author_name=c("Davies","Hawking","Carlyle","Higgs"),
author_affiliation=c("Oxford","Cambridge","UCL","Edinburgh"),
stringsAsFactors=F)
library(data.table)
A <- data.table(A,key="author_id")
B <- data.table(B,key="author_id")
A[B,c("author_name","author_affiliation"):=list(author_name,author_affiliation)]
setkey(A,paper_id)
head(A)
# paper_id author_id author_name author_affiliation
# 1: 1 D Higgs Edinburgh
# 2: 2 C Carlyle UCL
# 3: 3 B Hawking Cambridge
# 4: 4 A Davies Oxford
# 5: 5 D Higgs Edinburgh
# 6: 6 C Carlyle UCL
Unlike sort(...), setting a key in a data table sorts "by reference" using a radix algorithm. Sorting by reference means that the rows are rearranged in memory instead of copying the whole table into a new table. As a result, sorting data tables is extremely fast and memory efficient.
Also, the use of A[B,...] to do the merge is much faster than merging two data frames. In addition, this process appends the new columns to A (rather than creating a copy of A as with merge(...).
If you can consider non-base alternatives, then you may try the plyr equivalent of merge: join. From "Details" in ?join: Unlike merge, preserves the order of x no matter what join type is used.. Also the order of columns is preserved.
library(plyr)
join(A, B, type = "inner")
# Joining by: author_id
# paper_id author_id author_name author_affiliation
# 1 2 9 ErnestJordan Cambridge
# 2 8 15 Jakominich <NA>
inner_join in dplyr is similar. However, while the order of columns in x is kept, the columns in y seem to be sorted alphabetically:
library(dplyr)
inner_join(x = A, y = B)
# Joining by: "author_id"
# paper_id author_id author_affiliation author_name
# 1 2 9 Cambridge ErnestJordan
# 2 8 15 <NA> Jakominich
Too long for a comment
I do get what you want:
A <- read.table(text="paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15", header=T)
B <- read.table(text="author_id author_name author_affiliation
9 Ernest_Jordan Cambridge
14 K._MORIBE NA
15 D._Jakominich NA
25 William_H._Nailon NA
37 P._B._Littlewood Cavendish_Laboratory|Cambridge_University",
header=T)
b <- merge(A, B, by="author_id")
b
# author_id paper_id author_name author_affiliation
# 1 9 2 Ernest_Jordan Cambridge
# 2 15 8 D._Jakominich <NA>
Can you clarify your problem?

How to create dataframe subset of the one patient observation with the lowest score on a variable

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12
Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])
I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]
Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.
I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources