merging data in R - r

I have a data set A
paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15
........
and a data set B
author_id author_name author_affiliation
9 Ernest Jordan Cambridge
14 K. MORIBE NA
15 D. Jakominich NA
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
........
I want to merge these two data sets in such a way so that merging is done through author_id but result should be seen like:
paper id author_id author_name author_affiliation
2 9 Ernest Jordan Cambridge
8 15 D. Jakominich NA
That is I want to have data in the order by paper_id only and merging is performed on the author_id, such that all the paper_id order doesnt get disturbed.
From what I am doing is:
b<-merge(A,B,by="author_id")
and I am getting. In this the paper_id is getting disturbed
author_id paper_id author_name author_affiliation
9 1468598 Ernest Jordan cambridge
9 1682105 Ernest Jordan cambridge
and then I have to sort this output by sorting through paper_id column.Its a very inefficient way.
How could this be done.
Thanks

This should do what you want.
b <-merge(A,B,by="author_id", sort=F)
b <- b[,c(2,1,3,4)]
You can turn off sorting on the by=... columns with sort=F, but merge(...) will always make the sort columns the first columns of the result. The last line of code just reverses columns 1 and 2.
EDIT (Response to #BrianDiggs comment)
#BrianDiggs is correct that, while sort=F will not force a sort on the by=... column, it does not guarantee the original sort order in A. If efficiency is a big concern, then consider the data.table package, which was built for this:
# create an example
A <- data.frame(paper_id=1:10000, author_id=rev(LETTERS[1:4]))
B <- data.frame(author_id=LETTERS[1:4],
author_name=c("Davies","Hawking","Carlyle","Higgs"),
author_affiliation=c("Oxford","Cambridge","UCL","Edinburgh"),
stringsAsFactors=F)
library(data.table)
A <- data.table(A,key="author_id")
B <- data.table(B,key="author_id")
A[B,c("author_name","author_affiliation"):=list(author_name,author_affiliation)]
setkey(A,paper_id)
head(A)
# paper_id author_id author_name author_affiliation
# 1: 1 D Higgs Edinburgh
# 2: 2 C Carlyle UCL
# 3: 3 B Hawking Cambridge
# 4: 4 A Davies Oxford
# 5: 5 D Higgs Edinburgh
# 6: 6 C Carlyle UCL
Unlike sort(...), setting a key in a data table sorts "by reference" using a radix algorithm. Sorting by reference means that the rows are rearranged in memory instead of copying the whole table into a new table. As a result, sorting data tables is extremely fast and memory efficient.
Also, the use of A[B,...] to do the merge is much faster than merging two data frames. In addition, this process appends the new columns to A (rather than creating a copy of A as with merge(...).

If you can consider non-base alternatives, then you may try the plyr equivalent of merge: join. From "Details" in ?join: Unlike merge, preserves the order of x no matter what join type is used.. Also the order of columns is preserved.
library(plyr)
join(A, B, type = "inner")
# Joining by: author_id
# paper_id author_id author_name author_affiliation
# 1 2 9 ErnestJordan Cambridge
# 2 8 15 Jakominich <NA>
inner_join in dplyr is similar. However, while the order of columns in x is kept, the columns in y seem to be sorted alphabetically:
library(dplyr)
inner_join(x = A, y = B)
# Joining by: "author_id"
# paper_id author_id author_affiliation author_name
# 1 2 9 Cambridge ErnestJordan
# 2 8 15 <NA> Jakominich

Too long for a comment
I do get what you want:
A <- read.table(text="paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15", header=T)
B <- read.table(text="author_id author_name author_affiliation
9 Ernest_Jordan Cambridge
14 K._MORIBE NA
15 D._Jakominich NA
25 William_H._Nailon NA
37 P._B._Littlewood Cavendish_Laboratory|Cambridge_University",
header=T)
b <- merge(A, B, by="author_id")
b
# author_id paper_id author_name author_affiliation
# 1 9 2 Ernest_Jordan Cambridge
# 2 15 8 D._Jakominich <NA>
Can you clarify your problem?

Related

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

Efficiently joining two data tables with a condition

One data table (let's call is A) contains the ID numbers:
ID
3
5
12
8
...
and another table (let's call it B) contains the lower bound and the upper bound and the name for that ID.
ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah
so based on table B, given the ID from table A, we can find the matching name by finding the name on the row in table B such that
ID_lower <= ID <= ID upper
and I wanna create a table of ID and Name, so in the above example, it would be
ID Name
3 James
5 Arthur
12 Sarah
8 Jacob
... ...
I used for loop, so that for each row of A, I look for the row in B such that ID is between the ID_lower and ID_upper for that row and joined the name from there.
However, this method was a bit slow. Is there a fast way of doing it in R?
Using the new non-equi joins feature in the current development version of data.table, this is straightforward:
require(data.table) # v1.9.7+
dt2[dt1, .(ID, Name), on=.(ID_lower <= ID, ID_upper >= ID)]
See the installation instructions for devel version here.
where,
dt1=fread('ID
3
5
12
8')
dt2 = fread('ID_lower ID_upper Name
1 4 James
5 7 Arthur
8 11 Jacob
12 13 Sarah')
You can make a look-up table with your second data.frame (B):
lu <- do.call(rbind,
apply(B,1,function(x)
data.frame(ID=c(x[1]:x[2]),Name=x[3], row.names = NULL)))
then you query it with your first data.frame (A):
A$Name <- lu[A$ID,"Name"]
You can try this data.table solution:
data.table::setDT(B)[, .(Name, ID = Map(`:`, ID_lower, ID_upper))]
[, .(ID = unlist(ID)), .(Name)][ID %in% A$ID]
Name ID
1: James 3
2: Arthur 5
3: Sarah 12
4: Jacob 8
I believe findInterval() on ID_lower might be the ideal approach here:
A[,Name:=B[findInterval(ID,ID_lower),Name]];
A;
## ID Name
## 1: 3 James
## 2: 5 Arthur
## 3: 12 Sarah
## 4: 8 Jacob
This will only be correct if (1) B is sorted by ID_lower and (2) all values in A$ID are covered by the ranges in B.

Ordering alphabetically after ordering once numerically [duplicate]

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Sort data frame column by factor

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Resources