I am using the RecordLinkage package in R to deduplicate a dataset. The deduped output from the RecordLinkage package has loops in it.
For example:
Table rlinkage
id name id2 name2
1 Jane Johnson 5 Jane Johnson
5 Jane Johnson 17 Jane Johnson
I am trying to make a table that lists each id associated with all other id numbers in the loop of records.
For example:
id1 id2 id3 Name
1 5 17 Jane Johnson
or
Name Ids
Jane Johnson 1,5,17
Is this possible in R? I tried using the sqldf package to join the dataset onto itself multiple times to try and get all id's on the same line.
For example:
rlinkage2 <-sqldf('select a.id,
a.id2,
b.id as id3
b.id2 as id4
from rlinkage a
left join rlinkage b
on a.id = b.id
or a.id = b.id2
or a.id2 = b.id
or a.id2 = b.id2')
This creates a very messy dataset and will not put all of the id's on the same line unless I join the table rlinkage to itself many times. Is there a better way to do this?
1) sqldf To do this using sqldf union the two sets of columns and then use group_concat
sqldf("select name, group_concat(distinct id) ids from (
select id, name from rlinkage
union
select id2 id, name2 name from rlinkage
) group by name")
giving:
name ids
1 Jane Johnson 1,5,17
2) rbind/aggregate With plain R:
long <- rbind(rlinkage[1:2], setNames(rlinkage[3:4], names(rlinkage)[1:2]))
aggregate(id ~ name, long, function(x) toString(unique(x)))
giving:
name id
1 Jane Johnson 1, 5, 17
Note: We used this as the data:
Lines <- "id,name,id2,name2
1,Jane Johnson,5,Jane Johnson
5,Jane Johnson,17,Jane Johnson"
rlinkage <- read.csv(text = Lines, as.is = TRUE)
The answer to this question is to use a graph to identify all connected components. If the nodes in the graph are the id's listed in the question above we can create an edge list like this:
1 -> 5
5 -> 17
The graph would look like this 1-> 5 -> 17. Finding the connected components within the graph would reveal all of the groups.
Related
Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.
This question already has answers here:
Select only the first row when merging data frames with multiple matches
(4 answers)
Closed 1 year ago.
I want to match values in a specific column to a different dataframe, and retrieve the entire row of data for only the first match, and not all combinations.
I tried using left_join, but it simply retrieves all the data. Below is an example:
# First table
cusip <- c("AAA1", "AAA2","AAA3")
Datecode <- c("201912", "202003", "202006")
FirstTable <- data_frame(cusip,Datecode)
#Lookuptable
cusip <- c("AAA1", "AAA1","AAA2","AAA2","AAA3","AAA3")
Name <- c("Facebook Inc", "Facebook", "Apple","Apple INC", "Amz", "Amazon")
LookupTable <- data_frame(cusip,Name)
So What I want is to create a Name column in FirstTable that retrieves the Name from the Lookuptable. But, I don't care whether it says Facebook or Facebook Inc for AAA1.
A simple Left_join keeps all combinations so gives me 6 rows when I only want 3.
Hope that someone can help, Thanks!
A relatively new user of R
library(dplyr)
left_join(FirstTable,
LookupTable %>% group_by(cusip) %>% slice(1))
Joining, by = "cusip"
# A tibble: 3 x 3
cusip Datecode Name
<chr> <chr> <chr>
1 AAA1 201912 Facebook Inc
2 AAA2 202003 Apple
3 AAA3 202006 Amz
fast data.table approach
library(data.table)
#make them data.tables
setDT(LookupTable); setDT(FirstTable)
#perform update join
FirstTable[ LookupTable, Name := i.Name, on = .(cusip)]
# cusip Datecode Name
# 1: AAA1 201912 Facebook
# 2: AAA2 202003 Apple INC
# 3: AAA3 202006 Amazon
I am reading in data from a .txt file that contains over thousands of records
table1 <- read.table("teamwork.txt", sep ="|", fill = TRUE)
Looks like:
f_name l_name hours_worked code
Jim Baker 8.5 T
Richard Copton 4.5 M
Tina Bar 10 S
However I only want to read in data that has a 'S' or 'M' code:
I tried to concat the columns:
newdata <- subset(table1, code = 'S' |'M')
However I get this issue:
operations are possible only for numeric, logical or complex types
If there are thousands or tens of thousands of records (maybe not for millions), you should just be able to filter after you read in all the data:
> library(tidyverse)
> df %>% filter(code=="S"|code=="M")
# A tibble: 2 x 4
f_name l_name hours_worked code
<fct> <fct> <dbl> <fct>
1 Richard Copton 4.50 M
2 Tina Bar 10.0 S
If you really want to just pull in the rows that meet your condition, try sqldf package as in example here: How do i read only lines that fulfil a condition from a csv into R?
You can try
cols_g <- table1[which(table1$code == "S" | table1$code == "M",]
OR
cols_g <- subset(table1, code=="S" | code=="M")
OR
library(dplyr)
cols_g <- table1 %>% filter(code=="S" | code=="M")
If you want to add column cols_g on table1, you can use table1$cols_g assigned anything from these 3 methods instead of cols_g.
I have 2 data frames like this
TEAM <- c("PE","PE","MPI","TDT","HPT","ATD")
CODE <- c(NA,"F","A","H","G","D")
df1 <- data.frame(TEAM,CODE)
CODE <- c(NA,"F100","A234","D664","H435","G123","A666","D345","G324",NA)
TEAM <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
df2 <- data.frame(CODE,TEAM)
I am trying to update the TEAM in df2 by matching the first letter in code column in df1 with the code column in df2
My desired output for df2
CODE TEAM
1 NA PE
2 F100 PE
3 A234 MPI
4 D664 ATD
5 H435 TDT
6 G123 HPT
7 A666 MPI
8 D345 ATD
9 G324 HPT
10 NA PE
I am trying this way with sqldf but it is not right
library(sqldf)
df2 <- sqldf(c("update df2 set TEAM =
case
when CODE like '%F%' then 'PE'
when CODE like '%A%' then 'MPI'
when CODE like '%D%' then 'ATD'
when CODE like '%G%' then 'HPT'
when CODE like '%H%' then 'TDT'
else 'NA'
end"))
Can someone help me provide some directions on achieving this without sqldf?
Using match and substr (both in base R):
df2$TEAM = df1$TEAM[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2
# CODE TEAM
# 1 <NA> PE
# 2 F100 PE
# 3 A234 MPI
# 4 D664 ATD
# 5 H435 TDT
# 6 G123 HPT
# 7 A666 MPI
# 8 D345 ATD
# 9 G324 HPT
# 10 <NA> PE
This is expedient for a single case - if you're doing things like this frequently I would encourage you to just extract the first letter of code into its own column, CODE_1, and then do a regular merge or join.
Assuming you are looking for an sqldf solution try this:
sqldf("select CODE,
case
when CODE like 'F%' then 'PE'
when CODE like 'A%' then 'MPI'
when CODE like 'D%' then 'ATD'
when CODE like 'G%' then 'HPT'
when CODE like 'H%' then 'TDT'
else 'PE'
end TEAM from df2", method = "raw")
or this:
sqldf("select df2.CODE, coalesce(df1.TEAM, 'PE') TEAM
from df2
left join df1 on substr(df2.CODE, 1, 1) = df1.CODE")
Sorry in advance for the long post.
Although I manage to overcome this using a for-loop, I have a feeling sqldf would be more efficient, but I could not get it right so far.
My first data frame has a unique identifier (Name). It is something like a 1000x5, but in the spirt of this:
Name <- c('Ben','Gary','John','Michael')
Age <- c(13,20,5,57)
dfA <- as.data.frame(cbind(Name,Age))
dfA
> Name Age
> 1 Ben 13
> 2 Gary 20
> 3 John 5
> 4 Michael 57
My second data frame does NOT have a unique key, is also 5000x5, but looks generally like this:
Name <- c('Ben','Ben','Ben','Gary','Michael','Michael','Michael')
Color <- c('Blue','Red','Green','Red','Yellow','Yellow','Black')
Other.Entries <- c('180','200','150','100','70','200','130')
dfB <- as.data.frame(cbind(Name,Color))
dfB
> Name Color Other_Entries(not.related)
>1 Ben Blue 180
>2 Ben Red 180
>3 Ben Green 150
>4 Gary Red 100
>5 Michael Yellow 70
>6 Michael Yellow 200
>7 Michael Black 130
Notice that there are redundancies in the Colors for each Names, and not all Names appear.
My desired output is to:
Retrieve the Color for each Name in data frame B (remove redundant, possibly alphabetically)
Convert these few Colors to a string (by using function "toString" for example)
Add the string as a new entry in the first data frame
At first when I used the for loop I created a new data frame with an empty column like this
dfCombined <- dfA
dfCombined["Color"] <- NA
.. and iterated over all rows, querying from the second data frame.
But perhaps this may not be necessary using something clever.
The end result should be:
dfCombined
> Name Age Color
>1 Ben 13 Blue, Green, Red
>2 Gary 20 Red
>3 John 5
>4 Michael 57 Black, Yellow
Any suggestions?
1a) sqldf with multiple statements Try this:
library(sqldf)
dfB_s <- sqldf("select distinct * from dfB order by Name, Color")
dfB_g <- sqldf("select Name, group_concat(Color) Color
from dfB_s
group by Name")
sqldf("select *
from dfA
left join dfB_g using (Name)")
1b) sqldf with one statement or all in one:
sqldf("select *
from dfA
left join
(select Name, group_concat(Color) Color
from
(select distinct * from dfB order by Name, Color)
group by Name)
using (Name)")
Either of these gives:
Name Age Color
1 Ben 13 Blue,Green,Red
2 Gary 20 Red
3 John 5 <NA>
4 Michael 57 Black,Yellow
2) without packages Without sqldf it would be done like this:
dfB_s <- unique(dfB)[order(dfB$Name, dfB$Color), ]
dfB_g <- aggregate(Color ~ Name, dfB_s, toString)
merge(dfA, dfB_g, all.x = TRUE, by = "Name")
3) data.table If speed is the issue you might want to try data.table:
library(data.table)
unique(data.table(dfB, key = "Name,Color"))[
, toString(Color), by = Name][
data.table(dfA)]
giving:
Name V1 Age
1: Ben Blue, Green, Red 13
2: Gary Red 20
3: John NA 5
4: Michael Black, Yellow 57
4) dplyr and here is a dplyr solution:
library(dplyr)
dfA %.%
left_join(dfB %.%
unique() %.%
arrange(Name, Color) %.%
group_by(Name) %.%
summarise(Color = toString(Color)))
ADDED other solutions. Fixed some errors.
To batch process it do this in real code.
Psudo code:
Pull name
run while loop for color array
load array variable:$array = array("foo", "bar", "hello", "world");
var_dump($array);
run insert into new table for each name.