Match text across multiple rows in R - r

My data.frame(Networks) contains the following:
Location <- c("Farm", "Supermarket", "Farm", "Conference",
"Supermarket", "Supermarket")
Instructor <- c("Bob", "Bob", "Louise", "Sally", "Lee", "Jeff")
Operator <- c("Lee", "Lee", "Julie", "Louise", "Bob", "Louise")
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors=FALSE)
MY QUESTION
I wish to include a new column Transactions$Count in a new data.frame Transactions that sums the exchanges between each Instructor and Operator for every Location
EXPECTED OUTPUT
Location <- c("Farm", "Supermarket", "Farm", "Conference", "Supermarket")
Person1 <- c("Bob", "Louise", "Sally", "Jeff")
Person2 < - c("Lee", "Julie", "Louise", "Louise")
Count < - c(1, 2, 1, 1, 1)
Transactions <- data.frame(Location, Person1, Person2, Count,
stringsAsFactors=FALSE)
For example, there would be a total of 2 exchanges between Bob and Lee at the Supermarket. It does not matter if one person is a instructor or operator, I am interested in their exchange. In the expected output, the two exchanges between Bob and Lee at the Supermarket are noted. There is one exchange for every other combination at the other locations.
WHAT I HAVE TRIED
I thought grepl may be of use, but I wish to iterate across 1300 rows of this data, so it may be computationally expensive.
Thank you.

You can consider using "data.table" and use pmin and pmax in your "by" argument.
Example:
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors = FALSE)
library(data.table)
as.data.table(Networks)[
, TransCount := .N,
by = list(Location,
pmin(Instructor, Operator),
pmax(Instructor, Operator))][]
# Location Instructor Operator TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Louise Julie 1
# 4: Conference Sally Louise 1
# 5: Supermarket Lee Bob 2
# 6: Supermarket Jeff Louise 1
Based on your update, it sounds like this might be more appropriate for you:
as.data.table(Networks)[
, c("Person1", "Person2") := list(
pmin(Instructor, Operator),
pmax(Instructor, Operator)),
by = 1:nrow(Networks)
][
, list(TransCount = .N),
by = .(Location, Person1, Person2)
]
# Location Person1 Person2 TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Julie Louise 1
# 4: Conference Louise Sally 1
# 5: Supermarket Jeff Louise 1

You may try
library(dplyr)
Networks %>%
group_by(Location, Person1=pmin(Instructor,Operator),
Person2= pmax(Instructor,Operator)) %>%
summarise(Count=n())
# Location Person1 Person2 Count
#1 Conference Louise Sally 1
#2 Farm Bob Lee 1
#3 Farm Julie Louise 1
#4 Supermarket Bob Lee 2
#5 Supermarket Jeff Louise 1
Or using base R
d1 <-cbind(Location=Networks[,1],
data.frame(setNames(Map(do.call, c('pmin', 'pmax'),
list(Networks[-1])), c('Person1', 'Person2'))))
aggregate(cbind(Count=1:nrow(d1))~., d1, FUN=length)
# Location Person1 Person2 Count
#1 Farm Bob Lee 1
#2 Supermarket Bob Lee 2
#3 Supermarket Jeff Louise 1
#4 Farm Julie Louise 1
#5 Conference Louise Sally 1
data
Networks <- data.frame(Location, Instructor, Operator,
stringsAsFactors=FALSE)

Related

Merge R dataframes with at least x columns matching

I have 2 dataframes that I need to match based on at least x columns being the same. df1 has columns A:E; df2 has columns A:Z. Columns A:E are the same in both dfs, but the rows are in a different order.
df1 would look something like:
forename surname birthdate code gender
Joe Bloggs 23/03/2001 SW3 m
Anne Anderson 11/11/1999 D37 f
Tom Smith 31/01/2002 SW4 m
Andy Clarke 02/06/1999 B37 m
df2 would look like:
forename surname birthdate code gender eye_colour dinner_option
Jules Anderson 09/01/1986 D37 m blue meat
Katy Collins 03/03/2004 NA f brown meat
Andrew Clarke 02/06/1999 NA m brown veg
Joe Bloggs 23/03/2001 SW3 m green fish
What I need to do is:
compare cols A:E in df1 and df2
find the rows in df2 A:E that match at least 3 columns of df1
for the rows that match 3 or more columns, create df3 with df1[,A:E] and df2[,A:Z]
So the output (df3) would look like the following
forename surname birthdate code gender forename surname birthdate
Joe Bloggs 23/03/2001 SW3 m Joe Bloggs 23/03/2001
Andy Clarke 02/06/1999 B37 m Andrew Clarke 02/06/1999
code gender eye_colour dinner_option
SW3 m green fish
NA m brown veg
As Joe Bloggs and Andy Clarke are the only ones where at least 3 of the columns match between df1 and df2.*
Any idea about how I could do this in an efficient way?
I've tried the following, but of course, this only identifies matches where ALL the columns are the same, whereas I only need 3 columns to match, not all of them.
colsToUse <- intersect(colnames(df1), colnames(df2))
matching <- match(do.call("paste", df1[, colsToUse]), do.call("paste", df2[, colsToUse]))
matched <- cbind(df1, df2[matching, ])
Thank you for any help!
*I do realise there is some redundant information in df3, but for now I need it to be like that
This is my ugly first attempt.
It works for your sample data, but probaly needs some (= a lot of) testing to find weaknesses.
library(data.table)
# !!df1 and df2 need to be data.table, so use fread() or setDT() !!
df1 <- fread("forename surname birthdate code gender
Joe Bloggs 23/03/2001 SW3 m
Anne Anderson 11/11/1999 D37 f
Tom Smith 31/01/2002 SW4 m
Andy Clarke 02/06/1999 B37 m")
df2 <- fread("forename surname birthdate code gender eye_colour dinner_option
Jules Anderson 09/01/1986 D37 m blue meat
Katy Collins 03/03/2004 NA f brown meat
Andrew Clarke 02/06/1999 NA m brown veg
Joe Bloggs 23/03/2001 SW3 m green fish", sep = " ")
# combinations of colnames to join on
col_join <- combn(intersect(names(df1), names(df2)), 3, simplify = FALSE)
# create df3 with dummy names
df3 <- df2
setnames(df3, paste0(names(df2), ".y"))
df3[, id := .I]
# Create expression to evaluate later
joins <- lapply(col_join, function(x) {
paste0(sapply(x, function(x) {
paste0(x, " = ", x, ".y")
}), collapse = ", ")
})
# update join df1 on all join-combinations (only one match possible per row!!)
lapply( joins, function(x) {
expr = paste0("df1[df3, id := i.id, on = .(", x, ")]")
eval(parse(text = expr))
})
# final join on matched rows
df3[df1[!is.na(id), ], on = .(id)][,id := NULL]
# forename.y surname.y birthdate.y code.y gender.y eye_colour.y dinner_option.y forename surname birthdate code gender
# 1: Joe Bloggs 23/03/2001 SW3 m green fish Joe Bloggs 23/03/2001 SW3 m
# 2: Andrew Clarke 02/06/1999 <NA> m brown veg Andy Clarke 02/06/1999 B37 m

How can I count a variable in R conditional on the value of another variable?

I want to count occurrences of a variable in a dataframe by another variable, conditional on the value of a third variable. Here is my data:
Name Store Purchase Date
John CVS Shampoo 1/1/2001
John CVS Toothpaste 1/1/2001
John Whole Foods Kombucha 1/1/2005
John Kroger Ice Cream 1/1/2002
Jane CVS Soap 1/1/2001
Jane Whole Foods Crackers 1/1/2004
For each purchase, I want a count of how many previous purchases were made by the specified person, and how many previous shopping trips, like this:
Name Store Purchase Date Prev_Purchase Prev_trip
John CVS Shampoo 1/1/2001 0 0
John CVS Toothpaste 1/1/2001 0 0
John Whole Foods Kombucha 1/1/2005 3 2
John Kroger Ice Cream 1/1/2002 2 1
Jane CVS Soap 1/1/2001 0 0
Jane Whole Foods Crackers 1/1/2004 1 1
If I wanted the total number of purchases/trips for each person, I would use count or tapply--is there a way to adapt these functions so that the outputs are conditional on a third variable (date)?
Maybe you can try the base R code using ave
transform(df,
Prev_Purchase = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) sum(p > x))),
Prev_trip = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) length(unique(x[p > x]))))
)
which gives
Name Store Purchase Date Prev_Purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1
Data
df <- structure(list(Name = c("John", "John", "John", "John", "Jane",
"Jane"), Store = c("CVS", "CVS", "Whole Foods", "Kroger", "CVS",
"Whole Foods"), Purchase = c("Shampoo", "Toothpaste", "Kombucha",
"Ice Cream", "Soap", "Crackers"), Date = c("1/1/2001", "1/1/2001",
"1/1/2005", "1/1/2002", "1/1/2001", "1/1/2004")), class = "data.frame", row.names = c(NA,
-6L))
I think it should solve your problem. If your data is huge it's better if you optimize this code chunk.
# load environment
library(lubridate)
# base function
AddInfo = function(name, date, df) {
prev_purchase = sum(df$Name == name & df$Date < date)
prev_trip = length(unique(filter(df, Name == name & Date < date)$Date))
data = data.frame(
Prev_purchase = prev_purchase,
Prev_trip = prev_trip
)
return(data)
}
# define data frame
df = data.frame(
Name = c(rep('John', 4), rep('Jane', 2)),
Store = c('CVS', 'CVS', 'Whole Foods', 'Kroger', 'CVS', 'Whole Foods'),
Purchase = c('Shampoo', 'Toothpaste', 'Kombucha', 'Ice Cream', 'Soap', 'Crackers'),
Date = c('1/1/2001', '1/1/2001', '1/1/2005', '1/1/2002', '1/1/2001', '1/1/2004')
)
# transform date to POSIXct
df$Date = dmy(df$Date)
# apply function and bind the results
cols = mapply(AddInfo, df$Name, df$Date, MoreArgs = list(df), SIMPLIFY = FALSE)
cols = bind_rows(cols)
df = cbind(df, cols)
Here is the output:
Name Store Purchase Date Prev_purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1
We could also use outer
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(Name) %>%
mutate(Prev_Purchase = colSums(outer(Date, Date, FUN = "<")),
Prev_trip = colSums(outer(unique(Date), Date, FUN = "<")))
# A tibble: 6 x 6
# Groups: Name [2]
# Name Store Purchase Date Prev_Purchase Prev_trip
# <chr> <chr> <chr> <date> <dbl> <dbl>
#1 John CVS Shampoo 2001-01-01 0 0
#2 John CVS Toothpaste 2001-01-01 0 0
#3 John Whole Foods Kombucha 2005-01-01 3 2
#4 John Kroger Ice Cream 2002-01-01 2 1
#5 Jane CVS Soap 2001-01-01 0 0
#6 Jane Whole Foods Crackers 2004-01-01 1 1

Expand data.table so one row per pattern match of each ID

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I have managed to subset the table so it shows text that matches at least two of the patterns (relevant question here).
I now want to be able to have one row per match, with an additional column that identifies the match - so rows where there are multiple matches will be duplicates apart from that column.
It feels like this shouldn't be too hard but I'm struggling! My vague thoughts are around maybe counting the number of pattern matches, then duplicating the rows that many times...but then I'm not entirely sure how to get the label for each different pattern...(and also not sure that is very efficient anyway).
Thanks for your help!
Example data
library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5),
text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
Ideal output
required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
text = c("lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"),
person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))
A way to do that is to create a variable for each indicator and melt:
library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]
melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
## ID text variable
## 1: 1 lucy, sarah and paul live on the same street lucy
## 2: 2 lucy has only moved here recently lucy
## 3: 3 lucy and sarah are cousins lucy
## 4: 1 lucy, sarah and paul live on the same street sarah
## 5: 3 lucy and sarah are cousins sarah
## 6: 1 lucy, sarah and paul live on the same street paul or john
## 7: 4 john is also new to the area paul or john
## 8: 5 paul and john have known each other a long time paul or john
A tidy way of doing the same procedure is:
library(tidyverse)
text_table %>%
mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
gather(value = value, key = person, - c(ID, text)) %>%
filter(value) %>%
select(-value)
DISCLAIMER: this is not an idiomatic data.table solution
I would build a helper function like the following, that take a single row and an input and returns a new dt with Nrows:
library(data.table)
library(tidyverse)
new_rows <- function(dtRow, patterns = text_patterns){
res <- map(text_patterns, function(word) {
textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>%
ifelse(is.character(.), ., NA)
personField <- str_extract(string = dtRow[1, text], pattern = word) %>%
ifelse( . == "paul" | . == "john", "paul or john", .)
idField <- ifelse(is.na(textField), NA, dtRow[1, ID])
data.table(ID = idField, text = textField, person = personField)
}) %>%
rbindlist()
res[!is.na(text), ]
}
And I will execute it:
split(text_table, f = text_table[['ID']]) %>%
map_df(function(r) new_rows(dtRow = r))
The answer is:
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john
which looks like your required_table (duplicated IDs included)
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john

More efficient methods than nested for loops in R -- matching

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs.
I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to.
But it's slow as hell on bigger datasets.
I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling.
What's a more efficient alternative for what I'm doing here? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it.
dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
Here's a dplyr solution
library(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
# Person_id FirstName LastName DOB Actual_ID
# <dbl> <fct> <fct> <fct> <dbl>
# 1 1. John Smith 2001-01-01 1.
# 2 2. James Jones 2002-01-01 2.
# 3 3. John Jones 2003-01-01 3.
# 4 4. Alex Jones 2004-01-01 4.
# 5 5. Alexander Jones 2004-01-01 5.
# 6 6. Jonathan Smith 2001-01-01 6.
# 7 3. John Jones 2003-01-01 3.
# 8 8. Alex Smith 2006-01-01 8.
# 9 9. James Johnson 2006-01-01 9.
# 10 1. John Smith 2001-01-01 1.
# 11 11. John Smith 2009-01-01 11.
EDIT - Added Performance comparison
for_loop_approach <- function() {
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
}
dplyr_approach <- function() {
require(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
}
library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)
Unit: relative
expr min lq mean median uq max neval
for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743 100
dplyr_approach() 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
There were 50 or more warnings (use warnings() to see the first 50)
I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id:
dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))
This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame):
Person_id FirstName LastName DOB Actual_id
1 1 John Smith 2001-01-01 1
2 2 James Jones 2002-01-01 2
3 3 John Jones 2003-01-01 3
4 4 Alex Jones 2004-01-01 4
5 5 Alexander Jones 2004-01-01 5
6 6 Jonathan Smith 2001-01-01 6
7 7 John Jones 2003-01-01 3
8 8 Alex Smith 2006-01-01 8
9 9 James Johnson 2006-01-01 9
10 10 John Smith 2001-01-01 1
11 11 John Smith 2009-01-01 11
In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, e.g.
dta.test$Person_id <- paste0(LETTERS[1:11],1:11)
You just need a small tweak to make this still work, to make it extract value from the Person_id column:
dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]
Giving:
Person_id FirstName LastName DOB Actual_id
1 A1 John Smith 2001-01-01 A1
2 B2 James Jones 2002-01-01 B2
3 C3 John Jones 2003-01-01 C3
4 D4 Alex Jones 2004-01-01 D4
5 E5 Alexander Jones 2004-01-01 E5
6 F6 Jonathan Smith 2001-01-01 F6
7 G7 John Jones 2003-01-01 C3
8 H8 Alex Smith 2006-01-01 H8
9 I9 James Johnson 2006-01-01 I9
10 J10 John Smith 2001-01-01 A1
11 K11 John Smith 2009-01-01 K11
A data table solution will probably be quickest on large data with lots of groups:
library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

Re-Populate column in a relational data frame after randomization in R

I have a data frame of individuals and their spouses with some personal information (i.e. last names) that I have randomized with plyr::mapvalues in order to protect identities. Here is a reproducible example of how it looked before and after changing the surnames:
# before
d <- data.frame(id = c(1:6),
first_name = c('Jeff', 'Marilyn', 'Gwyn',
'Alice', 'Sam', 'Sarah'),
surname = c('Goldbloom', 'Monroe', 'Paltrow', 'Goldbloom',
'Smith', 'Silverman'),
spouse_id = c(2, 1, 1, 5, 4, "NA"),
spouse = c('Marilyn Monroe', 'Jeff Goldbloom', 'Jeff Goldbloom',
'Sam Smith', 'Alice Goldbloom', 'NA'))
d
> id first_name surname spouse_id spouse
1 Jeff Goldbloom 2 Marilyn Monroe
2 Marilyn Monroe 1 Jeff Goldbloom
3 Gwyn Paltrow 1 Jeff Goldbloom
4 Alice Goldbloom 5 Sam Smith
5 Sam Smith 4 Alice Goldbloom
6 Sarah Silverman NA NA
# replacement names to serve as surnames (doesn't matter what they are, just
that the ratios remain the same as before; mapvalues takes care of this)
repnames <- c("Arman" , "Clovis" , "Garner" , "Casey" , "Birch")
s <- unique(d$surname)
d$surname <- plyr::mapvalues(d$surname, from = s, to = repnames) #replace surnames
# After replacement, the dataframe looks like:
d
> id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Monroe
2 Marilyn Clovis 1 Jeff Goldbloom
3 Gwyn Garner 1 Jeff Goldbloom
4 Alice Arman 5 Sam Smith
5 Sam Casey 4 Alice Goldbloom
6 Sarah Birch NA NA
Each person has his or her own id number, but not all people have spouses. If a person does have a spouse, their spouse's individual id is reflected in the spouse_id column. I did this so that I could filter individuals and their spouses separately later using something like dplyr::filter(d, spouse %in% spouse_id).
My question is, how can I use the relational id and spouse_id columns to re-populate the spouse column so that it reflects the new, randomized surnames? i.e. the final expected output would be:
id first_name surname spouse_id spouse
1 Jeff Arman 2 Marilyn Clovis
2 Marilyn Clovis 1 Jeff Arman
3 Gwyn Garner 1 Jeff Arman
4 Alice Arman 5 Sam Casey
5 Sam Casey 4 Alice Arman
6 Sarah Birch NA NA
...So some concatenation will be involved on the first_name and surname columns. I've never done something quite so conditional in R - in Excel I guess it would be nested VLOOKUP functions...
Thanks, sorry it's so specific but hopefully it presents a fun challenge to someone out there.
Assuming that your NAs are actual NAs, then
d$spouse <- paste(d$first_name, d$surname)[d$spouse_id]
d$spouse
#[1] "Marilyn Clovis" "Jeff Arman" "Jeff Arman" "Sam Casey" "Alice Arman" NA

Resources