Lookup values in a vectorized way - r

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)

See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

Join of 2 dataframes [duplicate]

This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 3 years ago.
I have 2 dataframes and I want to join by name, but names are not written exactly the same:
Df1:
ID Name Age
1 Jose 13
2 M. Jose 12
3 Laura 8
4 Karol P 32
Df2:
Name Surname
José Hall
María José Perez
Laura Alza
Karol Smith
I need to join and get this:
ID Name Age Surname
1 Jose 13 Hall
2 M. Jose 12 Perez
3 Laura 8 Alza
4 Karol P 32 Smith
How to consider that the names are not exactly the same before to join?
You can get close to your result using stringdist_left_join from fuzzyjoin
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = "Name")
# ID Name.x Age Name.y Surname
#1 1 Jose 13 José Hall
#2 2 M. Jose 12 <NA> <NA>
#3 3 Laura 8 Laura Alza
#4 4 Karol P 32 Karol Smith
For the example shared it does not work for 1 entry since it is difficult to match Maria with M.. You can get the result for it by adjusting the max_dist argument to a higher value (default is 2) however, this will screw up other results and would give unwanted matches. If you have minimal NA entries (like the example shared) after this join you could just match them by "hand".
I would clean the database before (for example deleting those ´, in excel is easy doing those replace) and then use
new_df <- merge(df1, df2, by="name")
or you could try to assign an ID for df2 that coincide with df2 if it is possible.

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Construct a vector of names from data frame using R

I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.
Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig
Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

Resources