To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Related
Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}
I have a dataframe with a column of city names, in each cell of this column there are multiple text values separated by ",".
For example the first 4 rows of the cities column of my df are:
"Barcelona, Milaan, Londen, Paris, Berlin"
"Barcelona"
"Milaan, Barcelona, Berlin"
"London, Berlin"
I want to count for each row of this column
wheter these cities occurs.
For example, the output needs to look like this:
count_cities
5
1
3
2
Thank you in advance!
DATA:
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona",
"Milaan, Barcelona, Berlin","London, Berlin"), stringsAsFactors = F)
To count how many city namesthere are you can first split the string at ,and count the splits using lengths:
cities$count <- lengths(strsplit(cities$names, ","))
The resulting dataframe is this:
cities
names count
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
EDIT:
If the strings contain not only city namesbut additional information, you can use str_countto match upper-case letters (because city names begin with an upper-case letter but other words don't, at least not in the example you've given):
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona (a big city)",
"Milaan, Barcelona, Berlin","London, Berlin (are all capitals, are big cities)"), stringsAsFactors = F)
library(stringr)
cities$count <- str_count(cities$names, "[A-Z][a-z]+")
Alternatively, use str_extract:
cities$count <- lengths(str_extract_all(cities$names, "[A-Z][a-z]+"))
library(tidyverse)
travel <- tibble(CITYS = c("Barcelona, Milaan, Londen, Paris, Berlin",
"Barcelona",
"Milaan, Barcelona, Berlin",
"London, Berlin"))
travel %>%
mutate(CITY.COUNT = map_dbl(str_split(CITYS, ",\\s*"), length))
Yields
# A tibble: 4 x 2
CITYS CITY.COUNT
<chr> <dbl>
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
Another option is str_count
library(stringr)
str_count(travel$CITYS, "\\w+")
#[1] 5 1 3 2
I have two tables, 1. answers table and 2. solution table.
The answer table is a list of Name+Answer.
name=c("Jenns","Amy","Jake","Alison","Tommy","Jason","Alex","Vivian")
guess_answer=c("sdgf23894011","lp98ung67543","pwerugji22im","21loop98un89","9580ik8584sf","awe25f6ty788","k0o2jgpo146i","rgyhuj87630l")
answer=data.frame(cbind(name,guess_answer))
> answer
name guess_answer
1 Jenns sdgf23894011
2 Amy lp98ung67543
3 Jake pwerugji22im
4 Alison 21loop98un89
5 Tommy 9580ik8584sf
6 Jason awe25f6ty788
7 Alex k0o2jgpo146i
8 Vivian rgyhuj87630l
The solution table is lists of country with a corresponding (digit+alphabet).
corresponding_number=c("2341rg4524gr","9580ik7584sf","pp0or9rjg7n2","g0o2jgpo146i","lp98ung67543","pwerugji22im","lokibh678901")
country=c("US","UK","CN","AU","JP","KR", "NP")
counry_name=c("United State","United Kingdom","China","Australia","Japan","Korea","North Pole")
solution = cbind(country, corresponding_number,counry_name)
solution = data.frame(solution)
> solution
country corresponding_number counry_name
1 US 2341rg4524gr United State
2 UK 9580ik7584sf United Kingdom
3 CN pp0or9rjg7n2 China
4 AU g0o2jgpo146i Australia
5 JP lp98ung67543 Japan
6 KR pwerugji22im Korea
7 NP lokibh678901 North Pole
I would like to compare the answer table to the solution table, in which if the guess_number is the exact same or 1 digit/alphabet different, it is consider as correct. Then I want to create a table with the country, corresponding_number, and the counry_name.
For example:
> newtable
name corresponding_number country_name
[1,] "xxx" "sdgf23894011" "xxx"
[2,] "JP" "lp98ung67543" "Japan"
[3,] "KR" "pwerugji22im" "Korea"
[4,] "xxx" "21loop98un89" "xxx"
[5,] "UK" "9580ik8584sf" "United Kingdom"
[6,] "xxx" "awe25f6ty788" "xxx"
[7,] "AU" "k0o2jgpo146i" "Australia"
[8,] "xxx" "rgyhuj87630l" "xxx"
The name needs to be: either replaced by "xxx" if answer is wrong, or the "country abbreviations" if answer is wrong.
whether the answer is correct or wrong is based on the guess_answer; the guess_answer is correct if it is a)exactly the same as one of the corresponding_number, or b)1 digit/alphabet different.
the guess_answer will not change but the colname will become "corresponding_number"
Include a third columns showing the full country name, if the guess_answer is wrong, the responding full country name will be "xxx" as well .
edit: first condition.
Here one option is stringdist_left_join, after the join and mutate to replace the NA elements with 'xxx'
library(fuzzyjoin)
library(dplyr)
stringdist_left_join(answer, solution,
by = c("guess_answer" = "corresponding_number"))%>%
mutate(corresponding_number = case_when(is.na(corresponding_number)
~ guess_answer, TRUE ~ corresponding_number),
name = case_when(is.na(country) ~ 'xxx', TRUE ~ country),
counry_name = replace(counry_name, is.na(counry_name), 'xxx')) %>%
select(name, corresponding_number = guess_answer, counry_name)
# name corresponding_number counry_name
#1 xxx sdgf23894011 xxx
#2 JP lp98ung67543 Japan
#3 KR pwerugji22im Korea
#4 xxx 21loop98un89 xxx
#5 UK 9580ik8584sf United Kingdom
#6 xxx awe25f6ty788 xxx
#7 AU k0o2jgpo146i Australia
#8 xxx rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,
counry_name, stringsAsFactors = FALSE)
In base R, we can use adist.
#Calculate distance between guess_answer and corresponding_number
mat <- adist(answer$guess_answer, solution$corresponding_number)
#assign default value to result column
answer$country_name <- 'xxx'
#select values with distance of less than or equal to 1
mat1 <- which(mat <= 1, arr.ind = TRUE)
#Order them by row
ord <- order(mat1[, 1])
#Assign values to the column
answer$country_name[mat1[ord, 1]] <- solution$counry_name[mat1[ord, 2]]
answer
# name guess_answer country_name
#1 Jenns sdgf23894011 xxx
#2 Amy lp98ung67543 Japan
#3 Jake pwerugji22im Korea
#4 Alison 21loop98un89 xxx
#5 Tommy 9580ik8584sf United Kingdom
#6 Jason awe25f6ty788 xxx
#7 Alex k0o2jgpo146i Australia
#8 Vivian rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,counry_name,
stringsAsFactors = FALSE)
I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.
I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco