Join each term with list of keywords - r

Probably a simple problem and you can help me quickly.
I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example
vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)
The expected result looks like:
keywords terms
small boat tour small
small boat tour boat
small boat tour tour
a house on the river a
a house on the river house
a house on the river on
a house on the river the
a house on the river river
a houseboat a
a houseboat houseboat

You can use expand.grid to get all combinations, wrap the words of vec in word boundaries, grepl and filter, i.e.
df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]
Var1 Var2
1 small small boat tour
2 boat small boat tour
5 tour small boat tour
12 river a house on the river
13 house a house on the river
15 a a house on the river
16 on a house on the river
17 the a house on the river
24 a a houseboat
27 houseboat a houseboat

You can do a fuzzyjoin::fuzzy_join using stringr::str_detect as the matching function, and adding \\b word boundaries to each word in vec.
vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))
fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"),
match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))
output
keywords terms
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river river
5 a house on the river house
6 a house on the river a
7 a house on the river on
8 a house on the river the
9 a houseboat a
10 a houseboat houseboat

A way can be using strsplit and intersect.
. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
# keywords terms
#1 small boat tour small
#2 small boat tour boat
#3 small boat tour tour
#4 a house on the river a
#5 a house on the river house
#6 a house on the river on
#7 a house on the river the
#8 a house on the river river
#9 a houseboat a
#10 a houseboat houseboat
In casevec contains all keywords there is no need for join.
. <- lapply(strsplit(keywords, " ", TRUE), unique)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))

The expected result can be reproduced by
library(data.table)
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river a
5: a house on the river house
6: a house on the river on
7: a house on the river the
8: a house on the river river
9: a houseboat a
10: a houseboat houseboat
This is a rather simple answer which is an interpretation of OP's sentence
I have a vector with all the terms contained in a list of keywords.
The important point is all the terms. So, it is assumed that we just can split the keywords into separate terms.
Note that the regex \\W+ is used to separate the terms in case there are more than one non-word characters between the terms, e.g., ", ".
However, in case the vector does not contain all terms intentionally we need to subset the result, e.g.
vec <- c("small", "boat", "river", "house", "tour", "houseboat")
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords][
terms %in% vec]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river house
5: a house on the river river
6: a houseboat houseboat

Here is a base R one-liner with strsplit + stack
> with(keywords, rev(stack(setNames(strsplit(keywords, " "), keywords))))
ind values
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river a
5 a house on the river house
6 a house on the river on
7 a house on the river the
8 a house on the river river
9 a houseboat a
10 a houseboat houseboat

Related

I am trying to filter on two conditions, but I keep removing all patients with either condition

I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.
Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.
Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital

Is it possible to convert lines from a text file into columns to get a dataframe?

I have a text file containing information on book title, author name, and country of birth which appear in seperate lines as shown below:
Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA
Is there any way to convert the text to a dataframe with these three items appearing as different columns:
ID Author Book Country
1 "Oscar Wilde" "De Profundis" "Ireland"
2 "Nathaniel Hawthorn" "Birthmark" "USA"
There are built-in functions for dealing with this kind of data:
data.frame(scan(text=xx, multi.line=TRUE,
what=list(Author="", Book="", Country=""), sep="\n"))
# Author Book Country
#1 Oscar Wilde De Profundis Ireland
#2 Nathaniel Hawthorn Birthmark USA
#3 James Joyce Ulysses Ireland
#4 Walt Whitman Leaves of Grass USA
You can create a 3-column matrix from one column of data.
dat <- read.table('data.txt', sep = ',')
result <- matrix(dat$V1, ncol = 3, byrow = TRUE) |>
data.frame() |>
setNames(c('Author', 'Book', 'Country'))
result <- cbind(ID = 1:nrow(result), result)
result
# ID Author Book Country
#1 1 Oscar Wilde De Profundis Ireland
#2 2 Nathaniel Hawthorn Birthmark USA
#3 3 James Joyce Ulysses Ireland
#4 4 Walt Whitman Leaves of Grass USA
There aren't any built in functions that handle data like this. But you can reshape your data after importing.
#Test data
xx <- "Oscar Wilde
De Profundis
Ireland
Nathaniel Hawthorn
Birthmark
USA
James Joyce
Ulysses
Ireland
Walt Whitman
Leaves of Grass
USA"
writeLines(xx, "test.txt")
And then the code
library(dplyr)
library(tidyr)
lines <- read.csv("test.txt", header=FALSE)
lines %>%
mutate(
rid = ((row_number()-1) %% 3)+1,
pid = (row_number()-1) %/%3 + 1) %>%
mutate(col=case_when(rid==1~"Author",rid==2~"Book", rid==3~"Country")) %>%
select(-rid) %>%
pivot_wider(names_from=col, values_from=V1)
Which returns
# A tibble: 4 x 4
pid Author Book Country
<dbl> <chr> <chr> <chr>
1 1 Oscar Wilde De Profundis Ireland
2 2 Nathaniel Hawthorn Birthmark USA
3 3 James Joyce Ulysses Ireland
4 4 Walt Whitman Leaves of Grass USA

Conditional Counting in Data Tables

Suppose I have the following data table:
hs_code country city company
1: apples Canada Calgary West Jet
2: apples Canada Calgary United
3: apples US Los Angeles Alaska
4: apples US Chicago Alaska
5: oranges Korea Seoul West Jet
6: oranges China Shanghai John's Freight Co
7: oranges China Harbin John's Freight Co
8: oranges China Ningbo John's Freight Co
Output:
hs_code countries city company
1: apples 2 1,2 2,1,1
2: oranges 2 1,3 1,1,1,1
The logic is as follows:
For each good, I want the first column to summarize the number of unique countries. For apples it is 2. Based on this value, I want a 2-tuple in the city column that summarizes the unique number of cities for each country. So, since there is only unique city for Canada and two for the US, the value becomes (1,2). Notice that the sum of this tuple is 3. Finally, in the company column, I want a 3-tuple, that summarizes the unique number of companies per city and country possibility. So, since there is West Jet and United for the (Canada, Calgary) pair, I assign a 2. The next two values are 1 and 1 because Los Angeles and Chicago only have one transportation company listed.
I understand this is pretty confusing and involved. But any help would be greatly appreciated. I've tried using data table methods such as
DT[, countries := .uniqueN(country), by =.(hs_code)]
DT[, city:= .uniqueN(city), by = .(hs_code, country)]
but I'm not sure how to get this conveniently into a list form into a data.table recursively.
Thanks!
Well, this is some sort of nested transformation that you can do in three steps:
dt[, .(companies = length(unique(company))), by = .(hs_code, country, city)][,
.(cities = length(unique(city)),
companies = paste0(companies, collapse = ",")), by = .(hs_code, country)][,
.(countries = length(unique(country)),
cities = paste0(cities, collapse = ","),
companies = paste0(companies, collapse = ",")), by = hs_code ]
# hs_code countries cities companies
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1
You can use .SD[] notation to create subgroups with more granular grouping:
dt[, .(
countries = uniqueN(country),
company = c(.SD[, uniqueN(city), .(country)][, .(V1)]),
company = c(.SD[, uniqueN(company), .(country, city)][, .(V1)])
), .(hs_code)]
# hs_code countries company company
# 1: apples 2 1,2 2,1,1
# 2: oranges 2 1,3 1,1,1,1

Extract House Number from (address) string using r

I want to parse apart (extract) addresses into HouseNumber and Streetname.
I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).
So lets say I have a data frame called "shops":
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Example: Easyfakestreet 5 --> Easyfakestreet , 5
It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.
Examples: New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']
I would appreciate some help!
Here's a possible tidyr solution
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
You can try:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here is a visualization of pat:
(.*) (\d.*)
Debuggex Demo
Note:
1) We used this for shops:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.
You could use the package unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
Created on 2019-10-08 by the reprex package (v0.3.0)
Very complex issue for international addresses
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output example
https://regex101.com/r/WVPBji/1

R - Finding a specific row of a dataframe and then adding data from that row to another data frame

I have two dataframes. 1 full of data about individuals, including their street name and house number but not their house size. And another with information about each house including street name and house number and house size but not data on the individuals living in that house. I'd like to add the size information to the first dataframe as a new column so I can see the house size for each individual.
I have over 200,000 individuals and around 100,000 houses and the methods I've tried so far (cutting down the second dataframe for each individual) are painfully slow. Is their an efficient way to do this? Thank you.
Using #jazzurro's example another option for larger datasets would be to use data.table
library(data.table)
setkey(setDT(df1), street, num)
setkey(setDT(df2), street, num)
df2[df1]
# size street num person
#1: large liliha st 3 bob
#2: NA mahalo st 32 dan
#3: small makiki st 15 ana
#4: NA nehoa st 11 ellen
#5: medium nuuanu ave 8 cathy
Here is my suggestion. Given what you described in your data, I created a sample data. However, please try to provide sample data from next time. When you provide sample data and your code, you are more likely to receive help and let people save more time. You have two key variables to merge two data frames, which are street name and house number. Here, I chose to keep all data points in df1.
df1 <- data.frame(person = c("ana", "bob", "cathy", "dan", "ellen"),
street = c("makiki st", "liliha st", "nuuanu ave", "mahalo st", "nehoa st"),
num = c(15, 3, 8, 32, 11),
stringsAsFactors = FALSE)
#person street num
#1 ana makiki st 15
#2 bob liliha st 3
#3 cathy nuuanu ave 8
#4 dan mahalo st 32
#5 ellen nehoa st 11
df2 <- data.frame(size = c("small", "large", "medium"),
street = c("makiki st", "liliha st", "nuuanu ave"),
num = c(15, 3, 8),
stringsAsFactors = FALSE)
# size street num
#1 small makiki st 15
#2 large liliha st 3
#3 medium nuuanu ave 8
library(dplyr)
left_join(df1, df2)
# street num person size
#1 makiki st 15 ana small
#2 liliha st 3 bob large
#3 nuuanu ave 8 cathy medium
#4 mahalo st 32 dan <NA>
#5 nehoa st 11 ellen <NA>

Resources